International Journal of Computational Intelligence Systems

Volume 14, Issue 1, 2021, Pages 1633 - 1641

The Value Function with Regret Minimization Algorithm for Solving the Nash Equilibrium of Multi-Agent Stochastic Game

Authors
Luping LiuORCID, Wensheng Jia*
College of Mathematics and Statistics, Guizhou University, Guizhou, Guiyang, 550025, China
*Corresponding author. Email: wsjia@gzu.edu.cn
Corresponding Author
Wensheng Jia
Received 2 February 2021, Accepted 13 May 2021, Available Online 26 May 2021.
DOI
10.2991/ijcis.d.210520.001How to use a DOI?
Keywords
Regret minimization; Multi-agent; Stochastic game; Nash equilibrium; Spatial prisoner's dilemma
Abstract

In this paper, we study the value function with regret minimization algorithm for solving the Nash equilibrium of multi-agent stochastic game (MASG). To begin with, the idea of regret minimization is introduced to the value function, and the value function with regret minimization algorithm is designed. Furthermore, we analyze the effect of discount factor to the expected payoff. Finally, the single-agent stochastic game and spatial prisoner's dilemma (SDP) are investigated in order to support the theoretical results. The simulation results show that when the temptation parameter is small, the cooperation strategy is dominant; when the temptation parameter is large, the defection strategy is dominant. Therefore, we improve the level of cooperation between agents by setting appropriate temptation parameters.

Copyright
© 2021 The Authors. Published by Atlantis Press B.V.
Open Access
This is an open access article distributed under the CC BY-NC 4.0 license (http://creativecommons.org/licenses/by-nc/4.0/).

1. INTRODUCTION

Fudenberg and Levine [1] put forward the game learning theory, the goals of the bounded rationality agents are to maximum their long-term payoff and regret minimization by constantly adjusting their strategies from their known information. Recently, many scholars focused on the study Nash equilibrium (NE) of multi-agent stochastic game (MASG). Yang and Wang [2] researched the multi-agent reinforcement learning (MARL) from the perspective of game theory. Rubinstein [3] investigated that bounded rational participants continuously modified their cognition in repeated games in which compared current strategies with previous ones to make optimal strategy choices. Asienkiewicz and Balbus [4] presented the existence analysis of NE for random games under certain conditions. Watkins [5,6] first proposed the Q-learning method, and proved the convergence of Q-learning. Littman [7] proposed a minimax Q-learning for two-person zero-sum stochastic game. Shoham et al. [8] discussed Nash-Q learning in general-sum stochastic game. Therefore, Q-learning and various improved learning algorithms play an important role in the implementation of NE for the MASG.

Reinforcement learning [9] solved the MASG by interacting with complex environment and learning from experiences. Bowling and Velson [10,11] proposed a classic method to evaluate MARL algorithm. The MARL has recently been extensively used in wireless sensor networks [12], event-triggered consensus system [13], traffic signal controllers [14], numerical algorithm [15], comparative analysis [16], integrodifferential algebraic [17], computational algorithm [18], and other fields. But the majority MARL algorithms either lack a rigorous convergence guarantee [19], potentially converge only under strong assumptions such as the existence of an unique NE [20,21], or provably non-convergent in all cases [22]. Zinkevich [23] identified the nonconvergent behavior of the value-function method in general-sum stochastic game. Minagawa [24] considered a sufficient condition for the uniqueness of NE in strategic-form game. However, Hansen et al. [25] proposed the concept of no regret to measure convergence, which came up with a new criteria to evaluate convergence in zero-sum self-plays [26,27]. Regret minimization has been used in a variety of games in recent years [28]. Inspired by research works mentioned above, we mainly studies the value function with regret minimization algorithm for solving the NE of MASG. The central idea of regret minimization is that the agent obtained a payoff after the agent has taken an action in the learning process, agents can retrospect the history of actions and payoff taken so far, and the agent regret not having taken another action, namely, the best action in hindsight. The agents' goal is to minimize the cumulative regret, written as the sum t=1T(V*(s,a)Vt(s,a)) of the difference between the values of V at the action a at time t and the true optimum of V* of the action a. Different from [29] in which considered the expected average time payoff and limited space for states/actions, this paper considered the expected sum of discount payoff in an unlimited time range, which means regret minimization can be regard as discounted expected payoff optimization criterion. In this paper, the idea of regret minimization is introduced to the value function, and the value function with regret minimization algorithm is designed. Furthermore, we analyze the effect of discount factor to the expected payoff. Finally, the single-agent stochastic game (SASG) and spatial prisoner's dilemma (SDP) are investigated in order to support the theoretical results. The simulation results show that when the temptation parameter is small, the Cooperation strategy is dominant; when the temptation parameter is large, the defection strategy is dominant. Therefore, we improve the level of cooperation between agents by setting appropriate temptation parameters.

The remainder of this paper is structured as follows: in Section 2, we introduce the model of MASG, and analyze the discount factor to the influence of discounted expected payoff. In Section 3, the idea of regret minimization is introduced to the value function, and the value function with regret minimization algorithm is designed. In Section 4, a simple stochastic game and the SDP game are investigated in order to support the theoretical results. Finally, we present some brief summaries.

2. PROBLEM DESCRIPTION AND PREREQUISITES

2.1. The Model of Multi-Agent Stochastic Game

A framework of MASG is given as follows [7]:

Let N={1,,n} denote the set of all agents, a MASG is a tuple <N,S,Ai,D,Ri,γ>, where,

  • N is the number of agents;

  • S is the set of states;

  • {Ai}iN is the set of action for the i-th agent, A=A1××An denotes the joint action set of all agents;

  • D: S×A×S[0,1] (sS, aA, sSD(s|s,a)=1) is a state transition probability function, and s represents the possible state at the next moment;

  • Ri:S×A×S, iN is the payoff function of the i-th agent, giving the expected payoff received by the agent under joint actions in each state;

  • γ(0,1) denotes the discount factor. When γ0, the agent is regarded as myopic, which means that the agent is only worried about immediate payoff. When γ1, the agent is known as farsighted, which means that the agent more interested about future payoff.

In infinite-horizon process [9], the agents' discounted payoff from time step t to horizon is,

Rit=rit+1+γrit+2+γ2rit+3+=l=1γl1rit+l.(1)

The model of MASG as shown in Figure 1.

Figure 1

The interaction of multi-agent and environment.

πi:SAi denotes the strategy of agent i. Let π=(π1,,πn) be all agents' joint strategy, the value function Viπ defines the long-term cumulative payoff of agent i in any state s at time t, taking action a under the joint strategy π as follows:

Vit(s,a)=aAπi(s,a)(Rit(s|s,a)+γsSD(s|s,a)Viπ(s)),sS,t{1,2,,T},(2)
where T denotes terminate time, i.e. horizon. Equation (2) is referred to as Bellman updated equation of Vπ for agent i, and records the payoff value by obtaining on the Markov chain S0,S1,,St,St+1, with the state s as the initial state. The item Rit(s|s,at)+γViπ(s) denotes staring from the state s, taking action at at time t, the agent i's payoff value obtained by 1-step transition to s and plus the discounted expected payoff collected from the state s. To solve the MASG, Equation (2) has rewritten as an iterative formula of the dynamic programming equations as follows:
Vk+1(s,a)=R(s|s,a)+γsSD(s|s,a)Vk(s,a).(3)

The optimal value function of agent i is defined by

Vi*(s,a)=maxaA(Ri(s|s,a)+γsSD(s|s,a)Vi*(s,a)),(4)
as rational agents, they attempt to find the best response policy in favor of all their states.

Definition 2.1.

(Nash equilibrium NE of the MASG) Let <N,S,Ai,D,Ri,γ> be the MASG, if a policy π*=(π1*,,πi*,,πn*)T is a NE, πi(π1,,πi1,πi+1,,πn)T, then aiAi(iN), sS, the following inequality holds:

Vi(s,πi*,πi*)Vi(s,πi,πi*),πiΠi,
where Πi is the strategy space of agent i, Vi(s,πi*,πi*) denotes the discount accumulation payoff. π* is the NE of the MASG such that each individual strategy πi* is a best response to others. The NE of the MASG describes each agent maximize own discounted expected payoff, and no agent can obtain higher benefit by unilaterally changing its strategy as long as all other agents keep their strategies invariant.

2.2. The Analysis of Discount Factor

For a Markov decision process (MDP) system, we consider the discount factor to the influence of the expected payoff in MASG. Now we make a simple experiment about the MASG <N,S,Ai,D,R,γ>, where different discount factor γ, the transition probability function D and payoff function R are as follows:

D=2/31/20001/301/301/30001/32/31/41/201/4001/41/21/40,R=2100302101102102002102021.

Meanwhile, V0=[0,0,0,0,0]T denotes initial value function vector, where T represents the transposed operator, the value function's convergence property under different discount factor is shown Figure 2.

Figure 2

The diagram of the convergence of the value function under different discount factor.

According to Figure 2(a–f), we can observe that starting from V0, the value function V finally converges with the number of iteration step, and the optimal value function is unique. Through this experiment, we know that the value function is sensitive to the value of discount factor. When γ0, the agent is myopic, the expected payoff is small. When γ1, the agent is farsighted, the expected payoff is large. Consequently, it is easy to know that myopic agents only care about immediate benefits, and hyperopic agents are more likely to obtain higher benefits in the future. Another point that needs to be explained is that the expected payoff value will not converge when γ1.

3. THE VALUE FUNCTION WITH REGRET MINIMIZATION ALGORITHM

3.1. The Cumulative Regret Minimization

Assuming that the finite-horizon, a policy π is obliged to approach to the optimal strategy at any iteration. ϑtk is equivalent to the difference between V*, (4) and the value function of Vπ, (2) in time step.

ϑtk=V*(s,a)Vt(s,a),(5)

where ϑtk denotes regret degree under adopting strategy π in state s. Our goal is to minimize the agents' accumulative regret,

ΘTK=t<TϑtK,(6)
where K denotes the terminal time step.

In Bubeck [30], the formula (6) is the normalized object. The loss ϑtk is defined as follows [31],

ϑtK=mink<Kϑtk.(7)

We define the cumulative regret minimization of all agents in the MASG as follows:

ΘTK=t<Tk<Kϑtk,(8)
where ΘTK denotes an upper bound of the agent, which means minimization gap between the strategic value and the optimal strategy value.

3.2. Arithmetic Flow of the Value Function with Regret Minimization Algorithm

The implementation steps of the value function with regret minimization algorithm are as follows:

step 1: Initialization parameters. Some strategies are randomly generated, and set the discounted factor, the value of cumulative regret degree is 0.0001.

step 2: Each agent's the value function Vt(s,a) is calculated by Equation (2), and the optimal strategy V*(s,a) is calculated by formula (4).

step 3: The cumulative regret degree ΘTK=t<Tk<Kϑtk is calculated by Equation (8).

step 4: Stopping condition of iterations: does the cumulative regret degree satisfy ΘTK<0.0001 for all agents? If yes, we output the optimal strategy π*; otherwise, we return step 1.

Once the optimal strategy is obtained by satisfying the cumulative regret, the value of the discounted expected payoff is defined. The value function under the regret minimization algorithm see Figure 3.

Figure 3

Flow chart of value function under regret minimization algorithm.

4. THE NEOF A MASG

4.1. A Simple Stochastic Game

The aim of the agent is to maximize their long term discounted expected payoff making respond to others agents. We give an example of a SASG as follows:

Example 1.

Let N=1, the agent's state is S={s1,s2,s3}. In state s1 and s2, we select an action from the agent's action sets A(s1)=A(s2)={a1,a2}; in state s3, we choose action from A(s3)={a2}. If we select action a1 in state s1, then the payoff is R(s1,a1)=2, and move state s2 with probability 1. If we choose action a2 in state s1, then the payoff is R(s1,a2)=3, and remain in state s1 with probability 1. In state s2, if we select a1, then we receive R(s2,a1)=5, and move state s1 with probability 1, whereas the payoff choosing action a2 devotes R(s2,a2)=10 and we shift to state s3 with probability 0.5 and reserve in state s2 with probability 0.5. If we can only select a2 in state s3, which means R(s3,a2)=0 and we remain in state s3 with probability 1. Assume that the agent has enough farsighted and the discount factor is γ=0.9 by the analysis of the discount factor in Section 2.2.

The above description can be shown in Table 1. We suppose that the horizon is T and the final payoff is rT(s),sS.

Agent(N) 1
State(S) s1
s2
s3
Action(A) a1 a2 a1 a2 a2
Payoff(R) 2 3 5 10 0
Transition probability(D) (0.0, 1.0, 0.0) (1.0, 0.0, 0.0) (1.0, 0.0, 0.0) (0.0, 0.5, 0.5) (0.0, 0.0, 1.0)
Table 1

The SASG was described in Example 1.

Assume that the decision will be made at t=0,1,2, i.e. T=3. Moreover, R3(s1)=R3(s2)=R3(s3)=0. In state s3, πt,*(s3)=a2 and Vt,*(s3)=0, t. In terminal time, the agent's payoff RT(s1)=0 and V3(s1)=V3(s2)=0.

By the backward induction method, at time t=2 in state s1, we have

V2(s1,a1)=2+V3(s2)=2,V2(s1,a2)=3+V3(s1)=3.

So V2,*(s1)=3 and the optimal strategy π*(s1,2)=a2. In state s2, we have

V2(s2,a1)=5+V3(s1)=5,V2(s2,a2)=10+0.9V3(s2)+0.9V3,*(s3)=10.

So V2,*(s2)=10 and the optimal strategy π*(s2,2)=a2.

At time t=1 in state s1, we have

V1(s1,a1)=2+V2,*(s2)=12,V1(s1,a2)=3+V2,*(s1)=6,

So V1,*(s1)=12 and the optimal strategy π*(s1,1)=a1. In state s2, we have

V1(s2,a1)=5+V2,*(s1)=8,V1(s2,a2)=10+0.9V2,*(s2)+0.9V2,*(s3)=19.

So V1,*(s2)=19 and the optimal strategy π*(s2,2)=a2.

At time t=0 in state s1, we have

V0(s1,a1)=2+V1,*(s2)=21,V0(s1,a2)=3+V1,*(s1)=15.

So V0,*(s1)=21 and the optimal strategy π*(s1,0)=a1. In state s2, we have

V0(s2,a1)=5+V1,*(s1)=17,V0(s2,a2)=10+0.9V1,*(s2)+0.9V1,*(s3)=27.1.

So V0,*(s2)=27.1 and the optimal strategy π*(s2,0)=a2.

Therefore, the optimal value function and the optimal strategy in any state as follows,

t=0t=1t=2s1V*=s2s3(2112327.11910000),t=0t=1t=2s1π*=s2s3(a1a1a2a2a2a2a2a2a2),
where the payoff is V*(s1)=17 in state s1, V*(s2)=27.1 in state s2 or V*(s3)=0 in state s3. In decision horizon, the NE of the SASG is (a1,a1,a2) in state s1, the NE of the SASG is (a2,a2,a2) in state s2, the NE of the SASG is (a2,a2,a2) in state s3. Therefore, the agent's learning behavior be able to convergence to the NE of the SASG, and the agent is no-regret under fixed discount factor.

4.2. The Spatial Prisoners' Dilemma

The SPD [32] can be regarded as a two-agent two-action stochastic game <N,S,Ai, D,Ri,γ>, where N=2, agents' state set S corresponds to different temptation factors, agents' action set is Ai={C,D}(i=1,2). When all agents fixed strategy, we can obtain one of the four possible payoff: R(Payoff), S(Sucker), T(Temptation), and P(Penalty). In the multi-agent setting, if all agents select Cooperation (C), then they receive R(Payoff); if all agents choose Defection (D), then they obtain P(Penalty); if some agents select Cooperation (C) and some Defection (D), cooperators obtain Sucker (S) and defectors gain Temptation (T). The four payoff value of SPD satisfy the inequalities: T>R>P>S and 2R>T+S.

Example 2.

Let N={1,2} be the set of two agents, agents' state sets are S={s1,s2}, b(b>1) denotes the temptation parameter. In state s1 and s2, we can choose an action from A(s1)=A(s2)={C,D}. In state s1, the immediate payoff of agent 1 is R(s1,C,C)=1, R(s1,C,D)=0, R(s1,D,C)=b, and R(s1,D,D)=0, the immediate payoff of agent 2 is R(s1,C,C)=1, R(s1,C,D)=0, R(s1,D,C)=b, and R(s1,D,D)=0. If the agent choose the action pair {C,D}, {D,C}, {D,D} in state s1, then the agent will move to state s2 with probability 1; if the agent select the action pair {C,C} in state s1, then the agent remain in state s1 with probability 1. In state s2, the immediate payoff of agent 1 is R(s2,C,C)=1, R(s2,C,D)=0, R(s2,D,C)=b, and R(s2,D,D)=0, the immediate payoff of agent 2 is R(s2,C,C)=1, R(s2,C,D)=0, R(s2,D,C)=b, and R(s2,D,D)=0. Once the the agent reach state s2, the agent remain in state s2 with probability 1. In some cases, a finite-horizon problem for the SPD must be improved by identifying the horizon T and the terminal payoff RT(s),sS. The above description is represented by Table 2.

Table 2

The description of the MASG, where state s1 (left) and state s2 (right).

The game starts in state s2, the NE of the agent is (D,D). The MASG game is the prisoners' dilemma, and no agent can obtain higher benefit by unilaterally changing its strategy as long as all other agents keep their strategies invariant. The discount factor can be analyzed in state s2, we obtain,

Vi*(s2)=11γ.

In state s1, the MASG game be expressed as follows:

Agent 1

C
D
Agent 2 C 1 + γV1*(s1), 1 + γV2*(s1) γV1*(s2), b + γV2*(s2)
D b + γV1*(s2), γV2*(s2) γV1*(s2), γV2*(s2)

or

Agent 1

C
D
Agent 2 C 1 + γV1*(s1), 1 + γV2*(s1) γ1γ, b + γ1γ
D b + γ1γ, γ1γ 11γ, 11γ

where b(b>1) represents the temptation factor by using Defection strategy for agents.

Evidently, (D,C) and (C,D) aren't an equilibrium of the MASG game by virtue of agents are motivated to change their strategies, and (D,D) is the NE for all values of γ. the pair of actions (C,C) is the NE, if we have,

1+γVi(s1)b+γ1γVi(s1)(2b)γ+b1γ(1γ),(i=1,2).

Assume that both of agents select Cooperation in state s1, then

     Vi(s1)=1+γπi(s1)Vi(s1)=11γ.

Meanwhile,

11γ(2b)γ+b1γ(1γ)(γ1)[(1b)γ(1b)]0,γ1.

On the one hand, (D,D) is a NE of the MASG, but (C,C) isn't a NE due to γ1. On the other hand, γ1 is out of the range of discount factor, so the SDP not converge to the NE. γ1 shows that the agent does not converge to the strategy (C,C) [32,33] in the classic prisoner's dilemma game. Thus, the payoff value of SDP is independent of the discount factor. and we consider the influence of different temptation factors in order to raise the level of Cooperation all agents.

Therefore, we simulate the SDP on a 300 × 300 grid with an even 50-50 split between cooperators and defectors randomly distributed on the grid and the simulation is for 300 generations. Assume that the temptation parameter set as b=1.1, b=1.3, b=1.5, b=1.7, b=1.8, b=1.9, and the temptation parameter is independent of the discount factor γ. In the graph of the final state the cooperators are expressed by yellow, the defectors are indicated by blue, cooperators to defectors (C to D) are represented by red, defectors to cooperators (D to C) are green.

From Figure 4(a–f) we can see the number of cooperators suddenly dropped owing to agents being isolated and surrounded by defectors, hence a few cooperators that survive will create small clusters and then rise in numbers, so the cooperators quickly become dominant over the defectors after a few generations when the temptation factor b is lower.

Figure 4

The agent strategy change with different b=1.1, b=1.3, b=1.7 (left); the numbers of cooperators and defectors (right).

Up until b=1.7 the cooperators are dominant but the defectors are outnumbering the cooperators when the temptation becomes continually higher. This means that there exist some transition point between b=1.7 and b=1.9 when the defectors overtake the cooperators. Now we can investigate how the parameter b influences the model by looking at the density of cooperators each round and over time, see how the model changes behavior for different values of b, especially in the region 1.7<b<1.9.

According to Figure 5(a–d), when b1.8 the numbers of the defectors become further rising, then the defectors are dominant. The game selects different temptation factor corresponds to different states, the adoption strategy of the agent is closely related to the temptation factor. When 1<b<1.8, the agent's cooperation strategy is dominant, the level of cooperation is higher. When b>1.8, the agent's defection strategy is dominant, and the level of defection is higher with b is bigger. Obviously, we can improve the level of cooperation between agents by setting appropriate temptation parameter 1<b<1.8 and b1.

Figure 5

The agent strategy change with different b (left); the numbers of cooperators and defectors (right).

5. CONCLUSION

In this paper, we give a new attempt to solve NE of MASG by using the value function with regret minimization algorithm. We consider the expected payoff as an optimization criterion between agents. To begin with, the idea of regret minimization is introduced to the value function, and the value function with regret minimization algorithm is designed. Furthermore, we analyze the effect of discount factor to the discounted expected payoff. Finally, the simulation results show that when the temptation parameter is small, the cooperation strategy is dominant; when the temptation parameter is large, the defection strategy is dominant, we improve the level of cooperation between agents by setting appropriate temptation parameters 1<b<1.8 and b1. Hence, the value function with regret minimization algorithm is an effective way to solve the NE of the stochastic game. We are also interested in further research to explore whether the value function with regret minimization algorithm can be used to solve more complexity stochastic game for the large-scale action set or continuous action space.

CONFLICTS OF INTEREST

The authors declare they have no conflicts of interest.

AUTHORS' CONTRIBUTIONS

All authors read and approved the final manuscript

ACKNOWLEDGMENTS

This work was supported by the National Natural Science Foundation of China (Grant No. [12061020], [71961003]), the Science and Technology Foundation of Guizhou Province (Grant No.20201Y284, 20205016, 2021088), the Foundation of Guizhou University (Grant No. [201405], [201811]). The authors acknowledge these supports.

REFERENCES

1.D. Fudenberg and D.K. Levine, The Theory of Learning in Games, Cambridge University Press, 1988.
2.Y.D. Yang and J. Wang, An overview of multi-agent reinforcement learning from game theoretical perspective, Multiagent Syst., 2021. arXiv: 2011.00583
5.C.J.C.H. Watkins, Learning from Delayed Rewards, University of Cambridge, Cambridge, England, 1989. PhD Thesis
8.Y. Shoham, R. Powers, and T. Grenager, Multi-agent Reinforcement Learning: a Critical Survey, Technical Report, Stanford University, 2003. Web Manuscript https://www.cc.gatech.edu/classes/AY2008/cs7641_spring/handouts MALearning_ACriticalSurvey_2003_0516.pdf
9.R. Sutton and A. Barto, Reinforcement Learning: an Introduction, MIT Press, Cambridge, MA, USA, 1998.
10.M. Bowling and M. Veloso, Rational and convergent learning in stochastic games, in International Joint Conference on Artificial Intelligence, 2001, vol. 17, pp. 1021-1026. Web Manuscript http://www.cs.cmu.edu/~mmv/papers/01ijcai-mike.pdf
19.K. Zhang, Z. Yang, and T. Başar, Multi-agent reinforcement learning: a selective overview of theories and algorithms, Mach. Learn., 2019. arXiv: 1911.10635
22.E. Mazumdar, L.J. Ratliff, S. Sastry, et al., Policy gradient in linear quadratic dynamic games has no convergence guarantees, Bridging Game, in Smooth Games Optimization and Machine Learning Workshop, 2019. arXiv: 1907.03712
23.M. Zinkevich, Online convex programming and generalized infinitesimal gradient ascent, in Proceedings of the 20th International Conference on Machine Learning (ICML-03) (Washington, DC, USA), 2003, pp. 928-936. Web Manuscript http://www.stanford.edu/class/cs369/files/Zinkevich-GradDescent-ICML03.pdf
29.K. Zhang, Z. Yang, H. Liu, et al., Fully decentralized multi-agent reinforcement learning with networked agents, in Proceedings of the 35th International Conference on Machine Learning, vol. 80 of Proceedings of Machine Learning Research, 2018, pp. 5872-5881. arXiv: 1802.08757
Journal
International Journal of Computational Intelligence Systems
Volume-Issue
14 - 1
Pages
1633 - 1641
Publication Date
2021/05/26
ISSN (Online)
1875-6883
ISSN (Print)
1875-6891
DOI
10.2991/ijcis.d.210520.001How to use a DOI?
Copyright
© 2021 The Authors. Published by Atlantis Press B.V.
Open Access
This is an open access article distributed under the CC BY-NC 4.0 license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

TY  - JOUR
AU  - Luping Liu
AU  - Wensheng Jia
PY  - 2021
DA  - 2021/05/26
TI  - The Value Function with Regret Minimization Algorithm for Solving the Nash Equilibrium of Multi-Agent Stochastic Game
JO  - International Journal of Computational Intelligence Systems
SP  - 1633
EP  - 1641
VL  - 14
IS  - 1
SN  - 1875-6883
UR  - https://doi.org/10.2991/ijcis.d.210520.001
DO  - 10.2991/ijcis.d.210520.001
ID  - Liu2021
ER  -