Can Simulation Entice Cooperation in Non-Zero-Sum Games?

Can Simulation Entice Cooperation in Non-Zero-Sum Games?

In honor of Joe Halpern's 70ies Birthday, a year ago Vincent Conitzer posted this puzzle: https://www.cs.cmu.edu/~conitzer/halpernbirthday.pdf

Can you solve it?



Article content

We provide a solution for Game 1.

Let us denote the probability that Player B plays 'left' with p_B, and the probability of a simulated example being presented to Player A with s in [0,1]. Then, we can formalize the expected outcomes and rewards as follows:

  • The probability p_A of Player A of playing 'top' is determined by three other probabilities p_0, p_1, p_2 in [0,1] which Player A must set as her strategy. p_0 is the probability of playing 'top' when no simulation takes place. p_1 is the probability of playing 'top' if simulation took place and Player B's action sample is 'left.' p_2, finally, is the probability that Player A plays 'top' when simulation took place and Player B's action sample was 'right.'

Based on the strategy (p_0,p_1,p_2), we can determine

Article content

In other words, the strategy for Player A has become a function of the strategy of Player B due to the occasional simulation.

  • Denote with B in R^{2 x 2} the reward matrix for Player B. The expected reward for Player B then is

Article content

Using Equation 1, and setting

Article content

we get

Article content

Therefore, given Player A’s strategy (p_0, p_1, p_2), Player B should choose p_B such that the above quadratic term is maximized.

  • In our game, we have

Article content

and therefore

Article content

As long as p_1 - p_2 = q > 2/3 , Player B will want to play the (pure) strategy p_B = 1, which means Player B will always play ’left.’

  • Consider the strategy (p0 = 1, p1 = 1, p2 = 0) for Player A, which implies q = 1 and therefore p_B = 1. Since Player B now only plays ’left,’ Player A will always play ’top’ (p_2 has become immaterial since no simulated sample will ever be ’right’). At the same time, this is the optimal outcome for Player A since no other cell offers a greater reward for her than top-left. Therefore, Player A has no incentive of changing her strategy. At the same time, given this strategy for Player A, as we have seen, Player B also has no incentive of changing his strategy either. We have therefore found an equilibrium with rewards (2, 3). Simulation has led to cooperation in this example.


Practical Gameplay

In the game above, we found that the reward matrix of the simulated player had equal sums on their two main diagonals, leading to x = 0 which turned the quadratic optimization problem for Player B into a linear optimization problem. Now, consider a game where this is not the case:

Article content

This time, we will simulate an action of Player B with probability 0.7.

a) Can you devise a strategy for Player A? Is there an equilibrium that is strictly greater than (2,2) for both players?

b) Consider a variant where Player B continues to want to maximize his expected reward, while Player A also wants to manage her risk of receiving a negative reward. Particularly, what strategy should Player A play if she wants to keep her expected rewards above 2.09 while minimizing her chances of running a loss?

Article content


Answering these questions in theory is very demanding. In practice, we may therefore revert to using a tool for optimizing our strategy. The stochastic optimizer InsideOpt Seeker, for example, can answer both questions with the following simple program:

import seeker as skr

def halpern(s, A, B):
    z = B[1][0] - B[1][1]
    y = B[0][1] - B[1][1]
    x = B[0][0] - B[1][0] - y

    env = skr.Env("license.sio", stochastic=True)
    env.set_restart_likelihood(0.05)
    p = [env.continuous(0, 1) for _ in range(3)]
    q = p[1] - p[2]
    quadratic_coeff = s * x * q
    mixed_case = quadratic_coeff < 0
    linear_coeff = s * y * q + (s * x) * p[2] + 
                       ((1 - s) * x) * p[0] + z
    b_mixed = env.max([env.convert(0), env.min([env.convert(1),
                       -linear_coeff / 
                       (2 * quadratic_coeff)])])
    b_extreme = env.if_(quadratic_coeff + linear_coeff > 0,
                        1, 0)
    p_B = env.if_(mixed_case, b_mixed, b_extreme)
    p_A = s * (p_B * q + p[2]) + (1 - s) * p[0]
    draw = [env.continuous_uniform(0, 1) for _ in range(2)]
    plays = [env.if_(draw[0] < p_A, 0, 1),
             env.if_(draw[1] < p_B, 0, 1)]
    reward_A = env.index([plays[0], plays[1]], env.tensor(A))
    reward_B = env.index([plays[0], plays[1]], env.tensor(B))
    exp_reward_A = env.aggregate_mean(reward_A)

    # maximize reward for Player A for 60 seconds
    env.maximize(exp_reward_A, 60)
    exp_reward_B = env.aggregate_mean(reward_B)
    prob_A = env.aggregate_relative_frequency_geq(reward_A, 0)
    env.evaluate()
    print("Maximizing Expected Profit")
    print("Prob A >= 0:", prob_A.get_value())
    print("Reward A:", exp_reward_A.get_value())
    print("Reward B:", exp_reward_B.get_value())
    print("Strategy A:", p_A.get_value(), 
              [p[i].get_value() for i in range(3)])
    print("Strategy B:", p_B.get_value())
    print("Evaluations", env.get_number_evaluations())
    print()

    # constrain the expected profit
    env.enforce_geq(exp_reward_A, 2.09)
    # maximize probability to achieve at least profit 0
    env.maximize(prob_A, 60)
    print("Maximizing Chance of Profit>=0 (exp profit>=2.09)")
    print("Prob A >= 0:", prob_A.get_value())
    print("Reward A:", exp_reward_A.get_value())
    print("Reward B:", exp_reward_B.get_value())
    print("Strategy A:", p_A.get_value(), 
              [p[i].get_value() for i in range(3)])
    print("Strategy B:", p_B.get_value())


if __name__ == "__main__":
    A = [[3, 0], [-1, 2]]
    B = [[5, 9], [0, 2]]
    halpern(0.7, A, B)        

As a result we get:

Article content

In the case of the simple maximization of the expected profits for both players, we find that the solver chooses to set p_0 = 0.877. As x = -2 in this game, lowering p_0 is an interesting option for getting Player B to cooperate more: We see that he plays 'left' with probability 85.66%. The equilibrium is achieved at (2.1055, 4.8377).

Article content
Seeker-provided reward densities for Player A after the risk minimization


In the case where we minimize Player A’s risk of running a negative reward, we find that Seeker increases p_0 = 1, which makes it beneficial for Player B to defect more. As a result, the expected reward for Player A declines slightly, but the risk of running a loss is reduced by 1.2%.


Conclusion

As you see, InsideOpt Seeker allows us to devise optimized strategies in competitive environments, even in the face of stochasticity and non-linearities. After all, not everyone is lucky enough to employ a Joe Halpern!

Happy Birthday, Joe!

To view or add a comment, sign in

More articles by Meinolf Sellmann

  • Ant Race

    This year's Edelman Winners were US Cycling. Using optimization in sports has become a staple for teams that want to…

  • Precision Farming

    Farmer Ben from DS City needs help deciding what crops to plant and what treatments to use. He is thinking about next…

    2 Comments
  • Tariffs

    In these weeks, a lot of supply planners are grappling with the uncertainty of potential tariffs. The fundamental…

    3 Comments
  • Balancing Supply and Demand

    We have three factories from where we service three clients. The first factory can provide 400 units, the second 300…

    2 Comments
  • OR 2025-2030

    We asked 25 AI/OR researchers, tool makers, developers, and practitioners what they expect the second half of this…

    14 Comments
  • Game of Life

    In Conway's game of life (https://en.wikipedia.

    1 Comment
  • Non-Linear Optimization

    In todays puzzle, we solve a simple non-linear optimization problem with only five variables. Can you do it? In this…

    58 Comments
  • More Needles, More Haystack: Or The Irony of Optimization

    In this article, we discuss examples where an increase in performance potential frequently leads to a decline in actual…

    8 Comments
  • The Unquantified Risk

    In decision science pipelines, we typically first estimate data, such as demand, travel times, price sensitivity etc…

    1 Comment
  • A Problem Larger Than the Universe

    In this puzzle, you are not asked to solve a concrete optimization problem, but to provide an estimate. Easy? The…

Insights from the community

Others also viewed

Explore topics