%Paper: ewp-game/9401002
%From: Eitan Altman <Eitan.Altman@sophia.inria.fr>
%Date: Tue, 25 Jan 1994 14:07:49 +0100

% Subject: 0-sum game and flow control; 20 Dec 1991

\documentstyle[11pt]{article}
\setlength{\unitlength}{1mm}

\setlength {\topmargin}{0in}
\setlength {\textheight}{9.0truein}
\setlength {\oddsidemargin}{0in}
\setlength {\evensidemargin}{0in}
\setlength {\textwidth}{6.5in}
% \setlength {\parindent}{0pt}
\parskip10pt
\footheight 0.15truein
\footskip 0.6truein
\headheight 0.0truein
\headsep 0.0truein
\newcommand{\ls}[1]
   {\dimen0=\fontdimen6\the\font 
    \lineskip=#1\dimen0
    \advance\lineskip.5\fontdimen5\the\font
    \advance\lineskip-\dimen0
    \lineskiplimit=.9\lineskip
    \baselineskip=\lineskip
    \advance\baselineskip\dimen0
    \normallineskip\lineskip
    \normallineskiplimit\lineskiplimit
    \normalbaselineskip\baselineskip
    \ignorespaces
   }
\newcommand {\beq} {\begin{equation}}
\newcommand {\eeq} {\end{equation}}
\newcommand {\bear} {\begin{eqnarray}} 
\newcommand {\eear} {\end{eqnarray}} 
\newcommand {\done} {\quad\vrule height4pt WIDTH4PT}
\newcommand {\phantomI} {\vrule height20pt WIDTH0PT}
\newcommand {\N} {\rm I\!N}
\newcommand {\R} {\rm I\!R}
\def \bPa {$ {\bf \Pi_1} $}
\def \bPb {$ {\bf \Pi_2} $}
\def\limi{\mathop{\underline{\rm lim}}}
\def\lims{\mathop{\overline{\rm lim}}}
\def\sp{\newline\noindent}
\def\sqr#1#2{{\vcenter{\hrule height.#2pt
      \hbox{\vrule width.#2pt height#1pt \kern#1pt
        \vrule width.#2pt}
    \hrule height.2pt}}}
\def\Prf{\noindent{\bf Proof. }}
\def\vb{v_\beta}
\def\K{{\cal K}}
\def\cA{{\cal A}}
\def\fS{{\bf S}}
\def\cD{{\cal D}}
\newtheorem{definition}{Definition}[section]
\newtheorem{lemma}{Lemma}[section]
\newtheorem{proposition}{Proposition}[section]
\newtheorem{theorem}{Theorem}[section]
\newtheorem{remark}{Remark}[section]
\newtheorem{corollary}{Corollary}[section]
\ls{1.4}

\begin{document}
\title{\bf A MARKOV GAME APPROACH FOR 
OPTIMAL ROUTING INTO A QUEUEING NETWORK}
\author{Eitan ALTMAN\\
        INRIA \\
	2004 Route des Lucioles \\
        BP93, 06902 Sophia-Antipolis Cedex, France\\
	altman@martingale.inria.fr, tel. 93 65 76 73}
\date{January 1994}

\maketitle
\begin{abstract}
We study a dynamic optimal routing problem, where
a controller has to decide to which of two queues should
arriving customers (representing packets, messages,
call etc...) be sent. 
The service rate in each queue may depend on the state of the system, 
may change in time and is unknown to the controller. 
The goal of the controller is to design a strategy that 
guarantees the best performance under the worst case
service conditions. The payoff is composed of a holding
cost, an admission cost,
and a cost that depends on the quality of the service. 
We consider both the finite and infinite horizon discounted
cost and the expected average cost.  The problem is studied 
in the framework of zero-sum Markov games where the server, 
called player 1, is assumed to play against the router, called
player 2. Each player is assumed to have the information of all
previous actions of both players as well as the current and past
states of the system. 
We show that there exist pure optimal strategies for both players.
A value iteration algorithm is used to establish properties 
of the value of the game, which are related
to supermodularity and to convexity. This is then shown to
imply the existence of optimal strategies described by 
monotone switching curves for both players.
\bigskip
\

\noindent
{\bf Keywords:} zero-sum stochastic games,
value iteration, monotone switching curve strategies,
control of queueing networks, routing control.

\end{abstract}
%
\section{Introduction}
We consider in this paper a min-max type optimal control
of customers routing into two infinite capacity queues. Whenever a customer
arrives, the controller has to decide to which queue it will be sent. 
The service rate in each queue is known to remain within some interval.
However, the presence of customers arriving from other 
controlled sources as well as 
congestion phenomena is modeled by allowing the service rate in
each queue to depend on the state of the system, and to change in time
in a way that is unknown to the router. 
The goal of the router is to design a strategy that 
guarantees the best performance under the worst case
service conditions. We formulate this problem as a zero-sum 
stochastic game, where the server, called
player 1 (or ``nature"), is assumed to play against the router, called
player 2. Each player is assumed to have the information of all
previous actions of both players as well as the current and past
states of the system, namely, the length of the queues.

Our main result is to identify optimal 
strategies for both players which have a simple structure,
which implies that the optimal min-max strategy for the 
router is easy to implement.
We show that the router has an optimal strategy of
monotone switching curve type (see \cite{hajek}). This strategy has
the following monotonicity property. If it is optimal to route
a customer to queue $1$ for a given length of the queues $s=(s_1, s_2)$,
then it is also optimal to route the customer to queue $1$ 
when the length of the queues is $t=(t_1, t_2)$ provided that
$t_1 \leq s_1 $ and $t_2 \geq s_2 $. A similar monotonicity 
property holds for routing to queue 2.
We then identify a worst-case service conditions,
for which each server uses a bang-bang strategy,
i.e., depending on the state of the system, either the largest
or the smallest service rate will be chosen. Moreover, the decision
rule for each server, between the highest or lowest service rate,
is again characterized by a monotone switching curve strategy.

In order to establish the structure of the optimal strategies 
we use the following approach. We first identify properties
that the value function may have which would imply the desired
structure of the optimal strategies. We then use value 
iteration in order to show that
the value function indeed possesses these properties. 
This approach was used in the past to obtain structural 
results in several other stochastic games arrising in queueing models. 
In \cite{me} and \cite{me2} 
optimal threshold and optimal monotone
strategies are shown to exist for 
min-max flow control problems into a single
queue with unknown service rate; 
a similar structure is obtained when service rate is controlled \cite{AHS}.
In all cases, the property of the value function that induces
the structure of the optimal strategies was convexity.
A min-max routing problem was considered by Altman and Shimkin \cite{AS},
where the router has to decide to which of $N$ queues an
arriving customer should be routed. A symmetric setting was considered,
where the service rate in all queues are the same. In addition,
an extra service capacity was assumed to be allocated to the
queues in a way unknown to the router. Routing to the shortest
queue was identified as an optimal strategy. The property of the
value function that induces
the structure of the optimal strategies was Schur-convexity.
A related routing problem as well as a scheduling problem was
considered by Altman and Koole \cite{AK}.
In the present paper, the properties that will induce the structure
of the optimal strategies are related to supermodularity and convexity.

Structural properties of optimal strategies in queueing systems were
also obtained in non-zero sum games using different techniques,
see e.g. Hsiao and Lazar \cite{HL}, Altman and Shimkin \cite{an}, \cite{AS}.

The structure of the paper is as follows:
in Section \ref{sec:model} we describe the model.
In Section \ref{sec:fhz} we solve the finite horizon problem,
and in Section \ref{sec:infh} we solve the infinite horizon
problem. Generalizations are discussed finally in Section
\ref{sec:cr}.
%
\section{The model}
\label{sec:model}
Consider two infinite queues. Customers arrive to the system
according to a Poisson process with rate $\lambda'$. 
Upon arrival of a customer, the router chooses an action 1 or 2, with
the interpretation that action $i$ corresponds to routing the
customer to queue $i$. In each queue, customers are served according
to the order of FCFS: first come first served. The service duration
of a customer in queue $i$ is exponentially distributed with a
parameter $a_1(i)$
that lies in the interval $[\underline \mu' , \overline \mu' ]$.
This parameter, called the service rate, may change in time in
a way unknown to the router. 

At first sight, it seems that in order to model this
process as an MDP (Markov Decision Process), the state 
space should include
not only the queues' length, but also the identity of the last
event that happens, since, the router can take decisions
only at events of arrival of customers, whereas the servers can
take decisions at any time. We construct, however,
a simpler equivalent MDP model by allowing the router
to take decisions at every event. If the router chooses
action $i$ at time $t$, it has the interpretation that
if the following event is going to be an arrival, then the customer
that will arrive will be routed to queue $i$. With this interpretation,
it will suffice to consider the length of the queues as the state of the 
system.
Let $\cA_i s$, $i=1,2$ denote the state obtained by an arrival of a customer
to queue $i$ when the state was $s$, and let $\cD_i s $,
$i=1,2$ denote the state obtained by a departure of a customer
from queue $i$ when the state was $s$. If queue $i$ is empty,
then we understand $\cD_i s = s$.

Consider a time interval $\Delta $.
The probability to be in state $s'$ at time $t + \Delta$
if at time $t$ the state is $s$ and the actions are $a_1,a_2$ is
given by:
\[
q_{\Delta} ( s' \,|\,s,a_1 ,a_2 ):=
\left\{ 
\begin{array}{lr}
\lambda' \Delta + o(\Delta), & \mbox{ if } a_2 = i , s' = \cA_i s , i=1,2;\\
a_1(i) \Delta  + o(\Delta), & \mbox{ if }  s' = \cD_i s , s_i \not=0, i=1,2;\\
1- ( \lambda' + \sum_{i=1}^2 a_1 (i) 1\{ s_i > 0 \} ) \Delta + 
o(\Delta), & \mbox{ if }  s' =  s ;\\
o(\Delta) & \mbox{ otherwise}.
                               \end{array}
                       \right.
\]
We shall use a time discretized model by fixing some small $\Delta$.
Define $\lambda = \lambda' \Delta $, 
$\underline \mu_i = \underline \mu'_i \Delta $,
$\overline \mu_i = \overline \mu'_i \Delta $, $i=1,2$.
We obtain finally the following approximating discrete time MDP:

\underline{The state space} is $\fS=\N^2$ where $\N$ are the natural numbers,
so that the state denotes the number of customers in the queues.
An element $s \in \fS$ is thus a two dimensional vector denoted
by $s=(s_1,s_2)$, where $s_i$ are the number of customers in queue $i$.

\underline{The action space of the servers (player 1)} is the product of
the compact intervals $A_1 = A_1 (1) \times A_1 (2) =
[ \underline \mu_1 , \overline \mu_1 ] 
\times [ \underline \mu_2 , \overline \mu_2 ] $. (Note that although
there are two servers, each of which is a controller, they can be
considered together as a single player since they have a common
objective). An action $a$ in $A_1$ is thus a two dimensional
vector: $a = \{ a(1) , a(2) \}$.

\underline{The action space of the router (player 2)} is 
$A_2 = \{ 1 , 2 \}$.

\underline{The transition law:} 
\[
q ( s' \,|\,s,a_1 ,a_2 ):=
\left\{ 
\begin{array}{lr}
\lambda, & \mbox{ if } a_2 = i , s' = \cA_i s, i=1,2 ;\\
a_1(i), & \mbox{ if }  s' = \cD_i s, i=1,2 ;\\
1- ( \lambda + \sum_{i=1}^2 a_1 (i) 1\{ s_i > 0 \} ), 
& \mbox{ if }  s' =  s .
                               \end{array}
                       \right.
\]
We note that this law of motion is additive \cite{RF}:
$q$ can be expressed as 
$ q ( s' \,|\,s,a_1 ,a_2 )= q_1 ( s' \,|\,s,a_1 ) +
q ( s' \,|\,s,a_2 ) $ where
\[
q_1 ( s' \,|\,s,a_1 ):=
\left\{ 
\begin{array}{lr}
a_1(i), & \mbox{ if }  s' = \cD_i s, i=1,2 ;\\
1- (\lambda + \sum_{i=1}^2 a_1 (i) 1\{ s_i > 0 \} ) , 
& \mbox{ if }  s' =  s ,
                               \end{array}
                       \right.
\]
and
$ q_2 ( s' \,|\,s ,a_2 ):=
\lambda \mbox{ if } a_2 = i , s' = \cA_i s, i=1,2 $.
Moreover, $q$ is continuous in the actions.

\underline{The immediate payoff:}
We assume that the payoff $r(s,a_1,a_2)$, which the
router has to pay at each step if the state is $s$ and
the actions are $a_1,a_2$, is separable, and has the form:
\[
r(s,a_1,a_2) = c(s) 
+ \sum_{i=1}^2 \theta_i ( a_1 )
+ \sum_{i=1}^2 d_i 1 \{ a_2 = i \} .
\]
It is composed of a holding cost $c$, 
a cost $\theta_i$ that depends on the quality of the service
at queue $i$, and an admission cost $d_i$  if a customer is
to be admitted to queue $i$. We assume that $\theta_i$ (and thus
$r$) are continuous in the actions.

\underline{The strategies} We refer to \cite{RF} for the
definition of strategies. 

\underline{The finite horizon and infinite horizon costs}
For given strategies $f,g$ for the players and an initial state
$s$, let $r_n(f,g)(s)$ be the corresponding expected payoff 
(that is paid by the router) at stage $n$.
For a fixed discount factor $0 < \beta < 1$,
we shall consider the finite horizon discounted cost 
$\phi_\beta (n;f,g)(s) = \sum_{m=1}^n \beta^{m-1} r_m (f,g)(s)$,
and the infinite horizon discounted cost
$\phi_\beta (f,g)(s) = \sum_{m=1}^\infty \beta^{m-1} r_m (f,g)(s)$.
Let $v_\beta(n;s)$ and $v_\beta(s)$ denote the values of the
finite and infinite horizon discounted games, respectively,
i.e.,
\[
v_\beta(n;s) = \sup_f \inf_g \phi_\beta (n;f,g)(s) 
= \inf_g \sup_f \phi_\beta (n;f,g)(s) ,
\]
\[
v_\beta(s)
= \sup_f \inf_g \phi_\beta (f,g)(s) 
= \inf_g \sup_f \phi_\beta (f,g)(s) .
\]
A strategy $f^*$ that satisfies
\(
v_\beta(n;s) = \inf_g \phi_\beta (n;f^*,g )(s) 
\)
is said to be optimal for player 1 for the finite horizon problem.
Optimality for the infinite horizon problem and optimality for
player 2 are defined similarly.
Since for any fixed finite horizon $n$ and initial state $s$,
only a finite number of states can be reached, it follows that
the finite horizon problem is equivalent to a stochastic game with
a finite number of states, and the payoffs can thus
be considered to be bounded. The existence of a value and of optimal
Markov strategies for both players then follows from well known
results, see e.g. \cite{nowak} Theorem 4.1. 
We show in Section \ref{sec:infh}
that a value exists also for the infinite horizon problem
within the stationary strategies.

\begin{remark}
Combining the additivity of the transition probabilities
and the separability of the payoffs, we conclude that
the stochastic game has the AR-AT structure, see \cite{RF}.
This implies that pure Markov optimal strategies exist
for both player for the finite horizon case, and pure
stationary strategies exist
for both player for the infinite horizon case.
This results from the fact that in the dynamic
programming equation, the value operator decomposes to
the sum of separate maximization and minimization operations.
\end{remark}

\section{The finite horizon problem}
\label{sec:fhz}

\subsection{The structure of optimal strategies and
required properties of the value function}

Consider a horizon of $n$ steps.  Define for all $s \in \fS$
$\vb (0,s)=0$, and denote 
\[
R_i^m ( a , s ) = a [ \vb (m, \cD_i s )- \vb (m, s ) ]
+ \theta_i (a) + (1-\lambda) \vb(m,s), \quad a \in A_1(i)
\]
\[
S^m ( i , s ) = \vb (m, \cA_i s ) + d_i, \quad i=1,2.
\]
for $m=0,1,2,...,n$.

We shall make use of the following well known tool (\cite{nowak}
Theorem 4.1):

\begin{lemma}
\label{dp} 
(i)
The value function satisfies the following
DP equation 
\bear
\label{dp1}
\vb (m+1,s) &=& 
\mbox{\bf val}_{a_1,a_2} \left \{ r(s,a_1,a_2) +
\beta \sum_{t \in \fS} q (t | s,a_1,a_2 ) \vb (m,t) \right \} \\
\nonumber &=&
c(s) + \beta \sum_{i=1,2} 
\max_{a_1(i) \in A_1(i) } R_i^m ( a_1 (i) , s ) 
+ \beta \lambda \min_{a_2 = 1,2 } S^m ( a_2 , s ),
\eear
for $m=0,...,n-1.$ 
\sp
(ii) Any Markov strategies $f^*$ and $g^*$ for the
two players, that use at time $k=n-m$ any actions
that achieve the max and min in (\ref{dp1}) respectively,
are optimal.
\end{lemma}

We first establish a simple sufficient
condition for the router to have an optimal strategy of
a monotone switching curve type.  We say that a strategy is of the
monotone switching curve type (see \cite{hajek})
if it has the following monotonicity property. 
It is described by a curve in $\fS$ with a monotone
slope, that separates $\fS$ into two connected regions, $\fS_1$
and $\fS_2$; there exists two actions, $j=1,2$, such that
in region $\fS_j$ it is optimal to use action $j$, $j=1,2$. 

We shall show, in particular, that the router has an optimal nondecreasing
switching curve policy, i.e., 
if it is optimal to route
a customer to queue $1$ for a given length of the queues $s=(s_1, s_2)$,
then it is also optimal to route the customer to queue $1$
when the length of the queues is $t=(t_1, t_2)$ provided that
$t_1 \leq s_1 $ and $t_2 \geq s_2 $. (This implies indeed that the curve
that separates $\fS1$ and $\fS2$ is nondecreasing).
A similar monotonicity property should hold for routing to queue 2. 

Consider the following
property that a function $z:\fS \to \R $ may possess:
\sp
\bPa :
$ \qquad z(\cA^2_i s) - z(\cA_i \cA_j s) \geq z( \cA_i s) - z( \cA_j s ),
\qquad i,j=1,2, \ i\not=j $.
\sp
It is easy to see that if $\vb (m, \cdot)$ satisfies
\bPa , then there exists an action for the minizer in (\ref{dp1}) which
has the above monotone switching curve structure.

Next we identify sufficient conditions for a similar
structure to exist for optimal service strategies.
Consider the following
property that a function $z:\fS \to \R $ may possess:
\sp
\bPb :
$ \qquad z(\cA_i \cA_j s) - z( \cA_j s) \geq z( \cA_i s) - z( s ),
\qquad i,j=1,2 $.
\sp
It is easy to see that if $\vb (m, \cdot)$ satisfies
\bPb , then there exists an action for both maximizers in (\ref{dp1}) which
are monotone nonincreasing in $s$. Moreover, if
$\theta_i$ is convex for some $i=1,2$,
then server $i$ has an action achieving the maximum 
in (\ref{dp1}) which is among $ \{ \underline \mu_i , \overline \mu_i \}$
(strategies having this property are called bang-bang strategies).
In that case, if $\vb (m, \cdot)$ satisfies \bPb , server $i$ 
has a nonincreasing monotone switching curve structure
(this means in particular that if at some state $s$ server
$i$ uses $\underline \mu_i$, then it also uses $\underline \mu_i$
for all states $t \geq s$, componentwise).

Summerizing the above and using Lemma \ref{dp},
we get the following:

\begin{lemma}
\label{sufficient}
If $\vb (m, \cdot), \ m=1,...,n$ satisfy \bPa , 
then the router has an optimal pure Markov policy
$g=(g_1,...,g_n)$, such that $g_m$ has
a monotone nondecreasing switching curve structure,
$m=1,...,n$.
If $\vb (m, \cdot)$ satisfy \bPb , 
then both servers have optimal pure Markov policies
$f=(f_1^i,...,f_n^i)$, $i=1,2$, such that all $f_m^i$
are monotone nonincreasing in $s$.
If, moreover, $\theta_i$ is convex for some $i=1,2$,
then for all $m=1,...,n$, server $i$ has an optimal pure bang-bang
Markov policy with a monotone nonincreasing switching curve structure.
\end{lemma}

We note that property \bPb is equivalent to the properties of
(integer-)convexity of the value function in $s_i$ and to 
supermodularity. 
% We also notice by applying twice \bPa to a function
% $z$, that the following property holds:
% $ z(\cA_i^2 s) + z( \cA_j^2 s) \geq 2 z( \cA_i \cA_j s) 
% \qquad i,j=1,2 \ j \not= i $.

Next we use the value iteration to
show that $\vb (m , \cdot ) , \ m=1,...,n$ indeed possesses
properties \bPa and \bPb.

\subsection{The properties of the value function}

\begin{lemma}
\label{vi}
Assume that $c$ are nondecreasing, and satisfy \bPa and \bPb.
Then $\vb (m, \cdot), \ m=1,...,n$ are nondecreasing
and satisfy \bPa  and \bPb.
\end{lemma}

\Prf
Clearly, $\vb (m, \cdot) $ are nondecreasing
and satisfy \bPa  and \bPb for $m=0$.
Assume that $\vb (m, \cdot) $ are nondecreasing
and satisfy \bPa  and \bPb for some $m=0,...,n-1$.
We show that this implies that $\vb (m+1, \cdot )$ also
satisfies these properties.
It suffices to show that both 
$ \max_{a \in A_1(i)} R_i^m (a, s)$  and
$ \min_{a_2=1,2} S^m (a_2 , s) $ have these properties. 

We begin with \bPa.
Let $ k \in \{1,2\}$ be an action achieving 
$argmin_a S^m (a, \cA_i^2 s) $ 
and let $l \in \{ 1 , 2 \} $ be an action achieving
$argmin_a S^m (a, \cA_j s) $. Then
\bear
\nonumber
\lefteqn{
\min_{a \in A_2} S^m (a, \cA_i^2 s)
- \min_{a \in A_2} S^m (a, \cA_i \cA_j s)
- \min_{a \in A_2} S^m (a, \cA_i s)
+ \min_{a \in A_2} S^m (a, \cA_j s) } \\
\nonumber
& \geq &
S^m (k , \cA_i^2 s)
- S^m (k , \cA_i \cA_j s)
- S^m (l , \cA_i s)
+ S^m (l , \cA_j s) 
\nonumber \\
& = & 
\vb (m , \cA_k \cA_i^2 s)
- \vb (m , \cA_k \cA_i \cA_j s)
- \vb (m , \cA_l \cA_i s)
+ \vb (m , \cA_l \cA_j s) \geq 0 .
\label{lbl1}
\eear
The last inequality  follows for $k=l$ 
since $\vb ( m, \cdot ) $ satisfies \bPa (with $s'= \cA_l s $). 
It holds also for the other two cases
$(l=i,k=j)$ and $(k=i,l=j)$ since
\bear
\vb (m, \cA_i^3 s) - \vb (m, \cA_j \cA_i^2 s) 
&\geq &
\vb (m, \cA_i^2 s) - \vb (m, \cA_j \cA_i s) 
\label{l1} \\
& \geq &
\vb (m, \cA_j \cA_i^2 s) - \vb (m, \cA_j^2 \cA_i s) 
\label{l2} \\
& \geq &
\vb (m, \cA_j \cA_i s) - \vb (m, \cA_j^2 s) 
\label{l3}
\eear
where both (\ref{l1}) and (\ref{l2}) 
follow from the fact that
$\vb (m, \cdot )$ satisfies \bPa , with $s'=
\cA_i s $, and (\ref{l3}) 
follows from the fact that
$\vb (m, \cdot )$ satisfies \bPa with $s'=
\cA_j s $. 
Hence $ \min_{a_2=1,2} S^m (a_2 , s) $ satisfies \bPa .

Fix $i\in \{1,2 \}$.
Let $ \alpha \in A_1 (i) $ be an action achieving
$argmax_a R_i^m (a, \cA_i \cA_j s) $
and let $\gamma \in A_1(i)$ be an action achieving
$argmax_a R_i^m (a, \cA_i s) $. Then
\bear
\nonumber
\lefteqn{
\max_{a \in A_1(i)} R_i^m (a, \cA_i^2 s)
- \max_{a \in A_1(i)} R_i^m (a, \cA_i \cA_j s)
- \max_{a \in A_1(i)} R_i^m (a, \cA_i s)
+ \max_{a \in A_1(i)} R_i^m (a, \cA_j s) } \\
\nonumber
& \geq &
R_i^m (\alpha , \cA_i^2 s)
- R_i^m (\alpha , \cA_i \cA_j s)
- R_i^m (\gamma , \cA_i s)
+ R_i^m (\gamma , \cA_j s)
\nonumber \\
& = &
\nonumber
\alpha \left[
\vb (m , \cA_i s)
- \vb (m , \cA_i^2 s)
- \vb (m , \cA_j s)
+ \vb (m , \cA_i \cA_j s)
\right] \\ &&
-
\nonumber
\gamma
\left[
\vb (m , \cD_i \cA_i s)
- \vb (m , \cA_i s)
- \vb (m , \cA_j \cD_i s)
+ \vb (m , \cA_j s)
\right] \\ &&
+
\nonumber
(1-\lambda)
\left[
\vb (m , \cA_i^2 s)
- \vb (m , \cA_i \cA_j s)
- \vb (m , \cA_i s)
+ \vb (m , \cA_j s) 
\right] \\
&=&
(1-\lambda - \alpha)
\left[
\vb (m , \cA_i^2 s)
- \vb (m , \cA_i \cA_j s)
- \vb (m , \cA_i s)
+ \vb (m , \cA_j s) 
\right]
\label{l4}
\\ && + \gamma
\left[
\vb (m , \cA_i s)
- \vb (m , s )
- \vb (m , \cA_j s)
+ \vb (m , \cA_j \cD_i s)
\right] 
\label{l5}
\\ & \geq & 0.
\nonumber
\eear
Indeed, (\ref{l4}) itself is nonnegative since
$1 - \lambda - \alpha $ are nonnegative, and since
the term in square brackets is nonnegative as
$\vb (m, \cdot )$ satisfies \bPa . To see that (\ref{l5})
is nonnegative we distinguish between two cases.
If $s_i$ is zero (queue $i$ is empty)
then $\cD_i s = s$, and (\ref{l5}) reduces to
$\gamma \left[ \vb (m , \cA_i s) - \vb (m , s )
\right]$. This is nonnegative by the monotonicity of
$\vb (m,\cdot)$. To see that (\ref{l5}) is nonnegative
if $s_i$ is nonzero, we use
\bPa for $\vb (m , \cdot )$ evaluated in
$s'= \cD_i s $. This establishes that
$ \max_{a \in A_1(i)} R_i^m (a, s)$ satisfies \bPa . 

Next we establish \bPb .
Let $ k \in \{1,2\}$ be an action achieving
$argmin_a S^m (a, \cA_i \cA_j s) $
and let $l \in \{ 1 , 2 \} $ be an action achieving
$argmin_a S^m (a, s) $. If $l=k$ then
\bear
\nonumber
\lefteqn{
\min_{a \in A_2} S^m (a, \cA_i \cA_j s)
- \min_{a \in A_2} S^m (a, \cA_j s)
- \min_{a \in A_2} S^m (a, \cA_j s)
+ \min_{a \in A_2} S^m (a, s) } \\
\nonumber
& \geq &
S^m (k , \cA_i \cA_j s)
- S^m (k , \cA_i s)
- S^m (l , \cA_j s)
+ S^m (l , s)
\nonumber \\
& = &
\vb (m , \cA_i \cA_j (\cA_k s) )
- \vb (m , \cA_j (\cA_k s) )
- \vb (m , \cA_i (\cA_k s) )
+ \vb (m , \cA_j (\cA_k s) )
\nonumber \\ &\geq& 0 ,
\nonumber
\eear
since $\vb (m, \cdot )$ satisfies \bPb . 
If $l=j$ then
\bear
\nonumber
\lefteqn{
\min_{a \in A_2} S^m (a, \cA_i \cA_j s)
- \min_{a \in A_2} S^m (a, \cA_j s)
- \min_{a \in A_2} S^m (a, \cA_j s)
+ \min_{a \in A_2} S^m (a, s) } \\
\nonumber
& \geq &
S^m (k , \cA_i \cA_j s)
- S^m (k , \cA_i s)
- S^m (l , \cA_j s)
+ S^m (l , s)
\nonumber \\
& = &
\vb (m , \cA_i \cA_k (\cA_j s))
- \vb (m , \cA_k (\cA_j s))
- \vb (m , ( \cA_j s))
+ \vb (m , ( \cA_j s))
\nonumber \\ &\geq& 0 ,
\nonumber
\eear
since $\vb (m, \cdot )$ satisfies \bPb .
A similar argument holds for $l=i$.
It remains to check the case $l\not= i, l\not= j$.
We have
\bear
\lefteqn{
\min_{a \in A_2} S^m (a, \cA_i \cA_j s)
- \min_{a \in A_2} S^m (a, \cA_j s)
- \min_{a \in A_2} S^m (a, \cA_j s)
+ \min_{a \in A_2} S^m (a, s) } 
\nonumber \\
& \geq &
S^m (k , \cA_i \cA_i s)
- S^m (k , \cA_i s)
- S^m (l , \cA_i s)
+ S^m (l , s) 
\nonumber \\
& = &
\vb (m , \cA_i^2 \cA_k s)
- \vb (m , \cA_i \cA_k s)
- \vb (m , \cA_i \cA_l s)
+ \vb (m , \cA_l s)
\nonumber \\
& = &
\vb (m , \cA_i \cA_k ( \cA_i s) )
- \vb (m , \cA_k ( \cA_i s ) )
- \vb (m , \cA_i ( \cA_i s ) )
+ \vb (m , ( \cA_i s ) )
\label{l8}
\\ && +
\vb (m , \cA_i^2 s)
- \vb (m , \cA_i \cA_l s)
- \vb (m , \cA_i s)
+ \vb (m , \cA_l s) 
\label{l9} \\
\nonumber
& \geq & 0 \ .
\eear
(\ref{l8}) is nonnegative due to \bPb , and
(\ref{l9}) is nonnegative due to \bPa .
Hence, $ \min_{a_2=1,2} S^m (a_2 , s) $ satisfies \bPb .

Next we show that
$ \max_{a \in A_1(i)} R_i^m (a, s)$ satisfies \bPb . 
Fix $i\in \{1,2 \}$.
Let $ \alpha \in A_1 (i) $ be an action achieving
$argmax_a R_i^m (a, \cA_i \cA_j s) $
and let $\gamma \in A_1(i)$ be an action achieving
$argmax_a R_i^m (a, \cA_i s) $. Then
\bear
\nonumber
\lefteqn{
\max_{a \in A_1(i)} R_i^m (a, \cA_i \cA_j s)
- \max_{a \in A_1(i)} R_i^m (a, \cA_i s)
- \max_{a \in A_1(i)} R_i^m (a, \cA_j s)
+ \max_{a \in A_1(i)} R_i^m (a, \cA s) } \\
\nonumber
& \geq &
R_i^m (\gamma , \cA_i \cA_j  s)
- R_i^m (\gamma , \cA_i s)
- R_i^m (\alpha , \cA_j s)
+ R_i^m (\alpha , s)
\nonumber \\
& = &
\nonumber
\gamma \left[
\vb (m , \cA_j s)
- \vb (m , \cA_j \cA_i s)
- \vb (m , s)
+ \vb (m , \cA_i s)
\right] \\ &&
-
\nonumber
\alpha
\left[
\vb (m , \cD_i \cA_j s)
- \vb (m , \cA_j s)
- \vb (m , \cD_i s)
+ \vb (m , s)
\right] \\ &&
+
\nonumber
(1-\lambda)
\left[
\vb (m , \cA_i \cA_j s)
- \vb (m , \cA_j s)
- \vb (m , \cA_i s)
+ \vb (m , s) 
\right] \\
&=&
\label{l10}
(1-\lambda - \gamma)
\left[
\vb (m , \cA_i \cA_j s)
- \vb (m , \cA_j s)
- \vb (m , \cA_i s)
+ \vb (m , s) 
\right] \\
\label{l11}
&& + 
\alpha
\left[
\vb (m , \cA_j s)
- \vb (m , \cA_j \cD_i s)
- \vb (m , s)
+ \vb (m , \cD_i s)
\right] \\ \nonumber & \geq & 0 \ .
\eear
Indeed,  (\ref{l10}) is itself nonnegative due
to \bPb , and (\ref{l11}) is zero if $s_i=0$, and
is otherwise nonnegative due to \bPb.

It remains to establish the monotonicity of
$ \max_{a \in A_1(i)} R_i^m (a, s)$  and
$ \min_{a_2=1,2} S^m (a_2 , s) $. 
Choose some $i \in \{ 1,2 \}$,
and let $ j \in \{1,2\}$ be an action achieving
$argmin_a S^m (a, \cA_i s) $. Then
\bear
\nonumber
\lefteqn{
\min_{a \in A_2} S^m (a, \cA_i s)
- \min_{a \in A_2} S^m (a, s)} \\
\nonumber
& \geq &
S^m (j , \cA_i s) - S^m (j , s)
\nonumber \\
& = &
\vb (m , \cA_i (\cA_j s) )
- \vb (m , (\cA_j s) )
\nonumber \\ &\geq& 0 ,
\nonumber
\eear
since $\vb (m, \cdot )$ is monotone nondecreasing.
Next we show that
$ \max_{a \in A_1(i)} R_i^m (a, s)$ is 
monotone nondecreasing.  Fix $i\in \{1,2 \}$.
Let $ \alpha \in A_1 (i) $ be an action achieving
$argmax_a R_i^m (a, s) $. Then
\bear
\nonumber
\lefteqn{
\max_{a \in A_1(i)} R_i^m (a, \cA_j s)
- \max_{a \in A_1(i)} R_i^m (a, s) } \\
\nonumber
& \geq &
R_i^m (\alpha , \cA_j s)
- R_i^m (\alpha , s)
\nonumber \\
& = &
\nonumber
\alpha
\left[
\vb (m , \cD_i \cA_j s)
- \vb (m , \cA_j s)
- \vb (m , \cD_i s)
+ \vb (m , s)
\right] 
+
(1-\lambda)
\left[
\vb (m , \cA_j s)
- \vb (m , s)
\right] \\
&=&
\nonumber
(1-\lambda - \alpha)
\left[
\vb (m , \cA_j s)
- \vb (m , s)
\right] +
\alpha
\left[
\vb (m , \cD_i \cA_j s)
- \vb (m , \cD_i s)
\right]  \\
& \geq & 0 \ ,
\eear
again by the monotonicity of  $\vb (m, \cdot )$.
This establishes the monotonicity of
$ \max_{a \in A_1(i)} R_i^m (a, s)$  and
$ \min_{a_2=1,2} S^m (a_2 , s) $, and hence the proof.
\done

Combining Lemma \ref{sufficient} with Lemma \ref{vi}
we obtain the main result of the section:
%
\begin{proposition}
\label{rslt1}
Consider the finite horizon discounted problem.
Assume that $c$ is nondecreasing, and satisfy \bPa and \bPb.
Then the router has an optimal pure Markov policy which has
at each step a monotone nondecreasing switching curve 
structure.
Moreover, both servers have optimal pure Markov policies
which are monotone nonincreasing in $s$ at each time.
If, moreover, $\theta_i$ is convex for some $i=1,2$,
then server $i$ has an optimal pure bang-bang
Markov policy at each step,
with a monotone nonincreasing switching curve structure.
\end{proposition}

\section{The infinite horizon problem}
\label{sec:infh}

Under some mild growth conditions on the immediate cost,
optimal stationary exist for both players with the same
structure as described in Proposition \ref{rslt1}.
This is summerized in the following:
%
\begin{proposition}
\label{rslt2}
Consider the infinite horizon discounted problem
with discount factor $\beta < 1$.
Assume that $c$ is nondecreasing, satisfy \bPa and \bPb.
Assume moreover that for some $1< \gamma < \beta^{-1}$,
\beq
\label{asmp}
\sup_{s \in \fS}
\frac{|c(s)|}{\gamma^{(s_1 + s_2)}} < \infty .
\eeq
Then the router has an optimal pure stationary policy
with a monotone nondecreasing switching curve structure.
Moreover, both servers have optimal pure stationary policies
which are monotone nondecreasing in $s$.
If, moreover, $\theta_i$ is convex for some $i=1,2$,
then server $i$ has an optimal pure bang-bang
stationary policy with a monotone nonincreasing switching curve structure.
\end{proposition}

\Prf
Let $f_m,g_m$ be maximizing and minimizing
actions in the DP (\ref{dp1}), $m=1,2,...$.
It follows from Proposition \ref{rslt1} that
$f_m,g_m$ can be chosen to be pure and monotone.
Choose any pure stationary policies $f,g$ obtained
as some limit points of $f_m,g_m$.
It is easily seen that $f,g$ inherits the structural 
properties of $f_m,g_m$. Moreover, by Theorem 3.3 in \cite{AHS},
$f$ and $g$ are optimal, provided that some conditions
are satisfied, which we check below. This establishes the proof.

It remains thus to check the following conditions
of Theorem 3.3 in \cite{AHS}.
There exists some function $\mu:\fS \to [1,\infty)$
such that 
\sp
(i) 
$ \displaystyle 
\beta \times \sup_{s \in \fS, a_1 \in A_1 , a_2 \in A_2} 
\left\{
\mu^{-1}_s \sum_{t \in \fS} q(t|s,a_1 , a_2 ) \mu_t
\right\} < 1 $,
\sp
(ii)
$ \displaystyle 
\sup_{s \in \fS, a_1 \in A_1 , a_2 \in A_2} 
\left\{
\mu^{-1}_s | r(s,a_1,a_2 | )
\right\} < \infty $.

We choose $\mu_s = \gamma^{(s_1 + s_2 )}$.
Then (i) is satisfied since 
\[
\beta \frac{ \sum_{t \in \fS} q(t|s,a_1 , a_2 ) \mu_t}
{\mu_s}
=
\beta \frac
{ \sum_{t \in \fS} q(t|s,a_1 , a_2 ) \gamma^{t_1 + t_2 }}
{ \gamma^{s_1 + s_2}}
\leq
\beta \frac{ \gamma^{s_1 + s_2 + 1}}{
\gamma^{s_1 + s_2 }} = {\beta}{\gamma} < 1.
\]
(ii) holds by (\ref{asmp}). 
\done

\begin{remark}
Using \cite{AHS} it esaily follows that the
results in Proposition \ref{rslt2} hold also for the expected average cost,
provided that the following strong stability condition holds
instead of (\ref{asmp}):
$\lambda < \underline \mu_i$, $i=1,2$. 
\end{remark}

\section{Concluding remarks}
\label{sec:cr}
In this paper we established conditions for obtaining
monotonicity properties of strategies, in general, and monotone
switching curve type strategies in particular in
the context of stochastic games, where we focused on a routing
problem. This extends to the game setting methods
already known in the context of control, see e.g.
\cite{hajek}. Our results can easily be extended
to handle more involved control problems. Indeed,
the main results remain valid in the presence of
additional uncontrolled Poissonian flows of customers 
to both queues. Moreover, one may consider 
a situation where, with some probability, customers have 
to rerouted to get extra service in one of the
queues, as in the model studied by Hajek \cite{hajek}.
Extra controllers have then to be designed
to decide to which queue should such customers
be rerouted. Assuming that the input router controller,
which we considered in our paper, and the new routers
have the same common objective, it can again be shown,
as in the control case studied by Hajek \cite{hajek},
that all routers have monotone switching curves
strategies. 

An interesting observation is that a flow control problem
similar to the discrete time flow control models studied in \cite{me,me2}
can be considered as a special
case of the routing model which we solved. Indeed,
if the holding cost, the admission cost,
and the service cost at queue 2 are zero, then
the routing problem transforms into a flow control
problem into the first queue. (This becomes even more
apparent when the service rate in the second queue
is extremely high, so that it is almost always
empty). 


\begin{thebibliography}{9999}

\bibitem{me}
E.~Altman, ``Flow control using the theory of
zero-sum Markov games'', 
To appear in {\it IEEE Trans. Automatic Control}, 1994.

\bibitem{me2}
E. Altman,
``Monotonicity of optimal policies in a zero sum game:
a flow control model", to appear in
{\it Advances of dynamic games and applications}, 1993.

\bibitem{AK}
E. Altman and G. Koole,
``Stochastic Scheduling Games with Markov
Decision Arrival Processes", 
{\it Journal Computers and Mathematics with Appl.},
3rd special issue on Differential Games, pp. 141-148, 1993.

\bibitem{an}
E. Altman and N. Shimkin,
``Individually Optimal Dynamic Routing
in a Processor Sharing System: Stochastic Game Analysis",
EE Pub No. 849, August 1992.
Submitted to {\it Operations Research}.

\bibitem{AS}
E. Altman and N. Shimkin,
``Worst-case and Nash routing policies in
parallel queues with uncertain service allocations",
IMA Preprint Series No. 1120,
Institute for Mathematics and Applications,
University of Minnesota,  Minneapolis, USA, 1993,
submitted to {\it Operations Research}.

\bibitem{AHS}
E. Altman, A. Hordijk and F. M. Spieksma,
``Contraction conditions for average and 
$\alpha$-discounted optimality in countable state
Markov games with unbounded rewards",
submitted to {\it MOR}, 1994.

\bibitem{hajek}
B. Hajek,
``Optimal control of two interacting service stations",
{\it IEEE Trans. Automatic Control}, {\bf 29}. 
No. 6, pp. 491-499, 1984.

\bibitem{HL}
M. T. Hsiao and A. A. Lazar,
``A game theoretic approach to
decentralized flow control
of Markovian queueing Networks",
{\it Performance '87}, Courtois \& Latouche (eds.),
pp. 55-73, 1988.

\bibitem{nowak}
A. S. Nowak,
``On zero-sum stochastic games with general state space I".
Prob and Math Statistics, Vol. IV, Fasc. 1,
pp. 13-32, 1984.

\bibitem{RF}
T.E.S. Raghavan and J.A. Filar,
``Algorithms for Stochastic Games - A survey",
{\it Zeitschrift f\"ur OR}, vol 35, pp. 437-472, 1991.

\end{thebibliography}
\end{document}

