Combinations and mixtures of optimal policies in unichain MDPs are optimal
We show that combinations of optimal (stationary) policies in unichain Markov decision processes are optimal. That is, let M be a unichain Markov decision process with state space S, action space A and policies pi_j*: S->A (1<= j<= n) with optimal average infinite horizon reward. Then any combination pi of these policies, where for each state i in S there is a j such that pi(i)=pi_j*(i), is optimal as well. Furthermore, we prove that any mixture of optimal policies, where at each visit in a state i an arbitrary action pi_j*(i) of an optimal policy is chosen, yields optimal average reward, too.