robust_RL_multi_adversary
We investigate the effect of populations on finding good solutions to the robust MDP
view repo
Reinforcement Learning (RL) is an effective tool for controller design but can struggle with issues of robustness, failing catastrophically when the underlying system dynamics are perturbed. The Robust RL formulation tackles this by adding worstcase adversarial noise to the dynamics and constructing the noise distribution as the solution to a zerosum minimax game. However, existing work on learning solutions to the Robust RL formulation has primarily focused on training a single RL agent against a single adversary. In this work, we demonstrate that using a single adversary does not consistently yield robustness to dynamics variations under standard parametrizations of the adversary; the resulting policy is highly exploitable by new adversaries. We propose a populationbased augmentation to the Robust RL formulation in which we randomly initialize a population of adversaries and sample from the population uniformly during training. We empirically validate across robotics benchmarks that the use of an adversarial population results in a more robust policy that also improves outofdistribution generalization. Finally, we demonstrate that this approach provides comparable robustness and generalization as domain randomization on these benchmarks while avoiding a ubiquitous domain randomization failure mode.
READ FULL TEXT VIEW PDF
We study the robustness of reinforcement learning (RL) with adversariall...
read it
We show that adversarial reinforcement learning (ARL) can be used to pro...
read it
A reinforcement learning (RL) policy trained in a nominal environment co...
read it
Deep neural networks coupled with fast simulation and improved computati...
read it
A wide range of reinforcement learning (RL) problems  including robustn...
read it
Evaluating the worstcase performance of a reinforcement learning (RL) a...
read it
Deep neural networks have demonstrated their capability to learn control...
read it
We investigate the effect of populations on finding good solutions to the robust MDP
Developing controllers that work effectively across a wide range of potential deployment environments is one of the core challenges in engineering. The complexity of the physical world means that the models used to design controllers are often inaccurate. Optimization based control design approaches, such as reinforcement learning (RL), have no notion of model inaccuracy and can lead to controllers that fail catastrophically under mismatch. In this work, we aim to demonstrate an effective method for training reinforcement learning policies that are robust to model inaccuracy by designing controllers that are effective in the presence of worstcase adversarial noise in the dynamics.
One effective approach to induce robustness has been domain randomization Tobin et al. (2017); Jakobi (1997), a method where a designer with expertise identifies the components of the model that they are uncertain about. They then construct a set of training environments where the uncertain components are randomized, ensuring that the agent is robust on average to this set. However, this requires careful parametrization of the uncertainty set as well as handdesigning of the environments.
A more easily automated approach is to formulate the problem as a zerosum game and learn an adversary that perturbs the transition dynamics Tessler et al. (2019); Kamalaruban et al. (2020); Pinto et al. (2017). If a global Nash equilibrium of this problem is found, then that equilibrium provides a worst case performance bound under the specified set of perturbations. Besides the benefit of removing user design once the perturbation mechanism is specified, this approach is maximally conservative, which is useful for safety critical applications.
However, the aforementioned literature on learning an adversary predominantly uses a single stochastic adversary. This raises a puzzling question: the minimax problem does not necessarily have any pure Nash equilibria (see Appendix C Tessler et al. (2019)) but the existing robust RL literature mostly appears to attempt to solve for pure Nash equilibria. That is, the most general form of the minimax problem searches over distributions of adversary and agent policies
(1) 
where are distributions over policies and is a score function (for example, expected cumulative reward). However, this problem is approximated in the literature by the fixedpolicy problem
(2) 
We contend that this reduction to a single adversary approach can sometimes fail to result in improved robustness under standard parametrizations of the adversary policy.
The following example provides some intuition for why using a single adversary can decrease robustness. Consider a robot trying to learn to walk eastwards while an adversary outputs a force representing wind coming from the north or the south. For a fixed, deterministic adversary the agent knows that the wind will come from either south or north and can simply apply a counteracting force at each state. Once the adversary is removed, the robot will still apply the compensatory forces and possibly become unstable. Stochastic Gaussian policies (which are ubiquitous in continuous control) offer little improvement: low entropy policies can be counteracted whereas high entropy policies would endow the robot with the prior that the wind cancels on average. Under these standard policy parametrizations, which cannot represent a distribution over policies, we cannot use an adversary to endow the agent with a prior that a persistent, strong wind could come either from north or south. This leaves the agent exploitable to this class of perturbations.
The use of a single adversary in the robustness literature is in contrast to the multiplayer game literature. In multiplayer games, large sets of adversaries are used to ensure that an agent cannot easily be exploited Vinyals et al. (2019); Czarnecki et al. (2020); Brown and Sandholm (2019). Drawing inspiration from this literature, we introduce RAP (Robustness via Adversary Populations): a randomly initialized population of adversaries that we sample from at each rollout and train alongside the agent. Returning to our example of a robot perturbed by wind, if the robot learns to cancel any one of the adversaries effectively, then that opens a niche for an adversary to exploit by applying forces in another direction. As the number of adversaries increases, the robot is eventually endowed with the prior that a strong wind could come from either direction and that it must walk carefully to avoid being toppled over.
Our contributions are as follows:
Using a set of continuous control tasks, we provide evidence that a single adversary does not have a consistent positive impact on the robustness of an RL policy while the use of an adversary population provides improved robustness across all considered examples.
We investigate the source of the robustness and show that the single adversary policy is exploitable by new adversaries whereas policies trained with RAP are robust to new adversaries.
We demonstrate that adversary populations can be competitive with domain randomization while avoiding potential failure modes of domain randomization.
This work builds upon robust control Zhou and Doyle (1998)
, a branch of control theory focused on finding optimal controllers under worstcase perturbations of the system dynamics. The Robust Markov Decision Process (RMDP) formulation extends this worstcase model uncertainty to uncertainty sets on the transition dynamics of an MDP and demonstrates that computationally tractable solutions exist for small, tabular MDPs
Nilim and El Ghaoui (2005); Lim et al. (2013). For larger or continuous MDPs, one successful approach has been to use function approximation to compute approximate solutions to the RMDP problem Tamar et al. (2014).One prominent variant of the RMDP literature is to interpret the perturbations as an adversary and attempt to learn the distribution of the perturbation under a minimax objective. Two variants of this idea that tie in closely to our work are Robust Adversarial Reinforcement Learning (RARL) Pinto et al. (2017) and and Noisy Robust Markov Decision Processes (NRMDP) Tessler et al. (2019) which differ in how they parametrize the adversaries: RARL picks out specific robot joints that the adversary acts on while NRMDP adds the adversary action to the agent action. Both of these works attempt to find an equilibrium of the minimax objective using a single adversary; in contrast our work uses a large set of adversaries and shows improved robustness relative to a single adversary.
An alternative to the minimax objective, domain randomization, asks a designer to explicitly define a distribution over environments that the agent should be robust to. For example, Peng et al. (2018) varies simulator friction, mass, table height, and controller gain (along with several other parameters) to train a robot to robustly push a puck to a target location in the real world; Antonova et al. (2017) added noise to friction and actions to transfer an object pivoting policy directly from simulation to a Baxter robot. Additionally, domain randomization has been successfully used to build accurate object detectors solely from simulated data Tobin et al. (2017), to zeroshot transfer a quadcopter flight policy from simulation Sadeghi and Levine (2016).
However, as we discuss in Sec. 6, a policy that performs well on average across simulation domains is not necessarily robust as it may trade off performance on one set of parameters to maximize performance in another. EPOpt Rajeswaran et al. (2016) addresses this by replacing the uniform average across distributions with the conditional value at risk (CVaR) Chow et al. (2015); Tamar et al. (2015) a soft version of the minimax objective in which the optimization is only performed over a small percentage of the worst performing parameters. This is an interesting approach to align the domain randomization objective with the minimax objective and could be made compatible with our approach by only training using a subset of the strongest adversaries.
Our demonstration of overfitting to a single adversary is not new; there is extensive work establishing that agents trained independently in multiagent settings can result in nonrobust policies. Gleave et al. (2019) show that in zerosum games adversary pairs trained via RL are not robust to replacement of the adversary with a different adversary policy. Lanctot et al. (2017) extends this idea to general sum games by training a population of agentagent pairs and showing that taking two pairs and swapping the agents in them leads to failure to accomplish the objective. Shapley (1964) establishes that even in tabular settings (in this case, a generalsum version of Rock Paper Scissors), iterated best response to pure Nash strategies can lead to cyclical behavior and a failure to converge to equilibrium.
The use of population based training is also a standard technique in multiagent settings. Alphastar, the grandmasterlevel Starcraft bot, uses a population of "exploiter" agents that finetune against the bot to prevent it from developing exploitable strategies Vinyals et al. (2019). Czarnecki et al. (2020) establishes a set of sufficient geometric conditions on games under which the use of multiple adversaries will ensure gradual improvement in the strength of the agent policy. They empirically demonstrate that learning in games can often fail to converge without populations. Finally, Active Domain Randomization Mehta et al. (2019) is a very close approach to ours, as they use a population of adversaries to select domain randomization parameters whereas we use a population of adversaries to directly perturb the agent actions. Additionally, they use a Stein Variation Policy Gradient Liu et al. (2017) to ensure diversity in their adversaries and a discriminator reward instead of a minimax reward whereas our work does not have any explicit coupling between the adversary gradient updates and uses a simpler zerosum reward function.
In this work we use the framework of a multiagent, finitehorizon, discounted, Markov Decision Process (MDP) Puterman (1990) defined by a tuple . Here is the set of actions for the agent, is the set of actions for the adversary, is a set of states, is a transition function, is a reward function and is a discount factor. is shared between the adversaries as they share a statespace with the agent. The goal for a given MDP is to find a policy parametrized by that maximizes the expected cumulative discounted reward . The conditional in this expression is a shorthand to indicate that the actions in the MDP are sampled via . We denote the agent policy parametrized by weights as and the policy of adversary as . Actions sampled from the adversary policy will be written as . We use to denote the parametrization of the system dynamics (e.g. different values of friction, mass, wind, etc.) and the system dynamics for a given state and action as .
Here we outline prior work and the approaches that will be compared with RAP. Our baselines consist of a single adversary and domain randomization.
Our adversary formulation uses the Noisy Action Robust MDP Tessler et al. (2019) in which the adversary adds its actions onto the agent actions. The objective is
We note two important restrictions inherent to this adversarial model. First, since the adversary is only able to attack the agent through the actions, there is a restricted class of dynamical systems that it can represent; this set of dynamical systems may not necessarily align with the set of dynamical systems that the agent may be tested in. This is simply a restriction caused by the choice of adversarial perturbation and could be alleviated by using different adversarial parametrizations e.g. perturbing the transition function directly.
In addition to the restricted set of dynamical systems that the NRMDP can represent, there is a limitation induced by standard RL agent parametrizations. In particular, agents are often parametrized by either having deterministic actions or having their actions drawn from a probability distribution (i.e. we pass a state through our policy, it outputs parameters of a distribution and we sample the actions from that distribution). The single adversary cannot represent all the systems that we intend the agent to be robust to as a consequence of the parametrization. For example, suppose the agent is currently at some state
and the adversary outputs an action . In the deterministic case, the agent knows that the adversary will never output even though is clearly in the class of possible perturbations.Domain randomization is the setting in which the user specifies a set of environments which the agent should be robust to. This allows the user to directly encode knowledge about the likely deviations between training and testing domains. For example, the user may believe that friction is hard to measure precisely and wants to ensure that their agent is robust to variations in friction; they then specify that the agent will be trained with a wide range of possible friction values. We use
to denote some vector that parametrizes the set of training environments (e.g. friction, masses, system dynamics, etc.). We denote the domain over which
is drawn from as and use to denote some probability distribution over . The domain randomization objective is(4) 
Here the goal is to find an agent that performs well on average across the distribution of training environment. Most commonly, and in this work, the parameters are sampled uniformly over .
RAP
extends the minimax objective with a population based approach. Instead of a single adversary, at each rollout we will sample uniformly from a population of adversaries. By using a population, the agent is forced to be robust to a wide variety of potential perturbations instead of a single perturbation. If the agent begins to overfit to any one adversary, this opens up a potential niche for another adversary to exploit. For problems with only one failure mode, we expect the adversaries to all come out identical to the minimax adversary, but as the number of failure modes increases the adversaries should begin to diversify to exploit the agent. To induce this diversity, we will rely on randomness in the gradient estimates and randomness in the initializations of the adversary networks rather than any explicit term that induces diversity. While the idea of using populations does not preclude explicit terms in the loss to encourage diversity, we find that our chosen sources of diversity are sufficient for our purposes.
Denoting as the th adversary and
as the discrete uniform distribution defined on 1 through n, the objective becomes
(5) 
For a single adversary, this is equivalent to the minimax adversary described in Sec. 3.2.1
We will optimize this objective by converting the problem into the equivalent zerosum game. At the start of each rollout, we will sample an adversary index from the uniform distribution and collect a trajectory in using the agent and the selected adversary. For notational simplicity, we assume the trajectory is of length M and that adversary will participate in total trajectories while, since the agent participates in every rollout, it will receive J total trajectories. We denote the jth collected trajectory for the agent as and the associated trajectory for adversary as . Note that the adversary reward is simply the negative of the agent reward.
We will use Proximal Policy Optimization Schulman et al. (2017) (PPO) to update our policies. We caution that we have overloaded notation slightly here and for adversary , refers only to the trajectories in which the adversary was selected: adversaries will only be updated using trajectories where they were active. At the end of a training iteration, we update all our policies using gradient descent. The algorithm is summarized below:
We call the agent trained to optimize this objective using Algorithm 1 the RAP agent.
In this section we present experiments on continuous control tasks from the OpenAI Gym Suite Brockman et al. (2016); Todorov et al. (2012). We compare with our baselines and evaluate the efficacy of a population of learned adversaries across a wide range of state and action space sizes. We investigate the following hypotheses:
[label=H0.]
Agents are more likely to overfit to a single adversary than a population of adversaries, leaving them more exploitable on indistribution tasks.
Agents trained against a population of adversaries will generalize better, leading to improved performance on outofdistribution tasks.
Naive parametrization of domain randomization can result in a brittle policy, even when evaluated on the same distribution it was trained on.
While a larger adversary populations can represent more varied dynamics, there will be diminishing returns due to the decreased environment steps each adversary receives.
Indistribution tasks refer to the agent playing against perturbations that are in the training distribution: adversaries that add their actions onto the agent. However, the particular form of the adversary and their restricted perturbation magnitude means that there are many dynamical systems that they cannot represent (for example, significant variations of joint mass and friction). These tasks are denoted as out of distribution tasks. All of the tasks in the test set described in Sec. 5.1 are likely outofdistribution tasks.
While we provide exact details of the hyperparameters in the Appendix, adversarial settings require additional complexity in hyperparameter selection. In the standard RL procedure, optimal hyperparameters are selected on the basis of maximum expected cumulative reward. However, if an agent playing against an adversary achieves a large cumulative reward, it is possible that the agent was simply playing against a weak adversary. Conversely, a low score does not necessarily indicate a strong adversary nor robustness: it could simply mean that we trained a weak agent.
To address this, we adopt a version of the trainvalidatetest split from supervised learning. We use the mean policy performance on a suite of validation tasks to select the hyperparameters, then we train the policy across ten seeds and report the resultant mean and standard deviation over twenty trajectories. Finally, we evaluate the seeds on a holdout test set of eight additional modelmismatch tasks. These tasks vary significantly in difficulty; for visual clarity we report only the average across tasks in this paper and report the full breakdown across tasks in the Appendix.
We experiment with the Hopper, Ant, and HalfCheetah continuous control environments shown in Fig. 1. To generate the validation model mismatch, we predefine ranges of mass and friction coefficients as follows: for Hopper, mass and friction ; Half Cheetah and Ant, mass and friction . We scale the friction of every Mujoco geom and the mass of the torso with the same (respective) coefficients. We compare the robustness of agents trained via RAP against: 1) agents trained against a single adversary in a zerosum game, 2) agents trained using domain randomization, and 3) an agent trained only using PPO and no perturbation mechanism. To train the domain randomization oracle, at each rollout we uniformly sample a friction and mass coefficient from the validation set ranges. We then scale the friction of all geoms and the mass of the torso by their respective coefficients; this constitutes directly training on the validation set which creates a strong baseline. To generate the test set of model mismatch, we take both the highest and lowest friction coefficients from the validation range and apply them to different combinations of individual geoms. For the exact selected combinations, please refer to Appendix B.
All of our experiments are run on c4.8xlarge 36 vCPU instances on AWS EC2. Our full paper can be reproduced for a cost of (full breakdown in Appendix) but we provide all of the trained policies, data, and code at https://github.com/eugenevinitsky/robust_RL_multi_adversary to simplify reproducibility. For our RL algorithms we use the RLlib 0.8.0 Liang et al. (2017) implementation of PPO Schulman et al. (2017). For exact hyperparameters, please refer to the Appendix. Since both gradient computations and forwards passes can be batched across the adversaries, there is no additional runtime cost relative to using a single adversary.
1 Analysis of Overfitting
A globally minimax optimal adversary should be unexploitable and have a lower bound on performance against any adversary in the adversary class. We investigate the optimality of our policy by asking whether the minimax agent is robust to swaps of adversaries from different training runs, i.e. different seeds. Fig. 2 shows the result of these swaps for the one adversary and three adversary case. The diagonal corresponds to playing against the adversaries the agent was trained with while every other square corresponds to playing against adversaries from a different seed. To simplify presentation, in the three adversary case, each square is the average performance against all the adversaries from that seed.
We observe that the agent trained against three adversaries is robust under swaps while the single adversary case is not. For the single adversary case, the mean performance of the agent in each seed is high against its own set of adversaries (the diagonal). This corresponds to the mean reward that would be reported at the end of training. Looking at just the reward is deceptive, the agent is still highly exploitable, as can be seen by its extremely subpar performance against an adversary from any other seed. Since the adversaries offdiagonal are feasible adversaries, this suggests that we have found a poor local optimum of the objective.
In contrast, the three adversary case is generally robust regardless of which adversary it plays against, suggesting that the use of additional adversaries has made the agent more robust. Of course, it’s possible that the adversaries are simply weaker, but as we discuss in Sec. 2, the improved performance on transfer tasks suggests that the robustness across seed swaps is indicative of a genuine improvement in robustness.
2 Adversary Population Performance
Here we present the results from the validation and holdout test sets described in Section 5.1. We compare the performance of training with adversary populations of size three and five against vanilla PPO, the domain randomization oracle, and the single minimax adversary.
Fig.3 shows the average reward (the average of ten seeds across the validation or test sets respectively) for each environment. Table 1 gives the corresponding numerical values and the percent change of each policy from the baseline. Standard deviations are omitted on the test set due to wide variation in task difficulty; the individual tests that we aggregate here are reported in the Appendix Sec. C with appropriate error bars. In all environments we achieve a higher reward across both the validation and holdout test set using RAP of size three and/or five when compared to the single minimax adversary case. These results from testing on new environments with altered dynamics supports hypothesis 1 that training with a population of adversaries leads to more robust policies than training with a single adversary.
For a more detailed comparison of robustness across the validation set, Fig. 4 shows heatmaps of the performance across all the mass, friction coefficient combinations. Here we highlight the heatmaps for Hopper and Half Cheetah for vanilla PPO, domain randomization, single adversary, and best adversary population size. Additional heatmaps for other adversary population sizes and the Ant environment can be found in Appendix Sec. C. Note that Fig. 3(a) is an example of a case where a single adversary has negligible effect on or slightly reduces the performance of the resultant policy on the validation set. This supports our hypothesis that a single adversary can actually lower the robustness of an agent. This result is in contrast to those observed in Pinto et al. (2017); we conjecture that their handdesigned parametrization of the adversary (forces applied to carefully selected leg joints) may be the cause of the difference.
Validation  Test  

Hopper  0 Adv  DR  1 Adv  3 Adv  5 Adv  0 Adv  DR  1 Adv  3 Adv  5 Adv 
Mean Rew.  1182  2662  1094  2039  2021  472  1636  913  1598  1565 
% Change  125  7.4  72.6  71  246  93.4  238  231 
Validation  Test  

Cheetah  0 Adv  DR  1 Adv  3 Adv  5 Adv  0 Adv  DR  1 Adv  3 Adv  5 Adv 
Mean Rew.  5659  3864  5593  5912  6323  5592  3656  5664  6046  6406 
% Change  32  1.2  4.5  11.7  35  1.3  8.1  14.6 
Validation  Test  

Ant  0 Adv  DR  1 Adv  3 Adv  5 Adv  0 Adv  DR  1 Adv  3 Adv  5 Adv 
Mean Rew.  6336  6743  6349  6432  6438  2908  3613  3206  3272  3203 
% Change  6.4  0.2  1.5  1.6  24.3  10.2  12.5  10.2 


3 Effect of Domain Randomization Parametrization
From Fig. 3, we see that in the Ant and Hopper domains, the oracle achieves the highest transfer reward in the validation set as expected since the oracle is trained directly on the validation set. Interestingly, we found that the domain randomization policy performed much worse on the Half Cheetah environment, despite having access to the mass and friction coefficients during training. Looking at the performance for each mass and friction combination in Fig. 3(b), we found that the DR agent was able to perform much better at the low friction coefficients and learned to prioritize those values at the cost of significantly worse performance on average. This highlights a potential issue with domain randomization: while training across a wide variety of dynamics parameters can increase robustness, naive parametrizations can cause the policy to exploit subsets of the randomized domain and lead to a brittle policy.


We hypothesize that this is due to the DR objective in Eq. 4 optimizing in expectation over the sampling range. To test this, we created a separate range of ‘good’ friction parameters and compared the robustness of a DR policy trained with a ‘good‘ range against a DR policy trained with a ‘bad’ range in Fig. 5. Here we see that a ‘good’ parametrization leads to the expected result where domain randomization is the most robust. We observe that domain randomization, under the ‘bad’ paramtrization underperforms adversarial training on the validation set despite the validation set literally constituting the training set for domain randomization. This suggests that underlying optimization difficulties are partially to blame for the poor performance of domain randomization. Notably, the adversarybased methods are not susceptible to the same parametrization issues.
Prior work, EPOpt Rajeswaran et al. (2016), has addressed this issue by replacing the uniform average across distributions with the conditional value at risk (CVaR) Chow et al. (2015); Tamar et al. (2015), a soft version of the minimax objective where the optimization is only performed over a small percentage of the worst performing parameters. This interesting approach to align the domain randomization objective with the minimax objective could be made compatible with our approach by training using a subset of the strongest adversaries.
4 Increasing Adversary Population Size
We investigate whether RAP is robust to adversary number as this would be a useful property to minimize hyperparameter search.
Here we hypothesize that while having more adversaries can represent a wider range of dynamics to learn to be robust to, we expect there to be diminishing returns due to the decreased batch size that each adversary receives (total number of environment steps is held constant across all training variations). We expect decreasing batch size to lead to worse agent policies since the batch will contain undertrained adversary policies that the agent will learn to exploit. We cap the number of adversaries at eleven as our machines ran out of memory at this value. We run ten seeds for every adversary value and Fig. 7 shows the results for Hopper. Agent robustness on the test set increases monotonically up to three adversaries and roughly begins to decrease after that point. This suggests that a tradeoff between adversary number and performance exists although we do not definitively show that diminishing batch sizes is the source of this tradeoff. However, we observe in Fig. 3 that both three and five adversaries perform well across all studied Mujoco domains.
In this work we demonstrate that the use of a single adversary to approximate the solution to a minimax problem does not consistently lead to improved robustness. We propose a solution through the use of multiple adversaries (RAP), and demonstrate that this provides robustness across a variety of robotics benchmarks. We also compare RAP with domain randomization and demonstrate that while DR can lead to a more robust policy, it requires careful parametrization of the domain we sample from to ensure robustness. RAP does not require this tuning, allowing for use in domains where appropriate tuning requires extensive prior knowledge or expertise.
There are several open questions stemming from this work. While we empirically demonstrate the effects of RAP, we do not have a compelling theoretical understanding of why multiple adversaries are helping. Perhaps RAP helps approximate a mixed Nash equilibrium as discussed in Sec. 1 or perhaps population based training increases the likelihood that one of the adversaries is strong? Would the benefits of RAP disappear if a single adversary had the ability to represent mixed Nash (for example, by adding a source of randomness to the adversary state)? Another interesting question to ask is whether the minimax games described here satisfy the "gamesofskill" hypothesis Czarnecki et al. (2020) which would provide an optimizationbased reason for including adversary populations.
There are some interesting extensions of this work that we would like to pursue. We have looked at the robustness of our approach in simulated settings; future work will examine whether this robustness transfers to realworld settings. Additionally, our agents are currently memoryless and therefore cannot perform adversary identification; it would be worthwhile to see if the auxiliary task of adversary identification leads to a robust systemidentification procedure that improves transfer performance. Our adversaries can also be viewed as forming a task distribution, allowing them to be used in continual learning approaches like MAML Nagabandi et al. (2018) where domain randomization is frequently used to construct task distributions.
Finally, here we apply adversary populations to the noisy robust MDP; applying the adversary action to the agent action represents a restricted class of dynamical systems. The transfer tests used in this work may not even be included in the set of dynamical systems that can be represented by this adversary class; this restriction may be reducing the transfer performance of the minimax approach. In future work we would like to consider a wider range of dynamical systems by using a more powerful adversary class that can control the dynamics directly.
The authors would like to thank Lerrel Pinto for help understanding and reproducing "Robust Adversarial Reinforcement Learning" as well as insightful discussions of our problem. Additionally, we would like to thank Natasha Jaques and Michael Dennis who helped us develop intuition for why the single adversary case might be flawed. Eugene Vinitsky is a recipient of an NSF Graduate Research Fellowship and funded by X. Yuqing Du is funded by a Berkeley AI Research fellowship & ONR through PECASE N000141612723. Computational resources for this work were provided by an AWS Machine Learning Research grant.
Understanding the difficulty of training deep feedforward neural networks
. InProceedings of the thirteenth international conference on artificial intelligence and statistics
, pp. 249–256. Cited by: 1.Advances in game theory
52, pp. 1–29. Cited by: §2.We use the Mujoco ant, cheetah, and hopper environments as a test of the efficacy of our strategy versus the 0 adversary, 1 adversary, and domain randomization baselines. We use the Noisy Action Robust MDP formulation [26] for our adversary parametrization. If the normal system dynamics are
the system dynamics under the adversary are
where is the adversary action at time k.
The notion here is that the adversary action is passed through the dynamics function and represents some additional set of dynamics. It is standard to clip actions within some boundary but clipping the sum would allow the agent to "cancel" the adversary by always keeping its action at the bounds of the action space. Since we want the adversary to always affect the dynamics irrespective of agent action, we clip the agent and adversary actions separately. The agent is clipped between in all environments and the adversary is clipped between .
The MDP through which we train the agent policy is characterized by the following states, actions, and rewards:
where is an observation returned by the environment, and is the action taken by the agent.
We use the standard rewards provided by the OpenAI Gym Mujoco environments at https://github.com/openai/gym/tree/master/gym/envs/mujoco. For the exact functions, please refer to the code at https://github.com/eugenevinitsky/robust_RL_multi_adversary.
.
The MDP for adversary is the following:
. The adversary sees the same states as the agent.
The adversary reward is the negative of the agent reward.
.
For our domain randomization Hopper baseline, we use the following randomization: at each rollout, we scale the friction of all joints by a single value uniformly sampled from [0.7, 1.3]. We also randomly scale the mass of the ’torso’ link by a single value sampled from [0.7, 1.3]. For HalfCheetah and Ant the range for friction is [0.1, 0.9] and for mass the range is [0.5, 1.5].
In this section we describe in detail all of the holdout tests used.
Test  Body with Friction Coeff 1.3  Body with Friction Coeff 0.7 

A  Torso, Leg  Floor, Thigh, Foot 
B  Floor, Thigh  Torso, Leg, Foot 
C  Foot, Leg  Floor, Torso, Thigh 
D  Torso, Thigh, Floor  Foot, Leg 
E  Torso, Foot  Floor, Thigh, Leg 
F  Floor, Thigh, Leg  Torso, Foot 
G  Floor, Foot  Torso, Thigh, Leg 
H  Thigh, Leg  Floor, Torso, Foot 
The Mujoco geom properties that we modified are attached to a particular body and determine its appearance and collision properties. For the Mujoco holdout transfer tests we pick a subset of the hopper ‘geom’ elements and scale the contact friction values by maximum friction coefficient, . Likewise, for the rest of the ‘geom’ elements, we scale the contact friction by the minimum value of . The body geoms and their names are visible in Fig. 8.
The exact combinations and the corresponding test name are indicated in Table 2 for Hopper.
Test  Geom with Friction Coeff 0.9 

A  Torso, Head, Fthigh 
B  Floor, Head, Fshin 
C  Bthigh, Bshin, Bfoot 
D  Floor, Torso, Head 
E  Floor, Bshin, Ffoot 
F  Bthigh, Bfoot, Ffoot 
G  Bthigh, Fthigh, Fshin 
H  Head, Fshin, Ffoot 
The Mujoco geom properties that we modified are attached to a particular body and determine its appearance and collision properties. For the Mujoco holdout transfer tests we pick a subset of the cheetah ‘geom’ elements and scale the contact friction values by maximum friction coefficient, . Likewise, for the rest of the ‘geom’ elements, we scale the contact friction by the minimum value of . The body geoms and their names are visible in Fig. 9.
The exact combinations and the corresponding test name are indicated in Table 3 for Hopper.
Test  Geom with Friction Coeff 0.9 

A  FrontLegLeft, AuxFrontLeft, AuxBackLeft 
B  Torso, AuxFrontLeft, BackLegRight 
C  FrontLegRight, AuxFrontRight, BackLegLeft 
D  Torso, FrontLegLeft, AuxFrontLeft 
E  FrontLegLeft, AuxFrontRight, AuxBackRight 
F  FrontLegRight, BackLegLeft, AuxBackRight 
G  FrontLegLeft, AuxBackLeft, BackLegRight 
H  AuxFrontLeft, BackLegRight, AuxBackRight 
We will use torso to indicate the head piece, leg to refer to one of the four legs that contact the ground, and ’aux’ to indicate the geom that connects the leg to the torso. Since the ant is symmetric we adopt a convention that two of the legs are frontleft and frontright and two legs are backleft and backright. Fig. 10 depicts the convention. For the Mujoco holdout transfer tests we pick a subset of the ant ‘geom’ elements and scale the contact friction values by maximum friction coefficient, . Likewise, for the rest of the ‘geom’ elements, we scale the contact friction by the minimum value of .
The exact combinations and the corresponding test name are indicated in Table 4 for Hopper.
Here we recompute the values of all the results and display them with appropriate standard deviations in tabular form. Tables 5, 6, 7 contain the test results with appropriate standard deviations for Hopper, HalfCheetah, and Ant respectively.
Test Name  0 Adv  1 Adv  3 Adv  Five Adv  Domain Rand 

Test A  
Test B  
Test C  
Test D  
Test E  
Test F  
Test H  
Test G 
Test Name  0 Adv  1 Adv  3 Adv  Five Adv  Domain Rand 

Test A  
Test B  
Test C  
Test D  
Test E  
Test F  
Test H  
Test G 
Test Name  0 Adv  1 Adv  3 Adv  Five Adv  Domain Rand 

Test A  
Test B  
Test C  
Test D  
Test E  
Test F  
Test H  
Test G 
Finally, we place the heatmaps for Ant here for reference in Fig. 11.
Here we reproduce the hyperparameters we used in each experiment and compute the expected runtime and cost of each experiment. Numbers indicated in were each used for one run. Otherwise the parameter was kept fixed at the indicated value.
For Mujoco the hyperparameters are:
Learning rate:
for half cheetah
for hopper
: Bounds on adversary action space:
Generalized Advantage Estimation
for half cheetah
for hopper and ant
Discount factor
Training batch size:
SGD minibatch size:
Number of SGD steps per iteration:
Number of iterations:
We set the seed to 0 for all hyperparameter runs.
The maximum horizon is 1000 steps.
For the validation across seeds we used 10 seeds ranging from 0 to 9. Values of hyperparameters selected for each adversary number can be found by consulting the codebase. All other hyperparameters are the default values in RLlib [11] 0.8.0.
For all of our experiments we used AWS EC2 c4.8xlarge instances which come with 36 virtual CPUs. For the Mujoco experiments, we use 2 nodes and 11 CPUs per hyperparameter, leading to one full hyperparameter sweep fitting onto the 72 CPUs. We run the following set of experiments and ablations, each of which takes 8 hours.
0 adversaries
1 adversary
3 adversaries
5 adversaries
Domain randomization
for a total of 5 experiments for each of Hopper, Cheetah, Ant. For the best hyperparameters and each experiment listed above we run a seed search with 6 CPUs used perseed, a process which takes about 12 hours. This leads to a total of node hours and CPU hours. At a cost of dollars per node per hour for EC2 spot instances, this gives dollars to fully reproduce our results for this experiment. If the chosen hyperparameters are used and only the seeds are swept, this is dollars.
Comments
There are no comments yet.