1 Introduction
The task of learning from demonstrations (LfD) lies in the heart of many artificial intelligence applications
[28, 37]. By observing the expert’s behavior, an agent learns a mapping between world states and actions. This socalled policy enables the agent to select and perform an action, given the current world state. Despite the fact that this policy can be directly learned from expert’s behaviors, inferring the reward function underlying the policy is generally considered the most succinct, robust, and transferable methodology for the LfD task [1]. Inferring the reward function, which is the objective of Inverse Reinforcement Learning (IRL), is often very challenging in realworld scenarios. The demonstrations come from multiple experts who can have different intentions, and their behaviors are consequently not well modeled with a single reward function. Therefore, in this study, we research and extend the concept of mixture of conditional maximum entropy models and propose a deep IRL framework to infer an a priori unknown number of reward functions from experts’ demonstrations without intention labels.Standard IRL can be described as the problem of extracting a reward function, which is consistent with the observed behaviors [33]. Obtaining the exact reward function is an illposed problem, since many different reward functions can explain the same observed behaviors [25, 39]. Ziebart et al. [39]tackled this ambiguity by employing the principle of maximum entropy
[15]. The principle states that the probability distribution, which best represents the current state of knowledge, is the one with the largest entropy
[15]. Therefore, Ziebart et al. [39] chose the distribution with maximal information entropy to model the experts’ behaviors. The maximum entropy IRL has been widely employed in various applications [34, 16]. However, this method suffers from a strong assumption that the experts have one single intention in all demonstrations. In this study, we explore the principle of the mixture of maximum entropy models [30] that inherits the advantages of maximum entropy principle, while at the same time is capable of modeling multiintention behaviors. In many realworld applications, the demonstrations are often collected from multiple experts whose intentions are potentially different from each other [9, 2, 3, 5]. This leads to multiple reward functions, which is in direct contradiction with the single reward assumption in traditional IRL. To address this problem, Babes et al. [5] proposed a clusteringIRL scheme where the class of each demonstration is jointly learned via the respective reward function. Despite the recovery of multiple reward functions, the number of clusters in this method is assumed to be known a priori. To overcome this assumption, Choi et al. [9]presented a nonparametric Bayesian approach using the Dirichlet Process Mixture (DPM) to infer an unknown number of reward functions from unlabeled demonstrations. However, the proposed method is formulated based on the assumption that the reward functions are formed by a linear combination of a set of world state features. In our work, we do not make this assumption on linearity and model the reward functions using deep neural networks. DPM is a stochastic process in the Bayesian nonparametric framework that deals with mixture models with a countably infinite number of mixture components
[24]. In general, full Bayesian inference in DPM models is not feasible, and instead, approximate methods like MonteCarlo Markov chain (MCMC)
[19, 4] and variational inference [7] are employed. When deep neural networks are involved in DPM (e.g. deep nonlinear reward functions in IRL), approximates methods may not be able to scale with high dimensional parameter spaces. MCMC sampling methods are shown to be slow in convergence [29, 7] and variational inference algorithms suffer from restrictions in the distribution family of the observable data, as well as various truncation assumptions for the variational distribution to yield a finite dimensional representation [23, 11]. These limitations apparently make approximate Bayesian inference methods inapplicable for DPM models with deep neural networks. Apart from that, the algorithms for maximum likelihood estimations like standard EM are no longer tractable when dealing with DPM models. The main reason is that the number of mixture components exponentially grows with nonzero probabilities, and after some iterations, the Expectationstep would be no longer available in a closedform. However, inspired by two variants of EM algorithms that cope with infeasible Expectationstep
[36, 8], we propose two solutions in which the Expectationstep is either estimated numerically with sampling (based on Monte Carlo EM [36]) or computed analytically and then replaced with a sample from it (based on stochastic EM [8]).This study’s main contribution is to develop an IRL framework where one can benefit from the strength of 1) maximum entropy principle, 2) deep nonlinear reward functions, and 3) account for an unknown number of experts’ intentions. To the best of our knowledge, we are the first to present an approach that can combine all these three capabilities. In our proposed framework, the experts’ behavioral distribution is modeled as a mixture of conditional maximum entropy models. The reward functions are parameterized as a deep reward network, consisting of two parts: 1) a base reward model, and 2) an adaptively growing set of intentionspecific reward models. The base reward model takes as input the state features and outputs a set of reward features shared in all intentionspecific reward models. The intentionspecific reward models take the reward features and output the rewards for the respective expert’s intention. A novel adaptive approach, based on the concept of the Chinese Restaurant Process (CRP), is proposed to infer the number of experts’ intentions from unlabeled demonstrations. To train the framework, we propose and compare two novel EM algorithms. One is based on stochastic EM and the other on Monte Carlo EM. In Section 3, this problem of multiintention IRL is defined, following our two novel EM algorithms in Section 4. The results are evaluated on three available simulated benchmarks, two of which are extended in this paper for multiintention IRL, and compared with two baselines [5, 9]. These experimental results are reported in Section 5 and Section 6 is devoted to conclusions. The source code to reproduce the experiments is publicly available^{1}^{1}1https://github.com/tuemps/damiirl.2 Related Works
In the past decades, a number of studies have addressed the problem of multiintention IRL. A comparison of various methods for multiintention IRL, together with our approach, is depicted in Table 1. In an early work, Dimitrakakis and Rothkopf [10] formulated the problem of learning from unlabeled demonstrations as a multitask learning problem. By generalizing the Bayesian IRL approach of Ramachandran and Amir [32], they assumed that each observed trajectory is responsible for one specific reward function, all of which shares a common prior. The same approach has also been employed by Noothigattu et al. [27], who assumed that each expert’s reward function is a random permutation of one sharing reward function. Babes et al. [5] took a different approach and addressed the problem as a clustering task with IRL. They proposed an EM approach that clusters the observed trajectories by inferring the rewards function for each cluster. Using maximum likelihood, they estimated the reward parameters for each cluster. The main limitation in EM clustering approach is that the number of clusters has to be specified as an input parameter [5, 26]. To overcome this assumption, Choi and Kim [9] employed a nonparametric Bayesian approach via the DPM model. Using MCMC sampler, they were able to infer an unknown number of reward functions, which are linear combinations of state features. Other authors have also employed the same methodology in the literature [31, 22, 2]. All above methods are developed on the basis of modelbased reinforcement learning (RL), in which the model of the environment is assumed to be known. In the past few years, a couple of approximate, modelfree methods have been developed for IRL with multiple reward functions [13, 20, 14, 21]. Such methods aimed to solve largescale problems by approximating the Bellman optimality equation with modelfree RL. In this study, we constrain ourselves to modelbased RL and propose a multiintention IRL approach to infer an unknown number of experts’ intentions and corresponding nonlinear reward functions from unlabeled demonstrations.
Type  Features  
Models 






Dimitrakakis and Rothkopf [10]  ✓  ✓  
Babes et al. [5]  ✓  ✓  
Nguyen et al. [26]  ✓  ✓  
Choi and Kim [9]  ✓  ✓  ✓  
Rajasekaran et al. [31]  ✓  ✓  ✓  
Li et al. [20]  ✓  ✓  ✓  
Hausman et al. [13]  ✓  ✓  ✓  
Lin and Zhang [21]  ✓  ✓  
Hsiao et al. [14]  ✓  ✓  ✓  
Ours  ✓  ✓  ✓  ✓ 
3 Problem Definition
In this section, the problem of multiintention IRL is defined. To facilitate the flow, we first formalize the multiintention RL problem. For both problems, we follow the conventional modelling of the environment as a Markov Decision Process (MDP). A finite state MDP in a multiintention RL problem is a tuple
where is the state space , is the action space, is the transition probability function, is the discount factor, is the probability of staring in state , and is the reward function with to be the total number of intentions. A policy is a mapping function . The value of policy with respect to the reward function is the expected discounted reward for following the policy and is defined as . The optimal policy () for the reward function is the policy that maximizes the value function for all states and satisfies the respective Bellman optimality equation [35].In multiintention IRL, the context of this study, a finitestate MDP\R is a tuple where is the demonstration and is the total number of demonstrations. In this work, it is assumed that there is a total of intentions, each of which corresponds to one reward function, so that with length is generated from the optimal policy () of the reward function. It is further assumed that the demonstrations are without intention labels, i.e. they are unlabeled. Therefore, the goal is to infer the number of intentions and the respective reward function of each intention. In the next section, we model the experts’ behaviors as a mixture of conditional maximum entropy models, parameterize the reward functions via deep neural networks, and propose a novel approach to infer an unknown number of experts’ intentions from unlabeled demonstrations.4 Approach
In the proposed framework for multiintention IRL, the experts’ behavioral distribution is modeled as a mixture of conditional maximum entropy models. The Mixture of conditional maximum entropy models is a generalization of standard maximum entropy formulation for cases where the data distributions arise from a mixture of simpler underlying latent distributions [30]. According to this principal, a mixture of conditional maximum entropy models is a promising candidate to justify the multiintention behaviors of the experts. The experts’ behaviors with the intention is defined via a conditional maximum entropy distribution:
(1) 
where
is the latent intention vector,
is the reward of the trajectory with respect to the reward function with as the state reward value, and is the partition function. We define the reward function as: , where is a deep neural network with finite set of parameters which consists of a base reward model and an intentionspecific reward model (See Fig. 1). The base reward model with finite set of parameters takes the state feature vector and outputs the state reward feature vector : . The state reward feature vector that is produced by the base reward model is input to all intentionspecific reward models. The intentionspecific reward model with finite set of parameters , takes the state reward feature vector and outputs the state reward value: . Therefore the total set of reward parameters is . The reward of the trajectory with respect to the reward function can be further obtained as: , where is the expected State Visitation Frequency (SVF) vector for trajectory and is the vector of reward values of all states with respect to the reward function.In order to infer the number of intentions , we propose an adaptive approach in which the number of intentions adaptively changes whenever a trajectory is visited/revisited. For this purpose, at each iteration we first assume to have demonstrated trajectories that are already assigned to intentions with known latent intention vectors . Then, we visit/revisit a demonstrated trajectory and the task is to obtain the latent intention vector , which can be assigned to a new intention , and update the reward parameters . As emphasized before, our work aims to develop a method in which K, the number of intentions, is a priori unknown and can, in theory, be arbitrarily large. Now we define the predictive distribution for the trajectory as a mixture of conditional maximum entropy models:(2) 
where is the prior intention assignment for trajectory , given all other latent intention vectors. In the case of intentions, we define a multinomial prior distribution over all latent intention vectors :
(3) 
where is the number of trajectories with intention and is the vector of mixing coefficients with Dirichlet prior distribution , where is the concentration parameter. As the main problematic parameters are the mixing coefficients. Marginalizing out the mixing coefficients and separating the latent intention vector for trajectory yield (see Appendix A for full derivation):
(4) 
where is the number of trajectories assigned to intention excluding the trajectory,
is the prior probability of assigning the new trajectory
to intention , and is the prior probability of assigning the new trajectory to intention . Equation (4) is known as the CRP representation for DPM [24]. Considering the exchangeability property [12], the following optimization problem is defined:(5) 
The parameters
can be estimated via Expectation Maximization (EM)
[6]. Differentiating with respect to yields the following Estep and Mstep (see Appendix B for full derivation):4.0.1 Estep
Evaluation of the posterior distribution over the latent intention vector :
(6) 
and for :
(7) 
where we have defined .
4.0.2 Mstep
update of the parameter value with gradient of:
(8) 
where is the expected SVF vector under the parameterized reward function [39].When approaches infinity, the EM algorithm is no longer tractable since the number of mixture components exponentially grows with nonzero probabilities. As a result, after some iterations, the Estep would be no longer available in a closedform. We propose two solutions for estimation of the reward parameters which are inspired by stochastic and Monte Carlo EM algorithms. Both proposed solutions are deeply evaluated and compared with in Section 5.
4.1 First solution with stochastic expectation maximization
Stochastic EM, introduces a stochastic step (Sstep) after the Estep that represents the full expectation with a single sample [8]. Alg. 1 presents the summary of the first solution to multiintention IRL via stochastic EM algorithm when the number of intentions is no longer known. Given (6) and (7), first the posterior distribution over the latent intention vector for trajectory is obtained. Then, the full expectation is estimated with a sample from the posterior distribution. Finally, the reward parameters are updated via (8).
4.2 Second solution with Monte Carlo expectation maximization
The Monte Carlo EM algorithm is a modification of the EM algorithm where the expectation in the Estep is computed numerically via Monte Carlo simulations [36]. As indicated, Alg. 1 relies on the full posterior distribution which can be timeconsuming. Therefore, another solution for multiintention IRL is presented in which the Estep is performed through MetropolisHastings sampler (see Alg. 2 for the summary). First, a new intention assignment for trajectory, , is sampled from the prior distribution of (4), then is set with the acceptance probability of where (see Appendix C for full derivation):
(9) 
with and .
5 Experimental Results
In this section, we evaluate the performance of our proposed methods through several experiments with three goals: 1) to show the advantages of our methods in comparison with the baselines in environments with both linear and nonlinear rewards, 2) to demonstrate the advantages of adaptively inferring the number of intentions rather than predefining a fixed number, and 3) to depict the strengths and weaknesses of our proposed algorithms with respect to each other.
5.1 Benchmarks
In order to deeply compare the performances of various models, the experiments are conducted on three different environments: GridWorld, Multiintention ObjectWorld, and Multiintention BinaryWorld. Variants of all three environments have been widely employed in IRL literature [18, 38]. GridWorld [9] is a environment with 64 states and four actions per state with 20% probability of moving randomly. The grids are partitioned into nonoverlapping regions of size , and the feature function is defined by a binary indicator function for each region. Three reward functions are generated with linear combinations of state features and reward weights which are sampled to have a nonzero value with the probability of 0.2. The main idea behind using this environment is to compare all the models in aspects other than their capability of handling linear/nonlinear reward functions.Multiintention ObjectWorld (MObjectWorld) is our extension of ObjectWorld [18] for multiintention IRL. ObjectWorld is a grid of states with five actions per state with a 30% chance of moving in a different random direction. The objects with two different inner and outer colors are randomly placed, and the binary state features are obtained based on the Euclidean distance to the nearest object with a specific inner or outer color. Unlike ObjectWorld, MObjectWorld has six different reward functions, each of which corresponds to one intention. The intentions are defined for each cell based on three rules: 1) within 3 cells of outer color one and within 2 cells of outer color two, 2) Just within 3 cells of outer color one, and 3) everywhere else (see Table 2). Due to the large number of irrelevant features and the nonlinearity of the reward rules, the environment is challenging for methods that learn linear reward functions. Fig. 2 (top three) shows a zoomin of MObjectWorld with three reward functions and respective optimal policies.Multiintention BinaryWorld (MBinaryWorld) is our extension of BinaryWorld [38] for multiintention IRL. Similarly, BinaryWorld has states, five actions per state with a 30% chance of moving in a different random direction. But every state is randomly occupied with one of the twocolor objects. The feature vector for each state consequently consists of a binary vector, encoding the color of each object in neighborhood. Similar to MObjectWorld, six different intentions can be defined for each cell of MBinaryWorld based on three rules: 1) four neighboring cells have color one, 2) five neighboring cells have color one, and 3) everything else (see Table 2). Since in MBinaryWorld the reward depends on a higher representation for the basic features, the environment is arguably more challenging than the previous ones. Therefore, most of the experiments are carried in this environment. Fig. 2 (bottom three) shows a zoomin of MBinaryWorld with three different reward functions and policies.In order to assess the generalizability of the models, the experiments are also conducted on transferred environments. In transferred environments, the learned reward functions are reevaluated on new randomized environments.
5.2 Models
In this study, we compare our methods with existing approaches that can handle IRL with multiple intentions and constrain the experiments to modelbased methods. The following models are evaluated on the benchmarks:

EMMLIRL(), proposed by Babes et al. [5]. This method requires the number of experts’ intentions to be known. To research the influence on setting for this method, we use .

DPMBIRL, a nonparametric multiintention IRL method proposed by Choi and Kim [9].

SEMMIIRL, our proposed solution based on stochastic EM.

MCEMMIIRL, our proposed solution based on Monte Carlo EM.

EMMIIRL, a simplified variant of our approach where the concentration parameter is zero and the number of intentions are fixed to .
5.3 Metric
Following the same convention used in [9], the imitation performance is evaluated by the average of expected value difference (EVD). The EVD measures the performance difference between the expert’s optimal policy and the optimal policy induced by the learned reward function. For , , where and are the true policy and reward function for demonstration, respectively, and is the predicted policy under the predicted reward function demonstration. In all experiments, a lower averageEVD corresponds to better imitation performance.
5.4 Implementations details
In our experiments, we employed a fully connected neural network with five hidden layers of dimension 256 and a rectified linear unit for the base reward model, and a set of linear functions represents the intentionspecific reward models. The reward network is trained for 200 epochs using Adam
[17] with a fixed learning rate of 0.001. For easing the reproducibility of our work, the source code is shared with the community at https://github.com/tuemps/damiirl.5.5 Results
Each experiment is repeated for 6 times with different random environments, and the results are shown in the form of means (lines) and standard errors (shadings). The demonstration length for GridWorld is fixed to 40 timesteps and for both MObjectWorld and MBinaryWorld is 8 timesteps.
Fig. 4 and Fig. 5 show the imitation performances of our SEMMIIRL and MCEMMIIRL in comparison with two baselines, EMMLIRL() and DPMBIRL, for varying number of demonstrations per reward function in original and transferred environments, respectively. Each expert is assigned to one out of three reward functions (intentions A, B, and C in MObjectWorld and MBinaryWorld) and the concentration parameter is set to one. The results show clearly that our methods achieve significant lower averageEVD errors when compared to existing methods, especially in nonlinear environments of MObjectWorld and MBinaryWorld, with SEMMIIRL slightly outperforming MCEMMIIRL
To address the importance of inferring the number of intentions, we have compared the performances of our SEMMIIRL and MCEMMIIRL with two simplified variants, 2EMMIIRL and 5EMMIIRL, where the concentration parameter is set to zero and the number of intentions is fixed and equal to 2 and 5, respectively. Fig. 6 shows the results of these comparisons for a varying number of true reward functions from one to six (from intention: {A} to {A, B, C, D, E, F}) in both original and transferred MBinaryWorld. The number of demonstrations is fixed to per reward function and for both SEMMIIRL and MCEMMIIRL. As depicted, overestimation and underestimation of the number of reward functions, as happens frequently in both 2EMMIIRL and 5EMMIIRL, deteriorate the imitation performance. This while the adaptability in SEMMIIRL and MCEMMIIRL yields to less sensitivity with changes in the number of true reward functions.Further experiments are conducted to deeply assess and compare MCEMMIIRL and SEMMIIRL. Fig. 7 depicts the effects of the concentration parameter on both AverageEVD and number of predicted intentions. The number of demonstrations is fixed to per reward function and intentions are {A, B, C}. As shown, the best value for the concentration parameter is between 0.5 to 1, with lower values leading to higher AverageEVD and lower number of predicted intentions, while higher values result in higher AverageEVD and higher number of predicted intentions for both MCEMMIIRL and SEMMIIRL. The final experiment is devoted to the convergence behavior of MCEMMIIRL and SEMMIIRL. The number of demonstrations is again fixed to per reward function, intentions are {A, B, C} and the concentration parameter is set to . As shown in Fig. 8 (left image), the periteration execution time of MCEMMIIRL is lower than SEMMIIRL. The main reason is that SEMMIIRL evaluates the posterior distribution over all latent intentions. However, this extra operation guarantees faster converges of SEMMIIRL, making it overall the more efficient than MCEMMIIRL as can be seen in Fig. 8 (right image).
6 Conclusions
We proposed an inverse reinforcement learning framework to recover complex reward functions by observing experts whose behaviors originate from an unknown number of intentions. We presented two algorithms that are able to consistently recover multiple, highly nonlinear reward functions and whose benefits were pointed out through a set of experiments. For this, we extended two complex benchmarks for multiintention IRL in which our algorithms distinctly outperformed the baselines. We also demonstrated the importance of inferring rather than underestimating or overestimating the number of experts’ intentionsHaving shown the benefits of our approach in inferring the unknown number of experts’ intention from a collection of demonstrations via modelbased RL, we aim to extend the same approach in modelfree environments by employing approximate RL methods.
Acknowledgments
This research has received funding from ECSEL JU in collaboration with the European Union’s 2020 Framework Programme and National Authorities, under grant agreement no. 783190.
References
 [1] Abbeel, P., Coates, A., Quigley, M. & Ng, A. An application of reinforcement learning to aerobatic helicopter flight. Advances In Neural Information Processing Systems. pp. 18 (2007)

[2]
Almingol, J., Montesano, L. & Lopes, M. Learning multiple behaviors from unlabeled demonstrations in a latent controller space.
International Conference On Machine Learning
. pp. 136144 (2013) 
[3]
Almingol, J. & Montesano, L. Learning multiple behaviours using hierarchical clustering of rewards.
2015 IEEE/RSJ International Conference On Intelligent Robots And Systems (IROS). pp. 46084613 (2015)  [4] Andrieu, C., De Freitas, N., Doucet, A. & Jordan, M. An introduction to MCMC for machine learning. Machine Learning. 50, 543 (2003)
 [5] Babes, M., Marivate, V., Subramanian, K. & Littman, M. Apprenticeship learning about multiple intentions. Proceedings Of The 28th International Conference On Machine Learning (ICML11). pp. 897904 (2011)

[6]
Bishop, C. Pattern recognition and machine learning. (springer,2006)
 [7] Blei, D. & Jordan, M. Variational methods for the Dirichlet process. Proceedings Of The Twentyfirst International Conference On Machine Learning. pp. 12 (2004)
 [8] Celeux, G. The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Computational Statistics Quarterly. 2 pp. 7382 (1985)
 [9] Choi, J. & Kim, K. Nonparametric Bayesian inverse reinforcement learning for multiple reward functions. Advances In Neural Information Processing Systems. pp. 305313 (2012)
 [10] Dimitrakakis, C. & Rothkopf, C. Bayesian multitask inverse reinforcement learning. European Workshop On Reinforcement Learning. pp. 273284 (2011)
 [11] Echraibi, A., FloconCholet, J., Gosselin, S. & Vaton, S. On the Variational Posterior of Dirichlet Process Deep Latent Gaussian Mixture Models. ArXiv Preprint ArXiv:2006.08993. (2020)
 [12] Gershman, S. & Blei, D. A tutorial on Bayesian nonparametric models. Journal Of Mathematical Psychology. 56, 112 (2012)

[13]
Hausman, K., Chebotar, Y., Schaal, S., Sukhatme, G. & Lim, J. Multimodal imitation learning from unstructured demonstrations using generative adversarial nets.
Advances In Neural Information Processing Systems. pp. 12351245 (2017)  [14] Hsiao, F., Kuo, J. & Sun, M. Learning a MultiModal Policy via Imitating Demonstrations with Mixed Behaviors. ArXiv Preprint ArXiv:1903.10304. (2019)
 [15] Jaynes, E. Information theory and statistical mechanics. Physical Review. 106, 620 (1957)
 [16] Jin, J., Petrich, L., Dehghan, M., Zhang, Z. & Jagersand, M. Robot eyehand coordination learning by watching human demonstrations: a task function approximation approach. 2019 International Conference On Robotics And Automation (ICRA). pp. 66246630 (2019)
 [17] Kingma, D. & Ba, J. Adam: A method for stochastic optimization. ArXiv Preprint ArXiv:1412.6980. (2014)
 [18] Levine, S., Popovic, Z. & Koltun, V. Nonlinear inverse reinforcement learning with gaussian processes. Advances In Neural Information Processing Systems. pp. 1927 (2011)
 [19] Li, Y., Schofield, E. & Gönen, M. A tutorial on Dirichlet process mixture modeling. Journal Of Mathematical Psychology. 91 pp. 128144 (2019)
 [20] Li, Y., Song, J. & Ermon, S. Infogail: Interpretable imitation learning from visual demonstrations. Advances In Neural Information Processing Systems. pp. 38123822 (2017)

[21]
Lin, J. & Zhang, Z. Acgail: Imitation learning about multiple intentions with auxiliary classifier gans.
Pacific Rim International Conference On Artificial Intelligence. pp. 321334 (2018)  [22] Michini, B. & How, J. Bayesian nonparametric inverse reinforcement learning. Joint European Conference On Machine Learning And Knowledge Discovery In Databases. pp. 148163 (2012)

[23]
Nalisnick, E. & Smyth, P. StickBreaking Variational Autoencoders.
5th International Conference On Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings. (2017)  [24] Neal, R. Markov chain sampling methods for Dirichlet process mixture models. Journal Of Computational And Graphical Statistics. 9, 249265 (2000)
 [25] Ng, A., Russell, S. & Others Algorithms for inverse reinforcement learning.. Icml. 1 pp. 2 (2000)
 [26] Nguyen, Q., Low, B. & Jaillet, P. Inverse reinforcement learning with locally consistent reward functions. Advances In Neural Information Processing Systems. pp. 17471755 (2015)
 [27] Noothigattu, R., Yan, T. & Procaccia, A. Inverse Reinforcement Learning From LikeMinded Teachers. Manuscript. (2020)
 [28] Odom, P. & Natarajan, S. Actively Interacting with Experts: A Probabilistic Logic Approach. Machine Learning And Knowledge Discovery In Databases. pp. 527542 (2016)
 [29] Papamarkou, T., Hinkle, J., Young, M. & Womble, D. Challenges in Bayesian inference via Markov chain Monte Carlo for neural networks. ArXiv Preprint ArXiv:1910.06539. (2019)
 [30] Pavlov, D., Popescul, A., Pennock, D. & Ungar, L. Mixtures of conditional maximum entropy models. Proceedings Of The 20th International Conference On Machine Learning (ICML03). pp. 584591 (2003)
 [31] Rajasekaran, S., Zhang, J. & Fu, J. Inverse Reinforce Learning with Nonparametric Behavior Clustering. ArXiv Preprint ArXiv:1712.05514. (2017)
 [32] Ramachandran, D. & Amir, E. Bayesian Inverse Reinforcement Learning.. IJCAI. 7 pp. 25862591 (2007)

[33]
Russell, S. Learning agents for uncertain environments.
Proceedings Of The Eleventh Annual Conference On Computational Learning Theory
. pp. 101103 (1998)  [34] Shkurti, F., Kakodkar, N. & Dudek, G. ModelBased Probabilistic Pursuit via Inverse Reinforcement Learning. 2018 IEEE International Conference On Robotics And Automation (ICRA). pp. 78047811 (2018)
 [35] Sutton, R. & Barto, A. Reinforcement learning: An introduction. (MIT press,2018)
 [36] Wei, G. & Tanner, M. A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. Journal Of The American Statistical Association. 85, 699704 (1990)

[37]
Wei, H., Chen, C., Liu, C., Zheng, G. & Li, Z. Learning to Simulate on Sparse Trajectory Data.
Machine Learning And Knowledge Discovery In Databases: Applied Data Science Track
. pp. 530545 (2021)  [38] Wulfmeier, M., Ondruska, P. & Posner, I. Maximum entropy deep inverse reinforcement learning. ArXiv Preprint ArXiv:1507.04888. (2015)
 [39] Ziebart, B., Maas, A., Bagnell, J. & Dey, A. Maximum Entropy Inverse Reinforcement Learning. Proceedings Of The 23rd National Conference On Artificial Intelligence  Volume 3. pp. 14331438 (2008)
Appendix A
We assume that we have demonstrated trajectories with a set of known latent intention vectors with intentions. Then, we have a new demonstrated trajectory and the task is to obtain the latent intention vector , which can be a new intention , and update the reward parameters . We are willing to consider growing/infinite number of intentions. In the case of intentions, we define a Categorical prior distribution over :
(10) 
where is the number of trajectories with intention and is the vector of mixing coefficients with prior distribution of:
(11) 
where is the concentration parameter. The main problematic variable as are the mixing coefficients. We marginalize out :
(12) 
Given that:
(13) 
we can define the conditional prior over as:
(14) 
where is the number of trajectories with intention excluding . By letting , we reach:
(15) 
where is the prior probability of assigning the trajectory to intention . Since:
(16) 
we define as the prior probability of assigning the trajectory to intention :
(17) 
Equations (15) and (17) are known as Chinese Restaurant Process [19].
Appendix B
Given the predictive distribution for trajectory:
(18) 
the following optimization problem can be defined by employing the exchangeability property [12]:
(19) 
The parameters can be estimated via Expectation Maximization (EM) [6]. Differentiating the loglikelihood function with respect to yields:
(20) 
A standard trick in setting up the EM procedure is to introduce the posterior distribution over the latent intention vector [6]:
(21) 
Now the term under summation in (20) can be written as::
(22) 
Performing the differentiation of the second term in (22) yields:
(23) 
Therefore (20) results in:
(24) 
which is known as the Mstep. The posterior distribution over the latent intention vector can be obtained as:
(25) 
(26) 
and for :
(27) 
Which are known as the Estep.
Appendix C
The likelihood ratio for the trajectory is obtained as:
(28) 
with and .s
Comments
There are no comments yet.