1 Introduction
In this paper, we provide theoretical properties of metalearning in a suitable closedloop control setting. Specifically, we consider a scenario in which there is a sequence of episodes, each of a finite duration. In each episode, the system to be controlled is unknown, different and drawn from a fixed set. The controller has knowledge of the cost function but doesn’t have any other prior information, has to learn to control onthefly and can leverage the experience from previous episodes to improve its learning during a new episode. For this setting we propose and study an online modelbased metalearning control algorithm. It has two levels of learning: an outer learner that continually learns a general model by adapting the model after every new episode of experience and an inner learner
that continually learns a model of the system during an episode by adapting the model proposed by the outer learner. The controller computes the control input for a particular time during an episode by optimizing the costtogo using the model estimate provided by the inner learner in place of the actual model in the transition dynamics. The role of the outer learner is to learn a general model so that the adaptations within an episode improve across the episodes, and thus the overall controller is a metalearning controller.
We are particularly interested in providing online performance guarantees. We adopt the framework of online optimization to analyze the performance of our metalearning algorithm. Our work draws on foundational results from Online Convex Optimization (OCO)(ShalevShwartz & Kakade, 2009), (ShalevShwartz et al., 2011)
. In the OCO framework, the learner encounters a sequence of convex loss functions which are unknown beforehand and may vary arbitrarily over time. The learner updates the estimate of the optimal solution at each timestep based on the previous losses and incurs a loss for its updated estimate as given by the loss function for this time step. At the end of each step, either the loss function may be revealed, a scenario referred to as full information feedback, or only the experienced loss is revealed, a scenario known as bandit feedback. The objective of the learner is to minimize the loss accumulated over time. The performance of an algorithm in OCO is assessed through the notion of regret, which by definition is difference between the accumulated loss under the algorithm and that achieved by the optimal decision, which is fixed for a standard online convex optimization setting.
Since there are two levels of learning in our meta learning algorithm, we assess its performance through two notions of regret: (i) the regret for the performance within an episode (ii) the average regret for the performance across the episodes. Here we measure the performance of the algorithm for an episode by (i) the controller cost for the episode and (ii) the cumulative of the violation of the control input constraints at each time step for the duration of an episode. The idea for using the average regret as a measure is to assess the ability of the metalearner to improve the adaptations and hence the performance with more episodes of experience.
1.1 Related Work
Metalearning research has been pioneered by (Schmidhuber, 1987), (Naik & Mammone, 1992), and (Thrun & Pratt, 1998)
. Recently, there has been a renewed interest in metalearning, and allied ideas like fewshot learning, motivated by the success of deep learning.
(Andrychowicz et al., 2016), (Li & Malik, 2016), (Ravi & Larochelle, 2016) explored metalearning in the context of learning to optimize. (Finn et al., 2017) explored the case where the metalearner uses ordinary gradient descent to update the learner (ModelAgnostic MetaLearner (MAML)) and showed that this simplified approach can achieve comparable performance. The MAML approach has been subsequently improved and refined by many other works (Nichol & Schulman, 2018; Rajeswaran et al., 2019; Flennerhag et al., 2019). (Duan et al., 2016) and (Wang et al., 2016)both investigated metalearning in reinforcementlearning domains using traditional Recurrent Neural Networks (RNN) architectures, Gated Recurrent Units (GRUs) and Long ShortTerm memories (LSTMs).
(Mishra et al., 2018) investigated a general metalearner architecture that combines temporal convolutions and causal attention and showed that this model achieves improved performance.In the OCO framework, under the full information feedback setting, it is established that the best possible regret scales as (resp. ) for convex (resp. strongly convex) loss functions, where is the number of time steps (Zinkevich, 2003; Hazan et al., 2006; Abernethy et al., 2009). These results have also been extended to constrained online convex optimization where it has been shown that the best regret scales as for the cost and for constraint violation, where is a constant (Jenatton et al., 2016; Yuan & Lamperski, 2018).
There are some papers in the machine learning literature that provide online regret analysis for metalearning algorithms. These works analyse gradient based metalearners because of the natural connection between gradient based learning and online convex optimization.
(Finn et al., 2019) provide regret bound for a gradient based online metalearner under certain smoothness assumptions. (Balcan et al., 2019) extend the OCO setting to a metalearning setting and provide regret analysis for a gradient based metalearning algorithm. They show that the best regret scales as , where represents the number of time steps within an OCO procedure and represents the number of such procedures.Some significant advancements have been made in the recent years in providing convergence guarantees for standard learning methods in control problems. (Fazel et al., 2018) proved that the policy gradient based learning converges asymptotically to the optimal policy for the LinearQuadratic Regulator (LQR) problem. (Zhang et al., 2019) extended this result to the control problem. Recently, (Molybog & Lavaei, 2020) proved asymptotic convergence of a gradient based metalearner for the LQR problem. All of these works provide asymptotic convergence guarantees.
Recently, there has also been considerable interest in establishing online learning guarantees in standard control settings. (Dean et al., 2018) provide an algorithm for the LinearQuadratic control problem with unknown dynamics that achieves a regret of . Just recently, (Cohen et al., 2019) improved on this result by providing an algorithm that achieves a regret of for the same problem. (Agarwal et al., 2019a) consider the control problem for a general convex cost and a linear dynamic system with additive adversarial disturbance. They provide an online learning algorithm that achieves an regret for the cost with respect to the best linear control policy. Agarwal et al. also showed in a subsequent work (Agarwal et al., 2019b) that a poly logarithmic regret is achievable for the same setting when the transition dynamics is known.
We emphasize that the regret analysis we provide is the first of its kind for an online metalearning control algorithm. In contrast to the recent splendid work on online learning in control settings, our work considers the online metalearning setting for a control problem with a general convex cost function for an episode and control input constraints. The key difference from similar works is that the regret analysis we provide is with respect to the best policy that satisfies the control input constraints at all the times.
1.2 Our Contribution
In this work we propose a modelbased metalearning control algorithm for a general control setting and provide guarantees for its online performance. The key novelty of the algorithm we propose is how the control input is designed to so that it is persistently exciting. We show that the proposed algorithm achieves (i) a regret for the controller cost that is for an individual episode of duration with respect to the best policy that satisfies the control input constraints (ii) an average regret for the controller cost that varies as with the number of episodes , (iii) a regret for constraint violation that is for an episode of duration with respect to the best policy that satisfies the control input constraints and (iv) an average regret for constraint violation that varies as with the number of episodes . Hence we show that the worst regret for the learning within an episode continuously improves with experience of more episodes.
In section 2 we outline the learning setting. In section 3 we introduce and briefly discuss the online metalearning control algorithm. In section 4 we discuss the inner learner and provide an upper bound for the cumulative error in the model estimation for the inner learner during an episode. In section 5 we discuss the outer learner and provide an upper bound for the average of the cumulative error in model estimation across the episodes. And finally in section 6 we characterize the regret for the controller’s cost and cumulative constraint violation within an episode and the respective averages across the episodes.
2 Problem Setup
2.1 Episodes
The learning setting comprises a sequence of episodes of duration , from to , and the system to be controlled in each episode is an unknown linear deterministic system. We denote the matrices of the system in episode by , and . We denote the state of the system and the control input to the system at time in episode by and . The initial condition is that . Hence the dynamics of the evolution of the state of the system is given by the equation
(1) 
We denote a general controller by . The control cost at time step is a function of the state and the control input generated by the controller and is denoted by . Hence the overall cost for the controller in episode is given by
(2) 
Additionally, it is required that the control input be constrained within a bounded convex polytope given by
(3) 
Assumption 1
The cost function is strictly convex and is Lipschitz continuous, i.e., there exists a constant such that
(4) 
2.2 System Parameterization
The system matrices and are parameterized by , , where the operation denotes the vectorial expansion of the columns of the input, and are constants. In particular we assume that
(5) 
that the parameter where is compact and known, and and are constant matrices of appropriate dimensions. Let . We assume the following.
Assumption 2
, for all .
2.3 Regret
We define the regret for the controller’s cost in a particular episode as the difference between the cost and the overall cost for the best policy that satisfies the control input constraints. Thus, the average regret for episodes is given by
(6) 
Here, the violation of the constraint incurs an additional cost that is proportional to
(7) 
where denotes the
th component of a vector. The subscript
is a shorthand notation for . This is also the regret for constraint violation with respect to the best policy that satisfies the control input constraints. Thus the average regret for constraint violation for epsidoes is given by(8) 
The objective is to design a suitable learning controller such that the average regret for both the controller’s cost and constraint violation are minimized.
3 Structure of the MetaLearning Control Algorithm
In this section we propose a modelbased metalearning control algorithm for the learning setting described above. The overall metalearning control architecture is shown in Fig. 1. The overall controller comprises an outer learner and an inner learner. The outer learner learns a general model parameter by continually adapting it following new episodes of experience. The inner learner learns an estimate of the parameter of the model of the system during an episode by continually adapting the suggestion by the outer learner as more observations of the state transitions are made. The outer learner learns the general model parameter such that the learning within an episode continuously improves with exprerience of more episodes.
We denote the outer learner by and the inner learner by . We denote the output of in episode by and the output of , the estimate of the model parameter for time in episode by . At time the controller computes a control input by solving an optimization whose objective is the costtogo for the remaining horizon of the episode using the parameter estimate as a substitute for the parameter of the system model in the transition dynamics.
Denote the state of the system at time by . Let , where is a sequence of control inputs from to . So the control input is computed by solving the following optimization:
(9) 
and the control input is set by . The first line of the optimization problem in Eq. (9) is the objective function which is equal to the costtogo for the remaining horizon of the episode. The second line is the state evolution constraint and the control input constraint for all time steps. We note that if the model estimate is exact the state sequence will be the optimal trajectory because the control input computed by solving Eq. (9) at time will be equal to the policy computed via. dynamic programming.
The final control input is slightly different from the control input computed by solving Eq. (9). This is the second key part of the control design. The final control input is computed by applying a perturbation on the control input computed by solving Eq. (9). This perturbation is applied to ensure that the sequence of control inputs generated by the controller is persistently exciting. We formally define the persistence of excitation and discuss how the control input computed by solving Eq. (9) is modified later.
4 Inner Learner
In this section we discuss the algorithm and the structure of the inner learner and establish an upper bound for the cumulative error in the model parameter estimated by the inner learner during an episode. We adopt the framework of OCO discussed in Section 1 to design and analyse the inner learner.
Given a model parameter the prediction loss for time step is given by
(10) 
Because the state is measurable at this loss is computable for a parameter . We set this as the loss function for the inner learner for time step . The inner learner updates its overall loss function at time to by adding the loss function to the overall loss function of the pervious time step, . Thus
(11) 
The function is well defined as an overall loss function for the inner learner because (i) it is measurable and (ii) is convex in the model parameters. The inner learner computes an estimate of the system’s model parameters at time by minimizing the updated loss function plus a regularizer , where the parameter is the output of the outer learner at the beginning of the current episode. This optimization specifies how the inner learner adapts the general model proposed by the outer learner. We discuss later how is updated by the outer learner. Thus the inner learner computes by solving the following optimization problem:
(12) 
Let . We define the cumulative loss for the inner learner till time within an episode by
(13) 
Hence the regret for the loss incurred by the inner learner is given by
(14) 
The final equality follows from the fact that . This definition of cumulative loss and regret is the cumulative loss and regret for a learning algorithm in the OCO framework where is the overall loss function for . Hence we can apply standard results from OCO for deriving an upper bound for . In the following theorem we establish the regret for the loss incurred by the inner learner.
Lemma 1
Consider the estimator given in Eq. (12). Suppose is the system model for the th episode, , . Then there exist such that
Please see the Appendix for the proof. The proof involves showing that (i) is convex and Lipschitz, and (ii) using a standard result from online convex optimization for the regularized estimator of the type in Eq.12 (Theorem 2.11, (ShalevShwartz et al., 2011)). Here we note that the regret can grow exponentially if the state grows exponentially in terms of its norm.
4.1 Cumulative Error in Parameter Estimation
The cumulative error in the estimation of the model parameters for the duration of an episode is given by
(15) 
To extend the regret bound derived in Theorem 1 to an upper bound for the cumulative error in the estimation of the model parameters we require an additional condition on the state and control input sequence and of an episode This condition is a slight variation of the standard persistence of excitation. Let
(16) 
Definition 1
We say that the sequence and satisfies persistence of excitation if there exists a such that
(17) 
, and , , where .
For provability, we modify the inner learner given in Eq. (12) as follows. Here, unlike Eq. (12), the estimation is updated only after every time steps as given below.
(18) 
So the estimate by the inner learner remains constant and is equal to for the duration to , where is the minimizer of the loss function for . Given the results in Theorem 1 and provided the persistence of excitation condition (Eq. (17)) is satisfied the following result on the cumulative error can be established.
Theorem 1
Please see the Appendix for the proof. The use of a strongly convex regularizer in the optimization (Eq. (18)) of the learner’s strategy restricts the upper bound of the cumulative error in parameter estimation to the form that is a constant times , where s are the Lipschitz constant of the loss functions w.r.t , plus an additional error, the first term in the upper bound, that arises from the inclusion of this regularizer itself in the optimization of the learner’s strategy. We refer the reader to (Theorem 2.11, (ShalevShwartz et al., 2011)) for a more detailed discussion.
From Theorem 1 it follows that the best upper bound for the cumulative error in estimation is and this is achieved when . And so for this value of the parameter estimation improves with time during an epsiode. We formally establish later that this results in a sublinear growth of the regret for the controller’s cost and cumulative constraint violation during an episode. We note that the upper bound for the cumulative error in estimation presented above is a function of the output of the outer learner . Such a characterisation is useful because it allows us to characterize the effect of the metalearning on the average regret for the overall learner and the controller’s cost across the episodes.
5 Outer Learner
Here we discuss the outer learner and also formally establish how the metaupdates provided by the outer learner continually improves the average of the cumulative error in the estimation of the model parameters for an episode, where the average is calculated across the episodes experienced so far, with experience of more episodes. As for the inner learner, we adopt the framework of OCO discussed in Section 1 to design and analyse the outer learner’s intended role.
We set the loss function for the outer learner for the th episode as , which is the norm of the difference between the best parameter estimate at the end of episode and the parameter of the general model . We note that this is the same loss function used in (Balcan et al., 2019) for the outer learner. We denote the overall loss function for the outer learner at the end of episode which is the sum of the loss functions for the individual episodes till episode by . Thus
(20) 
The outer learner updates to after episode by minimizing . Thus
(21) 
From the regret bound derived in Theorem 1 it follows that this form of the update direcly minimizes the sum of the regret bound for the inner learner across the episodes in hindsight. If is convex the solution to Eq. (21) is given by
(22) 
Such an update is called Follow the Average Leader (FAL). The FAL strategy is desirable as a metaupdate for metalearning because it is the equivalent of Follow the Leader (FTL) for a regular online learner which for a general sequence of strongly convex loss functions achieves a logarithmic regret.
Aside, the update by FAL strategy can also be written as
(23) 
We point that this version of the metaupdate given in Eq. (23) is equivalent to the approximate gradient based metalearner proposed in (Nichol & Schulman, 2018) called REPTILE.
In the theorem we present next we show at what rate the metalearning improves the average of the cumulative error in the model parameter estimation for an episode with more experience. We denote this average by :
(24) 
Theorem 2
This proof is similar to the proof for the average regret for the controller’s cost we present later and so we do not provide a proof for this result. Not surprisingly the bound established here under the assumption that the state of the system remains bounded in all the episodes is similar to the bound established in (Balcan et al., 2019). The scaling has the factor because even after the metaupdates converge each episode can still incur a regret of . Most importantly, the scaling shows that the average cumulative error reduces at the rate of as increases which is a result of the FAL strategy for the metaupdate (Eq. (23)). This suggests that the worst regret for learning within an episode continuously improves with experience of more episodes.
6 Controller Performance
In this section we establish a bound for the regret for the overall cost incurred by the controller in an episode, the overall constraint violation in an episode and the average regret for each of them over a finite number of episodes leveraging the results derived in Section 4 and applying an analysis similar to that in Section 5. Through these results we show that the overall controller with the metalearner is able to continuously improve its performance with experience of more episodes.
The control input calculation at time entails the following two steps: (i) computing an intermediate control input by solving the optimization in Eq. (9) and setting the solution for the control input for time as the intermediate control input (ii) perturbing the intermediate control input calculated as above so that the persistence of excitation condition is satisfied. This perturbation is needed because without the perturbation there is no guarantee that the control sequence computed by solving a sequence of optimizations given by Eq. (9) will generate a sequence , that satisfies the excitation condition in Eq. (17).
Let the final control input computed by the controller be given by . Denote the perturbation applied to by . Let . Then for the controller we propose is given by
(26) 
is the solution of Eq. (9) and is given by
(27) 
We prove as part of the proof of the theorem we present next that the sequence of control inputs is persistently exciting. Please see the proof for the detailed argument proving the idea. The key idea here is to perturb just enough so that the persistence of excitation is satisfied while the additional cost for the perturbation does not make the regret for the controller cost and the cumulative constraint violation to grow more than sublinearly w.r.t the horizon of an episode. We denote the Lipschitz constant of , which is the solution of Eq. (9), w.r.t the state and the model parameter used in the optimization by . We prove as part of the proof for the next theorem that such a constant exist.
We denote the total cost for a general controller in episode by and so
(28) 
We introduce the following definitions. Let
(29) 
If and are the state and control input sequence of an episode when the system model is known and is computed by solving Eq. (9) with itself as the parameter estimate then it follows that the sequence corresponds to the response to the best policy that satisfies the control input constraints and is the sequence computed by the best policy that satisfies the control input constraints. Hence the regret for the cost and the regret for the cumulative constraint violation are the regret w.r.t the best policy that satisfies the control input constraints. In the theorem we present next we provide the an upper bound for the regret for the controller cost and cumulative contraint violation for an episode w.r.t this policy when the controller sets as the control input at time .
Theorem 3
Consider the sequence of model adaptations, , given by Eq. (18). Suppose the pair is reachable, , the control input for each time step is the solution of Eq. (9) and modified as in Eq. (26), , the control input and state sequences are given by and , and , are the control and state sequence when the system model is known, and when is the solution of Eq. (9) solved with itself as the parameter estimate. Then there exist constants such that
Please see the Appendix for the proof. The proof entails (i) showing that the regret is bounded above by plus a term that is , where the term arises because of the perturbation by (ii) showing that the sequence of control inputs is persistently exciting and (iii) applying the bound for from Thoerem 1, which requires the persistence of excitation to hold.
The condition is required for establishing our result. This condition can be interpreted as requiring that the closed loop system with the feedback control remains bounded for any bounded perturbation to . The first two terms arise from the fact that the cumulative deviation of the intermediate control inputs computed during an episode from the control sequence corresponding to the best control policy that satisfies the control input constraints is bounded by . This is because the magnitude of how much differs from for the same state is not more than the Lipschitz constant times the magnitude of the error in the model parameter estimate . The third term arises because of the perturbation , which is bounded by , applied on .
The key thing to note in the above theorem are the constants and of the algorithm whose scaling can be chosen in a way that the order of the regret is optimized. We note that the constants and appear in the numerator of one term and the denominator of another term. Hence there is a tradeoff in determining their scaling. The following corollary states the scaling for these parameters that achieves the best regret for the control algorithm proposed in this work.
Corollary 1
Consider the setting in Thm. 3. The best regret is achieved when and for this parameter setting
Please see the Appendix for the proof. We observe that the best scaling that the proposed algorithm can achieve for the regret for the controller cost is and for constraint violation is . The additional factor of in the regret scaling for the controller’s cost and cumulative constraint violation compared to the regret for the inner learner loss arises because of the perturbation that is applied to . We emphasize that the control constraints are violated only because of the perturbation.
The characterization of the regret for the controller’s cost and cumulative constraint violation provided in Theorem 3 for an episode can be extended to characterize the effect of the metaupdates provided by the outer learner on the respective average regrets as we did for the overall learner. Denote the average regret for the cost and the average regret for cumulative constraint violation across the episodes by and resply. Then
(30) 
In the next theorem we establish a bound for the average regret of the metalearning control algorithm when the metalearner update is given by Eq. (23).
Theorem 4
Please see the Appendix for the proof. Similar to the result for the average cumulative error for model parameter estimation here too we note that the average regret for the cost has a factor . This is because even after the metaupdates converge the regret within a particular episode can grow as . The result clearly establishes that the metaupdates provided by the outer learner results in a average regret for the controller’s cost and cumulative constraint that vary as with suggesting that the worst regret for the performance in an episode continuously improves with experience of more episodes.
7 Conclusion
In this work we proposed a model based metalearning control algorithm for an iterative control setting, where in each iteration the system to be controlled is different and unknown and the control objective is a general convex cost function with general control input constraints. We emphasize that this is the first work that provides provable guarantees for the regret for the performance of an online metalearning control algorithm in a suitable control setting. We proved that the proposed algorithm achieves regret for the controller cost for an episode of duration with respect to the best policy that satisfies the control input constraints and an average of this regret across the iterations that varies as with being the number of iterations. We also proved that the proposed algorithm achieves regret for an episode of duration with respect to the best policy that satisfies the control input constraints and an average of this regret across given iterations that varies as . Hence we established that the metalearning control algorithm continually improves its performance within an episode with experience of more episodes.
References
 Abernethy et al. (2009) Abernethy, J., Agarwal, A., and Bartlett, P. L. A stochastic view of optimal regret through minimax duality. In Proceedings of the 22nd Annual Conference on Learning Theory, 2009.
 Agarwal et al. (2019a) Agarwal, N., Bullins, B., Hazan, E., Kakade, S., and Singh, K. Online control with adversarial disturbances. In International Conference on Machine Learning, pp. 111–119, 2019a.
 Agarwal et al. (2019b) Agarwal, N., Hazan, E., and Singh, K. Logarithmic regret for online control. In Advances in Neural Information Processing Systems, pp. 10175–10184, 2019b.
 Andrychowicz et al. (2016) Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M. W., Pfau, D., Schaul, T., Shillingford, B., and De Freitas, N. Learning to learn by gradient descent by gradient descent. In Advances in neural information processing systems, pp. 3981–3989, 2016.
 Balcan et al. (2019) Balcan, M.F., Khodak, M., and Talwalkar, A. Provable guarantees for gradientbased metalearning. In International Conference on Machine Learning, 2019.
 Cohen et al. (2019) Cohen, A., Koren, T., and Mansour, Y. Learning linearquadratic regulators efficiently with only regret. In International Conference on Machine Learning, pp. 1300–1309, 2019.
 Dean et al. (2018) Dean, S., Mania, H., Matni, N., Recht, B., and Tu, S. Regret bounds for robust adaptive control of the linear quadratic regulator. In Advances in Neural Information Processing Systems, pp. 4188–4197, 2018.
 Duan et al. (2016) Duan, Y., Schulman, J., Chen, X., Bartlett, P. L., Sutskever, I., and Abbeel, P. Rl: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.
 Fazel et al. (2018) Fazel, M., Ge, R., Kakade, S., and Mesbahi, M. Global convergence of policy gradient methods for the linear quadratic regulator. In International Conference on Machine Learning, pp. 1467–1476, 2018.
 Finn et al. (2017) Finn, C., Abbeel, P., and Levine, S. Modelagnostic metalearning for fast adaptation of deep networks. In International Conference on Machine Learning, pp. 1126–1135, 2017.
 Finn et al. (2019) Finn, C., Rajeswaran, A., Kakade, S., and Levine, S. Online metalearning. In International Conference on Machine Learning, pp. 1920–1930, 2019.
 Flennerhag et al. (2019) Flennerhag, S., Rusu, A. A., Pascanu, R., Visin, F., Yin, H., and Hadsell, R. Metalearning with warped gradient descent. In International Conference on Learning Representations, 2019.
 Green & Moore (1986) Green, M. and Moore, J. B. Persistence of excitation in linear systems. Systems & control letters, 7(5):351–360, 1986.

Hazan et al. (2006)
Hazan, E., Kalai, A., Kale, S., and Agarwal, A.
Logarithmic regret algorithms for online convex optimization.
In
International Conference on Computational Learning Theory
, pp. 499–513. Springer, 2006.  Jenatton et al. (2016) Jenatton, R., Huang, J., and Archambeau, C. Adaptive algorithms for online convex optimization with longterm constraints. In International Conference on Machine Learning, pp. 402–411, 2016.
 Klatte & Kummer (1985) Klatte, D. and Kummer, B. Stability properties of infima and optimal solutions of parametric optimization problems. In Nondifferentiable Optimization: Motivations and Applications, pp. 215–229. Springer, 1985.
 Li & Malik (2016) Li, K. and Malik, J. Learning to optimize. arXiv preprint arXiv:1606.01885, 2016.
 Mishra et al. (2018) Mishra, N., Rohaninejad, M., Chen, X., and Abbeel, P. A simple neural attentive metalearner. In International Conference on Learning Representations, 2018.
 Molybog & Lavaei (2020) Molybog, I. and Lavaei, J. Global convergence of MAML for LQR. arXiv preprint arXiv:2006.00453, 2020.
 Moore (1983) Moore, J. Persistence of excitation in extended least squares. IEEE Transactions on Automatic Control, 28(1):60–68, 1983.

Naik & Mammone (1992)
Naik, D. K. and Mammone, R. J.
Metaneural networks that learn s learning.
In [Proceedings 1992] IJCNN International Joint Conference on Neural Networks, volume 1, pp. 437–442. IEEE, 1992.  Nichol & Schulman (2018) Nichol, A. and Schulman, J. Reptile: a scalable metalearning algorithm. arXiv preprint arXiv:1803.02999, 2(3):4, 2018.
 Rajeswaran et al. (2019) Rajeswaran, A., Finn, C., Kakade, S. M., and Levine, S. Metalearning with implicit gradients. In Advances in Neural Information Processing Systems, pp. 113–124, 2019.
 Ravi & Larochelle (2016) Ravi, S. and Larochelle, H. Optimization as a model for fewshot learning. 2016.
 Schmidhuber (1987) Schmidhuber, J. Evolutionary principles in selfreferential learning, or on learning how to learn: the metameta… hook. PhD thesis, Technische Universität München, 1987.
 ShalevShwartz & Kakade (2009) ShalevShwartz, S. and Kakade, S. M. Mind the duality gap: Logarithmic regret algorithms for online optimization. In Advances in Neural Information Processing Systems, pp. 1457–1464, 2009.
 ShalevShwartz et al. (2011) ShalevShwartz, S. et al. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107–194, 2011.
 Thrun & Pratt (1998) Thrun, S. and Pratt, L. Learning to learn: Introduction and overview. In Learning to Learn, pp. 3–17. Springer, 1998.
 Wang et al. (2016) Wang, J. X., KurthNelson, Z., Tirumala, D., Soyer, H., Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., and Botvinick, M. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016.
 Yuan & Lamperski (2018) Yuan, J. and Lamperski, A. Online convex optimization for cumulative constraints. In Advances in Neural Information Processing Systems, pp. 6137–6146, 2018.
 Zhang et al. (2019) Zhang, K., Hu, B., and Basar, T. Policy optimization for linear control with robustness guarantee: Implicit regularization and global convergence. arXiv preprint arXiv:1910.09496, 2019.
 Zinkevich (2003) Zinkevich, M. Online convex programming and generalized infinitesimal gradient ascent. In International Conference on Machine Learning, pp. 928–936, 2003.
Appendix A Proof of Lemma 1
For a sequence of loss functions , we define a strategy called follow the regularized leader (FTRL) as follows:
(31) 
First, we present the following standard theorem from online convex optimization.
Lemma 2
Consider a sequence of loss functions , that are convex and Lipschitz w.r.t . Then the regret for FTRL, , is given by
(32) 
where
Please see Theorem 2.11, (ShalevShwartz et al., 2011) for the proof. The final result is a straightforward application of Lemma 2. First we note that the function , where is convex w.r.t . Second
(33) 
By appplying traingle inequality twice, then CauchyShwartz inequality and using the fact that for a given matrix , we get
(34) 
where . Given that . This implies that the Lipschitz constant of is w.r.t .
Appendix B Proof of Theorem 1
Let the model parameter be and
Comments
There are no comments yet.