Stochastic approximation methods are a family of iterative methods typically used for rootfinding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating extreme values of functions which cannot be computed directly, but only estimated via noisy observations.
For instance in engineering many optimization problems are often of this type when you do not have a mathematical model of the system (which can be too complex) but still would like to optimize its behavior by adjusting certain parameters. For this purpose, you can do experiments or run simulations to evaluate the performance of the system at given values of the parameters. Stochastic approximation algorithms have also been used in the social sciences to describe collective dynamics: fictitious play in learning theory and consensus algorithms can be studied using their theory.^{[1]}. These methods are used often in statistics and machine learning that typically need to handle noisy measurements of empirical data. They are also related to stochastic optimization methods and algorithms.
In a nutshell, stochastic approximation algorithms deal with a function of the form which is the expected value of a function depending on a random variable . The goal is to recover properties of such a function without evaluating it directly. Instead, stochastic approximation algorithms use random samples of to efficiently approximate properties of such as zeros or extrema.
The earliest, and prototypical, algorithms of this kind are the RobbinsMonro and KieferWolfowitz algorithms introduced respectively in 1951 and 1952.
YouTube Encyclopedic

1/5Views:1 926393 745656 585792 8004 838

✪ Solving Simple Stochastic Optimization Problems with Gurobi

✪ 6. Monte Carlo Simulation

✪ Euler's method  Differential equations AP Calculus BC  Khan Academy

✪ Poisson process 1  Probability and Statistics  Khan Academy

✪ NIPS 2012 Tutoral  James Spall  Stochastic Search and Optimization
Transcription
Contents
Robbins–Monro algorithm
The Robbins–Monro algorithm, introduced in 1951 by Herbert Robbins and Sutton Monro,^{[2]} presented a methodology for solving a root finding problem, where the function is represented as an expected value. Assume that we have a function , and a constant , such that the equation has a unique root at . It is assumed that while we cannot directly observe the function , we can instead obtain measurements of the random variable where . The structure of the algorithm is to then generate iterates of the form:
Here, is a sequence of positive step sizes. Robbins and Monro proved ^{[2]}^{, Theorem 2} that converges in (and hence also in probability) to , and Blum^{[3]} later proved the convergence is actually with probability one, provided that:
 is uniformly bounded,
 is nondecreasing,
 exists and is positive, and
 The sequence satisfies the following requirements:
A particular sequence of steps which satisfy these conditions, and was suggested by Robbins–Monro, have the form: , for . Other series are possible but in order to average out the noise in , the above condition must be met.
Complexity results
 If is twice continuously differentiable, and strongly convex, and the minimizer of belongs to the interior of , then the RobbinsMonro algorithm will achieve the asymptotically optimal convergence rate, with respect to the objective function, being , where is the minimal value of over .^{[4]}^{[5]}
 Conversely, in the general convex case, where we lack both the assumption of smoothness and strong convexity, Nemirovski and Yudin ^{[6]} have shown that the asymptotically optimal convergence rate, with respect to the objective function values, is . They have also proven that this rate cannot be improved.
Subsequent developments and PolyakRuppert Averaging
While the RobbinsMonro algorithm is theoretically able to achieve under the assumption of twice continuous differentiability and strong convexity, it can perform quite poorly upon implementation. This is primarily due to the fact that the algorithm is very sensitive to the choice of the step size sequence, and the supposed asymptotically optimal step size policy can be quite harmful in the beginning.^{[5]}^{[7]}
Chung^{[8]}(1954) and Fabian^{[9]}(1968) showed that we would achieve optimal convergence rate with (or ). Lai and Robbins^{[10]}^{[11]} designed adaptive procedures to estimate such that has minimal asymptotic variance. However the application of such optimal methods requires much a priori information which is hard to obtain in most situations. To overcome this shortfall, Polyak^{[12]}(1991) and Ruppert^{[13]}(1988) independently developed a new optimal algorithm based on the idea of averaging the trajectories. Polyak and Juditsky^{[14]} also presented a method of accelerating RobbinsMonro for linear and nonlinear rootsearching problems through the use of longer steps, and averaging of the iterates. The algorithm would have the following structure:
A1)
Therefore, the sequence with satisfies this restriction, but does not, hence the longer steps. Under the assumptions outlined in the RobbinsMonro algorithm, the resulting modification will result in the same asymptotically optimal convergence rate yet with a more robust step size policy.^{[14]} Prior to this, the idea of using longer steps and averaging the iterates had already been proposed by Nemirovski and Yudin ^{[15]} for the cases of solving the stochastic optimization problem with continuous convex objectives and for convexconcave saddle point problems. These algorithms were observed to attain the nonasymptotic rate .
A more general result is given in Chapter 11 of Kushner and Yin^{[16]} by defining interpolated time , interpolated process and interpolated normalized process as
With assumption A1) and the following A2)
A2) There is a Hurwitz matrix and a symmetric and positivedefinite matrix such that converges weakly to , where is the stationary solution to
satisfied, and define . Then for each ,
The success of the averaging idea is because of the time scale separation of the original sequence and the averaged sequence , with the time scale of the former one being faster.
Application in Stochastic Optimization
Suppose we want to solve the following stochastic optimization problem
Here is an unbiased estimator of . If depends on , there is in general no natural way of generating a random outcome that is an unbiased estimator of the gradient. In some special cases when either IPA or likelihood ratio methods are applicable, then one is able to obtain an unbiased gradient estimator . If is viewed as some "fundamental" underlying random process that is generated independently of , and under some regularization conditions for derivativeintegral interchange operations so that , then gives the fundamental gradient unbiased estimate. However, for some applications we have to use finitedifference methods in which has a conditional expectation close to but not exactly equal to it.
We then define a recursion analogously to Newton's Method in the deterministic algorithm:
Convergence of the Algorithm
The following result gives sufficient conditions on for the algorithm to converge:^{[17]}
C1) .
C2)
C3)
C4)
C5)
Then converges to almost surely.
Here are some intuitive explanations about these conditions. Suppose is a uniformly bounded random variables. If C2) is not satisfied, i.e. , then
Example (where the stochastic gradient method is appropriate)^{[7]}
Suppose , where is differentiable and is a random variable independent of . Then depends on the mean of , and the stochastic gradient method would be appropriate in this problem. We can choose
KieferWolfowitz algorithm
The KieferWolfowitz algorithm^{[18]} was introduced in 1952 by Jacob Wolfowitz and Jack Kiefer, and was motivated by the publication of the RobbinsMonro algorithm. However, the algorithm was presented as a method which would stochastically estimate the maximum of a function. Let be a function which has a maximum at the point . It is assumed that is unknown; however, certain observations , where , can be made at any point . The structure of the algorithm follows a gradientlike method, with the iterates being generated as follows:
where and are independent, and the gradient of is approximated using finite differences. The sequence specifies the sequence of finite difference widths used for the gradient approximation, while the sequence specifies a sequence of positive step sizes taken along that direction. Kiefer and Wolfowitz proved that, if satisfied certain regularity conditions, then will converge to in probability as , and later Blum^{[3]} in 1954 showed converges to almost surely, provided that:
 for all .
 The function has a unique point of maximum (minimum) and is strong concave (convex)
 The algorithm was first presented with the requirement that the function maintains strong global convexity (concavity) over the entire feasible space. Given this condition is too restrictive to impose over the entire domain, Kiefer and Wolfowitz proposed that it is sufficient to impose the condition over a compact set which is known to include the optimal solution.
 The function satisfies the regularity conditions as follows:
 There exists and such that
 There exists and such that
 For every , there exists some such that
 There exists and such that
 The selected sequences and must be infinite sequences of positive numbers such that
A suitable choice of sequences, as recommended by Kiefer and Wolfowitz, would be and .
Subsequent developments and important issues
 The Kiefer Wolfowitz algorithm requires that for each gradient computation, at least different parameter values must be simulated for every iteration of the algorithm, where is the dimension of the search space. This means that when is large, the KieferWolfowitz algorithm will require substantial computational effort per iteration, leading to slow convergence.
 To address this problem, Spall proposed the use of simultaneous perturbations to estimate the gradient. This method would require only two simulations per iteration, regardless of the dimension .^{[19]}
 In the conditions required for convergence, the ability to specify a predetermined compact set that fulfills strong convexity (or concavity) and contains the unique solution can be difficult to find. With respect to real world applications, if the domain is quite large, these assumptions can be fairly restrictive and highly unrealistic.
Further developments
An extensive theoretical literature has grown up around these algorithms, concerning conditions for convergence, rates of convergence, multivariate and other generalizations, proper choice of step size, possible noise models, and so on.^{[20]}^{[21]} These methods are also applied in control theory, in which case the unknown function which we wish to optimize or find the zero of may vary in time. In this case, the step size should not converge to zero but should be chosen so as to track the function.^{[20]}^{, 2nd ed., chapter 3}
C. Johan Masreliez and R. Douglas Martin were the first to apply stochastic approximation to robust estimation.^{[22]}
The main tool for analyzing stochastic approximations algorithms (including the RobbinsMonro and the KieferWolfowitz algorithms) is a theorem by Aryeh Dvoretzky published in the proceedings of the third Berkeley symposium on mathematical statistics and probability, 1956.^{[23]}
See also
 Stochastic gradient descent
 Stochastic optimization
 Simultaneous perturbation stochastic approximation
References
 ^ Le Ny, Jerome. "Introduction to Stochastic Approximation Algorithms" (PDF). Polytechnique Montreal. Teaching Notes. Retrieved 16 November 2016.
 ^ ^{a} ^{b} Robbins, H.; Monro, S. (1951). "A Stochastic Approximation Method". The Annals of Mathematical Statistics. 22 (3): 400. doi:10.1214/aoms/1177729586.
 ^ ^{a} ^{b} Blum, Julius R. (19540601). "Approximation Methods which Converge with Probability one". The Annals of Mathematical Statistics. 25 (2): 382–386. doi:10.1214/aoms/1177728794. ISSN 00034851.
 ^ Sacks, J. (1958). "Asymptotic Distribution of Stochastic Approximation Procedures". The Annals of Mathematical Statistics. 29 (2): 373–405. doi:10.1214/aoms/1177706619. JSTOR 2237335.
 ^ ^{a} ^{b} Nemirovski, A.; Juditsky, A.; Lan, G.; Shapiro, A. (2009). "Robust Stochastic Approximation Approach to Stochastic Programming". SIAM Journal on Optimization. 19 (4): 1574. doi:10.1137/070704277.
 ^ Problem Complexity and Method Efficiency in Optimization, A. Nemirovski and D. Yudin, Wiley Intersci. Ser. Discrete Math 15 John Wiley New York (1983) .
 ^ ^{a} ^{b} Introduction to Stochastic Search and Optimization: Estimation, Simulation and Control, J.C. Spall, John Wiley Hoboken, NJ, (2003).
 ^ Chung, K. L. (19540901). "On a Stochastic Approximation Method". The Annals of Mathematical Statistics. 25 (3): 463–483. doi:10.1214/aoms/1177728716. ISSN 00034851.
 ^ Fabian, Vaclav (19680801). "On Asymptotic Normality in Stochastic Approximation". The Annals of Mathematical Statistics. 39 (4): 1327–1332. doi:10.1214/aoms/1177698258. ISSN 00034851.
 ^ Lai, T. L.; Robbins, Herbert (19791101). "Adaptive Design and Stochastic Approximation". The Annals of Statistics. 7 (6): 1196–1221. doi:10.1214/aos/1176344840. ISSN 00905364.
 ^ Lai, Tze Leung; Robbins, Herbert (19810901). "Consistency and asymptotic efficiency of slope estimates in stochastic approximation schemes". Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete. 56 (3): 329–360. doi:10.1007/BF00536178. ISSN 00443719.
 ^ Polyak, B T (19900101). "New stochastic approximation type procedures. (In Russian.)". 7 (7).
 ^ Ruppert, D. "Efficient estimators from a slowly converging robbinsmonro process".
 ^ ^{a} ^{b} Polyak, B. T.; Juditsky, A. B. (1992). "Acceleration of Stochastic Approximation by Averaging". SIAM Journal on Control and Optimization. 30 (4): 838. doi:10.1137/0330046.
 ^ On Cezari's convergence of the steepest descent method for approximating saddle points of convexconcave functions, A. Nemirovski and D. Yudin, Dokl. Akad. Nauk SSR 2939, (1978 (Russian)), Soviet Math. Dokl. 19 (1978 (English)).
 ^ Kushner, Harold; George Yin, G. (20030717). Stochastic Approximation and Recursive Algorithms and  Harold Kushner  Springer. www.springer.com. ISBN 9780387008943. Retrieved 20160516.
 ^ Bouleau, N.; Lepingle, D. (1994). Numerical Methods for stochastic Processes. New York: John Wiley. ISBN 9780471546412.
 ^ Kiefer, J.; Wolfowitz, J. (1952). "Stochastic Estimation of the Maximum of a Regression Function". The Annals of Mathematical Statistics. 23 (3): 462. doi:10.1214/aoms/1177729392.
 ^ Spall, J. C. (2000). "Adaptive stochastic approximation by the simultaneous perturbation method". IEEE Transactions on Automatic Control. 45 (10): 1839–1853. doi:10.1109/TAC.2000.880982.
 ^ ^{a} ^{b} Kushner, H. J.; Yin, G. G. (1997). Stochastic Approximation Algorithms and Applications. doi:10.1007/9781489926968. ISBN 9781489926982.
 ^ Stochastic Approximation and Recursive Estimation, Mikhail Borisovich Nevel'son and Rafail Zalmanovich Has'minskiĭ, translated by Israel Program for Scientific Translations and B. Silver, Providence, RI: American Mathematical Society, 1973, 1976. ISBN 0821815970.
 ^ Martin, R.; Masreliez, C. (1975). "Robust estimation via stochastic approximation". IEEE Transactions on Information Theory. 21 (3): 263. doi:10.1109/TIT.1975.1055386.
 ^ Dvoretzky, Aryeh (19560101). "On Stochastic Approximation". The Regents of the University of California.