State–action–reward–state–action

Machine learning and data mining
Part of a series on
Paradigms Supervised learning Unsupervised learning Online learning Batch learning Meta-learning Semi-supervised learning Self-supervised learning Reinforcement learning Rule-based learning Quantum machine learning
Problems Classification Generative model Regression Clustering dimension reduction density estimation Anomaly detection Data Cleaning AutoML Association rules Semantic analysis Structured prediction Feature engineering Feature learning Learning to rank Grammar induction Ontology learning Multimodal learning
Supervised learning (classification • regression) Apprenticeship learning Decision trees Ensembles Bagging Boosting Random forest k-NN Linear regression Naive Bayes Artificial neural networks Logistic regression Perceptron Relevance vector machine (RVM) Support vector machine (SVM)
Clustering BIRCH CURE Hierarchical k-means Fuzzy Expectation–maximization (EM) DBSCAN OPTICS Mean shift
Dimensionality reduction Factor analysis CCA ICA LDA NMF PCA PGD t-SNE SDL
Structured prediction Graphical models Bayes net Conditional random field Hidden Markov
Anomaly detection RANSAC k-NN Local outlier factor Isolation forest
Artificial neural network Autoencoder Cognitive computing Deep learning DeepDream Feedforward neural network Recurrent neural network LSTM GRU ESN reservoir computing Restricted Boltzmann machine GAN Diffusion model SOM Convolutional neural network U-Net Transformer Vision Spiking neural network Memtransistor Electrochemical RAM (ECRAM)
Reinforcement learning Q-learning SARSA Temporal difference (TD) Multi-agent Self-play
Learning with humans Active learning Crowdsourcing Human-in-the-loop
Model diagnostics Learning curve
Mathematical foundations Kernel machines Bias–variance tradeoff Computational learning theory Empirical risk minimization Occam learning PAC learning Statistical learning VC theory
Machine-learning venues ECML PKDD NeurIPS ICML ICLR IJCAI ML JMLR
Related articles Glossary of artificial intelligence List of datasets for machine-learning research List of datasets in computer vision and image processing Outline of machine learning
v t e

State–action–reward–state–action (SARSA) is an algorithm for learning a Markov decision process policy, used in the reinforcement learning area of machine learning. It was proposed by Rummery and Niranjan in a technical note^[1] with the name "Modified Connectionist Q-Learning" (MCQ-L). The alternative name SARSA, proposed by Rich Sutton, was only mentioned as a footnote.

This name reflects the fact that the main function for updating the Q-value depends on the current state of the agent "S₁", the action the agent chooses "A₁", the reward "R₂" the agent gets for choosing this action, the state "S₂" that the agent enters after taking that action, and finally the next action "A₂" the agent chooses in its new state. The acronym for the quintuple (S_t, A_t, R_t+1, S_t+1, A_t+1) is SARSA.^[2] Some authors use a slightly different convention and write the quintuple (S_t, A_t, R_t, S_t+1, A_t+1), depending on which time step the reward is formally assigned. The rest of the article uses the former convention.

YouTube Encyclopedic

1/5
Views:
3 260
1 669
62 901
5 581
11 375

Transcription

Algorithm

Q^{new}(S_{t},A_{t})\leftarrow (1-\alpha )Q(S_{t},A_{t})+\alpha \,[R_{t+1}+\gamma \,Q(S_{t+1},A_{t+1})]

A SARSA agent interacts with the environment and updates the policy based on actions taken, hence this is known as an on-policy learning algorithm. The Q value for a state-action is updated by an error, adjusted by the learning rate α. Q values represent the possible reward received in the next time step for taking action a in state s, plus the discounted future reward received from the next state-action observation.

Watkin's Q-learning updates an estimate of the optimal state-action value function $Q^{*}$ based on the maximum reward of available actions. While SARSA learns the Q values associated with taking the policy it follows itself, Watkin's Q-learning learns the Q values associated with taking the optimal policy while following an exploration/exploitation policy.

Some optimizations of Watkin's Q-learning may be applied to SARSA.^[3]

Hyperparameters

Learning rate (alpha)

The learning rate determines to what extent newly acquired information overrides old information. A factor of 0 will make the agent not learn anything, while a factor of 1 would make the agent consider only the most recent information.

Discount factor (gamma)

The discount factor determines the importance of future rewards. A discount factor factor of 0 makes the agent "opportunistic", or "myopic", e.g. ^[4], by only considering current rewards, while a factor approaching 1 will make it strive for a long-term high reward. If the discount factor meets or exceeds 1, the $Q$ values may diverge.

Initial conditions ( $Q (S 0, A 0)$ )

Since SARSA is an iterative algorithm, it implicitly assumes an initial condition before the first update occurs. A high (infinite) initial value, also known as "optimistic initial conditions",^[5] can encourage exploration: no matter what action takes place, the update rule causes it to have higher values than the other alternative, thus increasing their choice probability. In 2013 it was suggested that the first reward $r$ could be used to reset the initial conditions. According to this idea, the first time an action is taken the reward is used to set the value of $Q$ . This allows immediate learning in case of fixed deterministic rewards. This resetting-of-initial-conditions (RIC) approach seems to be consistent with human behavior in repeated binary choice experiments.^[6]

References

^ Online Q-Learning using Connectionist Systems" by Rummery & Niranjan (1994)
^ Reinforcement Learning: An Introduction Richard S. Sutton and Andrew G. Barto (chapter 6.4)
^ Wiering, Marco; Schmidhuber, Jürgen (1998-10-01). "Fast Online Q(λ)" (PDF). Machine Learning. 33 (1): 105–115. doi:10.1023/A:1007562800292. ISSN 0885-6125. S2CID 8358530.
^ "Arguments against myopic training". Retrieved 17 May 2023.
^ "2.7 Optimistic Initial Values". incompleteideas.net. Retrieved 2018-02-28.
^ Shteingart, H; Neiman, T; Loewenstein, Y (May 2013). "The Role of First Impression in Operant Learning" (PDF). J Exp Psychol Gen. 142 (2): 476–88. doi:10.1037/a0029550. PMID 22924882.

Differentiable computing

General

Concepts

Applications

Hardware

Software libraries

Implementations

Audio–visual	AlexNet WaveNet Human image synthesis HWR OCR Speech synthesis Speech recognition Facial recognition AlphaFold DALL-E Midjourney Stable Diffusion Whisper
Verbal	Word2vec Seq2seq BERT Gemini LaMDA Bard NMT Project Debater IBM Watson GPT-1 GPT-2 GPT-3 GPT-4 ChatGPT GPT-J Chinchilla AI PaLM BLOOM LLaMA
Decisional	AlphaGo AlphaZero Q-learning SARSA OpenAI Five Self-driving car MuZero Action selection Auto-GPT Robot control

People

Organizations

Architectures

Portals
- Computer programming
- Technology
Categories
- Artificial neural networks
- Machine learning

This page was last edited on 13 December 2023, at 09:23

From Wikipedia, the free encyclopedia

YouTube Encyclopedic

Transcription

Algorithm

Hyperparameters

Learning rate (alpha)

Discount factor (gamma)

Initial conditions ( $Q (S 0, A 0)$ )

See also

References

State–action–reward–state–action

From Wikipedia, the free encyclopedia

YouTube Encyclopedic

Transcription

Algorithm

Hyperparameters

Learning rate (alpha)

Discount factor (gamma)

Initial conditions (Q(S0, A0))

See also

References

Initial conditions ( $Q (S 0, A 0)$ )