SARSA

SARSA (State-Action-Reward-State-Action) is an algorithm for learning a Markov Decision Process policy, used in the Reinforcement Learning area of Machine Learning. It was introduced in the technical note 'Online Q-Learning using Connectionist Systems' by Rummery & Niranjan (1994) where the alternative name SARSA was only mentioned as a footnote.

This name simply reflects the fact that main function for updating the Q-value depends on the current state of the agent "S1", the action the agent choses "A1", the reward the agent gets for choosing this action "R", the state that the agent will now be in "S2" after taking that action, and finally the next action the agent will chose in its new state "A2". Taking every letter in the quintuple (st, at , rt+1 , st+1 , at+1) yields the word 'SARSA'.

Algorithm
$$Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha [r_{t+1} + \phi Q(s_{t+1}, a_{t+1})-Q(s_t,a_t)]$$

A SARSA agent will interact with the environment and update the policy based on actions taken, known as an on-policy learning algorithm. As expressed above, the Q value for a state-action is updated by an error, adjusted by the learning rate alpha. Q values represent the possible reward received in the next time step for taking action a in state s, plus the discounted future reward received from the next state-action observation. Created as alternative to the existing temporal difference technique, Watkin's Q-Learning, which updates the policy based on the maximum reward of available actions. The difference may be explained as SARSA learns the Q values associated with taking the policy it follows itself, while Watkin's Q-Learning learns the Q values associated with taking the exploitation policy while following an exploration/exploitation policy. For further information on the exploration/exploitation trade off, see Reinforcement Learning.

Some optimisations of Watkin's Q-Learning may also be applied to SARSA, for example in the paper 'Fast Online Q(λ)' (Wiering and Schmidhuber, 1998) the small differences needed for SARSA(λ) implementations are described as they arise.