The model is an adaptation of a standard Q-learning where the assumption that agents will always make the reward-maximizing action is replaced by a weighting scheme that the agent might also make the reward-minimizing decision. The pessimistic Q-learning model is used to model characteristics of anxious behavior.
\begin{equation} Q(s,a) = \sum_{s'} p(s', r \mid s, a) \left( r + \gamma \left[ c \max(Q(s', a')) + (1-c) \min Q(s', a') \right] \right) \end{equation}
where p(s', r | s, a) represents the probability of transitioning from state s to the next state s' by taking action a; r is the immediate reward obtained by taking action a in state s; and gamma represents the discount rate that determines how much immediate rewards are valued over future rewards. If gamma is zero, the agent values only immediate rewards. Here c is the pessism. The c parameter takes on values between 0 and 1, where 1 indicates that the person updates their beliefs based on the assumption that it will always make reward-maximizing actions in the future and value of 0 indicates that the person believes that it will always make reward-minimizing actions.
Solving Q(s,a) is done by value iteration .
Python