How to build Reinforcement Learning Agents using Q-Learning


May 28, 2019
Reaction score
Reinforcement Learning is a sub-field of Machine Learning where an agent or entity is introduced into an environment and learns how to navigate through it by trial and error. When you place an agent in a new environment, it will initially make mistakes as it has no idea what to do and when to do it. Naturally, over time, it will improve by continuously making mistakes and learning from them.
Here, we see a few terms being used - State, Reward and Action. Here, State refers to the particular condition of the environment at a given moment in time. Action refers to a specific move made in a particular state in the environment. When an action is performed in a state, the environment returns a Reward to the agent which shows whether the move made was good or bad. A positive reward means that the agent made a good move and a negative reward means that the agents made a bad move.

To give it some context, let's think of a video game. A state could be a single frame taken from the video game screen. An action could be something like 'move left' and the reward could be +1 or -1 depending on the success of the move for that state.


Introducing the Q-Learning algorithm

In this article, we'll be covering the basics of one of the most famous and easily-implementable Reinforcement Learning algorithm called Q-Learning.

Q-Learning belongs to a family of algorithms called SARSA, which stands for State-Action-Reward-State-Action. This means that there is a sequence of steps taken during the learning process in an environment.

Q-Learning makes use of a table called the Q-Table. It's just a fancy name for a lookup table that consists of columns for the actions and rows for every single state possible in the environment. For instance, the Q-Table could look like this for a mouse-in-a-maze game:

Every environment needs a reward function that gives the reward depending on the move made. For example, taking reference to the table above, all the mouse trap squares could carry a negative reward of -5 while the cheese square could have a reward of +2. Whenever it reaches the flag, the environment resets and it tries again, this time much faster than before.


Populating the table

Initially, the mouse has no idea what to do when introduced into the maze. In an effort to figure out what to do, it will first start by making random moves and seeing if the reward is positive or negative. If the move is good, then perform it more. If the move is bad, then avoid making it.

Given a state, action and the reward function for the environment, we can use Bellman's equation to populate the Q-Table. The Bellman Equation goes something like this:


All the equation is doing is updating the current value of the Q-Table at the index [s, a] with the current value added to the updated value. The updated value consists of the learning rate alpha multiplied to the sum of the reward for performing that action at that specific state and the maximum possible value of the Q Table for the next state subtracting away the current value.

The gamma value is called the discount factor. It is a weighting term used to represent how much future events lose their value depending on how far in time they are from the current state. Invariably, it's like saying that a human would prefer getting $100 now rather than 100$ tomorrow.

The max of Q(s', a) is the maximum possible score at the next state s' that is returned by the environment after making action a.


Q-Learning, a step-by-step approach

The easiest sequence of steps to perform in the Q-Learning algorithm are as follows:
  1. Initialise Q-Table
  2. Choose a random action from possible actions
  3. Perform the random action
  4. Measure Reward
  5. Update the Q-Table
Over time, the values in the table are optimised (this means your agent will start performing better and better every time it's reintroduced into the environment).


To Exploit and Earn or to Explore and Learn?

In Reinforcement Learning, there's a trade-off between Exploitation - the agent simply uses what it already knows and does the same thing repetitively - and Exploration - the agent has no clue what to do and takes random actions.

When you have an agent that knows the best way to do something, it'll exploit its knowledge and use the same approach every time it's reintroduced into the environment. This is particularly bad because it does not want to learn of any other methods that may be more lucrative or rewarding.

Similarly, when you have an agent that is newly introduced in the environment with zero knowledge on what's going on, you ideally want it to explore the environment, make random moves and get the reward.

An optimally-functioning agent does a bit of both - it makes moves both randomly and by relying on past learning experiences. To do this, we instil this behaviour into the agent, we use the Epsilon Greedy Policy. By setting a hyper-parameter called epsilon, we can control its behaviour when navigating through environments.

Every time the agent is about to make a move, generate a random number. If the number is greater than epsilon, then make a move from the table. This helps with Exploitation. If not, then make a random number. This helps with Exploration.

By simply doing this, it enables us to converge on an agent that is optimally able to navigate this environment.


Performing Q-Learning in Python

Now that you have an understanding of what's going on, let's take a loop at some code. Here, we'll be creating an agent that learns how to navigate from the starting square to the end square in a horizontal path that looks something like this:


import numpy as np
import pandas as pd

def get_table(states, actions):
        table = np.zeros([len(states), len(actions)])
        table = pd.DataFrame(table, columns=[actions])
        return table

gamma = 0.8
alpha = 0.6
epsilon = 0.1

states = list(range(10))
states[-1] = 'terminal'
actions = ['left', 'right']

q_table = get_table(states, actions)
Now that we have the table, let's create helper functions to get the reward and render the environment:

def get_reward(action, state):
        # The agent has reach the final state
    if state == 'terminal':
        return None, 'terminal'
        # Resetting the agent's position back to sqaure 1
    elif state >= 10:
        return None, 'terminal'
    elif state < 0:
        return None, 'terminal'
    elif action == 0:
        return -1, state - 1
    elif action == 1:
        if state + 1 >= 10:
            return 1, 'terminal'
            return 1, state + 1

def render(state):
    possible_states = ['_' for i in range(10)]
    possible_states[state] = 'X'
    string = ''
    for i in range(len(possible_states)):
        string += '{:^2}'.format(possible_states[i])
    return string
Now, let's dive into the main event loop:

total_rewards = []
for i in range(1000): # For 1000 episoders / generations / iterations
    current_state = 0
    action = None
    current_rewards = []
    steps = 0
    while True:
        message = "Step {:<3}".format(steps)

                # Using Epsilon Greedy Policy
        random = np.abs(np.random.randn())
        if random < epsilon: # random move
            action = np.random.choice(actions)
        else: # Take an action from the table
            action = np.argmax(table[current_state])
        reward, next_state = get_reward(action, current_state)
        if reward != None:
            current = table[current_state, action]
            update = gamma * table[current_state, np.argmax(table[current_state])]
            new_q = current + (alpha * (reward + update - current))
            table[current_state, action] = new_q

            message += '{}'.format(render(current_state))
            print (message, end="\r", flush=True)

            current_state = next_state

            if next_state == 'terminal':
                print ('Goal reached. Steps Taken: {}'.format(steps))
                steps += 1
With that, when the final table is printed to the terminal, you can observe that the values in the indices that are major milestones for the agent are higher compared to the ones that penalize the agent. Here are some stats from my training process:



Ending note

Q-Learning is the most basic form that Reinforcement Learning can take. The concepts used here are the basis of more complex Reinforcement Learning articles such as Proximal Policy Optimisation, Deep Q-Learning (Q-Learning but with Neural Networks), Actor-Critic Models and much more.

Hopefully, this article serves to introduce you to the basics of Reinforcement Learning and some of the terms used in the field. Ideally, we can apply SARSA algorithms to any environment that is episodic in nature (a task that can come to an end).

Do check out the code once again and run through the steps. Again, you can find it on this GitHub repo here.
Reactions: 25schmeckles