A Beginners Guide to Q-Learning (2024)

THE DEFINITIVE REFLECTION

Model-Free Reinforcement Learning

A Beginners Guide to Q-Learning (3)

Have you ever blamed or beat at your dog punitively for the wrongful actions once it done? Or have you ever trained a pet and rewarded it for every correct command you asked for? If you are a pet owner, probably your answer would be ‘Yes’. You may have noticed once you do so from its younger age frequently, its wrongful deeds getting reduced day by day. And the same as it will learn from mistakes and trained himself well.

As humans, we also have experienced the same. Can you remember, in our primary school, our school teachers rewarded us with stars once we had done school works properly. :D

This is what exactly happening in Reinforcement Learning(RL).

Reinforcement Learning is one of the most beautiful branches in Artificial Intelligence

The objective of RL is to maximize the reward of an agent by taking a series of actions in response to a dynamic environment.

A Beginners Guide to Q-Learning (4)

Reinforcement Learning is the science of making optimal decisions using experiences. Breaking it down, the process of Reinforcement Learning involves these simple steps:

  1. Observation of the environment
  2. Deciding how to act using some strategy
  3. Acting accordingly
  4. Receiving a reward or penalty
  5. Learning from the experiences and refining our strategy
  6. Iterate until an optimal strategy is found
A Beginners Guide to Q-Learning (5)

There are 2 main types of RL algorithms. They are model-based and model-free.

A model-free algorithm is an algorithm that estimates the optimal policy without using or estimating the dynamics (transition and reward functions) of the environment. Whereas, a model-based algorithm is an algorithm that uses the transition function (and the reward function) in order to estimate the optimal policy.

Q-learning is a model-free reinforcement learning algorithm.

Q-learning is a values-based learning algorithm. Value based algorithms updates the value function based on an equation(particularly Bellman equation). Whereas the other type, policy-based estimates the value function with a greedy policy obtained from the last policy improvement.

Q-learning is an off-policy learner. Means it learns the value of the optimal policy independently of the agent’s actions. On the other hand, an on-policy learner learns the value of the policy being carried out by the agent, including the exploration steps and it will find a policy that is optimal, taking into account the exploration inherent in the policy.

What’s this ‘Q’?

The ‘Q’ in Q-learning stands for quality. Quality here represents how useful a given action is in gaining some future reward.

Q-learning Definition

  • Q*(s,a) is the expected value (cumulative discounted reward) of doing a in state s and then following the optimal policy.
  • Q-learning uses Temporal Differences(TD) to estimate the value of Q*(s,a). Temporal difference is an agent learning from an environment through episodes with no prior knowledge of the environment.
  • The agent maintains a table of Q[S, A], where S is the set of states and A is the set of actions.
  • Q[s, a] represents its current estimate of Q*(s,a).

Q-learning Simple Example

In this section Q-learning has been explained along with a demo.

Let’s say an agent has to move from a starting point to an ending point along a path that has obstacles. Agent needs to reach the target in the shortest path possible without hitting in the obstacles and he needs to follow the boundary covered by the obstacles. For our convenience, I have introduced this in a customized grid environment as follows.

A Beginners Guide to Q-Learning (6)

Introducing the Q-Table

Q-Table is the data structure used to calculate the maximum expected future rewards for action at each state. Basically, this table will guide us to the best action at each state. To learn each value of the Q-table, Q-Learning algorithm is used.

Q-function

The Q-function uses the Bellman equation and takes two inputs: state (s) and action (a).

A Beginners Guide to Q-Learning (7)

Q-learning Algorithm Process

A Beginners Guide to Q-Learning (8)

Step 1: Initialize the Q-Table

First the Q-table has to be built. There are n columns, where n= number of actions. There are m rows, where m= number of states.

In our example n=Go Left, Go Right, Go Up and Go Down and m= Start, Idle, Correct Path, Wrong Path and End. First, let’s initialize the values at 0.

A Beginners Guide to Q-Learning (9)

Step 2 : Choose an Action

Step 3 : Perform an Action

The combination of steps 2 and 3 is performed for an undefined amount of time. These steps runs until the time training is stopped, or when the training loop stopped as defined in the code.

First, an action (a) in the state (s) is chosen based on the Q-Table. Note that, as mentioned earlier, when the episode initially starts, every Q-value should be 0.

Then, update the Q-values for being at the start and moving right using the Bellman equation which is stated above.

Epsilon greedy strategy concept comes in to play here. In the beginning, the epsilon rates will be higher. The agent will explore the environment and randomly choose actions. This occurs like this logically,since the agent does not know anything about the environment. As the agent explores the environment, the epsilon rate decreases and the agent starts to exploit the environment.

During the process of exploration, the agent progressively becomes more confident in estimating the Q-values.

In our Agent example, when the training of agent starting, the agent is completely unaware about the environment. So let’s say it takes a random action to its ‘right’ direction.

A Beginners Guide to Q-Learning (10)

We can now update the Q-values for being at the start and moving right using the Bellman equation.

A Beginners Guide to Q-Learning (11)

Steps 4 : Measure Reward

Now we have taken an action and observed an outcome and reward.

Steps 5 : Evaluate

We need to update the function Q(s,a).

This process is repeated again and again until the learning is stopped. In this way the Q-Table is been updated and the value function Q is maximized. Here the Q(state, action) returns the expected future reward of that action at that state.

A Beginners Guide to Q-Learning (12)

In the example, I have entered the rewarding scheme as follows.

Reward when reach step closer to goal= +1

Reward when hit obstacle =-1

Reward when idle=0

Initially, we explore the agent’s environment and update the Q-Table. When the Q-Table is ready, the agent start to exploit the environment and start taking better actions. Final Q-table can be like following (for example).

A Beginners Guide to Q-Learning (13)

Following are the outcomes that results the agent’s shortest path towards goal after training.

A Beginners Guide to Q-Learning (14)

Please drop a mail to grasp the python implementation code for the concept explained.

A Beginners Guide to Q-Learning (2024)
Top Articles
MP3, AAC, WAV, FLAC: all the audio file formats explained
Best M4A to FLAC Convertering Methods (Online and Desktop)
Antisis City/Antisis City Gym
Custom Screensaver On The Non-touch Kindle 4
Diario Las Americas Rentas Hialeah
Myexperience Login Northwell
Exam With A Social Studies Section Crossword
Culver's Flavor Of The Day Wilson Nc
Lexington Herald-Leader from Lexington, Kentucky
Co Parts Mn
[PDF] INFORMATION BROCHURE - Free Download PDF
World Cup Soccer Wiki
Sams Gas Price Fairview Heights Il
The Superhuman Guide to Twitter Advanced Search: 23 Hidden Ways to Use Advanced Search for Marketing and Sales
Tnt Forum Activeboard
Puretalkusa.com/Amac
History of Osceola County
Gentle Dental Northpointe
A Biomass Pyramid Of An Ecosystem Is Shown.Tertiary ConsumersSecondary ConsumersPrimary ConsumersProducersWhich
Gina Wilson All Things Algebra Unit 2 Homework 8
Craigslist Org Appleton Wi
Jail View Sumter
Employee Health Upmc
PCM.daily - Discussion Forum: Classique du Grand Duché
Ceramic tiles vs vitrified tiles: Which one should you choose? - Building And Interiors
Craigslist Illinois Springfield
Haunted Mansion Showtimes Near Epic Theatres Of West Volusia
Olivia Maeday
پنل کاربری سایت همسریابی هلو
Sofia the baddie dog
Foodsmart Jonesboro Ar Weekly Ad
Weathervane Broken Monorail
Ascensionpress Com Login
Ocala Craigslist Com
Turns As A Jetliner Crossword Clue
Miller Plonka Obituaries
My Reading Manga Gay
The Procurement Acronyms And Abbreviations That You Need To Know Short Forms Used In Procurement
Plasma Donation Racine Wi
Gold Nugget at the Golden Nugget
Trivago Sf
If You're Getting Your Nails Done, You Absolutely Need to Tip—Here's How Much
Top 40 Minecraft mods to enhance your gaming experience
Random Animal Hybrid Generator Wheel
Walmart Careers Stocker
Motorcycles for Sale on Craigslist: The Ultimate Guide - First Republic Craigslist
40X100 Barndominium Floor Plans With Shop
Race Deepwoken
Arginina - co to jest, właściwości, zastosowanie oraz przeciwwskazania
Diamond Desires Nyc
Joe Bartosik Ms
Strange World Showtimes Near Century Federal Way
Latest Posts
Article information

Author: Annamae Dooley

Last Updated:

Views: 6116

Rating: 4.4 / 5 (65 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Annamae Dooley

Birthday: 2001-07-26

Address: 9687 Tambra Meadow, Bradleyhaven, TN 53219

Phone: +9316045904039

Job: Future Coordinator

Hobby: Archery, Couponing, Poi, Kite flying, Knitting, Rappelling, Baseball

Introduction: My name is Annamae Dooley, I am a witty, quaint, lovely, clever, rich, sparkling, powerful person who loves writing and wants to share my knowledge and understanding with you.