false
Catalog
AI 101 - Class Recordings
Recording Class 8
Recording Class 8
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
So we're going to be going over your homework from last time. So are there any questions to begin with? I'll just go over it. So question 1, vectorize the following document using backup words. So we have our document 1, my friends like to fish, they do not like gardening, and document 2, fish are food, not friends. So if we remember how backup word works, we have a frequency indicator in our vocabulary. So our vocabulary here would just be all of the words used, not including multiple times in our vocabulary, words that are used multiple times like fish. This would be indicated in our backwards frequency. So we go to the key. What I mean by this is we have our vocabulary. So my friends like to fish, they do not, and then we already have said like. So we're not putting that in our vocabulary, and then we just continue gardening, and then other words from document 2 that haven't been used in document 1 are also added. So it's fine if your answer doesn't look exactly like this because we could have the order of our vocabulary different. But you should have the same amount of 1s, 2s, and 0s for document 1 and for document 2, just the order might be switched. If we're just going over it, this means that in document 1, we're using the word my once, and then in document 1, we're also using the word like twice, and we're using the word are 0 times. Similar methodology for document 2. Question 2. So calculate the cosine similarity between vectors representing different documents to measure their similarity. So we have our vector of document 1 and our vector of document 2. So let's see. We first do the dot product of the vectors, and then we find the magnitude of the vectors. So I just want to show this for just to remember the equation for cosine similarity. So we have our dot product and then our magnitude, and then that's being divided. So remember how we learned the dot product in CNNs. So we multiply the columns that correspond to each other. So we multiply 1 and 2, 2 and 3, 0 and 1, and then we add those together, which would get us 8, and then the magnitude of vectors. This, we also have put in the slides just in a clear format. So it's the square root of and squaring the individual components. So here we have document 1, we have each of our values squared, and then summed, and then square rooted. Same thing for document 2. So then to find the cosine similarity, we take the dot product, divide it by multiplication of our two magnitudes, which gets us around 0.956, which sounds about right because if you remember, cosine similarity can only be between negative 1 and 1. Does anyone have any questions? Explaining 1? Sure. So remember bag of words, we have our vocabulary from our documents. So it's not like we would have a word that's not using document 1 or document 2 in here. So that's just how our vocabulary is made, and then we have our frequency of those words in our vocabulary being used in each of our documents. So like I said, 2 in the placeholder 3 should correspond to like, because this is our third in our vocabulary, and we can see we're using like twice. Let's see. Oh, that's not. There we go. We're using it twice in our document 1. It's not repeated in our vocabulary for document 1, because we've already used the word like before. We're not using a different word in that case. Does that make sense? Okay. Does anyone else have any questions? If not, we can just get started on today's slides. Sounds good. Let me just get those ready. Okay, so today, for our last workshop, we have reinforcement learning, which is really interesting, I'm really excited for you guys to learn this. So we've seen this slide multiple times before. We've learned about supervised learning, unsupervised learning. And these two types of algorithms are pretty similar in how they work. We know that there's an input, there's black box, we don't fully understand the system, what's going on, and then we have our output. Reinforcement learning is different. We have to know what's going on in the system. And we're gonna talk a little more about this later. So as an analogy to reinforcement learning, just to understand what's going on. We're gonna think about how would you train a pet? So you give them treats. Those are their rewards every time they take an action. You want them to take, such as sitting, or coming, or not barking. Or I guess that would be in the case of a dog, but other things like that you could do for other pets. So what is reinforcement learning? Reinforcement learning is a trial and error process that humans follow in order to learn and acquire knowledge. And here we have our reinforcement learning loop. These terms you will learn about later in the slide, so don't worry. And basically, they use trial and error to learn a strategy that maximizes a reward in a selected environment. So here's the technical version, and we're gonna be using vocabulary. You guys don't know yet, but we're gonna be talking about later. So it's fine if you don't fully understand. But there's an agent, and it observes an input. And from there, it performs an action. And it goes to the next state, which generates the next reward. And this process happens continuously. And the agent will learn a policy that maximizes these rewards. So what is reinforcement learning, and what are its applications? So reinforcement learning can be used in autonomous driving, natural language processing, such as, yeah, question and answering, summarization, industry automation, so a lot of robotics, healthcare, and also stock market trades. So we're gonna be following the game, the snake game. If you guys, I'm sure you guys have played this game before. If not, you can just find it on Google. It's really accessible. And it's a very simple game where the snake increases its size each time it eats an apple, and then that makes the points go up every time it eats an apple. And the snake dies if it hits a boundary or if it hits itself. So we're just gonna be following the snake game in our examples today, just to help us understand what reinforcement learning is. So what is reinforcement learning? They're states, actions, and rewards. So first, before we talk about those, we have to talk about the agent. So the agent is something that interacts with the environment by executing actions and getting rewards. So in our snake game example, our agent would be the snake itself. And most of the time, the agent is trying to solve a problem in a more or less efficient way. So now what are states? States are the representation of the current environment of the task. And there are various situations in the environment it is in. And the environment is everything outside of an agent. So an environment in the snake game wouldn't change, but a state would. And how that works is the state corresponds to where the agent is in comparison to, say, this wall, or all the walls, and the apple, and also the tail of itself. So every time this agent moves, this state would change. But the environment itself in the snake game would stay the same. We still have the same grid and, yeah, the same size of the grid and everything like that. Okay, and in this case, the state case would include where the snake currently is, where the apple currently is, and the possible positions for movement. So if there was a wall, the snake couldn't move there. Or it could, but yeah, that's all part of the state. And then we have our actions. So actions are things that an agent can do in the environment. So for example, it can follow rules. If the agent can only move up or down, those will be the only possible actions. So in our snake game, our snake in this case could only move up or down or forward. And there's only three possible actions for the snake in the snake game. So here we just have those highlighted green. We should have those highlighted green in all the other slides. So just as a reference, those are the possible actions the snake could take. And rewards. Rewards are a value we obtain from the environment periodically, with the purpose of telling our agent how good of a job they did. So our rewards for these, we've assigned zero. So we're not getting any apple, so we're not having a positive reward. But we're also not dying, so that's why we've set them at zero. But we can also change this value if we would want. And in this example, all actions except the one to get an apple is zero, and the apple is one itself. And we can adjust the reward, like I was saying, for every action. And we can use that using equations that we will learn later today. So just as an overview, we have our state, where the agent is in comparison to its surroundings. We have our actions. So the snake could either move up, down, left, or right. And then we have our reward. So say we take a step and eat an apple, we would get ten points. But if we take a step and die, we'd get negative ten points. And just a note on rewards. Rewards, it's possible to have negative and positive rewards. And the reward depends on the current state in action. So like we said before, we can take a step, eat an apple. We can set that as ten, take a step and die, set at negative ten, and then take a step and nothing happens. Remember, we had this at zero before, but we could set it at negative one to speed up training times, but it's optional. And last but not least, what is a policy? So a policy is a function that outputs actions given the state of the environment. So the agents and policies are closely tied together. So an agent will do well if we have a good policy. So our goal is to have a really great policy so that our agent does well in, let's say, our snake does well in the snake game. It eats tons of apples. So now, just for you guys to put in the chat, determine in the example of training a pet what possible rewards could be. And please select all that apply. So whatever you guys think would be an example of reward in this example. I'm getting a lot of the same answer. Okay, I will say this is somewhat of a trick question. If you weren't paying attention in the past slides. Does that change any of your guys's answer? Okay. Okay, we have a different answer. You know, Naomi, looking at the question again, I think I'm actually gonna change my answer as well. Really? Yeah, I guess it's semantics, and it's how people interpret it, so. Okay, so I'll just show the original answer for this question. So, these are all rewards. So, because in reinforcement learning, rewards can be negative, it can be equivalent to zero, and they can also be positive. So, do nothing would be just our equivalence of the reward being zero. Yelling at the animal, that could be a negative reward. But giving the animal treat or petting the animal, that could be a positive reward. Now, Mr. Pree, you can share which one you think. That's correct. I thought we had ACB only selected, and I was gonna say B is also a reward, right? Reward of zero people, or a negative reward is also a reward. So, yeah, no, this is correct. Okay, okay, great. Yeah. Yeah, it looks like someone got it right before I changed the slide. Yeah, good job, guys. It was kind of a trick question, I will say. But I like harder questions, because they make me learn. So, what's the policy to find an optimal path? So, the policy depends on states and actions, and the machine learns when the snake gets a reward when it reaches the apple. So, how are we going to find the best path for our snake, the agent, to eat an apple in the least amount of steps? So, the idea is to learn a policy for the optimal path. So, in situation one, as we follow the snake, snake moves down one, and then forward one, and then forward again, and then it eats the apple on its fourth move, or fourth action. And then in our situation two, our snake moves forward, and then it moves forward again, and then it moves down, and then it eats the apple on its fourth action again. So these would both be the most optimal path, which is great, because we want our snake to eat the apple in the least amount of steps. But let's look at the situation now. So, our snake moves down one, but it moves down again. And then it moves forward, it moves forward again, and then it moves up, and it eats the apple. So, we're still getting that reward. But how do we tell the reinforcement learning agent or the model, hey, this is not exactly what we want. We want it to be faster, but it's still getting the apple. So, how would we let it know that this, primarily, the steps between two and three, this was a bad move? Even though in our past example, it would get the same reward of zero. Or if we're doing the negative one example, each move without the apple would be negative one. So yeah, this is not the most optimal path. So, to do this, we first have to learn how to represent states. So, states are every single orientation of game board for the apple and the snake. And there are a lot of states, especially if we have a bigger game board. So, how we will represent states today is the relative position of the apple to the head, and is there a snake cell near the head? So, A would be the x-axis and y-axis of where the snake is, or the snake head is, in comparison to apple. So, we can see we have to go one right, and then two down. And so, that's why our first two values are one and two. And then, is there a snake cell near the head? And then, we would put a one in left, top, right, or bottom. So, we can see there is a snake cell near the head, and it's in the just left of the head. So, that's why we have our one here. So, now we have given you the states for one and four, which I can go over. And just make sure ask any questions if you have any, because then you guys will put in the chat what the state representation for two and three are. So, for state one, see, we are two left from the apple. So, this would just be a negative two. So, we go two, and then we're one above. So, we go down one. And then, for state four, the apple is, well, we're eating the apple. So, there's no apple around us. So, we put zero, zero, and we have a snake cell above us. So, the top, we would give a one, and everything else would be zero. So, put in the chat what you guys think would, the state representation of two and three would be. And I'll give you guys a few minutes for this. I just got here. The Wi-Fi is really bad. So, can you explain to me what that is? Yeah, sure. So, yeah, make sure, we've introduced a lot of new vocabulary today. So, just make sure to watch the recording to fully understand everything. But here, we are getting the state's representation for the position of our snake, which I've explained in the last slide. But if our snake, in this example, we're going again, in comparison to the apple, is for the first two positions in our state. So, our snake is two to the left, so that'd be negative two. If it was two to the right, we'd put a positive two. Then we're going down one, so that's a positive one. And then we have no snake cell near the head, so the rest is zero. But in this case, four, where there is a snake cell near the head, it's on top of it. So, we put a zero in position four because, well, that's just the position it's supposed to be in. So, left, this one would be a two or a three, yeah. So, you can put in the chat what you think the states would be for position two and three. Oh yeah, so there's a quick question about just whether like how the value should be positive or negative for the relative position of the apple to the head. So here we just have like a little diagram of what our Y positive value and our X positive value should be. So you can just reference this, we'll always have this where we have our states. So here we have X going to the right, this would be positive. So in this case we see X, well our snake head going to the left, that would be negative. So you can just reference this diagram. And the reason why down is positive is because here we have our Y, it's going down. So this would be positive, but if we had our Y here like go has an up arrow, then we would put down as a negative value. But this is just how it is in our case. And this isn't exactly how it would be for like every given example, if that makes sense. And the reason why in this case, it looks a little weird because in math class when we have a Cartesian coordinate, the X is always pointed, positive X is pointed to the right, positive Y is pointed upwards, is because the core of this problem is we're dealing with a game and we're dealing with basically the game board and in sort of computer graphics terminology that's called a canvas. And the way that canvases always works is the origin or zero comma zero where X is zero and Y is zero is always in the upper left hand corner. And that's why in this case, since we are kind of dealing with that game board, the computer canvas, your X is pointing, positive X is pointing in the right direction and your positive Y is pointing downwards. Looks like a lot of you guys were able to get this. So I'll move on to the answers. From what I saw, most of, I think almost all of you guys were right that I saw in the chat. So yeah, two is negative one, one, and then zero for the rest because there's no snake cell near the head. And this is just because we're one to the left and then one above. And then here we have zero, one, and then the rest are zeros because, again, we don't have a snake cell near the head, and we're right above the apple so we'd have to go down one. Okay. So now we're going to have a little more complex examples or just a little examples where we have just a little different from what we've done before, although one of them is a past example. So I want you guys to determine the states of the relative position of the head to the apple and is there a snake cell near the head. And you guys can put these in the chat. Okay, I've been getting a lot of, yeah, I've been getting a lot of good answers, so I'll just go over these. So our first state is 121000. This one, notice, it's the same as one of the ones, the slide that we're introducing states. So the reason why it's 121000 is because we're going left or right one, so that's a positive one, and then from here we're going down to 12 to reach our apple. And then we also have a snake cell to the left of the head, so that's a one in that position. And then for this case, it's 00 because we don't have an apple relative to the position of the head since we just ate the apple, so we don't have an apple anymore. And then we have our snake cell to the left of the snake and also below the snake, so that's why we have ones in these positions. Does that make sense for everyone? Let's see. So just going over, maybe this is the more confusing example since we haven't had an example where we've eaten the snake or the apple before. Because there's no apple, that's why there's no 00 position. And then we need to see, is there a snake cell near the head? So remember the snake cell, just anything that's blue in the four possible positions that it could be in. So left or left top, right in the bottom. So we can see two of these are blue, so this is our snake and it's to the left and to the bottom of our snake. So that's why we have one and then 001 for the representation of, is there a snake cell near our head? Is that making more sense? Yeah, I can go over the first one again. So here we're going right one and then, so that's why we have our one here in the X axis. And then from there, once we're in this block, we're going to be going down two to reach our apple. So that's why we have a two here. And then I think you guys understand the snake cell near the head now. So we have one cell that's blue to the left of our snake. So if, yeah, I have some questions about the A and S part of states. So if we're determining the relative position of the apple to the head, here we've labeled this as A. So A would be concerning this part, relative position of the apple to the head. So we have this as X axis and Y axis, which is why we have our positive and negative values, since we could be going left to the apple or right to the apple. So that's why we have that and not the same for, is there a snake cell near the head? And then our, is there a snake cell near the head is S. So that would be concerning left, top, right, and bottom. So is there a snake cell to the left of it, top, right, bottom? So these are just, where is the snake in comparison to the head? Is that making more sense? Yeah, so the X axis is, or positive Y and X axis are flipped. Yeah, this is what Ms. Haripriya was saying before. So in like a computer screen, the origin is centered around this corner. If everyone can see where my mouse is. So this would be a positive X and this would be a positive Y. So we're just relating that here to our game board. Okay. Seems like everyone's good now. Oh, I see. So this might be a question about the game itself. So you guys can try the game after class if you want. But if there is an apple near the edge of the game board, say this is a barrier, then wouldn't the snake eat the apple, but then get hit by the barrier? There's a possibility in the snake game to be in one of these squares where there's a wall here. So let's say an apple was there without dying by the boundary. It's only if you hit the boundary head on. So what our snake could do, it could turn really fast here. Or if we go down, if there's an apple here, we could really quickly go there and then turn back. And then let's say this is also a boundary, we'd have to turn here really quick. So that's how that works with the boundary. But yeah, definitely something. So fun game you guys can try after class. So now we also have a problem. If you guys were thinking like, oh, well, how do we assign these rewards? Do we just do it the same for every example? And the answer is really no. Because we usually know the reward of being in the goal state. So not any of the intermediate steps. And we still need this reward, but we would also like to rate the individual actions to find a better policy. And this would just mean, hey, we know we don't want to go up just from us knowing the game because the apple's down here. But in these states, as we previously had, all of these would be zero. So the snake could likely choose to go up, even though it would take longer to get the apple. And so therefore, we'd like to distribute the reward backwards So this is how we could maybe say, hey, the reward for going forward or right is higher than the reward for going up. So the answer to this that we'll be talking about today is Q-learning. So Q-learning is a reinforcement learning model that learns an optimal policy that maximizes the expected reward over all the steps. And so there are Q-values, where given a state in an action, what's the expected total reward for taking that action? So Q-values will be our expected total reward for a given state in action. And Q-learning uses Q-values to improve at each step the behavior and learning of the agent. So here is our equation of the Q-value. So remember, the Q-value is an expected cumulative reward by taking a specific action in a specific state. So QSA, this would just be our Q-value of the current iteration. Then R parentheses S comma A. This is our reward for that certain action in a specific state. And then our gamma here, this is our discount factor. So this helps make infinite sums finite, which I'll talk about just after explaining this last part. And then our max Q and then S comma A. This is our best Q-value of our previous iteration. So just going over what is our discount factor. Theoretically, you would know if you played the snake game, but you can think about another game. You could play, I'll just use the snake game as an example. You could play the snake game for a really, really long time if you're really good at it. How would a reinforcement learning model, if it's going backwards, how would it prioritize the apples that are in gameplay now versus apples that will become in gameplay later? And so our discount factor helps that. It's like, oh, the snake's hungry now. We want to eat the apple now. So that's, yeah, what our discount factor helps with. And then, yeah, our best Q-value of the previous iteration. So that's our max Q, S comma A. And so Q-learning, this is just overview of Q-learning again. And then how this works is we get a state, and then we would execute the best action giving the state using Q-table. So let's say it gives us right, and then we receive a reward. Our reward could be zero, and then we go to the next state, and then we update our Q-value or Q-table because, remember, our Q-values change for every state and action, and every action we take, our state changes, and then the possible actions also change. And to talk a little more about Q-learning and Q-values and Q-tables, Ms. Harryprea will be going over this with you guys. Okay, there for some reason I was unable to unmute myself Okay, so the whole idea is you want to try to determine a policy of how can we get to this Apple? The quickest so you start off with the Q table and the Q table has the different states now You already know how to interpret each of these states, right? And these states actually Are a depiction of each of these diagrams. They were kind of made it easier for you Right, like for example the negative two one two steps to the left one step down And there's no snake body that will interfere. So it's zero zero zero zero, right? so each of these states are one of these diagrams and Whenever you start off with like solving these Q values and getting these expected intermediate rewards You always start off with like you you go backwards essentially So you always start off like at game over where there's no more steps left, right? So when there's no more steps left, right? What's reward? Can you get you can't get any reward right? You're done with the game. It's game over So we start off with game over right the snake has already eaten the Apple. There's nothing else to do Obviously, there's different levels in snake game, but we're gonna act like, you know Once the snake eats the Apple game over you're one. That's it, right? So the Q value here are Expected reward of each of these things like once we are at on the state and this state is nothing but a depiction of this You can't go up down left or right, right? So you're just reward here essentially Here the Q value is zero now what if you had one step left meaning this is You move one step and you can eat the Apple or you can do something else, right? So this is what this state looks like right it's one step above the Apple there's no snake body and it's all Hence, it's all zero. So zero one zero zero zero zero. So now we have to use this Equation, right? So if you go up If the snake goes up, it's gonna hit the wall. It's gonna die right our reward for that is negative ten So we're gonna say negative ten now Left and right there's nothing going on there So we're gonna just say that our reward is zero if that's the only step we can take However, if we go down the snake eats the Apple, right? and So your reward if you eat Apple is ten Plus this is gamma right the discount factor. We had already said our discount factor is 0.5 Times you take the maximum value of the next Q value, which we had calculated in the previous iteration So this is the confusing part because technically this s prime is the next state But the next state we calculate in our previous page, right? So here what is the maximum Q value? Well, they're all zeros, right? We said there were no more steps left and hence Here our Q value our maximum Q value is zero since they were all zeros in the previous iteration So this will be 10 plus 0.5 times 0 that is 10 So our Q value for down is 10 for up. It's negative 10 because you'll die left and right is 0 Why is there a 1 in this state for this one That's how far it is from the Apple, right? 0 on the X side But why if you go one unit down then that's where you see the Apple and then there's no other snake body parts that it Can run into so it's all 0 0 0 0 does that make sense? About other game modes, I'm not entirely sure what you mean. Do you mean like different levels or something or Also another question, what is the 0.5 times 0 where does that come from? Yeah, so the 0.5 times 0 is coming from this equation So the reward is 10 Right for especially going down right specifically going down you're eating that Apple you you get a reward of 10 Plus your gamma the discount factor has already been provided to you That's something that whoever's writing the problem will provide to you If you are yourself creating a reinforcement learning Sort of code or whatever then you yourself are gonna come up with some discount factor So it can be like 0.1 0.9 something between 0 and 1 So in our case, we're gonna just go with 0.5 so that's where the gamma this is coming from the times 0 is we take the maximum value of all the Q values for the next step, but the Q values for the next step by next step. I mean the next step in the board game If you go down after that the next step there is no next step because we say game over Right and we already did that in the previous iteration Yeah Sorry now my thing is annotated Yeah, sorry now my thing is annotating let me go back here Okay there oops So here you can see that the Q values are all zero And so that's where the maximum of all those zeros. That's why this is zero Does that make sense Does that make sense the 0.5 times 0 where that's coming from Okay, how do you know it's 10 or is it a guess or hypothetical yeah, so that's a good question So 10 is coming from the fact that when it goes down It's gonna eat that Apple and the reward we have set for the Apple is plus 10 The reality about reinforcement learning is you get to define your own states you get to define your own rewards It might not exactly match up with the game It should match up with the spirit of the game meaning that if you are gonna run into a wall and die Obviously that should be negative reward if you're gonna eat an apple and win Obviously that should be a positive reward but exactly what number that should be that is up to you, right? When you go in the snake game you go to these different squares, right? You're not getting any points or anything like that in the game, right? But for you to kind of maybe Kind of get to the Apple faster. If you are developing a reinforcement learning model, you might say hey each time It goes on one of these steps. We're gonna give it a smaller negative reward But we're gonna give it a negative reward That way it goes directly to the Apple instead of taking a bunch of steps just to get to the Apple So these are all your own definition. So we just defined it like 10 or negative 10 or whatever Or the discount factor. These are all own own definitions. It might not exactly correspond to the point value of the game I don't understand how you got 0.5. I understand zero. Yeah, 0.5 is just the discount factor. We defined it over here Go back Sorry, it's like very slow right now. I think in one of the pages, if not, we can add it in. Oh, maybe it got deleted. Let me see in my copy. Or I can just add it in here. Oh, Mr. Priya, are you looking for the place that Word defines the discount factor? Yeah, yeah, yeah, the 0.5. Does it say? I just added it in. I think it might have been deleted. Oh, oh, here we go. Here, yeah, we did it in a later page. OK, actually, so yeah, we just take it. I've added it to the slides. Just take the gamma to be 0.5 throughout this whole problem. So that's where it's coming from. We just defined the discount factor. Now, the question is, what's the importance of the discount factor? So Naomi kind of touched on it. It helps you get to this apple faster. The idea of the discount factor is coming from economics and stuff where, basically, money that you have today is worth more than that money tomorrow. Like $200, if you have them today, that's worth more today than it is tomorrow. And the reason is because of inflation and a lot of different reasons. So the idea is you want to get as much reward, as many points today, meaning as soon as possible. And so that's why you have this discount factor, where you kind of say discount factor, again, is always this fractional value between 0 and 1. And so basically, you're saying, hey, tomorrow, this thing is worthless. Or in the next step, this is going to be worthless to me. So in the competition, will we have a set value? Or will we get to choose the value? And can you plug in values in the equation, please? So I don't want to talk too much about what we're going to do in the competition. But I would say whatever we would have would be standardized because, obviously, all of you can come up with your own values and everything. But then you'd all have different answers to questions. So we'll make sure that it is standardized and everyone has the same sort of answer. So we would be giving the discount factor. On the test, you wouldn't have to define your own. Just only because we all want to get to the same answer for the purposes of a competition. But when you are creating your own reinforcement learning model, you get to pick all of this stuff. Like if you look at the snake game, you might define a state differently. We just define like this because we thought it would be easier. Because think about it. You can define a state like look at each of these grid squares and see if it has a snake, an apple, or nothing in it. And so this matrix, you would have essentially this huge table of numbers. It's going to be massive. If there's like 64 squares in your grid, you're going to have this whole massive table of numbers, table of 64 numbers, which say either 1, it has a snake. Maybe we give it a 2 if it has an apple, 0 if there's nothing. And that's going to be massive. Instead, we just created a state that had six values. What's the location of the head in relation to the apple? And is there a snake body next to the snake head? So that state is our own definition. Our rewards is our own definition. It has to relate with the game. But at the end of the day, the way the game gives you points might not be the exact same way you define the rewards. Discount factor, again, as long as it's between 0 and 1, you should be good. But that's, again, something you would define if you were creating this reinforcement learning model. In this case, we have defined it for you. Do you get to pick the discount factor? Yeah, as long as you're consistent, you can pick the discount factor. I mean, obviously, for something like a test, we'll give it to you. Here, we are giving it to you that gamma is 0.5. Any other questions so far? Discount factor affects the Q values. It affects the Q values, which is basically an expected reward. How do you do this equation to plug in? So yeah, so we can go back over this equation. This equation is just nothing but reward plus your discount factor, which for this problem, we have said, let's just use 0.5, times the maximum of the Q value of the next step. Now, the next step has actually been calculated in the previous iteration, because we're going backwards. We start with game over, and then we go backwards from there. And so we saw in the previous slide, I don't know why this keeps on getting stuck today. Uh, yeah, in the previous slide, we saw that the Q values were 0, right? Because it's game over. You can't go anywhere. You can't do anything. It's 0. So the maximum of these Q values is 0, and hence, we have 0 over here. And then the gamma, we are giving it to you that take the discount factor is 0.5. And so that's why we're getting 4 down. Now, we calculated the same thing for up, left, and right, but we didn't really go through this whole equation. You could have, but the reward here is you're dead. So that's negative 10. Left and right, nothing much is going on. So you could still have done the negative 10 here, plus 0.5 times 0. For here, you could have done 0 plus 0.5 times 0. Here, you could have done 0 plus 0.5 times 0, but that was all overkill for us at this point. So does the computer choose the greatest reward action? Yeah, so that's the idea. The AI bot is always trying to get toward this maximum reward. And you'll see it at the end. The idea is we're trying to define a policy that, hey, if it's in its given state, this is the action it should take, because that's how it's going to get the highest rewards. This is how it's going to get to the Apple the fastest. The answer to the equation, so these are the individual answers to the equation. The up, down, left, right, these are all individual answers to this equation. Yeah, so here, I've just showed you how to calculate a Q value for down, but there's a Q up, Q down, Q left, Q right. Let's go to the next one, because it'll make even more sense. Now, what about your two steps before the snake has eaten the apple? Again, we're going to take the discount factor as 0.5. If the snake goes up, you know that the reward's going to be negative 10. You're dead. That's it. However, and for this example, I am not doing calculations for down, just because it'll get overly complicated, and we haven't calculated all the tables and everything properly. So I'm only sticking to the ones that are super easy and that are accurate. So Q up is negative 10. We said Q left. Now, for Q left, our reward, there's no apple here, so our reward to go there is 0. But plus the gamma, 0.5, now we see the Q value of the next step or the previous iteration. And so we go back here. What were the different Q values that we saw? Well, we saw negative 10, 10, 0, and 0, right? So then what we do is we take the maximum of negative 10, 10, 0, 0. These were the four values we saw in the previous slide of the Q iteration. So what's the maximum of these four numbers? It's obviously 10, right? So then you say 0 plus 0.5 times 10. That's 5. And so when you go to the left, your Q value is 5. When you go to the right, again, we'll assume that our Q value here is 0, because there's just not enough steps to get to the apple. So it's 0. OK. Again, you do this one more time. And you keep on doing this, essentially. So now you're at this state. In the last thing we saw, negative 10, 5, 0 were our different Q values. So if you go left here, the maximum of these four values, negative 10, 0, 5, 0, 5. So this is 5 times our discount factor, 0.5. There is no reward, because there's no apple here if you go left. So it'll be 0 plus 0.5 times 5. So that's for our left is 2.5. For up, again, you're dead. So that's negative 10. For right, if you go, you're dead. So that's negative 10. What's S prime and A? S prime is the next state and action. So it's the Q value always corresponds to a state and action. You can see that in this table, right? You have for each state and each action, for each state and action pair, you have a different Q value. S prime just means next state, which means we're just looking at the Q values from the previous iteration and taking the maximum from that. Finally, and this is a made-up table, but the idea is you do this for many, many iterations, right? And you have this table. What you're going to see is for each state and each action, for each state-action pair, you're going to have a Q value. And you're going to see what's the maximum for each of them. So for this state, for example, that's looking like this. Your left might be 2.5. For this state, the negative 1, 1, 0, 0, 0. Your left might be 5. Your 0, 1, 0, 0, 0, that state would be 10 down. That's the maximum. For this, basically all of these are maximum. There's nothing. So what you see is you see the maximum for each state. And that's the action that you are going to take. This is basically telling you this is the best policy. So if you are in this state, it's great to take a left. If you're in this state, it's great to take a left. If you're in this state, it's great to take a down. If you're in this state, obviously game over, you're done. That's what Q-learning is in a nutshell. Obviously, this is simplified on all of these calculations a computer would do. And there's many different ways of doing it. But on a high level, what we're doing is we're playing the game backwards. And we're trying to estimate what's the reward we can get as we move throughout the game board. Does that make sense on a high level? Because we won't have homework on this. Because next week, we won't have time to go over homework. We're going to have the competition. So does this make sense? Because if it makes sense, then I want to quickly go on to Gini index, since a lot of you people wanted review on it. Any last questions for this? OK. Very quickly, Gini index. The competition is online. It's the same time next week, your regular class time, 10.30 CST to 11.30 CST. How did I get 2.5? Yeah, these are all made up numbers. These are all made up numbers in this tables. The idea is I just want you to see what's the maximum. And that's the action you take from that given state. These are all made up. Yeah, competition is online. If you're watching this recording at a later time and you're not someone who comes to class regularly, unfortunately, for the competition, you do have to come to class. There is no alternative test date or anything like that. So please do come. Please be on time. We're going to get started very quickly. And then in the last few minutes, we'll be reviewing the answers quickly, going over prizes and things like that. The test will be taken on a Google Form. So hopefully, it will automatically grade your answer. So we should hopefully get winners pretty soon. But if not, we'll see. And we'll try to get them very close to you after the competition. OK, very quickly, I have two more minutes. I remember talking about the competition in the first thing. I believe it was like, did I say 45 minutes or something? Naomi, if you can check on that, that was in the first Google Slides. But meanwhile, let me just quickly go over Gini index. So for Gini index, the big thing is before you divide on any nodes, you look at all the different rows. And you then divide it based on the target column, what you're trying to predict, which is loan approval. So you see how many rows you have approved, denied. And then you always need the total. And then you plug into this equation. And this equation is nothing but 1 minus, you take the fractional value of approved over total bracket square plus denied over total bracket square. You get the Gini index. Ideal Gini impurity is 0. That means that everything has the same classification. You're all good. For two classes, the range is always 0 to 0.5. So far, so good. Any questions? Does that make sense? We will have proctors, I believe. MathKinguru is going to provide them. But I will talk with the CEO once before. Google Form will not remind, but we will verbally remind you guys for sure. Does that make sense for Gini index? Or yes, you can use nodes on the test. Remember, I am a strong, firm believer of create a cheat sheet. You are not allowed to use the computer, chat GPT. You're not allowed to use the browser. You're not allowed to use the computer, chat GPT. You're not allowed to use the PowerPoint. So anything you want from the PowerPoint, write it down, or print it out, or do whatever, you can use a calculator. But you're not allowed to use the computer at the time of the test. You're also not allowed to use parents or anyone else, any other human beings or anything. OK, so that is pretty much the end of our time here. If anyone has any last minute questions, otherwise, we'll see you next week. The slides, again, print it or take the notes. I wouldn't even print all the slides. That's not environmentally friendly. And also, that's not going to help you. Go through the slides, review them, make a cheat sheet. That's much better. But you won't be able to use the slides online, just because you're not allowed to use any internet other than, obviously, Google Forms to answer. OK, good luck. We'll see you in the competition next week. Yeah, you can't go on multiple devices. Yeah. Thanks, Naomi. Yeah, 45 minutes competition in the beginning. So we'll get started. Make sure to be on time and everything. Thank you. Bye.
Video Summary
The session began with a review of homework on bag-of-words and cosine similarity to learn vectorizing documents for frequency analysis. Bag-of-words involves creating a vocabulary of unique words and counting occurrences in documents. Then, cosine similarity was explained through calculation between document vectors using dot products and magnitudes.<br /><br />The main focus was on reinforcement learning (RL), introduced as learning through trial and error to maximize reward. RL concepts such as agents, states, actions, and rewards were explained using a snake game analogy. An agent (snake) interacts with the environment to eat apples while avoiding walls. States represent the game environment and agent position, actions define possible agent movements, and rewards indicate the gain from eating apples versus hitting walls.<br /><br />Q-learning was introduced for updating Q-values (expected rewards) to find optimal policies. By working backwards from a goal state, Q-values help determine actions maximizing cumulative rewards. State-action pairs map to Q-values, with higher ones signifying preferred actions.<br /><br />Finally, the Gini index for measuring decision tree impurity was briefly recapped. Students were reminded of the upcoming online competition, required to be without external help, promoting readiness with notes.
Keywords
bag-of-words
cosine similarity
vectorizing documents
reinforcement learning
Q-learning
agents and actions
snake game analogy
Gini index
decision tree impurity
×
Please select your language
1
English