false
Catalog
AI 101 - Class Recordings
Recording Class 2
Recording Class 2
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
Explanation of her impurity. I did see a lot of the homework and answers and stuff Thank you for everyone who submitted the homework, you know If you haven't already you can always submit it next week or the following weeks But make sure you review the homework. Otherwise the test is or the competition at the end is going to be very difficult so I haven't had quite everyone submit. So please do make sure you submit moving forward But a lot of you did a great job. We did get a lot of hundred percent on the homework There were some issues especially with question two and three and so we're going to talk a lot about that Today, but first I want to kind of explain impurity in sort of a different way Hopefully most of you got it. But if you in case you were confused last week sort of that's why we're reviewing it so I'm just giving this imaginary example of whether an individual qualifies or doesn't qualify for a scholarship and we are Kind of going off on GPA and whether they participate in extracurriculars Or not and so you see this table, right? We kind of saw similar things to it last time You have the GPA whether person participates and then whether they qualify for the scholarship And so the N is stands for no and the Y stands for yeses. Okay now before We think about decision trees. We need to first see like what sort of our distribution of whether people are qualifying for scholarship So does every single person qualify for a scholarship? No, right you see some people not qualifying some people qualifying and so if we just look at that third column the Qualifying for scholarship this column over here. We see that there's two Y's and three N's They're not all the same classification. That's why we need a decision tree to kind of Make those classifications for us of like what sorts of features do people need to qualify for a scholarship? And so the fact that they're not all yeses or they're not all no's that shows impurity, right? Have you guys heard about like gold having impurity right some gold, you know, you won't get 24 carats gold You'll get you know gold that has some silver or copper nickel or other metals It's sort of that same way, you know impurity in this sense is that they don't all have the same classification Right, there is a mixture of classifications. And so here's the chart that I made here over here You see that if they're all yeses, right? The impurity is zero, right? It's all yeses or if it's all no's the impurity again is zero And the GI stands for Gini index, right? Not genie like Aladin and genie. I saw some of you write that in your assignment and that was hilarious But it's genie as well GI and I But the third here example where you have a combination of yeses and no's That shows that there is some impurity and impurities hence greater than zero. Does that make sense so far? Are we all on the same page? nods thumbs up Whatever you'd like Okay, cool So then the next question is we had to make it this decision tree, right? How are we making it though, right? There's kind of two ways we can go about it, right? Either we do based on GPA Oh, I like the check mark. You guys can use a check mark if you're understanding. I saw Amanda do that. That's awesome So you can go based on GPA or participates in extracurriculars? And so I think the question we sort of talked about at the end you guys had for me was How do we like what's the next step? How do we know which feature do we split on and that's sort of going to be the topic of today's lecture And so you have these two options, right so let's just first pick one, let's just say GP is greater than four Let's pick that one. So I just recreated this table here It's the same table from this page except I just dropped the participates in extracurriculars Column just because I don't want that kind of mess right now. It's a little confusing So I just kept the GP and qualify for scholarship because that's what we're gonna split on. That's all that matters for now And so if you have a GPA greater than four right in our decision tree We have two options either the person's GPA is not greater than four Hence, it's you know less than four or it is greater than four. So we have these two options nor yes So what we're going to do is I made the same table here. I just filtered it down So whether the person's GPA was greater than four or less than four, so if I filter this down If you look at all these GPAs the only GPA less than four on this chart is when the GPA is two Then again, I filter down this list of GPAs And I want to only retain the GPAs that are greater than four. And so you have these four point two four point six Four point three five. Okay so far so good with me thumbs up Okay, awesome So now we look at these two charts and we want to start filling this in so if the GPA is Not greater than four are people qualifying for a scholarship. There's only one person in our data set I see nose awesome. So I put my n over here, right? What is the impurity of this? Yeah, I see some interpretive dancing zero, right Awesome. It's not impure because there's only one thing and it's that thing, right? Now we look at this branch Whether GPA is greater than four Uh, and so we see that there's this whole range of there are two people who are qualifying for a scholarship And two people who aren't um So Do we have an impure data or like do we have an impurity here? Uh, is it greater than zero? Yes, right Uh, and in fact i'm seeing one over two actually and that's true that the impurity of this genie index is 0.5 Which is awesome um, so that's sort of How this impurity stuff works we can do the exact same thing for whether a person participates in extracurriculars or not Right, and I did the exact same thing. I take just the extracurriculars column and the qualify for scholarship column And then I say hey, let's just filter it down to whether the person participates if they don't what are the rows? And if they do participate in extracurriculars, what are the rows? And with the result, uh for doesn't participate in extracurriculars we see that both people don't qualify for a scholarship So what is our impurity here? Zero, right? Awesome, because both of them are the same. It's not impure But when you get to the yes Uh, we have one no and two yeses Right for qualifying for a scholarship if they participate in extracurriculars and hence our gene impurity Is um greater than zero and there is some impurity Okay, so we have we have created these two charts one is for gpa greater than four Um, and one is if the person uh participates in extracurriculars And so the question still remains is what feature do you split on and that's not something we covered last week um In the sense that there there's there's like a formula that we briefly went over But we're going to talk more about it today of okay. What is the total impurity, right? We have been calculating impurities for specific branches, but we still don't know what's the total impurity for the feature And so that's going to be taught the topic of today's discussion Um, but on a high level, which feature do you split on like if I gave you some? uh Ginny index If I gave you that the impurity of this was a certain value and impurity of this was a certain value Which feature do you split on? You can just maybe let's give it like c and d is our impurity Yeah, i've seen people message me the right answer, um It's the feature with a lower, uh, ginny index. That is the correct answer So i've seen a whole distribution of answers, but it's the one with the ginny lower impurity You always want to decrease your impurity, right? Because at the end of the day when you create the decision tree you want all the yeses in one section all the no's in Another section does that make sense? Are we all on the same page? Awesome. Okay. So now let's go over the homework. Hopefully this is making a little more sense Um, so let me just open up the homework on my end and i'm going to let's create a new page here Okay, so the first question was you are given a data set right containing information about whether customers of an e-commerce platform Purchase a product based on their age and browsing browsing history. And so the data is classified into two categories. And so Uh will buy or will not buy and the first question is calculate the ginny index before we split, right? So what is the ginny index? before splitting So which column do I have to look at for this question which column am I looking at We have age browsing history and purchase decision Yeah, are you do you have your hand raised? Um Column with will buy and will not buy Exactly purchase decision and i'm seeing that as well in the chat, which is fantastic Uh purchase decision for this question. It's fairly easy You just have a look at the last column and I think most of you did get uh, this one Correct or close to correct and so for this one we have uh, we'll buy We'll buy sorry. I need to recreate this. Um quickly so let me just copy this And then we had this oops Let me cut that Okay, perfect so the first thing we had to see is um how many like uh Will buys do we have right? How many will buys do we have in this uh column Yeah four good how many will not buy Also four right what's the total Yeah, the total number of rows here is eight perfect Um now The gini index to calculate it our equation is one minus and we basically square all the different proportions in our Table, so this would be one minus and i'm going to put a huge parentheses here Uh, so what's the proportion of a buy? Yeah, four over eight exactly so four over eight Right. All I did was will buys four total is eight four over eight Plus now I need to do the same thing for will not buy remember from last week this big fancy symbol Sigma is nothing but it's just telling you to add up everything right? So will not buy is Uh, the proportion is also four over eight, right? so if you were to calculate this all out, then this is nothing but This and i'm kind of you might be a little faster, but I just kind of want to Walk you through the steps here So Equals one minus one over two equals a half, right? Makes sense. Does this first question make sense to everyone ask now or forever hold your peace? Awesome cool Looks like we all understand it, you know again, if you don't we have in google classroom There is a questions thing I know some of you used it for this week, which I was really happy to see If you have questions at any point during the week, feel free to ask i'm not gonna totally give you the answer But i'll try to lead you to it as much as possible. Oh, uh, I see a no, uh, can you please? Uh, maybe message me in the chat of which part you didn't understand Okay Uh, and while you message me i'm gonna uh, oh you subtract the one minus one equals zero So you uh, so this one this part actually, um that gives us a half that doesn't give us a one Does that make sense or okay Awesome cool We all got it. Perfect Yeah, and don't be shy just ask questions, you know Uh, I used to ask a lot of questions when I was in college like more than other people and you know that that's fine, right we all have different learning styles different learning speeds and That's totally cool Okay question two. Here is a tricky question and i'm glad I asked it Because this was initially going to be a test competition question and I think we would have failed miserably at it Because most of us got it wrong for you. You did get it, right? So that was good Okay, uh Two times when you multiply two times half You get one So you uh, so according to the rules of uh pemdas, um or you guys use uh Do you guys use pemdas? Right. Yeah. Okay, because I I learned a different way but uh pemdas, right? Uh, you always do parentheses and then exponent exponent comes first before multiplication So you don't multiply two and a half you first do half square and then you multiply the two By the way, this is uh, not a totally math, uh, uh Not like a math test where you don't get to use a calculator you get to use a calculator So, you know if you're unclear about some of the steps just use a calculator. It's totally fine. That's allowed for the competition Okay, cool, uh next question Uh range of gini impurity for three classes. Okay. A lot of you gave me a Uh answer of two classes, right? What was our gini impurity for our two classes? Zero two 0.5 exactly now. I actually asked you for three classes. Um, so Uh Let me just do a quick thing of Let me show example of two classes because I think uh What I want you to understand is how we are calculating this maximum And I think that's the part that's not clear to people for minimum We know it's always going to be zero right because that means they all are of the same classification We kind of talked about that here, right? Um, oops, where did it go? Um We talked about that here right where Hey, if all the things are same classification, then it's not impure So the impurity is zero, but if it's different classification, then there is some level of impurity Okay, um, so two classes let me just quickly run through this example uh This is not the question yet the homework question, uh, but one two three four five, okay good Let's say we have a total of four rows, uh in our data set and um Basically, you have only two classes yes or no right Like yes, the loan will get an approval or no the loan will not get an approval So what are the different combinations right? You could have four yeses um and one nose You could have uh three yeses And sorry not one no zero nose and one nose You could have two yeses and two no's You have one yes And three no's And you have zero yeses and four no's So far, uh with me i'm just writing the whole distribution of like if you have Um, you know four things four classifications, what is the different classifications we can have Are we good? so far Perfect. Okay. So now what we do is we're going to calculate the gini index for each of them and i'm just going to quickly Uh write it out. I don't want to spend too much class time on this But just like we did in this last page, we're going to calculate the gini indexes and so You can try to see just by looking at the way i'm doing of like You know i'm getting the proportions Okay, so far so good. This is just the calculated GINI index for each of these. Okay, awesome. I'm hearing like people are like, oh, this makes so much more sense. Yeah, I thought this would, that's why I'm like, okay, I need to, I'll teach this a different way and this will make sense. So usually in college, the professors, they just leave it and they leave the proof up to you, but I would feel really bad if I did that and that's why I'm gonna kind of go through this with you. So what is our range that we're seeing for two classes? Yeah, Rohan. Zero to 0.5. Zero to 0.5, right? And we're seeing that there's some sort of symmetry here. We go from zero to 0.375 to 0.5 and then we go back. So far so good. We get how two classes is being calculated. The GINI index for two classes. Oh, this was not the, yeah, this is not the age, this is the GINI index basically, the impurity we're calculating for the different combinations of, if we had four rows, what are the classifications? I'm seeing people calculate equations for GINI impurity for n number of classes, which is great. I don't wanna talk about it in class because it might appear in the competition. So I'll leave that proof, I'll leave that to you, but yeah, Dia, did you have a question? Last time you said that we could just like count how many different types of range it was and then just like calculate it. So wouldn't it be eight? What would be eight, the GINI index or? No, like the range for the three classes. The range for the three classes. Can you maybe talk me through how you're getting eight? Because last time you said that we could just like, so there's like three, like there's different types of decisions and then there's like one, there's like eight different decisions and like eight different ranges or like eight different options. So wouldn't the range for the GINI impurity be eight? So the range will always, at least the sort of maximum is always zero to one, but then depending on the number of classes, it will go less. Because if you think about, yeah, it will go from zero to one. Because if you think about the formula for GINI index, the formula is this, where it's one minus the summation of all these proportions, right? So one has to be the maximum it can go from zero to one. I think you might, oh, no, no, go ahead. So like, I thought you said that the GINI impurity range was zero to 0.5. Yeah, it is 0.5, 0.5 is the stricter bound because we're talking only of two classes. My point being that if we go to a more number of classes, it can be more than 0.5, but it really can't be more than one. Oh, okay. Okay, cool. Thanks. Uh-huh. Yeah, so now three classes. We're gonna go three classes. Does that mean more is less? If you have a smaller number, like if you have zero, that means it's not impure and something closer to one would be more impure, I guess. I'm not sure about the more is less, but yes, there's something kind of confusing about this, which I agree with, but. So let's talk about three classes, but first for two classes, where do we get our maximum? What do we notice? Where are we landing this maximum? What is true about the distribution here? I see 0.5, but let's try to make it more generic. So if we kind of do apply the same thing to multiple classes, we would get that. Yeah, they're both equal. So even, oh, sorry, Dhanush, yeah. Go ahead. Yeah, I was gonna say equal distribution for every class. And I was gonna say that the Gini index is one over, the maximum Gini index is one over the number of classes for that reason. I'm not sure about the second part. The first part is correct. So let's just say, first of all, we get our max Gini split when it's evenly split among the classes. So for three classes, and it can be, like instead of yes or no, maybe we can have like three classes that is like class A, class B, class C. What would an even split look like? And I'm talking now in terms of proportions. I'm not talking about exact numbers, like a fraction. Here it was half and half. What would be for three classes? Yeah, 1 3rd, 1 3rd, 1 3rd. Now let's plug and chug into our Gini equation. Let me just copy it over here. So I'm not like flipping pages every five seconds. What we're gonna do is one minus, so we have 1 1 3rd square plus 1 3rd square. These are all our different proportions. Again, we just need to sum it. So that's nothing but one minus three, 1 3rd bracket square. Because of rules of PEMDAS, we always need to use the exponent first. So 1 3rd square is in the chat or interpretive dancing works. What is 1 3rd square? Yeah, 1 9th. One minus three times 1 9th equals what? What's the final answer? 2 3rds. 2 3rds, exactly. Does that make sense? It's not 1 3rd, it's 2 3rds. Does that make sense? Yep. Oh, I guess the less is more comment someone made. Yeah, in theory, because there's a one minus thing that adds, I guess, yeah, in that way. So it's 2 3rds. Perfect, we got question number two. Now onto question, oh, any question? Yeah, now onto question number three. So, like so many tabs open. I said, we haven't really talked about the splitting part, which we'll talk about today, but we have two features, feature A, feature B, and let's say splitting on feature A gives us this impurity, feature B gives us this impurity. My first question to you, some of you said, oh, this is like a trick question, right? Because 0.6 is not a possible Gini index. So is that true or is this, can 0.6 be a possible Gini index? So I see some nos, some yeses. Yeah, so this was, for a second, when I saw some of your answers, I was like, oh my goodness, did I mess up my question? But no, if you have three or more classes, we are okay with 0.6, right? And I have not said how many classes we have in this. Because here, for example, this is 0.66. So if we had, let's say three classes even, then 0.6 is valid. However, because of your answer, I did get some ideas for potential competition questions where I have trick numbers. So thank you for that. Okay, so we have 0.3, 0.6. And so what feature do we split on and why? Aro, by the way, for your question, it's correct. I'm not going to announce it though, because it might be a possible test question. Yeah, feature A, because it results in the lower Gini impurity. We always want to go towards a lower Gini impurity. Remember, if I have something like this or something like this, where they're all the same class, right? The Gini impurity is zero. And that's what I'm striving for. I want them all to have the same classification. So you always want to pick the one that gives you as close to zero. Why lower and why not higher is because if you have something like Y's and N's and Y's and N's, then that's not really the awesomest thing ever, because when you are creating your decision tree, right? When you have like feature A, feature B, you want that all the things that end up here all have the same classification. Otherwise your decision tree is useless and can't really predict very well. Does that make sense? You always want to, I mean, think about gold, right? You always aim for that, buying that 24K or not 24K, because that would be liquid, but close to 24K as much as possible. It's worth more to us because it has less impurities. You always want to go for the less impurities one. I would say sleep on it a little bit. Think about it. Think about the examples. I think it will make sense. It is a little bit of mental acrobatics, but think about it more. And it will make sense that you should always go for the lower one because you don't want there to be too much of a mixture of nos and yeses. You want as much as possible to be all nos or all yeses. That's kind of what we're trying to do with this decision tree. And it doesn't mean that on the first level, we're going to have a very simple, oh, these are all nos and these are all yeses. That's why our decision tree is multiple levels. But as much as possible, we want to reduce that impurity. And if feature A does that better, we're going to start with feature A. So, Rohan, the answer to the second question is zero to 0.6. It's not just zero. It's a range, right? OK. Is the answer as simple as which decimal numbers? I'm not going to give away any competition questions. OK, I'm just seeing the comments. You want the decision tree to exactly the class. Yeah, yeah, I'm going to actually read Daria's answer out loud, if that makes more sense to people. So what Daria wrote was, you want the decision tree to be able to sort the data into exactly the classes. You don't want some combination of decisions to classify. One thing is yes, and have the same decisions classify something else as no. Does that make sense? I love that explanation. OK. Cool, awesome. So let's get started with sort of some new topics here. Second. OK, so what I'm going to do is we're going to continue our example from last week, where we had this entire equations and this table. But the question I'm asking you for this week is a little bit different now. We already calculated. Shoot, OK, it worked, awesome. We already calculated. Before splitting the node, right, before splitting at the root node, we already calculated the Gini impurity. From last week, our Gini impurity was 0.469 before we split. This is from last week, OK? That's what we had calculated just by looking at this last loan approval column. What's the answer? Yeah, it's 0.66, yeah, not 0.6. And in fact, two thirds is the best answer because that's like exact. OK. 0.469 before we split. It makes sense, right? It's just from last week's answer. So now what my question to you is, if we were our first like feature here that we're going to split on is annual income, what's our Gini index? What's our Gini split, right? So now it's for the split. So basically what I'm doing here, if I were to backtrack here just for a second, where did it go? Remember in today's, the example I started off with in class, I was like, we had to pick which feature we want to split on. And obviously we split with the one that gives us the lower Gini index. That was the question from the homework. But basically this is where we're at. And even though I know what the Gini indices are for each of these branches, I don't know the complete Gini split, like what Gini index corresponds to this feature. And so that's exactly what we're trying to calculate for this problem, except we're doing it now for annual income. So what we are going to first do is in annual income, how many categories do we have? Three. Yeah, three, awesome. High, medium, and low. So we're gonna first break it down into high, medium, and low. And so I'm going to, again, we can ignore this credit score column. So I'm gonna just ignore that. Yeah, it's connected, so it won't like it, but it's okay. And so what we're going to see for each of these, let me, I'll just move this up a little bit, is for high, how many approved we have and how many approved we don't have. How many approved we have and how many denied. And we wanna do that for each of these. So let me, and actually also the total as well. So if you wanna start writing in the chat, you can just write for each of them, how many approved, denies, and totals do we have. Okay, high is to approved one denied for a total of three For medium how many so that's we looked at these rows now for medium how many approves and denies in total One approved to denied for a total of three for lows how many approved denied and total You Yeah for low it's only denied so we'll say 0 to 2 so far so good thumbs up make sense Just this is yeah, just counting the table rows perfect Now I'm gonna erase this just because I need a little space to operate here Now Here's what's going to be helpful. I'm going to give you the formula for the Gini split so the Gini split is The weight of the node times the Gini index of the node and so that gives us the Gini split Okay, perfect, um, so Wow First what we're going to do before we calculate the Gini split is Calculate the Gini index, which I have already written that formula a bazillion times Oops So we need to calculate the We need to calculate the Gini indices for each of these so let's first start with high How do I calculate the Gini index? Don't give me the final answer, but kind of give me it in algebraic notation, right? It will be 1 minus what? What are the proportions for approved and denied and then we square them Yeah for approved it's 2 by 3 and then we square them because that's the formula for denied it is what 1 3rd square I'm just going to give that to you. I'm not going to solve it out right now for the sake of time So that gives us something like 0.444 We can round to three decimal places for for this Now for medium, so that's our Gini index for high For medium, we're going to do the exact same thing. So we have 1 minus according to our formula. What's our proportion for approved Yeah, 1 3rd squared plus denied is 2 3rd square So again, our Gini index is 0.444 Then I do the same thing with low So what is our proportions for approved? 0 by yeah 2 bracket square plus 2 by 2 bracket square Okay So for that it the Gini index here is 0 why is it 0? It the Gini index here is 0 why is it 0? Yeah, RL because there's all of the all of the things are denied Exactly there it's all in one class. So there is a zero impurity. Do we have to use decimals or fractions? Do whatever the problem states for now use whatever you wish. I'm not a stickler In the competition we will just so that the checking happens correctly. We will kind of write down what you should be using Okay, cool so far so good we calculated the individually Gini indices So now we're going to calculate the Gini split. So I'm going to erase the Gini index formula here. So I have more space So Gini split is the weight of each of the nodes times the Gini index node So the way we calculate weight is seeing how many of this total you have in the complete total So how many roles are there in total in our in our chart here? Eight right eight so our complete total is eight and So the way we calculate weight is see what this total is and just divide by our complete total So what's the weight for high? I'm gonna calculate a medium low Three over eight right because you have the total and then you have the complete total what is for medium? Three over eight right you have the total and then you have the complete total what is what about for low? Two over eight, right Cool, so now let me make this a little smaller Okay, so now I want to do is I want to multiply these two values For each of these and then sum it up so it will be something like three over eight times zero point four four four So I just multiplied these two plus three over eight times zero point four four four multiply these two plus Two over eight times zero And that gives me zero point three three three So that is our Gini split. So now we finally have the Gini split for our annual income. Does that make sense? Okay, cool So my next question is does the split based on annual income reduce the impurity of the data set Well our split before we like before we split over anything. Our Gini was 0.469 Now it's zero point three three three. Does it reduce it? Yeah, right, yes Yeah, right, yes So it reduces so this is not a bad feature to split on however In real life what you would do is you would do this for annual income Then you do the same process this whole process We did you do it for credit score and you see which one reduces this impurity more and that is the feature you split on Does that make sense? Now do we understand how decision trees work It's a little more complex than we thought right we have done decision trees since our youth But it becomes very harder when you're working with hundreds and thousands of rows. This is the kind of Calculations the computer is doing Okay, cool. So now if we are all set with that, let me stop screen sharing And let's go on to our next topic in your 10 miles, this is gonna be really exciting And yeah, let me get started here Okay So what is machine learning? That is going to be the topic of today So machine learning is the acquisition of knowledge or skills through experience study or by being taught It's the kind of learning the machine does and in order for a machine to learn just like humans You know, we have just like humans, you know, whenever they're learning something Like if you're learning a new word, let's say right you have to listen to that word what 10 20 times and then you were like Oh now I know how to apply it. It's the same way for computers, right? If you feed in a lot of data hundreds of examples thousands of examples and then it finally starts to understand the underlying patterns And it can start Using that knowledge to predict or do other things So that's the kind of learning that we have. So what is machine learning? So machine learning is the acquisition of knowledge or skills through experience study or by being taught It's the kind of learning the machine does and in order for a machine to learn just like humans, you know, we have just like humans, right? If you feed in a lot of data hundreds of examples thousands of examples and then it finally starts to understand the underlying patterns So what is machine learning? So machine learning is the acquisition of knowledge or skills through experience study or by being taught This one is all about learning things as we go and you will see the difference in the coming weeks um having said that you can use ai machine learning interchangeably, um, the People who aren't computer scientists, they usually call everything ai whereas, you know computer scientists like myself We usually use the term machine learning. So depending on what circles you're you know, and uh, you might use different words Um computer vision we talked about it That's all about images visions things like that. That technically is uh, one of the fields in machine learning We talked about like medicine identifying anomalies from x-rays photos Autonomous vehicles, uh for natural language processing, you know, and this is all stuff We're going to talk about later on anything related to words. So speech recognition email filtering whether something spam or not chat gpt Uh, we're also our generation right stock market trades all these applications Uh, you are very familiar with um This uh Isn't machine learning letting the ai view the world itself through images. It's not necessarily images. It can be text It can be anything the difference between machine learning and ai is machine learning is specifically focused on learning So like decision trees might not fall under machine learning because it's more rule based Uh, but machine learning is specifically learning and we'll talk about that, uh in future weeks So, uh, you guys are familiar with this slide already, right? We have the three different main types of algorithms in machine learning Our focus still will be we're still on supervised, uh learning algorithms. So we're working with data and labels So here, uh, we have a task of uh, we want to see if we have a triangle or not, right? We want the computer to predict when you show it an image. Hey, is this a triangle? So what do you do? Well, you have to feed in positive and negative examples. What does that mean? Well positive examples are Examples are images of triangles and you not only give it the data So you mean you give it the image of the triangle you also give it the label that yes It is a triangle, you know, you're teaching the computer this it's learning um And then you have other shapes as well to to kind of demonstrate. Hey, this is not a triangle So you give it maybe hexagons and squares and octagons and circles and you say hey, this is negative one It's not a triangle and you feed hundreds and thousands of these examples into the computer um, and then uh That's called model training Um, and that's how when you kind of show an unseen image of like another triangle the computer can say yeah I know it's a triangle because I can try to find these patterns from all the images that i've ingested that hey a triangle Is something that looks like it has you know, these three corners Uh is machine learning in port inputting loads of data to teach ai to analyze it. Yes Uh, that is, uh, definitely one way to think about it. Um So, uh, why do we need machine learning like decision trees and all are fairly nice Um a lot of the different ai algorithms we haven't talked about a bunch we just talked about one They're pretty nice. But uh machine learning is far more complex and that's why it's better And so why do we need this complexity? Well, it's because of something called non-linearity So you may or may not have talked about this in your math class and if you haven't totally fine But non-linearity is the property of a system where the outputs are not linearly proportional to its input So linear just means line, right? It's what you see here on the left. Uh, it's just a line uh, if something is increasing like if the independent variable is increasing the dependent variable is also increasing or maybe if Uh, it's increasing the other one's decreasing but there's this constant, you know, sort of uh Thing that's going on whereas in non-linear you see it It's not just constantly going up or constantly going down. It's going up and down and up and that's Uh non-linear, um, and that's sort of the way our real world works, right? If you have written science experiments, you realize that the world is much more complex um, and as it turns out that linearity is Uh interpretable, right? Do you remember this vocab word from last week? Uh, what was interpretable feel free to put it in the chat. Yeah, easily understand, we can explain it away. If I show you a decision tree, you can explain to me that why this kid got a scholarship or why this didn't. In machine learning, that's much more difficult because now we're working with a lot of non-linearity. It's not easily as interpretable, and that's one of the drawbacks of machine learning that even though it works really well, we sometimes cannot explain well why it's working. And again and again, I'm mentioning this, that real-world problems, they aren't linear. There is this non-linearity component, and that's why machine learning works. We also call machine learning models a black box. And so a black box is basically like, it's something you really don't know what's inside, right? The idea is that this works, right? You put in inputs, you get some output, and that works for machine learning, but you really don't know how the model is working. There's too much complex math. And so this is an active field of research of trying to understand why the models are saying what they're saying, but the fact is, none of us still know to this day how exactly it's all working. There's a lot of different algebra, calculus, statistics, probability, so many things involved, and it's just so many variables, it's hard to keep track of, and that's why it's called black box, and that's why it's not interpretable. And so, actually, Naomi, if you wanna take over, hopefully I didn't speak through your slides. Okay, let me, let's just, instead of poll, let's just put the answers in the chat. Determine if the following inputs and outputs are linear or not. And one way to think about this is just try to see if this increases or decreases constantly. Like, if you subtract each of these inputs, right, five and three and seven and five and nine and seven, is that, if you subtract each of the outputs, is it always consistent, is it constant? Awesome, so I am seeing that it's all, you guys are saying it's nonlinear, and that is correct. And so, you know, because of this nonlinear relationship, we could use a machine learning model kind of to find this function that maps the inputs to the outputs, right? All this make sense? Okay, Naomi, feel free to take it from here. Let me give you access here. Yeah, sure thing. So while Ms. Haripri is giving me some access, oh, perfect. Let's talk about features. So you've kind of heard this word earlier before, but features are an individual measurable property. So you see to the example on the right, there's a dataset. And the example is, will a student pass their exam? The features are if they studied or the number of questions they studied. And notice how exam passed, that column is not a feature. And we're going to talk about what that is later on. Oops, sorry. There we go. So we input data into this machine learning model, but what does this data look like? So types of features are modalities. So there can be multiple modes or multi-modality, meaning that the features are different, can be different types and can be in the same dataset. And these can be in the forms of numbers, categories, images, time series, and more. So like a dataset describing a picture, and then there is the picture. So that could be a dataset of images and text. So that could be in the categories. So now we're going to talk about labels. So labels are the output that form some sort of identification. And these can be in numbers or categories. So in the example before, will a student pass their exam? Them passing the exam would be the label. We can't input the label into the machine learning model as something that the machine learning model can use to figure out if they pass the exam, because we're just handing them the answer. So we need the features inputted into the machine learning model that will help the machine learning model figure out if the student did pass their exam. So now let's move on to an example. If we want our model to predict a person's salary, what would be the features of this dataset? So you can put it in the chat. And I apologize, I won't give you guys too much time since we're running a little late. Okay. Oh, and I apologize, it's not showing up. Okay, and let's move on. So the features of this dataset would be education and job. So let's go through this. Salary is the label. That's what the model is trying to predict. So that obviously can't be a feature. The person's education and the person's job will help the machine learning model figure out the person's salary or estimate. And the person's name is more of an identifying factor in this dataset. And it wouldn't help our machine learning model figure out this person's salary. And we're gonna talk a little more about what the machine learning model cares about or what we want the machine learning model to care about and how we make sure machine learning model isn't learning things from categories or features that aren't important, but the machine learning model wouldn't know that. So now we're gonna talk about vectors, but nope, not this vector. We're gonna talk about mathematical vectors. So vectors are a quantity that has magnitude and direction commonly depicted by an arrow. So if you haven't heard of vectors before, that's okay. Just think of where your bed in your bedroom is in comparison to where your door is. That distance has both direction and magnitude or length. So that would be defined as a vector. So in machine learning models, we represent our features as column vectors. So in the example on the right, the graph is of a vector, a column vector, three, two. So it's going right three and up two. And notice there is the arrow signifying the direction. But how will we represent something categorical like color or shape as a number? Remember that vectors are what we represent our features as. So if a feature is something like a color or a shape, how would we represent that as a number? So to do this, we're going to go through a quick little situation. The machine learning model can classify different fruits. That's our goal for this data set. So the data set has fruit name, fruit color, and fruit diameter. So the fruit name is watermelon or apple, fruit color is green or red, and fruit diameter is number. So we don't really have to worry about that in terms of this situation. So the answer to this question is one-hot vectors. So we use zeros and ones to mark the existence of these variables, or in this case, color. And these vectors, remember, are called one-hot vectors. So the existence of something being green, like the watermelon, that would be a one instead of a zero. And the apple is red. So we would put a one in the red row describing that feature. And of course, none of them are blue. So both of those variables are a zero. Oh, and note that this situation occurs if the colors are discrete and not continuous, meaning that our colors right now are green and red, and that fits with RGB. But if we had an orange fruit, we would use other color systems, like some decimals instead of the continuous one and zero, but we're just gonna use ones and zeros in this case, just so it's easier to understand. But note that this might not always be the case. So how would we represent the feature vector in this situation? So the first row is the fruit diameter. You can see the difference from the watermelon and apple. And then the three below that is our one-hot vector indicating color. So the first row or second row of our feature vector would be if something is green, and we can tell because the watermelon is the one that has the one in this spot and everything else is zero. So watermelon's green, and then fruit color for the apple, zero, one, zero. So not green. Yes, it is red and not blue. So now let's do another quick question. So what feature vectors can be potential representations of this data? And check all that apply. So in this situation, we're trying to predict candy name. So candy name would be a label. So don't worry about representing that as a vector. Just worry about representing the color and diameter as a vector. I'm sorry, I... Oh, and check all that apply. So if you think multiple would work, then also put that. I see some answers right now. Okay, I'm gonna quickly move on. Everyone who responded to me, you are partially correct. So... Oh, and sorry. Do you have any questions? Okay, great. Oh, and sorry. Let's do... So the answers are A and C. So I saw a lot of A's in the chat. So A is the one most like our last example. So the first row would be diameter. So one inch and 0.5 inches. And then the bottom three rows would be indicating color using a one-hot vector. So for A, it would be... The first row would indicate blue, and then the second would be green, and the last one would be red. And something that you can just use in your mind as a quick little go through. If some things are different colors and you look at the spot where there are the one-hot vectors they can't have ones in the same row, which is why B is incorrect. So B would indicate that looking at the bottom three rows that the gumball and the M&M are the same color since they have ones in the same row. We know this is not true since they obviously are different colors, blue and green. So that's why B is incorrect. C is also correct, but the location of the diameter and the color are in different spots. So the diameter would be in the last row and then the one-hot vector indicating color are in the first three. And D is also incorrect because there is nothing indicating color for the gumball. It's all zeros in the one-hot vector indicating color. Okay, Ms. Harryprade, do you wanna introduce the homework? And if you have any questions feel free to put them in the chat. I know this is all some new topics and it's totally okay to have questions. Awesome, so yes, I will be assigning the homework here. It's the end of class, but basically it will be on the Gini indices from last week as well, like stuff we covered today and then the machine learning model and feature vectors. So yeah, good luck. And if you have any questions, feel free to just mention Google Classroom. That's it for today and great work guys.
Video Summary
The video discusses educational content on impurity and machine learning, introducing concepts like impurity in data classification, decision trees, and machine learning basics. The instructor emphasizes the importance of homework and understanding impurity: the variability in data classification using examples related to scholarship eligibility based on GPA and extracurricular activities. A decision tree is used to exemplify how to categorize data based on features that impact the classification, such as GPA.<br /><br />The class also explores machine learning terms like features (input data attributes) and labels (outcomes the model predicts) and explains the importance of reducing impurity in data to improve model decision-making. One-hot vectors are described as a way to represent categorical data numerically within machine learning datasets. The video wraps up with an introduction to homework assignments, focusing on these concepts. Key points include understanding impurity in data, decision tree mechanics, the use of one-hot encoding for categorical data, and the broad goal of machine learning to predict outcomes using data features effectively.
Keywords
impurity
machine learning
decision trees
data classification
one-hot encoding
features
labels
categorical data
homework assignments
×
Please select your language
1
English