false
Catalog
AI 101 - Class Recordings
Recording Class 7
Recording Class 7
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
Okay, we're going to start going over the homework. Let me share my screen. Okay, perfect. And to those of you who live on the West Coast and were impacted by the windstorm, the bomb cyclone, thank you so much for coming here. And if you weren't able to make it for those watching the recording, it's okay. I had to go to my dad's office to get internet. So going through the homework, question one, using the blow table, let's calculate the mean and standard deviation for the entire data set for grade in classes. So let's move this bar down. Let's look at the key. So these look to be, yeah, the average grade in classes of percent. And just to calculate the mean, we add all of these values up, divide by how many values there are. So our mean should be 70. This is our equation for finding the standard deviation. So this just means we take the difference between our value and the mean squared all over the number of values we have, and then we take the square root of that. Oh, yes. Do you have a question? Well, when I looked online, most of the standard deviation formulas showed n minus one as a denominator. That left me a bit confused. Oh, I believe this is... Naomi, sorry, if I could chime in for a bit. Oh, yeah, sure. Yeah, yeah. So there are two formulas for standard deviation. And thanks for asking this question. You won't be tested on this, but basically, if you have the entire population, and you're calculating the standard deviation on the entire population, then you want to use n. If you don't have the entire population, but you only have a sample of whatever the data points are, then you need to use n minus one. And that kind of helps make your standard deviation more accurate to account for the fact that you don't actually have the entire data set, and you just have a sample of it. Does that make sense? So, for example, like if you wanted to say like, hey, what's the standard deviation of the weight of like, every person in the United States? There's, you know, like, I don't know, 300 million some people right in the United States, we might not want to calculate the standard deviation across all of those people, we might do it across just maybe like 300 people, and then we would use the divided by n minus one formula. But for our class, always just do divided by n. Does that make sense? Yeah, always assume basically that you have all the data points that exist that are relevant to the problem. Great, thank you. Okay, so using this formula, we should have gotten 21.533. Okay, that sounds good. Okay. Now, Part B, calculate the standard deviation of the feature, number of classes taken being less than six. So remember, we did an example like this in class last week. So we just see, okay, looking at our data set, all the data points we have where the number of classes are less than six. So we already separated them here in the key. So these are our data points. And then we just take the standard deviation of these values. And to do that, we first have to find the mean, which we can calculate the same way we calculated it in the Part A. And then we just take the standard deviation using the same formula. And then this is our standard deviation, 12.675. And remember, we have three data points that follow this feature or this identification. And then we just take it for the opposite. So number of classes taken that are greater than or equal to six. So those would be these data points. We would just do the same thing. And then to calculate the standard deviation of the whole feature, we multiply our weight by the standard deviation we found. So remember, we have three data points here. So our weight would be three over six. Same thing for our number of classes taken that are greater than or equal to six. Three over six is our weight. And then we just add these together, giving us 9.679. This sounds good? Okay. So now Part C, we calculate the standard deviation of the feature number of extracurricular students participates in being less than five. So just the same concept. So we see, let's see, number of extracurricular student participates in. These are all the data points being less than five. So we take the mean and then standard deviation using the formulas we learned above. And we see our number of rows is four. And then we take the number of extracurricular student participates in greater than or equal to five. So we only have two data points, and we find our mean and our standard deviation. And then we multiply our weights by our found standard deviation. So for the first, four over six, since we have four data points. For the second, two over six, so we only have two data points. And so this is our standard deviation. So what feature should we split on? So remember, what it's asking here is what's the smallest standard deviation from both of these, from either of these features. So we have this one, 19.964. And then this one, it looks 9.679. So this would be smaller than the other standard deviation of that feature, which is why we choose number of classes taken being less than six. Okay. And then, let's see, what's this problem asking? Now we want to, yeah, so now, hopefully you guys had some fun looking through the edges to cats, edges to cats, the pics to pics. Unfortunately, I was supposed to do a demo. My internet won't allow it, unfortunately. But I hope you guys had fun looking at that. And I hope skimming through the paper gave some good ML sense. And you guys maybe learned something. If not, that's okay. And yeah. Does anyone have any questions? Okay. Perfect. I'm going to stop sharing while I get the presentation ready. Okay, and Richard, I'm going to try and give you remote control right now, just so that we don't have to do any switch. Okay. So, today's topic is NLP, ChatGPT, and RAG. So, NLP stands for Natural Language Processing. I'm sure you guys all know what ChatGPT is, and we're going to learn later it's a large language model, and RAG is Retrieval Augmented Generation. So, what is natural language processing? Natural language processing is a field which deals with analysis and synthesis of natural language, and many NLP applications involve machine learning. So, these applications include author identification and sentiment analysis. You guys might remember doing a project similar or in one of the homeworks determining the sentiment analysis, and I think we were trying to figure out if it was a classification or regression problem. And more applications of NLP include machine translation. So, just like, you know, maybe Google Translate, maybe another app that's better. And then chatbots, including ChatGPT, and there's a lot more out there too, text summarization and named entity recognition. So, this is just identifying and classifying entities like names, dates, and locations in text. So, now we're going to talk about bag of words. So, bag of words is a model or technique that represents text in vector format as a collection of words. So, it represents a language of unstructured words, and the word order is not preserved. So, just imagine you have like a bag full of words and they're put in there given the frequency of that word. So, maybe the word of or a is a lot more, is greater in that bag, and you're just like pulling out a word. And so, the probability of you pulling out a higher frequency or a higher used word is greater. And it looks for the frequency of words in each document. So, an example of bag of words is, let's say we have one document, I love, love NLP, and document two, NLP is amazing. So, the vocabulary for all the words are I love, NLP is amazing. So, the vector is just determining like how many times we've used these words. So, for document one, I love, love NLP, we use the word I once, love twice, and NLP once, and the other two would be zero. And know that the order of these words don't really matter, it just cares about the frequency. So, for you guys, document two, NLP is amazing, try and find the vector, the bag of word vector for document two. This is our vocabulary. Okay, looks like a lot of people are getting it. Nice. So the answer is 00111, since we don't use the words I and love in document 2, but we do use the words NLP is amazing in each of that ones. Okay, so now we're going to talk about what is a large language model. So a large language model is an AI model that can understand and generate large amounts of text. And they learn the probabilities of word sequences and can be used for text generation, translation, summarization, and more. So language models are common in machine learning. So and how these models are structured is they try and predict given a word or set of words, the next word. And how language models work is that there's a given uncertainty for that next word. So if I say, hello, how are, you'd probably assume the next word would be you. And so in like a well-trained language model with the model, it's a you. And then what would it say after that if we keep continuing that maybe how was your day? But we're not sure. So that given uncertainty increases after every predicted word. So how large language models affect this is that the uncertainty decreases a lot. So we have a lot higher certainty just because we are training on a lot more data. And the capabilities of predicting have been increased. And because large language models are trained on so, so much data, that's why they're called large language models. And they have billions of what's called parameters. So remember, in neural networks, we were talking about weights. So weights in large language models just translates to parameters. So if it has like one billion weights, it just means in the terms of large language models, it has one billion parameters. So models might have the number of parameters they use in their name. So we could have a model called LLAMA3B, so that means three billion parameters. Yeah, and talking about all these billions of parameters. So GPT-3.5 has 175 billion parameters. And GPT-4 has 1.5 trillion parameters. So these models are really big. And it's just like how powerful they are. And generally, it correlates to, yeah, their capabilities. So the input of large language models must be converted to numbers, since a model can only understand numbers. And I'm going to let Richard take it from here and how we can do this. All right, so let's start with what embeddings are. So because machine learning models can only do like mathematical calculations with numerical data, when we want to deal with language, we need a way to represent words as numbers. Words need to be converted into numerical representations called embeddings to allow models to understand and work with words in a meaningful way. So words are represented by vectors, which capture somatic meaning, where words with similar meanings have vectors close to each other. The vector, the embeddings, encode relationships between words. And with them, you can do some vector math to see those relationships. For example, if you subtract man from king and then add woman, then the result should be the vector of queen. Cosine similarity is a similarity metric that measures the similarity of two vectors through the cosine of their angle. It ranges between the values of 1 and negative 1, with 1 meaning identical and negative 1 being opposite. To calculate cosine similarity, you first need the magnitudes of the individual vectors. The magnitude of a vector is the square root of the sum of its components squared. For example, the vector 1, 2 has a magnitude of square root of 1 squared plus 2 squared equals root 5. Next, try to find the magnitude of the vector 3, 4. Use the formula for 2D vectors on the slide. And yeah, put your answer in the chat. Alright, so a lot of you are getting the answer, and it's correct. The magnitude is 5, which is calculated from square root of 3 squared plus 4 squared equals 5. Once you have the magnitudes of individual vectors, you multiply them. For vectors of 1, 2, and 3, 4, their magnitudes are root 5 and 5, so their product is 5 root 5. Next, you would calculate the dot product of the vectors. In this example, it would just be 1 times 3 plus 2 times 4 equals 11. And then finally, you would divide the dot product by the product of the magnitudes. So for vectors 1, 2, and 3, 4, this would be 11 over 5 root 5, which is about equal to 0.984. And as this is close to 1, this indicates that the vectors are very similar. So vectors are created by machine learning models like LLMs, but Word2Vec is another model that can create word embeddings. One way it does this is through continuous bag of words, and the model can try to predict missing words in a sentence using the context of the sentence. Such as for the sentence, I love AI and math, it can try to predict AI from the words I love and math. Each word in Word2Vec is mapped to a vector where every dimension corresponds to a learned feature or relationship. The embeddings are derived from the weights of the model's hidden layers during training. These weights encode the statistical patterns that the model learns about word co-occurrence and context. Embeddings can also help to reduce the dimensionality of words by identifying patterns and features. This not only captures semantic relationships, but can also help to speed up the training time for models. Word2Vec and LLMs both create embeddings. LLMs are a bit more context-specific. For example, the word bank in I went to the bank to deposit money refers to a financial institution, while in I can swim to the opposite bank, it refers to a river bank. LLMs can differentiate between these meanings based on the context, but Word2Vec will handle both meanings similarly. Yeah. ChatGPT, as you all probably know, is a large-language model. It's trained in three main stages. The first being supervised learning, where it's fed with a lot of prompt-and-answer pairs. Then it goes reward-modeling training, which ranks outputs for prompts to better refine the quality of the responses. And then reinforcement learning is used to optimize the model to generate the best responses. Now, prompt engineering involves designing and optimizing prompts to obtain the desired output from language models. Effective prompts can guide the model's response, making it more accurate and relevant to the task. Effective techniques for this include providing context, specifying formula, and asking for step-by-step reasoning. All right. So imagine this scenario, where you are hired by a digital encyclopedia company to create a chatbot that answers questions based on the articles. What are some ways that you can do that? Just think about it for a moment. Yeah. Well, one option might be you could just, like, ask ChatGPT the questions. All right? But this might not work, since ChatGPT might not know the answer and can hallucinate. Another option might be you could, like, feed all of the encyclopedia articles, and then ask ChatGPT the questions. Now, this won't work due to ChatGPT's token limits. You could try to, like, fine-tune ChatGPT for your task. But this will just require a lot of training data, which is way too many question-and-answer pairs for it to work. So, one example of how this might work is, like, if you have a chatbot, so, one answer could be to make use of embeddings. RAG, or Retrieval Augmented Generation, is an approach that combines information retrieval with generative models. It lets a model fetch relevant documents or facts from a dataset and use them to generate responses. This will ensure that a model doesn't solely rely on training data, and also includes up-to-date context-specific information. It's especially useful for tasks that require accurate and context-specific answers. So, for the previous scenario, one way you could do it is by converting each encyclopedia into embeddings, and then storing them in some kind of database. Then you could take the users' questions and convert them into embeddings. With those, you would then compare the embeddings of the question and articles, and find which article is most similar to the question using cosine similarity. Then you could input that most relevant article into ChatGPT and then ask for the question. Since you're only inputting that one article, you could probably get around the token limit. This should ensure a more accurate and relevant response to your questions. Next is homework, which I'll turn back to Naomi. Okay, does anyone have any questions on what we learned today? We can go over some things, so you have some extra time. Everyone good on embeddings? Yeah, we can go back. What do you want to go back on? Okay, so I'm going to go back to the chat. Does Google AI use RAG? I am not sure. Mr. Begirath, do you know if Google AI uses RAG? Yes, great question. Yes, yes, it does. So what it's doing is, and then this is also true for maybe some of like the newer models of chat GPT that search up the internet or something like Perplexity or Cloud. What these websites do is, first there's some sort of search engine that's searching the internet for relevant websites. And then from there, all of those websites, the documents in there are converted, like all the text in there, that's converted into a vector, right? So what we saw with the documents being converted to vector, that's the same thing for like webpages, right? PDFs, video, not videos, but like the transcripts of videos, things like that, right? And so we take those vectors and then we compare like all of those vectors to the query and we pull the most relevant pages. So actually maybe I can demonstrate this now. I have a Perplexity, which I think because it cites its sources might demonstrate this a little more clearly what's going on. So let me share my screen. Let's see, how does RAG work? Let me just write this. So you can see that it pulled up these sources. So what it did, it basically just said to Google, like, hey, can you just search up how does RAG work? And then from all the pages that Google returned, it says, hey, these are the five most relevant ones. And then it tries to summarize those for me by reading through all the data. And you can see that sometimes it's able to very clearly tell like, hey, I got this bullet point from these two webpages. So it cites those two webpages specifically. Sometimes it's a little, I think, harder for it just because like the data, or like this bullet point, for example, may have come from all five webpages. And so it's not really clear where to cite, which one to cite. And so it just doesn't cite anything. But you can see how it can be helpful if you want to create a chat bot that is actually capable of searching through documents and citing things up to date with the most relevant information, right? This is super cool, RAG, for those purposes. And it's also useful for like, if you're at a company, you can't always train a bot on like the latest manuals and documentation and guides and internal information, but you can like give it access to a database with vectors, right, and then search through those vectors and figure out which ones are the most relevant ones. So if I said like, hey, I want to know something about like how much was I paid this month or something like that, it could search through all the pay stubs and then figure out like, oh, since the user is me, Bagheerat, right, it'll figure out whose pay stub it should look for and then pull up my pay stub and answer that question. Does that make sense? Cool. Yeah, I'm seeing a lot of questions about just give a recap on embeddings and how they relate to RAG. So I can share my screen and Mr. Bagheerat or Richard, do you guys want to go over that? Sure, sure. Okay, let's go back a few slides. So I see something starting about embeddings. So was, yeah, the cosine similarity does that make sense for everyone or just how they, how they relate to RAG I think is the main question. So maybe a little more forward. And there were some questions about Word2Vec. So maybe here. Yeah, I can say out the question really quick. So can you give a quick recap on embeddings and how do embeddings relate to RAG and for the embeddings, are the vectors bag of words type or is there a different type of vector? Sorry, so the question is basically what kind of type of embedding RAG uses, is that right? Yes, I think that's what they're asking. Okay, sure. So first, maybe I can go over Word2Vec again. So the way Word2Vec works is basically, let's say we have maybe only 10 words that are being used in our entire vocabulary set, right? So it's just, every document is just like a combination of those 10 words in some order. Maybe some of those words are repeated. What you can do is basically take the sentence like I love AI and math. And for the word AI, you say, hey, what's the context? So there's something called a context window. So those are just the words before and after a given word. And usually like a context window of five is pretty good. So we'll take the two words before the word AI and the two words after the word AI and then AI itself, right? So our entire context window plus the word becomes I love AI and math. And so what you can do is you can draw out a vector and say, let's put a one for like I, one for love, one for and, one for math. So you're just putting like one for every word that's present in that context. If you do this for basically every single word throughout each document, right? You're going to end up with a vector for every single representation of each word. Now for a word like AI, which will probably show up many times, right? We'll have many such vectors, right? So what we can do is we can just average all the vectors. And then you have a vector for each individual word. This is a little more in depth than maybe like the slides went and that more than we'll test you on, but hopefully you can see like what the slides were saying. And I'm just expanding a little bit on how like the missing word in a sentence works. So we're using the context to kind of define the word itself. Does that make sense? Like we only care about I love and, and math. Those are the four words that we care about to define AI. Because there's actually this very famous like phrase that a word is known by the company it keeps. And it comes from these like philosophical debates and like you could trace it back to like Ludwig Wittgenstein from the early 20th century, who basically said that, look, I, all these philosophical problems, these are all just word games. All you have to do is like define each word precisely. And then like any of these like philosophical questions, like what is the meaning of life become trivial? So people who are studying natural language processing in like the 1990s, 2000s, 2010s, they said like, hey, okay, why don't we just take each word and look at the surrounding words? That's the context, right? And like, let's use that to define each word, right? Once we have a vector for each word, we average it for like all of, every time like the word AI occurs or every time the word math occurs, like we'll take all of those vectors, right? We'll average them. And then you're given like one giant vector. And then we saw that like that, like maybe a vector of five, we just maybe, can you go a few slides I think to the right Naomi, please? We had like a embedding that was of length five and then we just converted to like a embedding of length two. I think it's a couple of slides. Yes, this one, yeah. So we took this embedding and we just projected onto like a smaller plane. And I'm trying to think like a good way to explain exactly what it means to project something. But like, let's say you're looking at a globe, right? And if you were to trace a path along the globe, right? Then let's say you take that same path and you put it on like a 2D map, right? That's a projection. You took a 3D curve, right? And you just projected it onto a 2D map. And so the same way you can take like a 5D vector, right? And project it onto this plane, this XY plane which is just two dimensions. So that's maybe a little bit more in depth. I hope I didn't lose more students but the idea basically is the way an embedding works for word2vec is you take the surrounding words, the context, right? You use that kind of as a definition for the word itself. You convert it to a vector and then the vector though might become like really large, right? If we have five words in our vocabulary set the vector would be 5D. If we have a thousand words in our vocabulary set it would be a thousand dimensions, right? So we'll project it onto like a smaller dimensional space. So we might make the embedding 2D, right? And so that just gives us two numbers and it just makes our vector smaller and simpler to work with. And it also helps like, you know, account for like synonyms and things like that. Synonyms will be projected into similar places on the plane or that like whole, when we saw the analogy, right? Where like king is to man as queen is to woman, right? We were able to do like that analogy is able to like be easily solved essentially by the embeddings because of the fact that there's fewer dimensions. So I got a question, like how has the data kept if it's projected into smaller dimensions? So that's a good question. It's actually not kept completely but that's actually a kind of a good thing. So let's say I'm taking a picture of, let's see, my friend in front of like this, like, you know colorful sign, colorful neon sign, right? And there's all sorts of LEDs and stuff behind my friend, right? Now, if I'm in a high resolution camera and I'm taking a picture, if I'm capturing each pixel like full of detail, there's almost like too much data in that picture and it gets noisy. It gets harder to focus on like what matters, right? If I were to take a lower resolution picture, right? And maybe just focus in on my friend. Now it's much easier to see what I actually wanted to demonstrate with the picture, which is, you know I wanna show like my friend, right? I don't really necessarily care about the sign. I don't wanna show all the colors and it just, there's less noise, right? So this is exactly the kind of the same principle where if you make something lower resolution you kind of get rid of the noise and basically you're getting rid of overfitting. If you remember that word from earlier classes. So if you make a 5D vector 2D, that's maybe better, right? Now, if you, of course, if you get rid of like too many dimensions, right? If we take like a thousand D vector and we project it onto a 2D plane, maybe that's not great. Right now, now there's like, now you're underfitting and you're not seeing enough, right? All the words are going to basically crowd around in the same place. But if you have too many dimensions now you would be overfitting and you're not getting basically clear results because there's just too many dimensions, right? All your words are kind of like floating out in like all these different spaces and directions and everything. Does that make sense to people? I'm seeing some nods, so I'm hoping that's yes. And now the other question that people had was Naomi you said that people were asking about like what kind of embeddings does RAG use? So I think we mentioned in a one of these slides that basically LLMs use their own embeddings that are different from Word2Vec but we didn't want to get into the details for that because it actually gets very complicated how transformers work and they use this concept called attention, which involves many, many different matrices. So just to keep it simple for you, I will say that LLMs are somewhat similar to Word2Vec in that you have like embeddings, right? You are using maybe surrounding words, right? You have there, it's called like a mask, right? But the key difference and one of the differences we describe is here or so this is like the key difference right here. So you have different embeddings for different contexts but just like we have described previously how you always have to convert words to tokens, right? We convert words to tokens in LLMs as well. So like when chat GPT takes in like a message, right? All of those words are actually given like a token ID and they're converted to like these numbers and those numbers are fed into the LLM. And then it goes through all these like layers of a neural network, right? And if you can go back a few slides, Naomi, please where we were showing how there's the hidden layer. Yeah, right here. So actually this hidden layer is a really good explanation of like how it works in LLMs. So like the second to last layer in an LLM that's called the hidden layer. Normally as a user, you're not interacting with that because you're only looking at the output, right? The last word, the last layers that are, sorry, all the words generated from the last layer but the second to last layer, that's the embedding that's created. And basically it's a different kind of vector than you'll see in Word2vec, right? It uses different math, right? Word2vec is trained on some data set and like LLMs may be trained on some completely different data set. But the key idea is the same, that it's kind of this like vector representation of a word. Does that make sense? But because of the way that transformers work, they also use previous words. They use the context to generate the embedding for the word and for the query as a whole. Does that kind of make sense how that works? So there is a different way in what the way embeddings work Word2vec that's compared to LLMs. So the embedding, the idea of an embedding is the same but the way that the embeddings work for LLMs are completely different. And that's why we saw on that slide where like bank has two different definitions depending on context, right? LLMs will do a much better job at differentiating between the two because they're looking at the entire context of the query when you feed that query into the LLM. LLMs have a sort of memory. You can think of that as a, in fact, they were inspired by an earlier form of networks where it actually had the name memory, the word memory in the name of those networks. So that's the biggest distinction between how LLM vectors and embeddings work as compared to is generally the same. And so that's why we just demonstrated it with Word2vec because that's a lot simpler to understand. As for the actual question, which was like, hey, so what kind of embeddings does RAP use? It uses things generated from an LLM but like it's technically, it's theoretically possible that it could use like embeddings from Word2vec and stuff like that. It'll just, the whole system would function a bit differently. But for LLMs, it's generates an embedding in that hidden layer. We use that hidden layers embeddings in order to like figure out which documents are the most relevant for RAG. Does that make sense? Have I lost everybody? I think that's, hopefully that's a yes. If there are any more questions, I'm happy to answer them. But I also have the Pix2Pix demo working on my laptop if people wanna see that. Oh yeah, I think that'd be great, yeah. So I can just click this random button and it has a bunch of like, I guess, pre-designed cats that people have made. Some of these are cute. Some of these are horrifying. Actually the person who created this was to the person who created this website, I think has written that some of the pictures look especially creepy because it's easier to notice when an animal looks wrong, especially around the eyes. And the auto-detected edges are not very good. And in many cases did not detect the cat's eyes, making it a bit worse for training. So it just, while training, it didn't pick up the cat's eyes very well. But if anyone wants to see my beautiful design skills, I'm happy to draw something for you all. Does everyone know where the Mona Lisa is located? Do people know? Oh yes, someone, oh, someone said, oh, it's located where I am. That's very sweet of you to say. But no, yes, a couple of people are saying the Lou. It's actually very close to me. Oh really? I live in France and it's just a train drive. It's a one and two hour train drive. Wow, that's very cool. Okay. I was actually practicing drawing a cat before this and it was much better. I promise you guys this, I don't know what happened here. That's a tail, that's four legs. And I'm sorry for giving you all horrific nightmares as I click products. Yeah, so a couple of people, yes, thank you. It's the Lou. Oh yeah, and the Mona Lisa looks at you as you walk around the room. That's right, yes. So I have artwork actually located at the Lou, right? Not the Lou in Paris, a different Lou. If people know the other definition of the word Lou, but I hope it's clear why it's located in the Lou, my artwork, if you can see my beautiful pictures. Let's see if I can draw a better Cheshire cat-like smile. Have people read Alice in Wonderland? You guys know how the Cheshire cat like disappears and like leaves only the smile behind and then eventually the smile disappears? I think this is more horrifying than that. Yeah, so the person who created this website just did it from basically taking the pix2pix picture. And then like you can draw like shoes, maybe this will be better. I guess this is a heel, I don't know. Yeah, it's not too bad, it's cool. And then let's draw a handbag. Okay, we have an actual question in the chat. Okay, does the cosine similarity formula simplify to give the cosine of the angle between the vectors? Yes, that's exactly it. Basically what the cosine similarity formula is, if that's a little hard for you to remember. Yeah, oh yeah, can you pull up the slides? Thank you. Yeah, yes, yeah, this is a good slide, sure. So this is the dot product of A and B is basically going to give you the magnitude of A times the magnitude of B, times the cosine of the angle in between A and B. So if A and B are located, oh, by the way, do people understand what cosine is? Somewhat, yes, no? I'll just define it like saying this. So the cosine of something, I'm sorry, of zero is one, and the cosine of 90 degrees is zero. So if two things are synonyms and they're basically located on top of each other, their cosine similarity should be very high, because there's almost no difference between their angles and how far apart they're located. So the cosine of zero is one. So if the angles between these two vectors is zero, then the cosine similarity between those two words should be one or close to one. But if some two words are very different, if they're located perpendicular to each other, the cosine of those two vectors will be zero. So the cosine of, by the way, 180 degrees, is equal to negative one. But actually, that's actually something that's interesting to think about. Like our two words that are located in completely opposite directions, are they similar or different? Does anyone have a thought about that? they're exact opposite, someone said. Do people agree with that? Some people are saying yes, some people are saying no. So this is actually interesting. So you might think, right, that, you know, two words that are located exactly opposite from each other in opposite directions are similar to each other. And that might be possible in some embedding systems, right? Maybe like the y-axis is just like a negation, right? Maybe like all negative words are below the y-axis, maybe everything above the y-axis, sorry, everything, let's say above the x-axis is negative, everything above the x-axis is positive or something like that, right? So if you have a negative y value, then that might just be like a word like bad. And if you have a positive y value, that might be a word like good, but that wouldn't actually be a very good embedding system. So generally speaking, we have like maybe hundreds of dimensions. And if you were to maybe like make it compact and just make it like say two dimensions, just to make it easier to visualize, right? Two dimensions to visualize a hundred may not be the best idea, but again, that might just might be good enough just if we want to visualize things in 2D and plot points, right? In that case, just one of those dimensions is not going to be reserved just to mean like good or bad, right? It'll mean like 50 different things because the meaning of those hundred dimensions has been now like compacted and combined into two dimensions. To some extent, think of it like a factorization, right? So if I have a bunch of numbers, like two, three, five, and seven, right? Those are four different numbers. But if I just multiply them all together, right? Then I get the number 210, right? Which is, it's just a single number. It's much easier to store kind of, right? And if I wanted to factorize it and pull out the original four factors, I can, right? I can't necessarily do that when I have like two dimensional vectors, I can't pull out necessarily the original hundred dimensions. But the point is that to some extent, some of the semantic meaning that was contained within those hundred dimensions, it's buried within these two dimensions still, right? But now it might be mixed between the two dimensions, or it might be entirely contained within one dimension, but also combined with other things. The point is that though, like when I'm, this is kind of a whole digression, just to point out the fact that if two things are almost exactly the same, they only differ in like, maybe like good and bad are exactly the same word, except they're opposites of each other, right? Or old and young are, you know, both describing maybe age, but they're opposites of each other in terms of like, are they talking about a large age or small age, right? Those words are basically going to be identical across all the other dimensions, right? It's just one dimension, across one dimension on which they are opposites. So they will actually be located very close to each other on the plane. Does that make sense? So king and prince, for example, or king and queen, I mean, it depends on like, are you saying are there opposites based on time or opposites based off of gender, but they will be located very close to each other. So if you can see on the slide, negative one is perfect dissimilarity. So, you know, you, if a king is in one direction, right, in the completely opposite direction, you're going to have a word that's completely different from king. Like maybe like, I don't know the word moon or something like that. So perfect dissimilarity, it's kind of hard to predict, like, what will you see there? Because it's basically going to be like the opposite in like many different dimensions. Anyway, that's kind of a digression off the original question, which is, does the cosine similarity formula simplify to give the cosine of the angle between the vectors? So the main idea is yes. We divide by the magnitudes because if two things are pointing in the same direction, but one is a larger vector than the other, that shouldn't make a difference, right? We don't care about the size of the vectors. We just care about the direction. So what this formula basically gives us is the direction of the vectors, or specifically the cosine of the angle between the vectors. And so if the angle is small, if it's zero, then your cosine will be one. You have perfectly similar vectors. If your cosine is 90 degrees, then those words aren't really similar to each other, though that will give you zero, right? And if it's negative one, that's perfect dissimilarity. That might be like completely different, right? Now, it doesn't have to be, right? It could be exactly what people were saying, which is like, hey, there might be exact opposites. It could be that. It could be that they're exact opposites. It just depends on the embedding system being used and how we projected it down to two dimensions. Are there any more questions? And someone asked it. So this is used by RAG to retrieve the most relevant document, like in the encyclopedia example. Yes, it is. So the way it's used is we create an embedding for each of the documents, and we store them ahead of time if we can. If we already have the documents available to us, it's easier to just calculate all the embeddings once and just keep them stored. And then for the embedding of the query that's generated, right, we compare it to each of the embeddings that we have stored, and we say, hey, which ones have the greatest cosine similarity, right? And if it has the greatest cosine similarity, that means the document probably points into the same direction as our query, which means these are probably related. So hey, we should probably look at these documents and feed them into the LLM in order to answer the query itself. So it's a couple of steps. Basically, you take the original query, right? You generate an embedding. You compare it to all the embeddings of the documents you have at hand. You compare using cosine similarity as a metric. You could use other metrics, but cosine similarity is very quick to calculate, and it's the easiest way to tell they are two things similar. After that, you pick maybe, let's say, the top 10 documents, top five documents. You feed them all into the LLM along with the original query, and then you return the answer. Does that make sense? And some LLMs do interesting things. Maybe these could be like hard-coded pre-processing roles or it could be just depending on like the kind of intelligence of the LLM itself could be like another model that AI model used to like pre-process the query. But what you could do is if like, if I say, hey, like, you know, recommend a playlist for me to chat GPT. Chat GPT now has a concept of memory. So what it could do is it could try to figure out like, I wonder like what kind of music does he like? And it might go through like previous questions and answers sessions we've had and figure out which ones are the most relevant. And then from that, it will feed that into the query, add that to the query, right? And now it has the previous memory of like, hey, these are the music genres that this person liked. It has my question, right? And now it can do some retrieval augmented generation and go out there on the internet and say like, okay, what are the latest songs that exist? And maybe I said, hey, my favorite genre is rock and roll. So it'll look up what are the latest rock and roll songs. Maybe it'll find 10, 20 different websites, narrow it down to like five, and then generate a list from those for me. So in the example I gave you, retrieval augmented generation was used in multiple steps. One was just to generate the query itself and make it better based off of our previous Q&A sessions, and then based off of the query being used, and like combined with the memory that chatGPT had of our Q&A sessions, then it goes out on the internet and then users drag on that to figure out what are the most relevant documents. Do people have any other questions? It doesn't have to be just from this session. It could be from previous sessions as well. If there are no other questions, I can also talk about just, there's a related paper, someone related paper that my, my linguistics teacher had written, it's called a start. So at Stanford, she wrote this paper, and the way it works is basically, it uses linguistics knowledge to try to figure out how to like, insert missing information into the question. So if I say something like, what are house prices similar to where I live, right? That's first of all, not even grammatically accurate. So it will go through the question, and it will insert like words like I, right? And then it will say, okay, where does he live, right? And then it'll use rack to try to figure out where I live, insert that, into the question and append that information to that entire query. And then it will like go through the internet and figure out like, what are the housing prices for like, maybe if I come from the Chicago area, right, it'll look up that and then return that information to me. But basically there, there are all these ways in which you can make a chat GPT and similar like LLM queries more powerful, right? So it's not just about how you structure the prompt, it's about what other information you feed into it. How do you make them prompt more accurate? How do you add all the information that you need in order to make LLMs more like capable of answering the question with the pre-trained knowledge that they have and like just the general like kind of world model or something that they have, and then all the other information that they're given at hand, right? And then how can you, then you just, you know, it's easy for them to just retrieve the relevant information from all the documents you give them. But the question is just how do you give them documents or information that make them accurate? And that's the whole purpose of RAG. So this is probably going to be a very important topic just in your lifetimes and then, you know, next 10, 20 years, because people are constantly looking into like, how can we help better improve human-computer interaction, like, you know, or human interaction with AI, right? Anything that can make chatbots or similar models more helpful, more powerful, right? These kinds of techniques are always going to be interesting. So that's pretty much it for our class. If anyone has any questions, please let us know. We're happy to stay after, or you can post on the classroom. We'll make sure to post the homework in the slides. Thank you all for coming. Thank you, everyone. We'll see you next week.
Video Summary
The session focused on reviewing homework, explaining statistical concepts like mean and standard deviation, and exploring topics in machine learning and natural language processing (NLP). Initially, the instructor explained how to calculate the mean and standard deviation for different data sets and clarified when to use different formulas for populations versus samples. The session then covered concepts related to NLP and machine learning, emphasizing the use of embeddings and large language models (LLMs) like ChatGPT. Embeddings convert words into numerical vectors, which the models use to understand semantic relationships. Cosine similarity, a metric for comparing vector similarity, was explained. Retrieval Augmented Generation (RAG) was introduced as a method combining document retrieval with generation to ensure responses are contextually relevant and accurate. The session included a demonstration of the Pix2Pix model, an image-to-image translation tool. Various aspects of embeddings, including their use in Word2Vec and the differences with LLM embeddings, were discussed. The instructor answered questions about NLP applications, AI chatbots, and the technical details behind RAG and embeddings, stressing their importance in improving AI interactions.
Keywords
mean
standard deviation
machine learning
natural language processing
embeddings
large language models
cosine similarity
retrieval augmented generation
Pix2Pix model
×
Please select your language
1
English