Large language models (LLMs) have taken the world by storm, and for good reason. They possess the remarkable ability to comprehend and generate human-like text, opening up a horizon of new applications.
But have you ever wondered how these models achieve such a feat? Well, they are first trained on massive text datasets in a process known as pre-training, gaining a solid grasp of grammar, facts, and reasoning.
Next comes fine-tuning, where they specialize in particular tasks or domains, honing their skills for more niche applications.
And let's not forget the one that makes prompt engineering possible: in-context learning, allowing models to adapt their responses on-the-fly based on the specific queries or prompts they are given, offering a more tailored and flexible output.
Curious about how these processes work and the differences between them? Let’s take a deeper look.
Pre-training is a foundational step in the LLM training process, where the model gains a general understanding of language by exposure to vast amounts of text data.
Think of it as the model's introduction to the world of words, phrases, and ideas. During this phase, the LLM learns grammar rules, linguistic patterns, factual information, and reasoning abilities.
To achieve this, the model is fed an extensive dataset containing diverse texts from books, articles, websites, and more. Take the example of GPT-3, with its immense parameter count. It ingested around 570 GB of text data during pre-training. This is like reading hundreds of thousands of books in multiple languages, giving the model a rich tapestry of language to draw from.
Here's an overview of the characteristics of pre-training LLMs.
Pre-training in unsupervised learning is like immersing the model in a vast sea of text data without any right or wrong answers for guidance. It's a bit like throwing someone into a language immersion program where they learn through sheer exposure and context, absorbing the intricacies of language over time, without explicit instructions.
Since there is no right and wrong answer, one common technique used to provide the model with a learning structure is masked language modeling.
Imagine the model as a language detective trying to decipher a sentence. It's presented with sentences where certain words are intentionally masked or missing. The model has to deduce what those missing words could be based on the context. Then it's given the correct answer and analyzes how far off it was to improve its ability to predict.
This process helps the model understand how words relate to one another and how they fit within the bigger picture of a sentence.
Think of the Transformer architecture as a sophisticated web of connections between words. It's like the model's brain wiring that enables it to capture relationships between words, even if they're far apart.
Models that leverage the transformers architecture don't simply predict the next word based on immediate sequence of words before — they handpick the relevant words from the entirety of the preceding text first. The architecture's attention mechanism allows the model to consider the entire context and prioritize the relevant parts, capturing nuances that are crucial for understanding the meaning.
Let's explore the world of pre-trained models and how they allow for a range of language tasks:
Think of pre-trained models as the ultimate storytellers. They can whip up engaging narratives, generate creative poetry, and provide responses that feel remarkably human.
Businesses are using this ability for virtual assistants that guide customers through troubleshooting or chatbots that offer real-time support (although that often involves fine-tuning). For instance, the AI-powered chatbot 'Woebot' provides mental health support by engaging users in conversations akin to a therapeutic session.
Imagine having a multilingual friend who can instantly translate conversations. Pre-trained models — exposed to a wide range of languages — can do just that. Companies like Airbnb have leveraged this to enhance user experience by automatically translating host reviews and messages into different languages. It's like having a universal translator in your pocket!
Consider a tool that can gauge emotions in text, which is also called sentiment analysis. However, you’d need to fine-tune pre-trained models using sentiment-labeled data to be able to do that.
After the initial pre-training phase, like primary and secondary education where the LLM gets its general language skills, fine-tuning is like giving it on-the-job training. It's honing its abilities for particular tasks or domains, transforming it from a language learner into a task-specific expert.
For instance, scientists might fine-tune a model with a dataset of medical texts, making it exceptional at understanding medical jargon and answering health-related questions.
Similarly, you can train a model on legal documents, turning it into a whizz at summarizing contracts and legal discussions.
As you can guess, fine-tuning typically requires a smaller, specialized dataset compared to the massive one used in pre-training. But this small dataset makes a big impact, often resulting in significant improvements in task performance.
The following are three major characteristics of fine-tuning LLMs:
Fine-tuning employs a strategy known as transfer learning. The model takes the understanding it gained during pre-training (e.g. learning grammar and syntax) and tailors it to the specific task at hand. This accelerates learning and makes the model more efficient in tackling new challenges.
Imagine the LLM as a student studying for an exam. Fine-tuning involves providing the model with task-specific study material. For instance, if it's learning to categorize news articles, it's given a dataset of labeled articles. This targeted information equips the model with the domain expertise needed to excel in that task.
As the model processes task-specific data, it calculates the difference between its predictions and actual outcomes. This difference, known as the gradient, guides parameter adjustments. Optimization techniques then use this gradient information to iteratively fine-tune the model's parameters.
This minimizes prediction errors and enhances the LLM's task-specific expertise.
Fine-tuned LLMs have a wide range of applications, including:
Support issue prioritization
Fraud detection
Blog writing
Lead qualification
Text classification
Question answering
And more
Let’s talk about two of these in more detail.
Fine-tuning makes it easy for language models to classify text. For example, you can fine-tune a language model to categorize customer reviews into "Positive," "Neutral," or "Negative". This way, you can quickly know how customers feel about your products.
Similarly, in the news industry, a model can be fine-tuned to classify articles as "Sports," "Politics," or "Technology," helping news outlets organize and deliver content more effectively.
A real-life example is Gmail, which employs text classification to filter emails into categories like "Primary," "Social," and "Promotions," making it easier for you to manage your inbox.
Picture a virtual expert that can give you instant answers to your questions. Fine-tuning can create this expert, which can be especially useful for technical industries.
So for example, imagine a hospital. They can fine-tune a model with medical question-answer pairs, making it adept at answering health-related inquiries. If you ask the model about the ideal blood pressure range, it can retrieve accurate information from its training and provide a precise answer.
This concept extends to customer support. A model fine-tuned on support queries can help users troubleshoot issues effectively. An interesting example in this regard is H&M, which has an AI chatbot that chats with customers to determine how they look. It also lets them buy and share products from the company.
In-context learning refers to a method of guiding the model's behavior based on the specific context given to it during prompt formulation.
Unlike fine-tuning, in-context learning doesn't require altering the model's parameters or training the model on a specific dataset.
Rather, you provide the model with a prompt or set of instructions within the interaction itself to influence its responses. The model uses this prompt to condition its output, generating a response that is appropriate to the context specified.
Here are five key differences between in-context learning and fine-tuning of LLMs to make this more clear:
Methodology: In-context learning relies on carefully designed prompts to guide the model, while fine-tuning modifies the model's parameters through additional training. To change output with in-context learning, the single prompt must be modified. To change output with fine-tuning, any number of training examples can be added, edited, or removed from the dataset.
Flexibility: In-context learning is more flexible and can be done on the fly for various tasks without requiring a retraining process. Fine-tuning, however, specializes the model for specific tasks at the cost of this flexibility. That makes in-context learning great for prototyping and fine-tuning ideal for "baking in" the expected behavior for long-term use.
Resource requirements: In-context learning doesn't require additional computational resources beyond those needed for running the inference, while fine-tuning requires additional computational power and data to retrain the model. However, the fine-tuned model typically requires fewer resources for inference because it can use fewer tokens or leverage a lighter model with fewer parameters that operates more efficiently. This is important because training is a one-time event whereas inference requires ongoing costs.
Expertise needed: Fine-tuning usually requires a deeper understanding of machine learning and the specific problem domain, while in-context learning can often be carried out with a well-crafted prompt. However, modern software tools for fine-tuning make it much more accessible to people without technical backgrounds.
Data: Fine-tuning requires a labeled dataset that is representative of the task at hand, whereas in-context learning does not. This makes fine-tuning suitable for an optimization that follows the validation of an idea with an engineered prompt, which can also help generate the initial fine-tuning dataset.
Note: In-context learning and fine-tuning are not mutually exclusive—a fine-tuned model can also include an instructional prompt, which can be useful for smaller datasets.
When it comes to in-context learning of LLMs, the use cases are very similar to those of fine-tuning. But they’re a bit more personalized.
Let’s check two examples:
In-context learning takes dialogue systems to the next level. Imagine a virtual assistant that not only understands your questions but also responds contextually. For instance, Capital One's virtual assistant, Eno, utilizes in-context learning to engage in natural customer conversations. It helps users manage their accounts, track expenses, and even provide financial advice, making banking interactions more user-friendly and efficient.
Think of in-context learning as an advanced text predictor. It's like the auto-suggestions on your phone's keyboard, but smarter.
So, imagine writing an email and an AI model suggests the next words in a way that fits the context perfectly. This can be a boon for content creators, helping them write coherent and engaging pieces faster.
Take the example of Grammarly, the popular writing assistant that leverages in-context learning to offer suggestions that align with the user's writing style. It predicts words and phrases based on context, aiding writers in crafting coherent content.
As you might have realized, some of the most powerful applications of AI emerge once it’s fine-tuned. And that’s exactly what Entry Point helps you do (without any coding knowledge). Curious to learn how? Try Entry Point for free today!