Fine-tuning
Training a pre-existing model on a specific dataset to improve its performance on a narrow task or domain.
Fine-tuning is the process of taking a pre-trained language model and training it further on a specific dataset to improve its performance for a particular task or domain. The base model has broad general knowledge from its original training, and fine-tuning specializes it. It is like the difference between hiring a general consultant and hiring a specialist who has spent years in your exact industry.
Why it matters: while prompt engineering (including few-shot) can handle many tasks, some use cases require consistent, domain-specific behavior that is hard to achieve through prompting alone. Fine-tuning embeds knowledge and patterns directly into the model's weights, resulting in more reliable, faster, and often cheaper outputs (because you need shorter prompts). For companies processing thousands of similar requests per day, fine-tuning can significantly improve both quality and cost efficiency.
When to fine-tune vs. when to prompt: prompt engineering is the right starting point for almost every use case. Move to fine-tuning when: you need extremely consistent output format or style, you have a narrow task with clear right/wrong answers, you need to reduce latency or cost (fine-tuned models need fewer prompt tokens), or your task requires deep domain knowledge that few-shot examples cannot cover. Fine-tuning is not a substitute for good prompts. The best results come from fine-tuning plus good prompting.
How to do it: prepare a training dataset of input-output examples (typically 50-500 examples for basic fine-tuning, more for complex tasks). Format them according to the provider's requirements. OpenAI, Anthropic, Google, and Cohere all offer fine-tuning APIs. Open-source options (using Hugging Face's Trainer, QLoRA, or Axolotl) let you fine-tune on your own infrastructure. The process typically takes minutes to hours depending on dataset size and model size.
Cost and considerations: fine-tuning has upfront costs (compute time for training) and ongoing costs (hosting the fine-tuned model, which may be slightly more expensive than the base model). Start with a small training set, evaluate on a test set, and iterate. Keep a diverse, high-quality training set. Bad training data produces a model that is confidently wrong.
Common mistakes: fine-tuning when prompt engineering would suffice (over-engineering). Training on too little data (50 examples is the minimum for most platforms, but 200+ is better). Not holding out a test set to evaluate the fine-tuned model. Training on biased or unrepresentative data. Expecting fine-tuning to add factual knowledge the model does not have (fine-tuning adjusts behavior and style, RAG adds external knowledge).
Practical example: a customer support team processes 500+ tickets per day. They fine-tune a model on 300 examples of tickets paired with ideal responses, categorized by issue type. The fine-tuned model drafts responses that match their brand voice, follow their escalation protocols, and include relevant knowledge base links. Agent response time drops from 8 minutes to 3 minutes per ticket (the agent reviews and sends the draft), and customer satisfaction scores improve because responses are more consistent.
Related terms
Large Language Model. A neural network trained on massive text data that can generate, summarize, and reason about language.
Providing a small number of examples in a prompt to guide an LLM's output format and behavior.
The practice of crafting inputs to an LLM to reliably produce desired outputs, including system prompts and few-shot examples.
Numerical vector representations of text that capture semantic meaning, used for search, clustering, and recommendations.
Put these concepts into action
Oscom connects your SEO, content, ads, and analytics into one system. Stop context-switching between tools.
Start free trial