If you've been anywhere near tech lately, you've heard "AI" applied to everything from chatbots to spreadsheet formulas. The word means almost nothing at this point. So let's reset.
Here's what AI actually looks like right now, stripped of the marketing.
The thing that changed everything: LLMs
The biggest shift is large language models, or LLMs. Claude, GPT, Gemini, Llama. These are programs that learned to understand and generate language by studying massive amounts of text. Think of them as autocomplete on steroids. They predict the most useful next piece of text given everything they've read so far, and they're shockingly good at it.
Before LLMs, getting a computer to understand a sentence like "I got charged twice and I'm not happy about it" took months of specialized engineering. Now you describe what you want in plain English and get a useful answer in under a second.
LLMs are good at understanding and generating text, classifying and summarizing information, writing code, following complex instructions, and working across dozens of languages.
They're not good at precise math (they guess based on patterns instead of actually computing), staying consistent across runs (same question, slightly different answers), or interacting with the outside world on their own. An LLM by itself is just a text-in, text-out function. It doesn't browse the web or run code unless you give it tools to do that.
Tokens and context windows
LLMs don't read words. They read tokens. A token is a piece of text that the model treats as a single unit. Sometimes it's a whole word, sometimes it's part of one. For example, the word "understanding" might get split into "under" and "standing" as two separate tokens. Short common words like "the" or "is" are usually one token each. A rough rule of thumb: 1 token is about 3/4 of a word, so 1,000 tokens is roughly 750 words.
Everything you send to an LLM (your message, the system prompt, any documents you include) gets converted into tokens. The model processes those tokens and generates new ones as its response.
The context window is how many tokens the model can handle at once, both input and output combined. Think of it as the model's working memory. If your conversation gets longer than the window, older content gets dropped. This is why long conversations sometimes feel like the model "forgot" what you said earlier.
A year ago, most models maxed out at around 200,000 tokens (roughly 500 pages of text). Now the top models handle 1 million tokens, and Meta's Llama 4 Scout pushes to 10 million. That's roughly 25,000 pages in a single prompt. There are also features like Anthropic's context compaction that auto-summarize earlier parts of a conversation to stretch sessions even further.
Embeddings: turning meaning into numbers
Earlier we talked about tokens, which are how text gets chopped up into pieces. Embeddings are the next step: they capture what those pieces actually mean.
An embedding takes a chunk of text (a sentence, a paragraph, a whole document) and converts it into a list of numbers called a vector. Something like [0.12, -0.87, 0.45, ...], except with hundreds or thousands of numbers. These numbers represent the meaning of the text in a way that math can work with.
The key property is that similar meanings produce similar numbers. So "how do I cancel my subscription" and "I want to stop paying" would end up very close together in this number space, even though the words are completely different.
This is what makes semantic search possible. Instead of matching keywords (which breaks when people phrase things differently), you compare meanings. You'll see why this matters in the next section.
RAG: giving LLMs access to your data
Here's a problem. You build a chatbot for your company and a customer asks "what's your refund policy?" The LLM has no idea. It was trained on public internet data, not your internal docs. It'll either make something up or say it doesn't know.
Retrieval-Augmented Generation (RAG) solves this. The idea is simple: before you ask the LLM a question, you find the relevant information from your own documents and paste it into the prompt. The LLM doesn't "know" your data. You're just showing it the right bits at the right time.
Say a customer asks about refunds. Your system searches your help docs, finds the page about refund policies, and builds a prompt like: "Based on the following policy document, answer the customer's question..." The LLM reads that context and gives an accurate answer grounded in your actual policy.
That's it. Search first, then ask. This is how most "chat with your docs" and "ask your knowledge base" products work. No fine-tuning, no retraining, no magic. Just search plus prompting.
Here's how it works step by step:
- Take all your documents and split them into smaller chunks (paragraphs or sections). This is just cutting up text, nothing fancy
- Convert each chunk into an embedding so you can search by meaning, not just keywords
- Store those embeddings in a database designed for this kind of search (Pinecone, Weaviate, Qdrant, and Milvus are popular options)
- When a user asks a question, convert it into an embedding too and find the chunks with the closest meaning
- Put those chunks in the prompt alongside the question
- The LLM reads the context and responds
This basic pattern works well for most use cases. Once you need more sophistication, there are advanced approaches (like having the LLM reason about what to search for, or combining meaning-based search with keyword matching), but the fundamentals above will get you far.
"Can I train it on my data?"
This is one of the most common questions people ask, and it's almost always the wrong question. What they actually want is for the AI to know things specific to their business. There are three ways to do that, and most people jump to the hardest one first.
Prompting is the simplest. You just tell the model what it needs to know at the time you ask the question. "You are a customer support agent for Acme Corp. Here's our refund policy: [paste policy]. Now answer the customer's question." The model reads your policy, understands it, and responds based on it.
This sounds too simple to work well, but it really does. You can give the model a persona, rules to follow, examples of good and bad responses, and reference documents, all in the same prompt. It costs nothing to set up, you can change it instantly, and for the vast majority of use cases, this is all you need.
RAG (which we covered earlier) is the next step. Instead of pasting documents into the prompt by hand, you build a system that automatically finds the right documents based on the user's question and puts them in the prompt. This is how you handle large amounts of data that won't fit in a single prompt. The model still doesn't "know" your data permanently. It reads the relevant bits fresh every time.
Fine-tuning is what people usually mean when they say "train it on my data," but it's rarely what they actually need. Fine-tuning takes a base model and retrains it on thousands of your own examples (pairs of inputs and desired outputs). The model's internal behavior changes permanently. After fine-tuning, it doesn't need your data in the prompt anymore because it's baked into how the model works.
This sounds powerful, and it is, but the tradeoffs are real. Fine-tuning is expensive (both the compute and the data preparation), takes days or weeks to get right, requires ongoing maintenance as your data changes, and can actually make the model worse at things outside your specific use case. It also doesn't add new factual knowledge reliably. If you fine-tune a model on your company docs, it might learn your writing style and terminology, but it won't reliably memorize specific facts like your pricing or return policy. For factual accuracy, RAG is almost always better.
So when does fine-tuning actually make sense? When you need the model to consistently follow a very specific output format, adopt a particular writing voice, or handle a specialized domain (like medical or legal language) in a way that prompting alone can't achieve. These cases exist, but they're rarer than people think.
The practical order is: start with prompting, add RAG if you have lots of documents, and only consider fine-tuning when you've genuinely exhausted the first two.
Agents: LLMs that can do things
An LLM by itself just reads text and writes text. An AI agent takes this further by giving the LLM the ability to use tools. Instead of just generating a response, it can decide to search the web, run code, call an API, read a file, or take other actions. It plans, acts, observes the result, and decides what to do next.
For example, if you ask an agent "find out who our top 5 competitors are and compare their pricing to ours," it might search the web, visit several pricing pages, read your own pricing doc, and write a summary. You didn't spell out those steps. The agent figured them out.
The major frameworks for building agents right now:
- Claude Agent SDK powers Claude Code internally. Best for long-running, sophisticated single-agent tasks
- OpenAI Agents SDK makes multi-agent handoffs simple, but locks you into OpenAI
- LangGraph is the production standard for complex workflows
- CrewAI is best for quick prototyping with teams of agents working together
Agents can even use computers now. Anthropic's computer use API lets agents click buttons, fill forms, and navigate software. OpenAI's Operator handles browser tasks like booking flights. This used to be a research demo. Now it's a real product.
All these tools connect to external services through something called the Model Context Protocol (MCP). Think of it like USB for AI: you build a connector once and any model can use it. It's backed by Anthropic, OpenAI, Google, Microsoft, and AWS, and has become the standard way to give LLMs access to tools and data.
All that said, agents are still slow, expensive, and hard to control. They work best for open-ended research and investigation. For most tasks, a simple pipeline with an LLM call in the middle is the better choice. We wrote more about this in You Probably Don't Need an AI Agent.
Multimodal: beyond text
Everything we've talked about so far has been mostly about text. But AI models have been steadily getting better at seeing, hearing, and creating visual content too. By now, multimodal is the default, not a special feature.
You can drop a screenshot, a PDF, a chart, or a photo into Claude, GPT, or Gemini and just ask questions about it. No preprocessing, no special setup. Companies are already using this in production: customers photograph a broken router and the AI diagnoses the issue from the image, manufacturers use cameras to catch defects on production lines, hospitals test systems where nurses show a patient's condition on video and the AI helps with triage.
On the audio side, OpenAI's voice models can hold a real-time spoken conversation with latency that feels close to a natural phone call. Gemini can do live speech-to-speech translation while preserving the speaker's tone and pitch. Claude doesn't have native audio yet, which is one of its bigger gaps.
For generation, AI can now create both images and video. GPT Image 1.5 is built directly into the language model, so you can refine images through conversation. Midjourney v7 is the most artistic. Google's Imagen 4 is the most photorealistic. On the video side, Sora 2 and Veo 3 produce cinematic footage with synchronized audio at up to 4K resolution. Video generation has gone from "interesting demo" to genuinely useful in about a year.
The big technical shift is that newer models are trained on text, images, audio, and video all together from the start, rather than bolting separate models together after the fact. This means they can reason across what they see, hear, and read at the same time, which is why things like conversational image editing and real-time video Q&A actually work well now.
Running AI locally
Not everything needs an API call. Some models are "open weight," meaning anyone can download and run them on their own hardware. The big names right now are DeepSeek V3.2 (which rivals GPT-5 on reasoning benchmarks), Llama 4, Qwen 3.5, and Mistral 3. But here's the catch: the full-size models that compete with cloud APIs have hundreds of billions of parameters. You're not running those on your laptop. What people actually run locally are smaller versions, typically 7 to 32 billion parameters, either distilled from the larger models or designed for efficiency from the start.
The most popular tools are Ollama (pull-and-run from the command line, like Docker for models) and LM Studio (a desktop app with a nice GUI and model browser). Both are built on llama.cpp under the hood, so performance is similar.
What hardware do you actually need? The hard constraint is memory. The model has to fit in your GPU's video memory (VRAM), or in unified memory on Apple Silicon Macs. A technique called quantization helps here: it compresses the model's weights from 16-bit to 4-bit precision, cutting memory use by roughly 75% with only a small hit to quality. A 7B parameter model goes from about 14GB down to 4GB after quantization, and a 70B model shrinks from 140GB to around 40GB.
On an NVIDIA GPU, an RTX 4090 (24GB VRAM) comfortably runs models up to 32 billion parameters at 4-bit quantization, producing responses at 100+ tokens per second on a 7B model. That's faster than you can read. An RTX 3060 with 12GB handles 7B models well. For 70B models, you'd need the new RTX 5090 (32GB) or a pair of high-end cards.
Macs have a surprising advantage here. Because Apple Silicon shares memory between the CPU and GPU, a MacBook Pro with 48GB of unified memory can load a 32B quantized model that would need a dedicated GPU on a PC. An M4 Max with 64 to 128GB can even handle 70B models, though at slower speeds (15 to 30 tokens per second). Apple's MLX framework is optimized for this and runs about 20 to 30% faster than generic tools on Mac.
Running on CPU alone (no GPU) is possible but painful. You'll get usable speed only with tiny models (3B parameters or less). Anything larger crawls at a few tokens per second.
How does local compare to cloud? The 7 to 14B models most people run locally are solid for code completion, summarization, and simple Q&A, roughly matching what the largest cloud models could do two years ago. But they fall noticeably short on complex reasoning, long documents, and creative writing compared to current cloud models like Claude Opus or GPT-5. Local AI makes the most sense when you care about privacy (data never leaves your machine), need to work offline, want to avoid per-token API costs on high-volume tasks, or just want to experiment without a credit card.
AI coding tools
This is probably the clearest AI success story right now. The tools have split into three tiers:
Autonomous agents that plan, write, and verify code across entire projects: Claude Code, OpenAI Codex, and AWS Kiro. These handle complex multi-file changes on their own.
IDE-integrated agents that work alongside you in your editor: Cursor (smoothest experience), Windsurf, and Google's Antigravity. These feel like a really good pair programmer.
Inline assistants for suggestions and quick help: GitHub Copilot, which has the broadest adoption and now includes an agent mode too.
Every tool is racing toward fully autonomous coding. The practical reality is that they work best when experienced developers use them for acceleration, not as replacements.
The current models at a glance
Now that you know what tokens, context windows, and agents mean, here's where the major models stand:
| Model | Provider | Context Window | What stands out |
|---|---|---|---|
| Claude Opus 4.6 | Anthropic | 1M tokens | Multiple agents working together, context compaction |
| Claude Sonnet 4.6 | Anthropic | 1M tokens | Strong balance of coding and general use |
| GPT-5.4 Thinking | OpenAI | 1M tokens | Built-in computer use, strong at math |
| Gemini 3.1 Pro | 1M tokens | Adaptive reasoning, built-in fact checking | |
| Llama 4 Maverick | Meta | 1M tokens | Open weight, handles text and images natively |
| Llama 4 Scout | Meta | 10M tokens | Largest context window of any model |
Pricing has dropped roughly 80% year over year. The cheapest options (Claude Haiku, GPT-5 mini) run at around $0.25 per million input tokens. The most powerful models cost a few dollars per million tokens.
What's real and what's noise
The stuff that's actually delivering value right now is more boring than the headlines suggest. AI coding tools are making experienced developers noticeably faster. RAG-based search and Q&A over company data works well when scoped properly. Tightly defined automation (routing support tickets, processing documents, checking compliance) is saving real time and money. And for individual productivity, using an LLM for writing, research, and analysis with a human checking the output is genuinely useful.
The overhyped stuff is equally predictable. "AGI by next year" is still being said, still not happening. The dream of fully autonomous AI employees replacing entire teams hasn't materialized. And despite billions poured into enterprise AI initiatives, 90 to 95% of organizations are seeing little to no measurable financial return. A lot of AI companies are operating at razor-thin margins compared to traditional software, and the gap between valuations and reality is getting harder to ignore.
What's actually interesting to keep an eye on is quieter. MCP is making it easier to connect AI to pretty much anything. New model architectures are bringing the cost of running large models way down. The way models reason through hard problems is improving in ways that feel qualitatively different from just making them bigger. And local AI running on your own hardware is getting surprisingly good as models become more efficient.
Where to start if you're building
- Pick an API. Claude or GPT. Both are good. Cost is low enough that it shouldn't be the deciding factor.
- Build something small. A classifier, a summarizer, a Q&A bot over a document. Something you can finish in a weekend.
- Learn prompting well. This is still the highest-leverage skill. A good prompt replaces thousands of lines of code.
- Add RAG when you need it. When your app needs to answer questions about specific data, add a retrieval step before the LLM call.
- Avoid agents until you actually need them. They add complexity. Start with deterministic workflows and add autonomy only where it's genuinely required.
The AI landscape is simpler than it looks once you cut through the noise. Most of what matters fits in this post.
