LLM··15 min read

What Even Is AI Right Now

A no-hype look at where AI actually stands in 2026. What's real, what's noise, and what matters if you're building things.

Ram BakthavachalamRam Bakthavachalam
What Even Is AI Right Now

If you've been anywhere near tech lately, you've heard "AI" used for everything from chatbots to spreadsheet formulas. The word has lost its meaning at this point. So let's start over.

Here's what AI actually looks like right now, without the marketing.

The big thing: LLMs

The biggest change is large language models. LLMs. Claude, GPT, Gemini, Llama. These are programs that learned to understand and generate language by processing massive amounts of text. They predict the most useful next piece of text based on everything they've seen, and they're very good at it.

Before LLMs, if you wanted a computer to understand a sentence like "I got charged twice and I'm not happy about it," that took months of specialized engineering. Now you describe what you want in plain English and get a useful answer in under a second.

LLMs are good at understanding and generating text, classifying and summarizing information, writing code, following complex instructions, and working across dozens of languages.

They're not good at precise math (they guess patterns instead of actually computing), staying consistent across runs (same question, slightly different answers), or doing anything in the real world on their own. An LLM by itself is just text in, text out. It doesn't browse the web or run code unless you give it tools to do that.

Tokens and context windows

LLMs don't read words. They read tokens. A token is a piece of text the model treats as one unit. Sometimes it's a full word, sometimes it's part of a word. "Understanding" might get split into "under" and "standing" as two tokens. Short words like "the" or "is" are usually one token. A rough rule: 1 token is about 3/4 of a word. 1,000 tokens is about 750 words.

Everything you send to an LLM (your message, the system prompt, any documents) gets converted to tokens. The model processes them and generates new tokens as its response.

The context window is how many tokens the model can handle at once, input and output combined. Think of it as working memory. If your conversation gets too long, older parts get dropped. This is why long conversations sometimes feel like the model forgot what you said earlier.

A year ago, most models handled around 200,000 tokens (roughly 500 pages of text). Now the top models handle 1 million tokens, and Meta's Llama 4 Scout goes up to 10 million. That's roughly 25,000 pages in one prompt. There are also features like Anthropic's context compaction that summarize earlier parts of a conversation to make sessions last longer.

Embeddings: turning meaning into numbers

Earlier we talked about tokens, how text gets split into pieces. Embeddings go further: they capture what those pieces mean.

An embedding takes a piece of text (a sentence, paragraph, or document) and turns it into a list of numbers. Something like [0.12, -0.87, 0.45, ...], but with hundreds or thousands of numbers. These numbers represent the meaning in a way that math can work with.

The important thing is that similar meanings produce similar numbers. "How do I cancel my subscription" and "I want to stop paying" end up very close together in this number space, even though the words are completely different.

This is what makes semantic search possible. Instead of matching keywords (which fails when people phrase things differently), you compare meanings. This matters for the next section.

RAG: giving LLMs your data

Here's a problem. You build a chatbot for your company and a customer asks "what's your refund policy?" The LLM doesn't know. It was trained on public internet data, not your internal docs. It will either make something up or say it doesn't know.

RAG (Retrieval-Augmented Generation) fixes this. The idea is simple: before asking the LLM a question, find the relevant information from your documents and put it in the prompt. The LLM doesn't "know" your data. You're just showing it the right pieces at the right time.

A customer asks about refunds. Your system searches your help docs, finds the refund policy page, and puts it in the prompt: "Based on this policy document, answer the customer's question..." The LLM reads it and gives an answer based on your actual policy.

That's it. Search first, then ask. This is how most "chat with your docs" products work. No fine-tuning, no retraining. Just search plus prompting.

Here's the step by step:

  1. Split all your documents into smaller chunks (paragraphs or sections)
  2. Convert each chunk into an embedding so you can search by meaning
  3. Store those embeddings in a database built for this (Pinecone, Weaviate, Qdrant, Milvus are popular ones)
  4. When a user asks a question, convert it into an embedding too and find the closest chunks
  5. Put those chunks in the prompt with the question
  6. The LLM reads the context and responds

This basic approach works well for most cases. There are more advanced methods (like having the LLM decide what to search for, or mixing meaning-based search with keyword search), but the basics above will get you far.

"Can I train it on my data?"

This is one of the most common questions, and it's almost always the wrong one. What people actually want is for the AI to know things about their business. There are three ways to do that, and most people go for the hardest option first.

Prompting is the simplest. You tell the model what it needs to know when you ask the question. "You are a customer support agent for Acme Corp. Here's our refund policy: [paste policy]. Answer the customer's question." The model reads your policy and responds based on it.

This sounds too simple, but it works. You can give the model a role, rules, examples of good and bad responses, and reference documents, all in one prompt. It costs nothing, you can change it right away, and for most use cases, this is all you need.

RAG (covered above) is the next level. Instead of pasting documents into the prompt by hand, you build a system that finds the right documents based on the question and puts them in the prompt automatically. This is how you handle lots of data that won't fit in a single prompt. The model still doesn't "know" your data. It reads the relevant parts fresh every time.

Fine-tuning is what people usually mean when they say "train it on my data," but they rarely need it. Fine-tuning takes a base model and trains it on thousands of your examples (input and expected output pairs). The model's behavior changes. After fine-tuning, it doesn't need your data in the prompt because the behavior is built in.

This sounds great, but the tradeoffs are real. Fine-tuning is expensive (compute and data preparation), takes days or weeks to get right, needs maintenance as your data changes, and can make the model worse at things outside your specific use case. It also doesn't add factual knowledge well. If you fine-tune on your company docs, it might learn your writing style, but it won't reliably remember your pricing or return policy. For facts, RAG works better.

When does fine-tuning make sense? When you need the model to always follow a specific output format, use a particular writing voice, or handle a specialized domain (like medical or legal language) in a way that prompting can't do. These cases exist, but they're less common than people think.

The order is: start with prompting, add RAG if you have lots of documents, and only try fine-tuning when the first two are not enough.

Agents: LLMs that can do things

An LLM by itself just reads text and writes text. An AI agent gives the LLM tools. Instead of just generating a response, it can search the web, run code, call an API, read a file, or do other actions. It plans, acts, looks at the result, and decides what to do next.

If you ask an agent "find our top 5 competitors and compare their pricing to ours," it might search the web, visit pricing pages, read your pricing doc, and write a summary. You didn't give it those steps. It figured them out.

The main frameworks for building agents right now:

  • Claude Agent SDK powers Claude Code. Good for long-running single-agent tasks
  • OpenAI Agents SDK makes multi-agent handoffs easy, but ties you to OpenAI
  • LangGraph is the production standard for complex workflows
  • CrewAI is good for quick prototyping with teams of agents

Agents can use computers now too. Anthropic's computer use API lets agents click buttons, fill forms, and navigate software. OpenAI's Operator handles browser tasks like booking flights. This used to be just a research demo. Now it's a real product.

These tools connect to external services through the Model Context Protocol (MCP). Think of it as a standard connector: you build it once and any model can use it. It's backed by Anthropic, OpenAI, Google, Microsoft, and AWS, and it has become the standard way to give LLMs access to tools and data.

That said, agents are still slow, expensive, and hard to control. They work best for open-ended research and investigation. For most tasks, a simple pipeline with an LLM call in the middle is the better choice.

Multimodal: beyond text

Everything above has been about text. But AI models can also work with images, audio, and video now. Multimodal is the default, not a special feature.

You can drop a screenshot, a PDF, a chart, or a photo into Claude, GPT, or Gemini and ask questions about it. No extra setup. Companies use this in production: customers photograph a broken router and the AI diagnoses the problem, factories use cameras to catch defects, hospitals test systems where nurses show patient conditions on video and the AI helps with triage.

For audio, OpenAI's voice models can have a real-time conversation that feels close to a phone call. Gemini can do live speech translation while keeping the speaker's tone. Claude doesn't have native audio yet, which is a gap.

For generation, AI can create images and video. GPT Image 1.5 is built into the language model, so you can refine images through conversation. Midjourney v7 is the most artistic. Google's Imagen 4 is the most photorealistic. For video, Sora 2 and Veo 3 produce high-quality footage with audio at up to 4K. Video generation went from demo to actually useful in about a year.

The technical shift is that newer models are trained on text, images, audio, and video all together from the start, instead of connecting separate models after the fact. They can reason across what they see, hear, and read at the same time. That's why things like image editing through conversation and real-time video Q&A work well now.

Running AI locally

Not everything needs an API call. Some models are "open weight," meaning you can download and run them on your own hardware. The big names now are DeepSeek V3.2 (which is close to GPT-5 on reasoning benchmarks), Llama 4, Qwen 3.5, and Mistral 3. But the full-size models that compete with cloud APIs have hundreds of billions of parameters. You can't run those on your laptop. What people actually run locally are smaller versions, usually 7 to 32 billion parameters.

The most popular tools are Ollama (command line, like Docker for models) and LM Studio (desktop app with a GUI and model browser). Both use llama.cpp, so performance is similar.

Hardware requirements: The model has to fit in your GPU memory (VRAM), or in unified memory on Apple Silicon Macs. Quantization helps: it compresses model weights from 16-bit to 4-bit, cutting memory by about 75% with a small quality loss. A 7B model goes from about 14GB to 4GB after quantization. A 70B model goes from 140GB to about 40GB.

On NVIDIA, an RTX 4090 (24GB VRAM) runs models up to 32 billion parameters at 4-bit, producing 100+ tokens per second on a 7B model. That's faster than you can read. An RTX 3060 with 12GB handles 7B models fine. For 70B models, you need the RTX 5090 (32GB) or multiple cards.

Macs have an advantage here. Apple Silicon shares memory between CPU and GPU, so a MacBook Pro with 48GB unified memory can load a 32B quantized model that would need a dedicated GPU on a PC. An M4 Max with 64 to 128GB can handle 70B models, but slower (15 to 30 tokens per second). Apple's MLX framework is optimized for this.

CPU only (no GPU) is possible but very slow. Usable speed only with tiny models (3B or less).

Local vs cloud: The 7 to 14B models people run locally are solid for code completion, summarization, and simple Q&A. About as good as the largest cloud models were two years ago. But they are noticeably worse than current models like Claude Opus or GPT-5 on complex reasoning, long documents, and creative writing. Local AI makes sense when you care about privacy (data stays on your machine), need to work offline, want to avoid API costs on high-volume tasks, or want to experiment without paying.

AI coding tools

This is probably where AI is delivering the most value right now. The tools fall into three groups:

Autonomous agents that plan, write, and verify code across full projects: Claude Code, OpenAI Codex, and AWS Kiro. These handle complex multi-file changes on their own.

IDE-integrated agents that work alongside you in your editor: Cursor, Windsurf, and Google's Antigravity. These feel like a good pair programmer.

Inline assistants for suggestions and quick help: GitHub Copilot, which has the widest adoption and now includes agent mode too.

All of them are moving toward fully autonomous coding. The practical reality is they work best when experienced developers use them to go faster, not as replacements for knowing what you're doing.

The current models

Now that you know what tokens, context windows, and agents are, here's where the major models stand:

ModelProviderContext WindowWhat stands out
Claude Opus 4.6Anthropic1M tokensMultiple agents working together, context compaction
Claude Sonnet 4.6Anthropic1M tokensGood balance of coding and general use
GPT-5.4 ThinkingOpenAI1M tokensBuilt-in computer use, strong at math
Gemini 3.1 ProGoogle1M tokensAdaptive reasoning, built-in fact checking
Llama 4 MaverickMeta1M tokensOpen weight, handles text and images
Llama 4 ScoutMeta10M tokensLargest context window of any model

Pricing has dropped about 80% year over year. The cheapest options (Claude Haiku, GPT-5 mini) cost around $0.25 per million input tokens. The most powerful models cost a few dollars per million.

What's real and what's noise

The things that actually deliver value right now are not exciting. AI coding tools make experienced developers faster. RAG-based search and Q&A over company data works when you scope it properly. Defined automation (routing tickets, processing documents, checking compliance) saves real time and money. And for personal productivity, using an LLM for writing, research, and analysis while checking its output is genuinely useful.

The overhyped things are predictable. "AGI by next year" keeps getting said, keeps not happening. Fully autonomous AI employees replacing teams hasn't happened. Most enterprise AI projects I've seen or heard about are still figuring out where the real value is. There's a lot of spending, but it's not always clear what's coming back.

What's worth watching is quieter. MCP is making it easier to connect AI to anything. New architectures are reducing the cost of running large models. The way models work through hard problems is improving in ways that feel different from just making models bigger. And local AI on your own hardware is getting better as models become more efficient.

Where to start if you're building

  1. Pick an API. Claude or GPT. Both are good. Cost is low enough that it shouldn't be the deciding factor
  2. Build something small. A classifier, a summarizer, a Q&A bot over a document. Something you can finish in a weekend
  3. Learn prompting well. This is still the most useful skill. A good prompt replaces thousands of lines of code
  4. Add RAG when you need it. When your app needs to answer questions about specific data, add a search step before the LLM call
  5. Avoid agents until you actually need them. They add complexity. Start with simple workflows and add autonomy only where you really need it

The AI landscape is simpler than it looks once you ignore the noise. Most of what matters fits in this post.