What is an API key and why do I need one?

An API key is a unique string of characters that identifies you to the LLM service. It authenticates your requests and tracks your usage for billing. You generate it through the provider's dashboard and keep it secret, like a password.

How are LLM APIs priced?

Most LLM providers charge based on the number of tokens processed. Input tokens (your prompt) and output tokens (the model's response) have separate prices, often with output being more expensive. Prices vary by model capability, with newer or more powerful models costing more.

Can I run LLM APIs locally instead of through a cloud service?

Yes. Open-source models like Meta's Llama, Mistral, or Falcon can be downloaded and run on your own hardware or through services like Hugging Face. This gives you more control but requires technical setup and compute resources.

How LLM APIs Work | distill.md

In 2023, a startup wanted to add AI-powered customer support to their product. They had two options: train their own language model from scratch (cost: millions of dollars, timeline: years), or call an API that let them use an existing model (cost: a few hundred dollars per month, timeline: days). They chose the API. This is now the standard way companies add AI capabilities to their products, and it works surprisingly similarly to how web pages load.

The short answer

An LLM API is a web service that lets developers send text to a large language model and receive generated text in response. The provider hosts the model on their servers, handles all the complex computing, and charges you based on how much you use. You make HTTP requests over the internet, passing your prompt and parameters, and get back the model’s output. This abstracts away all the hardware and ML expertise needed to run a language model yourself.

The full picture

What actually happens when you call an API

When you use an LLM API, you’re making a request to a remote server that hosts the language model. The flow goes like this.

Your application prepares a request containing your prompt, the model you want to use, and settings like temperature or maximum length. This request gets sent over HTTPS to the provider’s servers.

The provider’s system receives your request, feeds your prompt into their model, runs the prediction (this is the computationally expensive part that happens on their GPUs), and generates a response.

The response comes back to your application as text, which you can then display to users, process further, or store.

This happens in seconds, sometimes fractions of a second, depending on the request size and model complexity.

Major LLM API providers

The market for LLM APIs is dominated by a few major players, each with their own strengths.

OpenAI’s API powers ChatGPT and offers models like GPT-4 and GPT-4o. Their API was one of the first and remains widely used, known for strong performance across general tasks.

Anthropic provides the Claude models through their API, with a reputation for thoughtful, careful responses and strong reasoning capabilities. Their models often produce more cautious, nuanced outputs.

Google offers the Gemini API with models that excel at multimodal tasks, handling text, images, and other inputs together. Their large context windows make them suitable for analyzing long documents.

Meta provides access to open-source models like Llama through various platforms. These can be run locally or hosted yourself, offering more control at the cost of convenience.

Key parameters you can control

When calling an LLM API, you can adjust several parameters to change how the model behaves.

Temperature controls randomness. A low temperature (like 0.1) makes outputs more predictable and focused, while a high temperature (like 0.9) introduces more creativity and variation. For factual questions, you want low temperature. For creative writing, higher temperature works better.

Max tokens limits how long the response can be. This prevents runaway responses and helps you manage costs, since you’re billed by the token.

Top-p (nucleus sampling) is an alternative to temperature that controls which tokens the model can consider. Lower values restrict the model to more likely choices.

The system prompt lets you give the model instructions about how to behave. You can tell it to be more formal, answer as a specific character, or follow particular guidelines.

Authentication and security

Access to LLM APIs requires an API key, which you generate through the provider’s developer dashboard. This key authenticates your requests and ties them to your account for billing purposes.

Best practices for API key security include storing keys in environment variables rather than hardcoding them in your source code, rotating keys periodically, and setting usage limits if the provider supports them to prevent runaway costs if something goes wrong.

Some providers offer scoped keys with limited permissions, which is smarter than using your master key everywhere.

Rate limits and quotas

Providers impose rate limits on how many requests you can make per minute or per day. These limits vary by pricing tier, with free tiers having strict limits and paid plans offering higher throughput.

If you need to make many requests, you’ll need to implement queuing and retry logic in your application. Most providers return error codes when you hit limits, and exponential backoff (waiting longer after each failed attempt) is the standard approach.

Pricing models explained

Most LLM APIs price based on token usage, with separate rates for input tokens (your prompt) and output tokens (the model’s response). A typical pricing structure might charge $3 per million input tokens and $15 per million output tokens for a capable model.

More advanced models cost more. Smaller or open-source models can be dramatically cheaper, sometimes free for certain use cases.

Understanding token counting matters for budgeting. A rough rule of thumb is that 750 words equals about 1,000 tokens, though this varies depending on the text.

Why it matters

LLM APIs have democratized access to powerful AI. Before these APIs existed, only large tech companies with massive compute budgets could afford to deploy language models. Now any developer with an API key can build AI-powered features.

This has accelerated AI adoption across industries. Startups now compete with established companies on AI features. Individual developers can build products that would have required an entire research team just two years ago.

The API model also separates concerns. Model providers focus on improving the underlying technology. Application developers focus on user experience and product logic. Neither needs to be expert in the other’s domain.

Understanding how these APIs work helps you make better architectural decisions, debug issues when they arise, and optimize costs by reducing unnecessary token usage.

Common misconceptions

“Calling an API means my data goes to a public model.”

Your prompts do go to the provider’s servers when you use their API. Most providers have options to opt out of having your data used for training. Enterprise plans often include additional data isolation. If data privacy is critical, look for these options or consider running models locally.

“More expensive models are always better.”

Not true. For many tasks, cheaper models perform nearly as well. A smaller model fine-tuned for a specific task can outperform a larger general-purpose model at that particular job. Benchmark your actual use case against different models rather than assuming the most expensive one is best.

“The API is just a direct connection to the model.”

There’s significant infrastructure between your request and the model. Providers handle load balancing, fallback to backup systems if a model is overloaded, content filtering, logging, and more. The “API” is really a managed service, not a wire directly into a neural network.

Key terms

API (Application Programming Interface): A set of rules that lets two software applications communicate with each other. LLM APIs let your code send text to a remote model and get responses back.

API Key: A unique identifier that authenticates your requests to an LLM service. Keep it secret and secure.

Tokens: The basic units a language model processes. Roughly, 1 token equals about 3/4 of an English word. You’re billed per token.

Temperature: A parameter that controls how random or creative the model’s output is. Lower values mean more predictable outputs.

Rate Limit: The maximum number of requests you can make to an API within a certain time period.

Context Window: The maximum amount of text the model can consider when generating a response, measured in tokens.

System Prompt: Instructions you provide to the model that shape how it behaves across all interactions.