Settings

Language

Build an AI Chatbot with One API Key: From Zero to Production in 30 Minutes

L
LemonData
·February 26, 2026·555 views
Build an AI Chatbot with One API Key: From Zero to Production in 30 Minutes

This tutorial builds a small but production-ready chatbot service with FastAPI, SSE streaming, conversation memory, and model switching. The goal is not to ship a toy demo. The goal is to get to a backend you can actually put behind a product surface and iterate on safely.

If you have already pointed one OpenAI-compatible SDK at LemonData, this article picks up from there. If you have not done the base URL swap yet, read the migration guide first. If your main concern is request shaping and backoff under load, pair this guide with the AI API rate limiting guide.

What We Are Building

The finished service has six moving parts:

  1. A synchronous /chat endpoint for smoke tests.
  2. A streaming /chat/stream endpoint for the real UI.
  3. Conversation state keyed by conversation_id.
  4. A model allowlist so the frontend cannot request arbitrary IDs.
  5. Error handling that does not collapse on the first 429.
  6. A clear path from in-memory prototype to Redis or PostgreSQL.

That is enough to power a support bot, an internal assistant, or the first version of an embedded chat widget.

Install the Minimum Stack

pip install fastapi uvicorn openai pydantic redis

You can omit redis for the first pass, but it is useful to wire the import in now so the upgrade path is obvious.

Step 1: Start With a Small, Boring Chat Endpoint

The fastest way to get lost in chatbot work is to start with websockets, tool use, and agent orchestration before the basic request path is stable. Start with one small endpoint that proves your key, base URL, and model routing are correct.

from fastapi import FastAPI
from openai import OpenAI
from pydantic import BaseModel

app = FastAPI()

client = OpenAI(
    api_key="sk-lemon-xxx",
    base_url="https://api.lemondata.cc/v1"
)

class ChatRequest(BaseModel):
    message: str
    model: str = "gpt-4.1-mini"
    conversation_id: str | None = None

@app.post("/chat")
async def chat(req: ChatRequest):
    response = client.chat.completions.create(
        model=req.model,
        messages=[{"role": "user", "content": req.message}]
    )
    return {"reply": response.choices[0].message.content}

Run one smoke test. If this fails, do not keep layering features on top.

Step 2: Add Streaming Because Users Feel Latency Before They Measure It

Most chatbot products feel slow not because the model is slow, but because the UI stays blank until the full response arrives. SSE is enough for many chat products and has a lower operational burden than websockets.

from fastapi.responses import StreamingResponse

@app.post("/chat/stream")
async def chat_stream(req: ChatRequest):
    def generate():
        stream = client.chat.completions.create(
            model=req.model,
            messages=[{"role": "user", "content": req.message}],
            stream=True
        )
        for chunk in stream:
            delta = chunk.choices[0].delta
            if delta.content:
                yield f"data: {delta.content}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingResponse(generate(), media_type="text/event-stream")

On the frontend, the simplest browser-side client is still good enough:

async function sendMessage(payload) {
  const response = await fetch('/chat/stream', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify(payload),
  });

  const reader = response.body.getReader();
  const decoder = new TextDecoder();

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    const chunk = decoder.decode(value, { stream: true });
    console.log(chunk);
  }
}

If your product already uses a browser client and standard HTTP, SSE keeps the architecture simpler.

Step 3: Move Conversation State Out of the Request Body

The first chatbot demo usually keeps the full transcript in the browser and sends it on every turn. That works for prototypes. It becomes messy the moment you need retries, resumable sessions, or server-side tooling.

An in-memory store is fine to start:

from collections import defaultdict
import uuid

conversations: dict[str, list] = defaultdict(list)
SYSTEM_PROMPT = "You are a helpful assistant. Be concise and direct."

def build_messages(conv_id: str, user_msg: str) -> list:
    messages = [{"role": "system", "content": SYSTEM_PROMPT}]
    history = conversations[conv_id][-20:]
    messages.extend(history)
    messages.append({"role": "user", "content": user_msg})
    conversations[conv_id].append({"role": "user", "content": user_msg})
    return messages

The upgrade path to Redis is mostly storage plumbing:

import json
import redis

redis_client = redis.Redis(host="127.0.0.1", port=6379, decode_responses=True)

def load_history(conv_id: str) -> list:
    raw = redis_client.get(f"chat:{conv_id}")
    return json.loads(raw) if raw else []

def save_history(conv_id: str, history: list) -> None:
    redis_client.setex(f"chat:{conv_id}", 60 * 60 * 24, json.dumps(history))

Use Redis if conversations need TTL, resumability, or multi-instance deployment. Use PostgreSQL if the transcript itself is product data.

Step 4: Treat Errors as Product Behavior, Not Just Exceptions

If your chatbot is customer-facing, the failure path matters as much as the happy path. A user does not care whether the failure came from rate limiting, balance, or a model outage. They care whether the UI freezes.

from openai import APIConnectionError, APIError, RateLimitError

@app.post("/chat/stream")
async def chat_stream(req: ChatRequest):
    conv_id = req.conversation_id or str(uuid.uuid4())
    messages = build_messages(conv_id, req.message)

    def generate():
        full_response = []
        try:
            stream = client.chat.completions.create(
                model=req.model,
                messages=messages,
                stream=True
            )
            for chunk in stream:
                delta = chunk.choices[0].delta
                if delta.content:
                    full_response.append(delta.content)
                    yield f"data: {delta.content}\n\n"
        except RateLimitError:
            yield "data: [ERROR] The model is busy. Please retry in a few seconds.\n\n"
        except APIConnectionError:
            yield "data: [ERROR] Temporary network issue. Please retry.\n\n"
        except APIError as error:
            yield f"data: [ERROR] {error.message}\n\n"
        else:
            conversations[conv_id].append(
                {"role": "assistant", "content": "".join(full_response)}
            )
        finally:
            yield "data: [DONE]\n\n"

    return StreamingResponse(
        generate(),
        media_type="text/event-stream",
        headers={"X-Conversation-ID": conv_id}
    )

If you are serving meaningful load, you should also shape requests before they reach the upstream. The detailed patterns are in the rate limiting guide, but the short version is: use bounded retries, use jitter, and never except Exception: time.sleep(1).

Step 5: Model Switching Needs an Allowlist, Not a Free Text Box

One API key can reach hundreds of models. That does not mean your UI should expose hundreds of models. The backend should publish a small allowlist matched to your use case.

AVAILABLE_MODELS = {
    "fast": "gpt-4.1-mini",
    "balanced": "claude-sonnet-4-6",
    "reasoning": "o3",
    "budget": "deepseek-chat",
}

@app.get("/models")
async def list_models():
    return {"models": AVAILABLE_MODELS}

This does three useful things:

  • it stops the frontend from requesting invalid or deprecated model IDs
  • it lets you remap a tier later without redeploying every client
  • it gives you one place to enforce cost controls

If your team is still deciding which providers to standardize on, the pricing comparison and OpenRouter vs LemonData comparison are the two pages worth reading before you lock the allowlist.

Step 6: Add the Production Edges Before Traffic Arrives

A chatbot backend becomes production-grade when the surrounding edges are handled, not when the core chat call is clever.

The checklist is short:

  • add request IDs so you can connect frontend failures to backend logs
  • cap per-user concurrency and request size
  • trim long histories before they explode your token budget
  • log model, latency, input size, and finish reason
  • separate user-visible error messages from internal error detail
  • test one alternate model so you know fallback works before the first outage

History trimming can stay simple:

def trim_history(messages: list, max_tokens: int = 8000) -> list:
    system = messages[0]
    history = messages[1:]
    total_chars = len(system["content"])
    trimmed = []

    for msg in reversed(history):
        msg_chars = len(msg["content"])
        if total_chars + msg_chars > max_tokens * 4:
            break
        trimmed.insert(0, msg)
        total_chars += msg_chars

    return [system] + trimmed

The point is not token-perfect accounting. The point is stopping obvious context blowups.

From Demo to Product

Once this backend is stable, the next upgrade is rarely “more AI.” It is usually boring infrastructure:

  • auth so one user cannot read another user's conversation
  • persistence so sessions survive deploys
  • rate limiting so one noisy user cannot burn your quota
  • billing or usage attribution if the chatbot is customer-facing
  • background summarization if conversations need long-term memory

That is why a unified gateway helps. Once you have the base URL migration behind you, model changes stop being a platform rewrite and become configuration.

Smoke Test

uvicorn main:app --reload --port 8000

curl -N -X POST http://localhost:8000/chat/stream \
  -H "Content-Type: application/json" \
  -d '{"message": "Hello!", "model": "gpt-4.1-mini"}'

If you can stream one turn, preserve one conversation, and return a clean error on a forced failure, you have the right foundation.

Cost Estimate

Create an API key at LemonData, point your OpenAI SDK at https://api.lemondata.cc/v1, and you can ship the first production version of your chatbot without managing separate provider accounts.

Model Daily Cost Monthly Cost
GPT-4.1-mini ~$2.40 ~$72
GPT-4.1 ~$12.00 ~$360
Claude Sonnet 4.6 ~$18.00 ~$540
DeepSeek V3 ~$1.68 ~$50

Using GPT-4.1-mini for most conversations and upgrading to Claude Sonnet 4.6 only when users request it keeps costs under $100/month for most applications.


Get your API key: lemondata.cc provides 300+ models through one endpoint. $1 free credit to start building.

Share: