Overview

The Llama Discord Bot started as an experiment in latency-sensitive LLM inference. The objective was to deliver a conversational assistant that could keep pace with a crowded Discord server while retaining multi-user context.

Discord conversation screenshot

Architecture Highlights

Inference Gateway – Streams tokens from a self-hosted Llama 3 model using Hugging Face Transformers with accelerated attention kernels.
Session Orchestrator – Tracks per-channel dialogue state and triages simultaneous requests using Redis streams and worker pools.
Observability – Logs interaction metrics into OpenTelemetry and Grafana dashboards for prompt-level debugging.

async def stream_reply(message: Message) -> AsyncIterator[str]:
    chat_state = await session_store.fetch(message.channel.id)
    prompt = build_prompt(chat_state, message.content)

    async for token in llama.generate(prompt, temperature=0.6):
        yield token
        await websocket.send(token)

Outcomes

Sub-2s average first-token latency for chats of 1–4 simultaneous users.
Conversation continuity across channels thanks to explicit session handoffs.
Fast iteration loop: new prompt templates can be deployed without bot downtime.

Next Steps

Introduce fine-tuned adapters for support-style conversations.
Expand the moderation layer with automated abuse detection before messages reach inference.