< Back to projects

Conversational AI · Python · Hugging Face

Llama Discord Bot

Built a real-time Discord companion powered by Llama 3, orchestrating streaming inference, context management, and simultaneous multi-user conversations.

Screenshot of the Llama Discord Bot interface

Overview

The Llama Discord Bot started as an experiment in latency-sensitive LLM inference. The objective was to deliver a conversational assistant that could keep pace with a crowded Discord server while retaining multi-user context.

Discord conversation screenshot

Architecture Highlights

  • Inference Gateway – Streams tokens from a self-hosted Llama 3 model using Hugging Face Transformers with accelerated attention kernels.
  • Session Orchestrator – Tracks per-channel dialogue state and triages simultaneous requests using Redis streams and worker pools.
  • Observability – Logs interaction metrics into OpenTelemetry and Grafana dashboards for prompt-level debugging.
async def stream_reply(message: Message) -> AsyncIterator[str]:
    chat_state = await session_store.fetch(message.channel.id)
    prompt = build_prompt(chat_state, message.content)

    async for token in llama.generate(prompt, temperature=0.6):
        yield token
        await websocket.send(token)

Outcomes

  • Sub-2s average first-token latency for chats of 1–4 simultaneous users.
  • Conversation continuity across channels thanks to explicit session handoffs.
  • Fast iteration loop: new prompt templates can be deployed without bot downtime.

Next Steps

  • Introduce fine-tuned adapters for support-style conversations.
  • Expand the moderation layer with automated abuse detection before messages reach inference.