Building a Terminal Chatbot with LLaMA 3
Software development
19 July 2025

Building a Terminal Chatbot with LLaMA 3

Previously we configured Ollama to run LLaMA 3 on a Google VM and connected to it through an SSH tunnel. In this article, we will discover how to build a beautiful markdown-powered chatbot in your terminal using Python and the generation endpoint we have created.

1. Create a project folder

mkdir llama_project
cd llama_project

2. Create and activate virtual environment

python3 -m venv venv
source venv/bin/activate

3. Create requirements.txt to list all related dependencies

touch requirements.txt

4. Create .env

I'm pulling llama3:8b for this article. We'll be connecting to the Ollama server on the previously created Google VM via an SSH tunnel.

OLLAMA_MODEL_NAME=llama3:8b
OLLAMA_GENERATION_ENDPOINT=http://localhost:11434/api/chat

4. Create client ollama_client.py

This script connects to an Ollama server running llama3:8b on a Google VM using environment variables for configuration. It defines a ask_ollama_streaming function that sends a chat message payload to the Ollama API and streams the response in real time.

Key features:

  • Loads API endpoint and model name from .env variables
  • Sends messages via POST request with streaming enabled
  • Yields streamed chunks of content from the response
  • Includes basic error handling and line-by-line JSON parsing
import os  # For accessing environment variables
import json  # For parsing JSON responses
import requests  # For sending HTTP requests to Ollama endpoint
from dotenv import load_dotenv  # To load environment variables from a .env file
from typing import Generator  # For typing the streaming generator function

# Load environment variables from a .env file into the environment
load_dotenv()

# Fetch the Ollama generation endpoint and model name from environment variables
OLLAMA_URL = os.getenv("OLLAMA_GENERATION_ENDPOINT")
OLLAMA_MODEL = os.getenv("OLLAMA_MODEL_NAME", "llama3:8b")  # Default to "llama3:8b" if not set

# Function to send messages to the Ollama model and stream back the response
def ask_ollama_streaming(messages: list) -> Generator[str, None, None]:
    try:
        # Send a POST request to the Ollama endpoint with the model and messages
        response = requests.post(
            OLLAMA_URL,
            headers={"Content-Type": "application/json"},
            json={
                "model": OLLAMA_MODEL,
                "messages": messages
            },
            stream=True  # Enable streaming response
        )
        response.raise_for_status()  # Raise exception for HTTP errors

        # Iterate over streamed lines in the response
        for line in response.iter_lines(decode_unicode=True):
            if not line.strip():
                continue  # Skip empty lines

            try:
                # Parse the streamed line as JSON
                chunk = json.loads(line)
                # Safely extract the assistant's content from the response chunk
                content_piece = chunk.get("message", {}).get("content", "")
                if content_piece:
                    yield content_piece  # Yield each piece of content for streaming display
            except Exception as e:
                print(f"\n[Parse Error]: {e}")  # Handle JSON parsing errors gracefully

    except Exception as e:
        print(f"\n[Error]: {e}")  # Handle connection or request errors

5. Create main function main.py

This script provides a terminal-based LLaMA 3 chatbot interface, powered by Ollama and styled with the rich library for a polished user experience. It connects to a local or remote Ollama server (configured in ask_ollama_streaming) and streams responses in real-time.

Key features:

  • Styled CLI chat interface using rich.console, Markdown, and Live components
  • Real-time streaming of model responses from Ollama with typing effect
  • Persistent conversation context (chat_history) for multi-turn dialogue
  • Displays response time for each interaction
  • Gracefully handles user interruptions and errors

To start chatting, just run the script python main.py and type your questions. Type 'exit' to quit.

from ollama_client import ask_ollama_streaming  # Import the streaming function from ollama_client.py
from rich.console import Console  # For pretty printing in the terminal
from rich.markdown import Markdown  # To render markdown-style responses
from rich.panel import Panel  # To display stylized boxes
from rich.live import Live  # To dynamically update terminal output (used for streaming effect)
import time  # ⏱️ Used to measure response time

# Create a rich console for stylized output
console = Console()

def main():
    # Print an intro panel when the app starts
    console.print(Panel("🤖 [bold cyan]LLaMA 3 Chatbot[/bold cyan]\nType [bold]'exit'[/bold] to quit.", expand=False))

    # Initialize the chat history with a system message to guide the model's behavior
    chat_history = [
        {"role": "system", "content": "You are a helpful assistant."}
    ]

    # Loop for continuous conversation
    while True:
        try:
            # Prompt user input in green
            user_input = console.input("\n[bold green]🧠 You:[/bold green] ")

            # Exit condition if user types "exit" or "quit"
            if user_input.lower() in {"exit", "quit"}:
                console.print("\n[bold yellow]👋 Goodbye![/bold yellow]")
                break

            # Add user message to chat history
            chat_history.append({"role": "user", "content": user_input})

            response_text = ""  # To accumulate streamed response
            console.print("\n[bold cyan]🤖 LLaMA:[/bold cyan]")

            start_time = time.time()  # ⏱️ Start timer for performance logging

            # Stream response and update live markdown display
            with Live(Markdown(""), refresh_per_second=10, console=console) as live:
                for chunk in ask_ollama_streaming(chat_history):
                    response_text += chunk  # Accumulate the streaming chunks
                    live.update(Markdown(response_text))  # Update terminal output in real time

            end_time = time.time()  # ⏱️ End timer
            elapsed = end_time - start_time

            # Add model response to chat history to maintain context
            chat_history.append({"role": "assistant", "content": response_text})

            # Display time taken for the response
            console.print(f"\n[dim]⏱️ Responded in {elapsed:.2f} seconds[/dim]")

        except KeyboardInterrupt:
            # Handle Ctrl+C gracefully
            console.print("\n[bold yellow]👋 Interrupted by user. Exiting.[/bold yellow]")
            break
        except Exception as e:
            # Catch and print any other error that might occur
            console.print(f"\n[bold red]❗ Error:[/bold red] {str(e)}")

# Run the chatbot if this file is executed directly
if __name__ == "__main__":
    main()