
Building a Terminal Chatbot with LLaMA 3
Previously we configured Ollama to run LLaMA 3 on a Google VM and connected to it through an SSH tunnel. In this article, we will discover how to build a beautiful markdown-powered chatbot in your terminal using Python and the generation endpoint we have created.
1. Create a project folder
mkdir llama_project
cd llama_project
2. Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate
3. Create requirements.txt to list all related dependencies
touch requirements.txt
4. Create .env
I'm pulling llama3:8b for this article. We'll be connecting to the Ollama server on the previously created Google VM via an SSH tunnel.
OLLAMA_MODEL_NAME=llama3:8b
OLLAMA_GENERATION_ENDPOINT=http://localhost:11434/api/chat
4. Create client ollama_client.py
This script connects to an Ollama server running llama3:8b on a Google VM using environment variables for configuration. It defines a ask_ollama_streaming function that sends a chat message payload to the Ollama API and streams the response in real time.
Key features:
- Loads API endpoint and model name from
.envvariables - Sends messages via POST request with streaming enabled
- Yields streamed chunks of content from the response
- Includes basic error handling and line-by-line JSON parsing
import os # For accessing environment variables
import json # For parsing JSON responses
import requests # For sending HTTP requests to Ollama endpoint
from dotenv import load_dotenv # To load environment variables from a .env file
from typing import Generator # For typing the streaming generator function
# Load environment variables from a .env file into the environment
load_dotenv()
# Fetch the Ollama generation endpoint and model name from environment variables
OLLAMA_URL = os.getenv("OLLAMA_GENERATION_ENDPOINT")
OLLAMA_MODEL = os.getenv("OLLAMA_MODEL_NAME", "llama3:8b") # Default to "llama3:8b" if not set
# Function to send messages to the Ollama model and stream back the response
def ask_ollama_streaming(messages: list) -> Generator[str, None, None]:
try:
# Send a POST request to the Ollama endpoint with the model and messages
response = requests.post(
OLLAMA_URL,
headers={"Content-Type": "application/json"},
json={
"model": OLLAMA_MODEL,
"messages": messages
},
stream=True # Enable streaming response
)
response.raise_for_status() # Raise exception for HTTP errors
# Iterate over streamed lines in the response
for line in response.iter_lines(decode_unicode=True):
if not line.strip():
continue # Skip empty lines
try:
# Parse the streamed line as JSON
chunk = json.loads(line)
# Safely extract the assistant's content from the response chunk
content_piece = chunk.get("message", {}).get("content", "")
if content_piece:
yield content_piece # Yield each piece of content for streaming display
except Exception as e:
print(f"\n[Parse Error]: {e}") # Handle JSON parsing errors gracefully
except Exception as e:
print(f"\n[Error]: {e}") # Handle connection or request errors
5. Create main function main.py
This script provides a terminal-based LLaMA 3 chatbot interface, powered by Ollama and styled with the rich library for a polished user experience. It connects to a local or remote Ollama server (configured in ask_ollama_streaming) and streams responses in real-time.
Key features:
- Styled CLI chat interface using
rich.console,Markdown, andLivecomponents - Real-time streaming of model responses from Ollama with typing effect
- Persistent conversation context (
chat_history) for multi-turn dialogue - Displays response time for each interaction
- Gracefully handles user interruptions and errors
To start chatting, just run the script python main.py and type your questions. Type 'exit' to quit.
from ollama_client import ask_ollama_streaming # Import the streaming function from ollama_client.py
from rich.console import Console # For pretty printing in the terminal
from rich.markdown import Markdown # To render markdown-style responses
from rich.panel import Panel # To display stylized boxes
from rich.live import Live # To dynamically update terminal output (used for streaming effect)
import time # ⏱️ Used to measure response time
# Create a rich console for stylized output
console = Console()
def main():
# Print an intro panel when the app starts
console.print(Panel("🤖 [bold cyan]LLaMA 3 Chatbot[/bold cyan]\nType [bold]'exit'[/bold] to quit.", expand=False))
# Initialize the chat history with a system message to guide the model's behavior
chat_history = [
{"role": "system", "content": "You are a helpful assistant."}
]
# Loop for continuous conversation
while True:
try:
# Prompt user input in green
user_input = console.input("\n[bold green]🧠 You:[/bold green] ")
# Exit condition if user types "exit" or "quit"
if user_input.lower() in {"exit", "quit"}:
console.print("\n[bold yellow]👋 Goodbye![/bold yellow]")
break
# Add user message to chat history
chat_history.append({"role": "user", "content": user_input})
response_text = "" # To accumulate streamed response
console.print("\n[bold cyan]🤖 LLaMA:[/bold cyan]")
start_time = time.time() # ⏱️ Start timer for performance logging
# Stream response and update live markdown display
with Live(Markdown(""), refresh_per_second=10, console=console) as live:
for chunk in ask_ollama_streaming(chat_history):
response_text += chunk # Accumulate the streaming chunks
live.update(Markdown(response_text)) # Update terminal output in real time
end_time = time.time() # ⏱️ End timer
elapsed = end_time - start_time
# Add model response to chat history to maintain context
chat_history.append({"role": "assistant", "content": response_text})
# Display time taken for the response
console.print(f"\n[dim]⏱️ Responded in {elapsed:.2f} seconds[/dim]")
except KeyboardInterrupt:
# Handle Ctrl+C gracefully
console.print("\n[bold yellow]👋 Interrupted by user. Exiting.[/bold yellow]")
break
except Exception as e:
# Catch and print any other error that might occur
console.print(f"\n[bold red]❗ Error:[/bold red] {str(e)}")
# Run the chatbot if this file is executed directly
if __name__ == "__main__":
main()