Syaeful Bahri - Tech Blog

1. Create a project folder

mkdir llama_project
cd llama_project

2. Create and activate virtual environment

python3 -m venv venv
source venv/bin/activate

3. Create requirements.txt to list all related dependencies

touch requirements.txt

4. Create .env

I'm pulling llama3:8b for this article. We'll be connecting to the Ollama server on the previously created Google VM via an SSH tunnel.

OLLAMA_MODEL_NAME=llama3:8b
OLLAMA_GENERATION_ENDPOINT=http://localhost:11434/api/chat

4. Create client `ollama_client.py`

This script connects to an Ollama server running llama3:8b on a Google VM using environment variables for configuration. It defines a ask_ollama_streaming function that sends a chat message payload to the Ollama API and streams the response in real time.

Key features:

Loads API endpoint and model name from .env variables
Sends messages via POST request with streaming enabled
Yields streamed chunks of content from the response
Includes basic error handling and line-by-line JSON parsing

import os  # For accessing environment variables
import json  # For parsing JSON responses
import requests  # For sending HTTP requests to Ollama endpoint
from dotenv import load_dotenv  # To load environment variables from a .env file
from typing import Generator  # For typing the streaming generator function

# Load environment variables from a .env file into the environment
load_dotenv()

# Fetch the Ollama generation endpoint and model name from environment variables
OLLAMA_URL = os.getenv("OLLAMA_GENERATION_ENDPOINT")
OLLAMA_MODEL = os.getenv("OLLAMA_MODEL_NAME", "llama3:8b")  # Default to "llama3:8b" if not set

# Function to send messages to the Ollama model and stream back the response
def ask_ollama_streaming(messages: list) -> Generator[str, None, None]:
    try:
        # Send a POST request to the Ollama endpoint with the model and messages
        response = requests.post(
            OLLAMA_URL,
            headers={"Content-Type": "application/json"},
            json={
                "model": OLLAMA_MODEL,
                "messages": messages
            },
            stream=True  # Enable streaming response
        )
        response.raise_for_status()  # Raise exception for HTTP errors

        # Iterate over streamed lines in the response
        for line in response.iter_lines(decode_unicode=True):
            if not line.strip():
                continue  # Skip empty lines

            try:
                # Parse the streamed line as JSON
                chunk = json.loads(line)
                # Safely extract the assistant's content from the response chunk
                content_piece = chunk.get("message", {}).get("content", "")
                if content_piece:
                    yield content_piece  # Yield each piece of content for streaming display
            except Exception as e:
                print(f"\n[Parse Error]: {e}")  # Handle JSON parsing errors gracefully

    except Exception as e:
        print(f"\n[Error]: {e}")  # Handle connection or request errors

5. Create main function `main.py`

This script provides a terminal-based LLaMA 3 chatbot interface, powered by Ollama and styled with the rich library for a polished user experience. It connects to a local or remote Ollama server (configured in ask_ollama_streaming) and streams responses in real-time.

Key features:

Styled CLI chat interface using rich.console, Markdown, and Live components
Real-time streaming of model responses from Ollama with typing effect
Persistent conversation context (chat_history) for multi-turn dialogue
Displays response time for each interaction
Gracefully handles user interruptions and errors

To start chatting, just run the script python main.py and type your questions. Type 'exit' to quit.

from ollama_client import ask_ollama_streaming  # Import the streaming function from ollama_client.py
from rich.console import Console  # For pretty printing in the terminal
from rich.markdown import Markdown  # To render markdown-style responses
from rich.panel import Panel  # To display stylized boxes
from rich.live import Live  # To dynamically update terminal output (used for streaming effect)
import time  # ⏱️ Used to measure response time

# Create a rich console for stylized output
console = Console()

def main():
    # Print an intro panel when the app starts
    console.print(Panel("🤖 [bold cyan]LLaMA 3 Chatbot[/bold cyan]\nType [bold]'exit'[/bold] to quit.", expand=False))

    # Initialize the chat history with a system message to guide the model's behavior
    chat_history = [
        {"role": "system", "content": "You are a helpful assistant."}
    ]

    # Loop for continuous conversation
    while True:
        try:
            # Prompt user input in green
            user_input = console.input("\n[bold green]🧠 You:[/bold green] ")

            # Exit condition if user types "exit" or "quit"
            if user_input.lower() in {"exit", "quit"}:
                console.print("\n[bold yellow]👋 Goodbye![/bold yellow]")
                break

            # Add user message to chat history
            chat_history.append({"role": "user", "content": user_input})

            response_text = ""  # To accumulate streamed response
            console.print("\n[bold cyan]🤖 LLaMA:[/bold cyan]")

            start_time = time.time()  # ⏱️ Start timer for performance logging

            # Stream response and update live markdown display
            with Live(Markdown(""), refresh_per_second=10, console=console) as live:
                for chunk in ask_ollama_streaming(chat_history):
                    response_text += chunk  # Accumulate the streaming chunks
                    live.update(Markdown(response_text))  # Update terminal output in real time

            end_time = time.time()  # ⏱️ End timer
            elapsed = end_time - start_time

            # Add model response to chat history to maintain context
            chat_history.append({"role": "assistant", "content": response_text})

            # Display time taken for the response
            console.print(f"\n[dim]⏱️ Responded in {elapsed:.2f} seconds[/dim]")

        except KeyboardInterrupt:
            # Handle Ctrl+C gracefully
            console.print("\n[bold yellow]👋 Interrupted by user. Exiting.[/bold yellow]")
            break
        except Exception as e:
            # Catch and print any other error that might occur
            console.print(f"\n[bold red]❗ Error:[/bold red] {str(e)}")

# Run the chatbot if this file is executed directly
if __name__ == "__main__":
    main()

Building a Terminal Chatbot with LLaMA 3

1. Create a project folder

2. Create and activate virtual environment

3. Create requirements.txt to list all related dependencies

4. Create .env

4. Create client `ollama_client.py`

5. Create main function `main.py`

Implementing Advanced Syntax Highlighting with Line Numbers and Copy Functionality in Next.js Blog

Automating Media File Renaming: A Practical Bash Scripting Project

Reclaiming Disk Space from Docker on Mac

Google Cloud PostgreSQL Managed Instance with Cloud SQL Proxy

Retrieval-Augmented Generation (RAG) with Python, PostgreSQL, and Qdrant - Part 1: Installing PostgreSQL and Qdrant with Docker

1. Create a project folder

2. Create and activate virtual environment

3. Create requirements.txt to list all related dependencies

4. Create .env

4. Create client ollama_client.py

5. Create main function main.py

Implementing Advanced Syntax Highlighting with Line Numbers and Copy Functionality in Next.js Blog

Automating Media File Renaming: A Practical Bash Scripting Project

Reclaiming Disk Space from Docker on Mac

Google Cloud PostgreSQL Managed Instance with Cloud SQL Proxy

Retrieval-Augmented Generation (RAG) with Python, PostgreSQL, and Qdrant - Part 1: Installing PostgreSQL and Qdrant with Docker

4. Create client `ollama_client.py`

5. Create main function `main.py`