Back to blog

Monday, September 2, 2024

Building LLM Applications with OpenAI, Vector DBs and LangChain

cover

Introduction: The Modern LLM Application Stack

Large Language Models (LLMs) have revolutionized how we build AI applications. In this guide, we'll explore how to combine OpenAI's powerful models with vector databases and LangChain to create sophisticated AI applications. We'll cover everything from basic setup to advanced patterns for production deployment.

Setting Up the Environment

First, let's install the required dependencies:

pip install openai langchain chromadb python-dotenv

Core Components

1. OpenAI Integration

First, let's set up OpenAI integration with proper environment management:

from dotenv import load_dotenv
import os
from openai import OpenAI

load_dotenv()

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

def get_completion(prompt: str, model="gpt-3.5-turbo") -> str:
    messages = [{"role": "user", "content": prompt}]
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0.7
    )
    return response.choices[0].message.content

2. Vector Database Setup

We'll use Chroma as our vector store. Here's how to set it up:

import chromadb
from chromadb.config import Settings

client = chromadb.Client(Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory="db"
))

collection = client.create_collection(
    name="documents",
    metadata={"hnsw:space": "cosine"}
)

3. LangChain Integration

LangChain helps orchestrate the interaction between OpenAI and our vector store:

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

embeddings = OpenAIEmbeddings()
text_splitter = CharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

def create_knowledge_base(documents):
    texts = text_splitter.split_documents(documents)
    vectorstore = Chroma.from_documents(
        documents=texts,
        embedding=embeddings,
        persist_directory="db"
    )
    return vectorstore

Building the Application

Let's combine these components to create a question-answering system:

def create_qa_chain(vectorstore):
    qa_chain = RetrievalQA.from_chain_type(
        llm=OpenAI(),
        chain_type="stuff",
        retriever=vectorstore.as_retriever(),
        return_source_documents=True
    )
    return qa_chain

def query_documents(qa_chain, query: str):
    response = qa_chain({"query": query})
    return {
        "answer": response["result"],
        "sources": [doc.page_content for doc in response["source_documents"]]
    }

Production Best Practices

1. Error Handling

Always implement robust error handling:

from typing import Optional, Dict, Any
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def safe_query(qa_chain, query: str) -> Optional[Dict[str, Any]]:
    try:
        return query_documents(qa_chain, query)
    except Exception as e:
        logger.error(f"Error processing query: {str(e)}")
        return None

2. Rate Limiting and Caching

Implement rate limiting and caching to optimize API usage:

from functools import lru_cache
from ratelimit import limits, sleep_and_retry

ONE_MINUTE = 60
MAX_CALLS_PER_MINUTE = 60

@sleep_and_retry
@limits(calls=MAX_CALLS_PER_MINUTE, period=ONE_MINUTE)
@lru_cache(maxsize=1000)
def cached_completion(prompt: str) -> str:
    return get_completion(prompt)

3. Environment Configuration

Use environment variables for configuration:

# .env
OPENAI_API_KEY=your-api-key
EMBEDDING_MODEL=text-embedding-ada-002
COMPLETION_MODEL=gpt-3.5-turbo
MAX_TOKENS=500

Deployment Considerations

  1. Scalability: Use async operations for better performance:
async def async_process_queries(queries: List[str]):
    tasks = [safe_query(qa_chain, query) for query in queries]
    return await asyncio.gather(*tasks)
  1. Monitoring: Implement proper logging and monitoring:
import prometheus_client as prom

query_latency = prom.Histogram('query_latency_seconds', 'Time spent processing queries')
query_counter = prom.Counter('queries_total', 'Total number of queries processed')
  1. Cost Management: Track token usage:
def estimate_tokens(text: str) -> int:
    return len(text.split()) * 1.3  # Rough estimate

def track_usage(prompt: str, response: str):
    prompt_tokens = estimate_tokens(prompt)
    response_tokens = estimate_tokens(response)
    logger.info(f"Usage - Prompt: {prompt_tokens}, Response: {response_tokens}")

Conclusion

Building LLM applications requires careful consideration of various components and their integration. By following these patterns and best practices, you can create robust, production-ready applications that leverage the power of OpenAI's models while maintaining scalability and reliability.

Remember to:

  • Always handle API keys securely
  • Implement proper error handling and monitoring
  • Consider rate limits and costs
  • Cache responses when possible
  • Use async operations for better performance

The complete code for this tutorial is available on GitHub.