LlamaIndex Query Engine - OpenInference

This example demonstrates how to create and instrument a LlamaIndex query engine with OpenInference tracing.

Prerequisites

Python 3.9+
OpenAI API key
Phoenix or another OpenTelemetry collector

Installation

Install dependencies

pip install llama-index llama-index-core llama-index-llms-openai \
  openinference-instrumentation-llama-index \
  opentelemetry-sdk \
  opentelemetry-exporter-otlp

Set environment variables

export OPENAI_API_KEY="your-api-key"
export COLLECTOR_ENDPOINT="http://localhost:6006/v1/traces"

Instrumentation Setup

First, create an instrumentation module:

import os

from openinference.instrumentation.llama_index import LlamaIndexInstrumentor
from openinference.semconv.resource import ResourceAttributes
from opentelemetry import trace as trace_api
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk import trace as trace_sdk
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace.export import SimpleSpanProcessor


def instrument():
    collector_endpoint = os.getenv("COLLECTOR_ENDPOINT")
    resource = Resource(attributes={ResourceAttributes.PROJECT_NAME: "llama-index-chat"})
    tracer_provider = trace_sdk.TracerProvider(resource=resource)
    span_exporter = OTLPSpanExporter(endpoint=collector_endpoint)
    span_processor = SimpleSpanProcessor(span_exporter=span_exporter)
    tracer_provider.add_span_processor(span_processor=span_processor)
    trace_api.set_tracer_provider(tracer_provider=tracer_provider)
    LlamaIndexInstrumentor().instrument()
    print("🔭 OpenInference instrumentation enabled.")

Complete Query Engine Example

import os
from dotenv import load_dotenv
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# Load environment and instrument
load_dotenv()
from instrument import instrument
instrument()

# Configure LLM and embeddings
Settings.llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

# Load documents
documents = SimpleDirectoryReader("./data").load_data()

# Create index
index = VectorStoreIndex.from_documents(documents)

# Create query engine
query_engine = index.as_query_engine(similarity_top_k=3)

# Query the engine
response = query_engine.query("What is the main topic of these documents?")
print(response)

Chat Engine Example

LlamaIndex also provides chat engines for conversational applications:

from llama_index.core.memory import ChatMemoryBuffer

# Create chat engine with memory
memory = ChatMemoryBuffer.from_defaults(token_limit=3000)
chat_engine = index.as_chat_engine(
    chat_mode="context",
    memory=memory,
    system_prompt=(
        "You are a helpful assistant with access to a knowledge base. "
        "Always ground your answers in the provided context."
    ),
)

# Have a conversation
response1 = chat_engine.chat("Tell me about the key concepts.")
print(response1)

response2 = chat_engine.chat("Can you elaborate on the first point?")
print(response2)

Streaming Responses

# Create streaming query engine
query_engine = index.as_query_engine(streaming=True)

# Stream the response
streaming_response = query_engine.query("Explain the main ideas.")
for text in streaming_response.response_gen:
    print(text, end="", flush=True)
print()

Key Features

Automatic Tracing

LlamaIndex instrumentation captures:

Query execution: Full query pipeline from input to output
Retrieval: Document retrieval with similarity scores
LLM calls: All calls to language models
Embeddings: Embedding generation for queries and documents
Node processing: Document chunking and indexing

Resource Attributes

Use resource attributes to organize projects:

resource = Resource(attributes={
    ResourceAttributes.PROJECT_NAME: "my-app",
})

Memory and Context

The instrumentation tracks:

Conversation history in chat engines
Context window management
Memory buffer operations

Production Setup

For production deployments:

import uvicorn
from fastapi import FastAPI
from fastapi.responses import StreamingResponse

from instrument import instrument

# Initialize instrumentation before app creation
instrument()

app = FastAPI(title="LlamaIndex API")

@app.post("/query")
async def query_endpoint(question: str):
    response = query_engine.query(question)
    return {"answer": str(response)}

@app.post("/chat/stream")
async def chat_stream_endpoint(message: str):
    streaming_response = chat_engine.stream_chat(message)
    
    async def generate():
        for token in streaming_response.response_gen:
            yield token
    
    return StreamingResponse(generate(), media_type="text/plain")

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

Documentation Index

​Prerequisites

​Installation

​Instrumentation Setup

​Complete Query Engine Example

​Chat Engine Example

​Streaming Responses

​Key Features

​Automatic Tracing

​Resource Attributes

​Memory and Context

​Production Setup

​Next Steps