Recovery-August generation (RAG) has emerged as a powerful paradigm to increase the capabilities of large language model (LLM). By combining the creative generation capabilities of LLMS with factual accuracy of recovering systems, Rag provides a solution to one of the most frequent challenges of llms: hallucinations.
In this tutorial, we will use a full RAG system:
- FAISS as our vector database (Facebook AI Equality Search),
- Sentence transformer to make high quality embeding
- Hugging an open-source LLM (we will use a light model compatible with CPU)
- The basis of a custom knowledge we will create
By the end of this tutorial, you will have a working rip system that can answer questions based on your documents with better accuracy and relevance. This approach is valuable for the creation of domain-specific assistants, customer support systems, or any application where it is important to ground LLM reactions in specific documents.
Let us start.
step 1: Establish our environment
First, we need to install all the necessary libraries. For this tutorial, we will use Google Colab.
# Install required packages
!pip install -q transformers==4.34.0
!pip install -q sentence-transformers==2.2.2
!pip install -q faiss-cpu==1.7.4
!pip install -q accelerate==0.23.0
!pip install -q einops==0.7.0
!pip install -q langchain==0.0.312
!pip install -q langchain_community
!pip install -q pypdf==3.15.1
Also check if we have access to GPU, which will speed up our model’s estimate:
import torch
# Check if GPU is available
print(f"GPU available: torch.cuda.is_available()")
if torch.cuda.is_available():
print(f"GPU name: torch.cuda.get_device_name(0)")
else:
print("Running on CPU. We'll use a CPU-compatible model.")
step 2: Creating the basis of our knowledge
For this tutorial, we will form the basis of a simple knowledge about AI concepts. In the real -world scenario, one can use it to import PDF documents, web pages or databases.
import os
import tempfile
# Create a temporary directory for our documents
docs_dir = tempfile.mkdtemp()
print(f"Created temporary directory at docs_dir")
# Create sample documents about AI concepts
documents =
"vector_databases.txt": """
Vector databases are specialized database systems designed to store, manage, and search vector embeddings efficiently.
They are crucial for machine learning applications, particularly those involving natural language processing and image recognition.
Key features of vector databases include:
1. Fast similarity search using algorithms like HNSW, IVF, or exact search
2. Support for various distance metrics (cosine, euclidean, dot product)
3. Scalability for handling billions of vectors
4. Often support for metadata filtering alongside vector search
Popular vector databases include FAISS (Facebook AI Similarity Search), Pinecone, Weaviate, Milvus, and Chroma.
FAISS specifically was developed by Facebook AI Research and is an open-source library for efficient similarity search.
""",
"embeddings.txt": """
Embeddings are dense vector representations of data in a continuous vector space.
They capture semantic meaning and relationships between entities by positioning similar items closer together in the vector space.
Types of embeddings include:
1. Word embeddings (Word2Vec, GloVe)
2. Sentence embeddings (Universal Sentence Encoder, SBERT)
3. Document embeddings
4. Image embeddings
5. Audio embeddings
Embeddings are created through various techniques, including neural networks trained on specific tasks.
Modern embedding models like those from OpenAI, Cohere, or Sentence Transformers can capture nuanced semantic relationships.
The dimensionality of embeddings typically ranges from 100 to 1536 dimensions, with higher dimensions often capturing more information but requiring more storage and computation.
""",
"rag_systems.txt": """
Retrieval-Augmented Generation (RAG) is an AI architecture that combines information retrieval with text generation.
The RAG process typically works as follows:
1. User query is converted into an embedding vector
2. Similar documents or passages are retrieved from a knowledge base using vector similarity
3. Retrieved content is provided as context to the language model
4. The language model generates a response informed by both its parameters and the retrieved information
Benefits of RAG include:
1. Reduced hallucination compared to pure generative approaches
2. Up-to-date information without model retraining
3. Attribution of information sources
4. Lower computation costs than increasing model size
RAG systems can be enhanced through techniques like reranking, query reformulation, and hybrid search approaches.
"""
# Write documents to files
for filename, content in documents.items():
with open(os.path.join(docs_dir, filename), 'w') as f:
f.write(content)
print(f"Created len(documents) documents in docs_dir")
step 3: Loading and processing documents
Now, let’s load these documents and process them for our raga system:
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Initialize a list to store our documents
all_documents = []
# Load each text file
for filename in documents.keys():
file_path = os.path.join(docs_dir, filename)
loader = TextLoader(file_path)
loaded_docs = loader.load()
all_documents.extend(loaded_docs)
print(f"Loaded len(all_documents) documents")
# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["nn", "n", ".", " ", ""]
)
document_chunks = text_splitter.split_documents(all_documents)
print(f"Created len(document_chunks) document chunks")
# Let's look at a sample chunk
print("nSample chunk content:")
print(document_chunks[0].page_content)
print(f"Source: document_chunks[0].metadata")
step 4: Embeding
Now, let’s convert our documents to vector embeding:
from sentence_transformers import SentenceTransformer
import numpy as np
# Initialize the embedding model
model_name = "sentence-transformers/all-MiniLM-L6-v2" # A good balance of speed and quality
embedding_model = SentenceTransformer(model_name)
print(f"Loaded embedding model: model_name")
print(f"Embedding dimension: embedding_model.get_sentence_embedding_dimension()")
# Create embeddings for all document chunks
texts = [doc.page_content for doc in document_chunks]
embeddings = embedding_model.encode(texts)
print(f"Created len(embeddings) embeddings with shape embeddings.shape")
Step 5: Faiss Index Building
Now we will build our FAISS index with these embeding:
import faiss
# Get the dimensionality of our embeddings
dimension = embeddings.shape[1]
# Create a FAISS index - we'll use a simple Flat L2 index for demonstration
# For larger datasets, consider using indexes like IVF or HNSW for better performance
index = faiss.IndexFlatL2(dimension) # L2 is Euclidean distance
# Add our vectors to the index
index.add(embeddings.astype(np.float32)) # FAISS requires float32
print(f"Created FAISS index with index.ntotal vectors")
# Create a mapping from index position to document chunk for retrieval
index_to_doc_chunk = i: doc for i, doc in enumerate(document_chunks)
Step 6: A language model is loading
Now let’s load by hugging an open-source language model. We will use a small model that works well on the CPU:
from transformers import AutoTokenizer, AutoModelForCausalLM
# We'll use a smaller model that works on CPU
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float32, # Use float32 for CPU compatibility
device_map="auto" # Will use CPU if GPU is not available
)
print(f"Successfully loaded model_id")
Step 7: Creating our raga pipeline
Let’s create a function that connects the recover and generation:
def rag_response(query, index, embedding_model, llm_model, llm_tokenizer, index_to_doc_map, top_k=3):
"""
Generate a response using the RAG pattern.
Args:
query: The user's question
index: FAISS index
embedding_model: Model to create embeddings
llm_model: Language model for generation
llm_tokenizer: Tokenizer for the language model
index_to_doc_map: Mapping from index positions to document chunks
top_k: Number of documents to retrieve
Returns:
response: The generated response
sources: The source documents used
"""
# Step 1: Convert query to embedding
query_embedding = embedding_model.encode([query])
query_embedding = query_embedding.astype(np.float32) # Convert to float32 for FAISS
# Step 2: Search for similar documents
distances, indices = index.search(query_embedding, top_k)
# Step 3: Retrieve the actual document chunks
retrieved_docs = [index_to_doc_map[idx] for idx in indices[0]]
# Create context from retrieved documents
context = "nn".join([doc.page_content for doc in retrieved_docs])
# Step 4: Create prompt for the LLM (TinyLlama format)
prompt = f"""<|system|>
You are a helpful AI assistant. Answer the question based only on the provided context.
If you don't know the answer based on the context, say "I don't have enough information to answer this question."
Context:
context
<|user|>
query
<|assistant|>"""
# Step 5: Generate response from LLM
input_ids = llm_tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)
generation_config =
"max_new_tokens": 256,
"temperature": 0.7,
"top_p": 0.95,
"do_sample": True
# Generate the output
with torch.no_grad():
output = llm_model.generate(
input_ids=input_ids,
**generation_config
)
# Decode the output
generated_text = llm_tokenizer.decode(output[0], skip_special_tokens=True)
# Extract the assistant's response (remove the prompt)
response = generated_text.split("<|assistant|>")[-1].strip()
# Return both the response and the sources
sources = [(doc.page_content, doc.metadata) for doc in retrieved_docs]
return response, sources
Step 8: Testing our raga system
Let’s test our system with some questions:
#Define some test questions
test_questions = [
"What is FAISS and what is it used for?",
"How do embeddings capture semantic meaning?",
"What are the benefits of RAG systems?",
"How does vector search work?"
]
# Test our RAG pipeline
for question in test_questions:
print(f"nn'='*50")
print(f"Question: question")
print(f"'='*50n")
response, sources = rag_response(
query=question,
index=index,
embedding_model=embedding_model,
llm_model=model,
llm_tokenizer=tokenizer,
index_to_doc_map=index_to_doc_chunk,
top_k=2 # Retrieve top 2 most relevant chunks
)
print(f"Response: responsen")
print("Sources:")
for i, (content, metadata) in enumerate(sources):
print(f"nSource i+1:")
print(f"Metadata: metadata")
print(f"Content snippet: content[:100]...")
Output:
Step 9: Evaluate and correct
Let’s apply a simple assessment function to assess the performance of our RAG system:
def evaluate_rag_response(question, response, retrieved_sources, ground_truth_sources=None):
"""
Simple evaluation of RAG response quality
Args:
question: The query
response: Generated response
retrieved_sources: Sources used for generation
ground_truth_sources: (Optional) Known correct sources
Returns:
evaluation metrics
"""
# Basic metrics
response_length = len(response.split())
num_sources = len(retrieved_sources)
# Simple relevance score - we'd use better methods in production
source_relevance = []
for content, _ in retrieved_sources:
# Count overlapping words between question and source
q_words = set(question.lower().split())
s_words = set(content.lower().split())
overlap = len(q_words.intersection(s_words))
source_relevance.append(overlap / len(q_words) if q_words else 0)
avg_relevance = sum(source_relevance) / len(source_relevance) if source_relevance else 0
return
"response_length": response_length,
"num_sources": num_sources,
"source_relevance_scores": source_relevance,
"avg_relevance": avg_relevance
# Evaluate one of our previous responses
question = test_questions[0]
response, sources = rag_response(
query=question,
index=index,
embedding_model=embedding_model,
llm_model=model,
llm_tokenizer=tokenizer,
index_to_doc_map=index_to_doc_chunk,
top_k=2
)
# Run evaluation
eval_results = evaluate_rag_response(question, response, sources)
print(f"nEvaluation results for question: 'question'")
for metric, value in eval_results.items():
print(f"metric: value")
Step 10: Advanced Raga Technology – Query Extension
Let’s apply the query expansion to improve recovery:
# Here's the implementation of the expand_query function:
def expand_query(original_query, llm_model, llm_tokenizer):
"""
Generate multiple search queries from an original query to improve retrieval
Args:
original_query: The user's original question
llm_model: The language model for generating variations
llm_tokenizer: Tokenizer for the language model
Returns:
List of query variations including the original
"""
# Create a prompt for query expansion
prompt = f"""<|system|>
You are a helpful assistant. Generate two alternative versions of the given search query.
The goal is to create variations that might help retrieve relevant information.
Only list the alternative queries, one per line. Do not include any explanations, numbering, or other text.
<|user|>
Generate alternative versions of this search query: "original_query"
<|assistant|>"""
# Generate variations
input_ids = llm_tokenizer(prompt, return_tensors="pt").input_ids.to(llm_model.device)
with torch.no_grad():
output = llm_model.generate(
input_ids=input_ids,
max_new_tokens=100,
temperature=0.7,
do_sample=True
)
# Decode the output
generated_text = llm_tokenizer.decode(output[0], skip_special_tokens=True)
# Extract the generated variations
response_part = generated_text.split("<|assistant|>")[-1].strip()
# Split response by lines to get individual variations
variations = [line.strip() for line in response_part.split('n') if line.strip()]
# Ensure we have at least some variations
if not variations:
variations = [original_query]
# Add the original query and return the list with duplicates removed
all_queries = [original_query] + variations
return list(dict.fromkeys(all_queries)) # Remove duplicates while preserving order
Step 11: Evaluation and improve our expand_query function
Let’s apply a simple assessment function to assess the performance of our expand_query function
# Example usage of expand_query function
test_query = "How does FAISS help with vector search?"
# Generate query variations
expanded_queries = expand_query(
original_query=test_query,
llm_model=model,
llm_tokenizer=tokenizer
)
print(f"Original Query: test_query")
print(f"Expanded Queries:")
for i, query in enumerate(expanded_queries):
print(f" i+1. query")
# Enhanced RAG with query expansion
all_retrieved_docs = []
all_scores =
# Retrieve documents for each query variation
for query in expanded_queries:
# Get query embedding
query_embedding = embedding_model.encode([query]).astype(np.float32)
# Search in FAISS index
distances, indices = index.search(query_embedding, 3)
# Track document scores across queries (using 1/(1+distance) as score)
for idx, dist in zip(indices[0], distances[0]):
score = 1.0 / (1.0 + dist)
if idx in all_scores:
# Take max score if document retrieved by multiple query variations
all_scores[idx] = max(all_scores[idx], score)
else:
all_scores[idx] = score
# Get top documents based on scores
top_indices = sorted(all_scores.keys(), key=lambda idx: all_scores[idx], reverse=True)[:3]
expanded_retrieved_docs = [index_to_doc_chunk[idx] for idx in top_indices]
print("nRetrieved documents using query expansion:")
for i, doc in enumerate(expanded_retrieved_docs):
print(f"nResult i+1:")
print(f"Source: doc.metadata['source']")
print(f"Content snippet: doc.page_content[:150]...")
# Now use these documents with the LLM to generate a response
context = "nn".join([doc.page_content for doc in expanded_retrieved_docs])
# Create prompt for the LLM
prompt = f"""<|system|>
You are a helpful AI assistant. Answer the question based only on the provided context.
If you don't know the answer based on the context, say "I don't have enough information to answer this question."
Context:
context
<|user|>
test_query
<|assistant|>"""
# Generate response
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)
with torch.no_grad():
output = model.generate(
input_ids=input_ids,
max_new_tokens=256,
temperature=0.7,
top_p=0.95,
do_sample=True
)
# Extract response
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
response = generated_text.split("<|assistant|>")[-1].strip()
print("nFinal RAG Response with Query Expansion:")
print(response)
Output:
FAIS can handle a wide range of vector types including text, image and audio, and can be integrated with popular machine learning framework such as tensorflow, pitorch and scalless.
conclusion
In this tutorial, we have created a complete RAG system using FAIS as our vector database and an open-source LLM. We applied documentation, embeding generation and vector sequencing, and these components integrated to improve recovery quality with query expansion and hybrid search techniques.
In addition, we can consider:
- Implementing Query Reringing with Cross-Enkoders
- Creating a web interface using GraDio or Statelit
- Matadata Filting Ability
- Experiment with various embeding models
- Scale the solution with a more efficient Faiss index (HNSW, IVF)
- Fix LLM on your domain-specific data
Useful resources:
Here here Collab notebookAlso, don’t forget to follow us Twitter And join us Wire And LinkedIn GROUPDon’t forget to join us 80k+ mL subredit,
Asjad is a trainee advisor in Marktechpost. He is maintaining B.Tech in Mechanical Engineering at Indian Institute of Technology, Kharagpur. Asjad is a machine learning and deep learning enthusiasts who are always researching the applications of machine learning in healthcare.