Developer Guide to start with AI-ML: Pdf to ChatBot

Working with models, embeddings and vector database on local machine

Shashwat Srivastava
8 min read3 days ago

Introduction:

In this blog, we will cover common machine learning terminologies and build a simple application that allows users to ask questions from a PDF on their local machine for free. I have tested this on resume pdf, and it seems to provide good results. While fine-tuning for better performance is possible, it is beyond the scope of this blog.
_______________________________________________________________

What are Models?
In the context of Artificial Intelligence (AI) and Machine Learning (ML), “models” refer to mathematical representations or algorithms trained on data to perform specific tasks. These models learn patterns and relationships within the data and use this knowledge to make predictions, classifications, or decisions.

What are Embeddings?
Embeddings, in simple terms, are compact numerical representations of data. They take complex data (like words, sentences, or images) and translate it into a list of numbers (a vector) that captures the key features and relationships within the data. This makes it easier for machine learning models to understand and work with the data.

What are Embedding Models?
Embedding models are tools that transform complex data (like words, sentences, or images) into simpler numerical forms called embeddings.

What are vector databases?
Vector databases are specialized databases designed to store, index, and query high-dimensional vectors efficiently. These vectors, often generated by machine learning models, represent data like text, images, or audio in a numerical format. Vector databases are optimized for tasks involving similarity searches and nearest neighbour queries, which are common in applications like recommendation systems, image retrieval, and natural language processing. We store embeddings in vector database.
examples of vector databases: ChromaDB, Pinecone, ADX, FAISS…

Representation of Embeddings

What does “dimensions” refer to in the context of vector embeddings?
Dimensions in vector embeddings refer to the number of components or features used to represent each item (such as a word, document, or image) in a high-dimensional space. For example, in a word embedding model, the word “king” might be represented as a vector in a 300-dimensional space. Each dimension of this vector could capture different linguistic aspects such as gender, royalty, and age.

What is LLM?
An LLM, or Large Language Model, is a type of artificial intelligence model that processes and generates human-like text based on the patterns and information it has learned from large amounts of text data. It’s designed to understand and generate natural language, making it useful for tasks like answering questions, writing essays, translating languages, and more. Examples of LLMs include GPT-3, GPT-4, and BERT.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
Workflow:

1. Convert PDF to text.
2. Create Embeddings
Recursively split file into chunks and create embeddings for each chunk.
Use Embedding model to create embeddings. In our case we are using model=”nomic-embed-text” provided by ollama library .
Store the embeddings in Vector Database(in our example we have used chromaDB).
3. Take user’s question and create embeddings for the question text.
4. Query your vectorDB to find the similar embeddings in database, specify the number of results you need. ChromaDB performs Similarity search to get best results.
5. Take the user question + similar results as a context and pass it to LLM model for more precise and fitting answers. In our example we have used model=”llama3"

Prerequisite:

  • Install Python.
  • To run model in local download ollama from “https://ollama.com/”. Ollama is an open-source project that serves as a powerful and user-friendly platform for running LLMs on your local machine.
  • If you wish you can use other models provided by OpenAI and Huggingface.
  • For quick start just run: ollama run llama3
  • To run embedding model run: ollama pull nomic-embed-text
  • To run embedding model run: ollama pull nomic-embed-text
    - Choose suitable model from ollam library .
  • Install jupytar and create .ipynb

Test installation:

#ollama runs on 11434 port by default.
res = requests.post('http://localhost:11434/api/embeddings',
json={
'model': 'nomic-embed-text',
'prompt': 'Hello world'
})

print(res.json())
# In our example we will be using a framework langchain.
# langchain provides library to interact with ollama.
Output: you will get embeddings as a response

What is langchain?
LangChain is a framework designed to simplify the development of applications that leverage large language models (LLMs) for various natural language processing (NLP) tasks. It provides tools and abstractions to help developers build, manage, and deploy applications that use LLMs efficiently and effectively.

Install Dependencies and import required libraries:

!pip3 install numpy, requests
!pip3 install chromadb
!pip3 install jupyter --upgrade
!pip3 install langchain-community

from PyPDF2 import PdfReader # used to extract text from pdf
from langchain.text_splitter import RecursiveCharacterTextSplitter # split text in smaller snippets
from langchain_community.llms import Ollama # interact with ollama local server

import requests
import chromadb
import numpy
import uuid

Define constants:

DIMENSION = len(res.json()['embedding']) # 768 dimensions for nomic-embed-text embedding model
EXTRACTED_TEXT_FILE_PATH = "pdf_txt.txt" # text extracted from pdf
PDF_FILE_PATH = "my_pdf.pdf"
CHUNK_SIZE = 150 # chunk size to create snippets
CHUNK_OVERLAP = 20 # check size to create overlap between snippets
OUTPUT_RESULT_COUNT=5 # results of chunk from vector database.

Initialize chromaDB:

chroma_client = chromadb.Client()
collection = chroma_client.get_or_create_collection(name="test_chat_nomic")

Initialize Ollama-embeddings and test:

embeddings = (
OllamaEmbeddings(model="nomic-embed-text")
)

embedding = embeddings.embed_query("Hello World")
dimension = len(embedding)
print(len(embedding)) #768

Extract text from pdf:

def extract_text_from_pdf(file_path: str):
# Open the PDF file using the specified file_path
reader = PdfReader(file_path)
# Get the total number of pages in the PDF
number_of_pages = len(reader.pages)

# Initialize an empty string to store extracted text
pdf_text = ""

# Loop through each page of the PDF
for i in range(number_of_pages):
# Get the i-th page
page = reader.pages[i]
# Extract text from the page and append it to pdf_text
pdf_text += page.extract_text()
# Add a newline after each page's text for readability
pdf_text += "\n"

# Specify the file path for the new text file
file_path = EXTRACTED_TEXT_FILE_PATH

# Write the content to the text file
with open(file_path, "w", encoding="utf-8") as file:
file.write(pdf_text)

Create Embeddings and Save it to chromaDB collection:

def create_embeddings(file_path: str):
# Initialize a list to store text snippets
snippets = []
# Initialize a CharacterTextSplitter with specified settings
text_splitter = RecursiveCharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)

# Read the content of the file specified by file_path
with open(file_path, "r", encoding="utf-8") as file:
file_text = file.read()

# Split the text into snippets using the specified settings
snippets = text_splitter.split_text(file_text)
print(len(snippets))

x = numpy.zeros((len(snippets), dimension), dtype= 'float32')
ids = []

for i, snippet in enumerate(snippets):
print(snippet)
embedding = embeddings.embed_query(snippet)
ids.append(get_uuid())
x[i] = numpy.array(embedding)

collection.add(embeddings = x,
documents=snippets,
ids=ids)

def get_uuid():
return str(uuid.uuid4())

Execute:

extract_text_from_pdf(PDF_FILE_PATH)
create_embeddings(EXTRACTED_TEXT_FILE_PATH)

To see whats there in collection, run:

data=collection2.get(include=['embeddings', 'documents', 'metadatas'])
print(data['embeddings'])
print(data['ids'])
print(data['metadatas'])
print(data['documents'])

Retrieve n matching answers from ChromaDB for a user's query and pass those answers as context to an LLM model (here LLaMA3) to frame a response.

def answer_users_question(user_question):
embedding_arr = embeddings.embed_query(user_question)
result = collection.query(
query_embeddings= embedding_arr,
n_results=OUTPUT_RESULT_COUNT
)

return frame_response(result['documents'][0], user_question)

def frame_response(results, ques):
joined_string = "\n".join(results)
prompt = joined_string + "\n Given this information, " + ques
llm = Ollama(
model="llama3"
)
return llm.invoke(prompt)

Start an infinite loop, allowing the user to ask questions.

while True:

# Prompt the user to input a question
print("👤USER:")

# Read the user's question from the console
user_question = input("Enter question: ")
print(user_question)

# Print a separator for readability
print("----------------------")

# Check if the user entered an empty question
if user_question =="exit":

# If the user entered an empty question, exit the loop
break
else:

# If the user entered a question, proceed to generate a response
print("🤖 BOT:")

# Call the function to generate an answer based on the user's question
# and print the bot's response
print(answer_users_question(user_question=user_question))

# Print a separator for readability
print("----------------------")
Output: ChatBot

Github: https://github.com/shashwat12june/pdftochatbot

FAQ’s

  1. Why to use Embedding search over Normal search?
    Embedding search is better than normal keyword search because it understands the meaning and context of words, not just exact matches. This means it can find more relevant results, even if the words used are different but have the same meaning. It’s useful in many applications like finding similar documents, images, and making personalized recommendations.
    For example, in embedding search, “car” and “automobile” would be recognized as similar because they have close positions in the embedding space. In contrast, normal search might miss relevant results if the exact keyword isn’t used, as it doesn’t account for synonyms or variations.
  2. Are there any alternates to vector db?
    Certainly! Here are the alternatives to dedicated vector databases broken down into smaller points
    Relational Databases:
    PostgreSQL and MySQL:
    Can store embeddings as arrays or JSON fields with custom implementations for vector operations.
    Extensions like
    pgvector: Enhance PostgreSQL’s capabilities for efficient vector searches.
    NoSQL Databases:
    MongoDB
    : Offers flexibility in storing embeddings but lacks native support for efficient vector operations.
    Search Engines:
    Elasticsearch: Can be extended with plugins (e.g., elasticsearch-vector-scoring) to handle vector data.
    Provides scalability and advanced search capabilities suitable for specific applications.

Each alternative has its own strengths and trade-offs in terms of performance, scalability, and ease of implementation, depending on the specific requirements of your use case.

3. What challenges did I encounter during implementation?
The setup process was straightforward, but I encountered accuracy issues when using the llama3 embedding model, despite its 4096 dimensions. I also tested the mxbai-embed-large and nomic-embed-text models, finding that the nomic-embed-text model, with its 768 dimensions, provided the best results. This shows that higher dimensions do not necessarily guarantee better performance. Additionally, I had to adjust parameters like chunk size, chunk overlap, and output chunk counts in the vector database to optimize performance.

****************************** THANK YOU ********************************

--

--