API Reference

This section provides detailed API documentation for the DataVerse ChatBot components.

chatbot Package

rag Module

BaseRAG

class BaseRAG

The base class for all RAG implementations.

__init__(content_path, index_path=None, rerank=True, model_name=None, chunking_type='recursive')

Initialize the RAG system.

Parameters:

content_path – Path to content files
index_path – Path to store/load vector indexes
rerank – Whether to enable reranking
model_name – LLM model name
chunking_type – Method for chunking text

get_response(query, user_id)

Get a response for the user query.

Parameters:

query – User query text
user_id – Unique user identifier

Returns:

Generated response

_create_embeddings(texts, is_query=False)

Create embeddings for text chunks.

Parameters:

texts – List of text chunks
is_query – Whether these are query embeddings

Returns:

List of embeddings

_generate_system_prompt(query, user_id, context, include_query=True, include_context=True, include_prev_conv=True)

Generate a standardized system prompt for all LLMs.

Parameters:

query – User query
user_id – User identifier
context – Retrieved context
include_query – Whether to include the query
include_context – Whether to include context
include_prev_conv – Whether to include previous conversation

Returns:

Formatted system prompt

_get_index_path(content_path)

Generate unique index path based on content.

Parameters:: content_path – Path to content
Returns:: Path to index

_clean_html_content(content)

Clean HTML content and convert to markdown.

Parameters:: content – HTML content
Returns:: Cleaned markdown content

_create_chunks(text)

Create chunks using the specified chunking method.

Parameters:: text – Text to chunk
Returns:: List of text chunks

_load_or_create_vectorstore(content_path)

Load existing index or create new one.

Parameters:: content_path – Path to content
Returns:: FAISS vectorstore

_create_vectorstore(content_path)

Create FAISS vectorstore from content with incremental embedding saving.

Parameters:: content_path – Path to content
Returns:: FAISS vectorstore

_update_vectorstore(new_content)

Update existing vectorstore with new content.

Parameters:: new_content – New content to add
Returns:: None

_save_vectorstore(vectorstore, path)

Save vectorstore to disk.

Parameters:

vectorstore – FAISS vectorstore
path – Path to save to

Returns:

None

_load_vectorstore(path)

Load vectorstore from disk.

Parameters:: path – Path to load from
Returns:: FAISS vectorstore

_rerank_docs(query, docs)

Refine the top-k retrieved chunks for relevance.

Parameters:

query – User query
docs – Retrieved documents

Returns:

Reranked documents

_find_relevant_context(query, top_k=5)

Find relevant context using similarity search.

Parameters:

query – User query
top_k – Number of top chunks to retrieve

Returns:

Relevant context as string

classmethod get_models()

Get available models for this RAG implementation.

Returns:: List of available models

classmethod get_config_class()

Get the configuration class for this RAG implementation.

Returns:: Configuration class

ClaudeRAG

class ClaudeRAG

RAG implementation using Anthropic’s Claude models.

__init__(content_path, index_path=None, rerank=True, model_name=None, chunking_type='recursive')

Initialize the Claude RAG system.

Parameters:

content_path – Path to content files
index_path – Path to store/load vector indexes
rerank – Whether to enable reranking
model_name – Claude model name
chunking_type – Method for chunking text

get_response(query, user_id)

Get a Claude-powered response.

Parameters:

query – User query
user_id – User identifier

Returns:

Generated response

_initialize_models()

Initialize Claude API client.

Returns:: None

classmethod get_config_class()

Get the Claude configuration class.

Returns:: ClaudeConfig class

OpenAIRAG

class OpenAIRAG

RAG implementation using OpenAI’s models.

__init__(content_path, index_path=None, rerank=True, model_name=None, chunking_type='recursive')

Initialize the OpenAI RAG system.

Parameters:

content_path – Path to content files
index_path – Path to store/load vector indexes
rerank – Whether to enable reranking
model_name – OpenAI model name
chunking_type – Method for chunking text

get_response(query, user_id)

Get an OpenAI-powered response.

Parameters:

query – User query
user_id – User identifier

Returns:

Generated response

_initialize_models()

Initialize OpenAI API client.

Returns:: None

classmethod get_config_class()

Get the OpenAI configuration class.

Returns:: OpenAIConfig class

CohereRAG

class CohereRAG

RAG implementation using Cohere’s models.

__init__(content_path, index_path=None, rerank=True, model_name=None, chunking_type='recursive')

Initialize the Cohere RAG system.

Parameters:

content_path – Path to content files
index_path – Path to store/load vector indexes
rerank – Whether to enable reranking
model_name – Cohere model name
chunking_type – Method for chunking text

get_response(query, user_id)

Get a Cohere-powered response.

Parameters:

query – User query
user_id – User identifier

Returns:

Generated response

_initialize_models()

Initialize Cohere API client.

Returns:: None

classmethod get_config_class()

Get the Cohere configuration class.

Returns:: CohereConfig class

GeminiRAG

class GeminiRAG

RAG implementation using Google’s Gemini models.

__init__(content_path, index_path=None, rerank=True, model_name=None, chunking_type='recursive')

Initialize the Gemini RAG system.

Parameters:

content_path – Path to content files
index_path – Path to store/load vector indexes
rerank – Whether to enable reranking
model_name – Gemini model name
chunking_type – Method for chunking text

get_response(query, user_id)

Get a Gemini-powered response.

Parameters:

query – User query
user_id – User identifier

Returns:

Generated response

_initialize_models()

Initialize Gemini API client.

Returns:: None

classmethod get_config_class()

Get the Gemini configuration class.

Returns:: GeminiConfig class

MistralRAG

class MistralRAG

RAG implementation using Mistral AI models.

__init__(content_path, index_path=None, rerank=True, model_name=None, chunking_type='recursive')

Initialize the Mistral RAG system.

Parameters:

content_path – Path to content files
index_path – Path to store/load vector indexes
rerank – Whether to enable reranking
model_name – Mistral model name
chunking_type – Method for chunking text

get_response(query, user_id)

Get a Mistral-powered response.

Parameters:

query – User query
user_id – User identifier

Returns:

Generated response

_initialize_models()

Initialize Mistral API client.

Returns:: None

classmethod get_config_class()

Get the Mistral configuration class.

Returns:: MistralConfig class

DeepseekRAG

class DeepseekRAG

RAG implementation using Deepseek models.

__init__(content_path, index_path=None, rerank=True, model_name=None, chunking_type='recursive')

Initialize the Deepseek RAG system.

Parameters:

content_path – Path to content files
index_path – Path to store/load vector indexes
rerank – Whether to enable reranking
model_name – Deepseek model name
chunking_type – Method for chunking text

get_response(query, user_id)

Get a Deepseek-powered response.

Parameters:

query – User query
user_id – User identifier

Returns:

Generated response

_initialize_models()

Initialize Deepseek API client.

Returns:: None

classmethod get_config_class()

Get the Deepseek configuration class.

Returns:: DeepSeekConfig class

GrokRAG

class GrokRAG

RAG implementation using Grok models.

__init__(content_path, index_path=None, rerank=True, model_name=None, chunking_type='recursive')

Initialize the Grok RAG system.

Parameters:

content_path – Path to content files
index_path – Path to store/load vector indexes
rerank – Whether to enable reranking
model_name – Grok model name
chunking_type – Method for chunking text

get_response(query, user_id)

Get a Grok-powered response.

Parameters:

query – User query
user_id – User identifier

Returns:

Generated response

_initialize_models()

Initialize Grok API client.

Returns:: None

classmethod get_config_class()

Get the Grok configuration class.

Returns:: GrokConfig class

embeddings Module

BaseEmbedding

class BaseEmbedding

Base class for text embedding providers.

__init__(api_key=None)

Initialize the embedding provider.

Parameters:: api_key – API key for the embedding service

embed(texts, is_query=False)

Create embeddings for a list of texts.

Parameters:

texts – List of text strings
is_query – Whether these are query embeddings

Returns:

List of embedding vectors

CohereEmbedding

class CohereEmbedding

Cohere embedding provider.

__init__(api_key=None)

Initialize the Cohere embedding provider.

Parameters:: api_key – Cohere API key

embed(texts, is_query=False)

Create embeddings using Cohere.

Parameters:

texts – List of text strings
is_query – Whether these are query embeddings

Returns:

List of embedding vectors

MistralEmbedding

class MistralEmbedding

Mistral embedding provider.

__init__(api_key=None)

Initialize the Mistral embedding provider.

Parameters:: api_key – Mistral API key

embed(texts, is_query=False)

Create embeddings using Mistral.

Parameters:

texts – List of text strings
is_query – Whether these are query embeddings

Returns:

List of embedding vectors

OpenAIEmbedding

class OpenAIEmbedding

OpenAI embedding provider.

__init__(api_key=None)

Initialize the OpenAI embedding provider.

Parameters:: api_key – OpenAI API key

embed(texts, is_query=False)

Create embeddings using OpenAI.

Parameters:

texts – List of text strings
is_query – Whether these are query embeddings

Returns:

List of embedding vectors

HuggingFaceEmbedding

class HuggingFaceEmbedding

Hugging Face embedding provider.

__init__(model_name='sentence-transformers/all-MiniLM-L6-v2', device='cpu')

Initialize the Hugging Face embedding provider.

Parameters:

model_name – Name of the Hugging Face model
device – Device to run the model on (cpu or cuda)

embed(texts, is_query=False)

Create embeddings using Hugging Face models.

Parameters:

texts – List of text strings
is_query – Whether these are query embeddings

Returns:

List of embedding vectors

crawler Module

Crawler

class Crawler

Web content crawler.

__init__(base_url, domain_name, max_depth=2, max_pages=50, wait_time=1.0, follow_links=True, ignore_query_params=True, client='crawl4ai')

Initialize the crawler.

Parameters:

base_url – Starting URL
domain_name – Target domain name
max_depth – Maximum crawl depth
max_pages – Maximum pages to crawl
wait_time – Time to wait between requests
follow_links – Whether to follow links
ignore_query_params – Whether to ignore URL query parameters
client – Client library to use for crawling

extract_content(link, webpage_only=True, max_depth=None)

Extract content from a webpage or multiple webpages.

Parameters:

link – URL to crawl
webpage_only – Whether to only extract content from a single page
max_depth – Maximum depth to crawl

Returns:

Path to the extracted content

_clean_html(html)

Clean HTML content.

Parameters:: html – HTML content
Returns:: Cleaned text content

_save_extracted_content(text, file_path)

Save extracted content to a file.

Parameters:

text – Content to save
file_path – Path to save the content to

Returns:

Path to the saved content

utils Module

General Utilities

create_folder(path)

Create a folder if it doesn’t exist.

Parameters:: path – Path to create
Returns:: Path object of the created folder

File Operations

class FileLoader

File loading and processing utility.

__init__(file_path, content_path=None, client='docling')

Initialize the file loader.

Parameters:

file_path – Path to the file to load
content_path – Path to save extracted content
client – Document processing client to use

extract_from_file()

Extract and process content from a file.

Returns:: List of document objects

_get_extension(file_path)

Get the extension of a file.

Parameters:: file_path – Path to the file
Returns:: File extension

supported_extensions()

Get list of supported file extensions.

Returns:: List of supported extensions

Database Operations

class DatabaseOps

Database operations utility.

__init__(db_path=None)

Initialize database operations.

Parameters:: db_path – Path to SQLite database

_init_db()

Initialize database tables if they don’t exist.

Returns:: None

get_chat_history(user_id=None, last_n=3, full_history=False, last_n_hours=24)

Retrieve chat history for user.

Parameters:

user_id – User identifier
last_n – Maximum entries to retrieve when not using full_history
full_history – Whether to retrieve full history for all users
last_n_hours – Number of hours to look back when using full_history

Returns:

Chat history as formatted string or list of interactions

append_chat_history(user_id, question, answer, model_used, embedding_model_used)

Save chat interaction to database.

Parameters:

user_id – User identifier
question – User question
answer – System response
model_used – LLM model used
embedding_model_used – Embedding model used

Returns:

None

append_cost(user_id, model_used, embedding_model_used, input_tokens, output_tokens, cost_per_input_token, cost_per_output_token)

Track token usage and cost.

Parameters:

user_id – User identifier
model_used – LLM model used
embedding_model_used – Embedding model used
input_tokens – Number of input tokens
output_tokens – Number of output tokens
cost_per_input_token – Cost per million input tokens
cost_per_output_token – Cost per million output tokens

Returns:

None

get_monitored_resp()

Get monitored responses from the last 24 hours.

Returns:: List of question-answer tuples

append_bot_sub(user_id, first_name, platform)

Add a new bot subscriber.

Parameters:

user_id – User identifier
first_name – User’s first name
platform – Platform (Telegram, WhatsApp)

Returns:

None

get_bot_sub(user_id=None)

Get bot subscribers.

Parameters:: user_id – Optional user ID to filter by
Returns:: List of subscribers or single subscriber

Email Services

class EmailService

Email notification service.

__init__(smtp_server=None, smtp_port=None, sender_email=None, sender_password=None, receiver_email=None)

Initialize the email service.

Parameters:

smtp_server – SMTP server address
smtp_port – SMTP server port
sender_email – Sender email address
sender_password – Sender email password
receiver_email – Receiver email address

subscribe(callback)

Allow other classes to subscribe to email state changes.

Parameters:: callback – Callback function to notify
Returns:: None

unsubscibe(callback)

Remove a subscriber.

Parameters:: callback – Callback function to remove
Returns:: None

_notify_subscribers(old_email, new_email)

Notify subscribers of email changes.

Parameters:

old_email – Previous email
new_email – New email

Returns:

None

_format_email_content(unknowns)

Format the email content with a table of uncertain responses.

Parameters:: unknowns – List of (question, answer) tuples
Returns:: Formatted HTML content

_send_without_attachment(message, unknowns)

Prepare message without attachments for uncertain responses.

Parameters:

message – The email message object
unknowns – List of (question, answer) tuples

Returns:

HTML content for the message

_add_file_attachment(message, file_path, content_type=None)

Add a file attachment to the email message.

Parameters:

message – The email message object
file_path – Path to the file to attach
content_type – Content type of the file

Returns:

True if attachment was successful, False otherwise

send_email_with_attachments(subject, message_body, file_paths=None)

Send an email with multiple file attachments.

Parameters:

subject – The email subject
message_body – The email body text
file_paths – List of file paths to attach

Returns:

None

_send_with_attachment(message, json_data, filename)

Add JSON data as an attachment to the email.

Parameters:

message – The email message object
json_data – JSON data to attach
filename – Filename for the attachment

Returns:

The JSON attachment

send_email(subject, unknowns=None, json_data=None, filename='conversations.json')

Send an email with either uncertain responses or JSON data.

Parameters:

subject – The email subject line
unknowns – List of uncertain responses
json_data – JSON data to attach
filename – Filename for JSON attachment

Returns:

None

property receiver_email

Get the receiver email address.

Returns:: Email address

receiver_email.setter()

Set the receiver email address.

Parameters:: value – New email address
Returns:: None

Data Processing

count_labels(df, column)

Count the occurrences of each label in a DataFrame column.

Parameters:

df – Pandas DataFrame
column – Column name to count

Returns:

Series with label counts

standardize_length(df, max_length=250)

Standardize the length of text in a DataFrame column.

Parameters:

df – Pandas DataFrame
max_length – Maximum length for text

Returns:

DataFrame with standardized text

truncate_to_n_tokens(text, tokenizer, max_tokens=50)

Truncate text to a maximum number of tokens.

Parameters:

text – Text to truncate
tokenizer – Tokenizer to use
max_tokens – Maximum number of tokens

Returns:

Truncated text

Monitoring Services

class UncertainResponseMonitor

Monitor and detect uncertain responses.

__init__(email_service, every_hours=24, start_service=True)

Initialize the monitor.

Parameters:

email_service – Email service for notifications
every_hours – Check frequency in hours
start_service – Whether to start monitoring immediately

check_for_uncertain_responses()

Check database for potentially uncertain responses.

Returns:: List of uncertain responses

_start_monitoring()

Start the monitoring service.

Returns:: None

_stop_monitoring()

Stop the monitoring service.

Returns:: None

_schedule_monitoring()

Schedule periodic monitoring.

Returns:: None

_on_exception(e)

Handle exceptions during monitoring.

Parameters:: e – Exception object
Returns:: None

class ChatHistoryMonitor

Monitor chat history and generate reports.

__init__(email_service, every_hours=24, start_service=True)

Initialize the monitor.

Parameters:

email_service – Email service for notifications
every_hours – Check frequency in hours
start_service – Whether to start monitoring immediately

generate_report()

Generate usage report from chat history.

Returns:: Report data

_start_monitoring()

Start the monitoring service.

Returns:: None

_stop_monitoring()

Stop the monitoring service.

Returns:: None

_schedule_monitoring()

Schedule periodic monitoring.

Returns:: None

Path Management

BASE_DIR: Base directory of the project.

DATA_DIR: Directory for all data.

WEB_CONTENT_DIR: Directory for web content.

DATASETS_DIR: Directory for datasets.

DATABASE_DIR: Directory for database files.

INDEXES_DIR: Directory for vector indexes.

VOICES_DIR: Directory for voice recordings.

MODELS_DIR: Directory for ML models.

LOGS_DIR: Directory for log files.

TRAIN_FILES_DIR: Directory for training files.

CHAT_HIST_DIR: Directory for chat history.

FONTS_DIR: Directory for fonts.

CLF_PATH: Path to the classifier model.

config Module

get_api_key(provider)

Get API key for a specified provider.

Parameters:: provider – Provider name (e.g., “OPENAI”, “COHERE”)
Returns:: API key string
Raises:: MissingAPIKeyError – If the API key is not found

MissingAPIKeyError

exception MissingAPIKeyError: Exception raised when an API key is missing.

Config

class Config

Global configuration container.

TEMPERATURE: float: Temperature setting for language models (0.0-1.0)

MAX_TOKENS: int: Maximum tokens for LLM responses

CHUNKING_CONFIGS: dict: Configuration for different text chunking methods

AVAILABLE_MODELS: list: List of available language models

LLM Provider Configs

class OpenAIConfig

OpenAI-specific configuration.

AVAILABLE_MODELS: list: List of available OpenAI models

class ClaudeConfig

Claude-specific configuration.

AVAILABLE_MODELS: list: List of available Claude models

class CohereConfig

Cohere-specific configuration.

AVAILABLE_MODELS: list: List of available Cohere models

class GeminiConfig

Gemini-specific configuration.

AVAILABLE_MODELS: list: List of available Gemini models

class MistralConfig

Mistral-specific configuration.

AVAILABLE_MODELS: list: List of available Mistral models

class DeepSeekConfig

DeepSeek-specific configuration.

AVAILABLE_MODELS: list: List of available DeepSeek models

class GrokConfig

Grok-specific configuration.

AVAILABLE_MODELS: list: List of available Grok models

web Package

Chat Web App

home()

Serve the iframe HTML interface.

Returns:: HTML response with the chat interface

chat(request)

Handle chat requests from the web interface.

Parameters:: request – ChatRequest object containing the query
Returns:: JSON response with the chatbot’s answer

transcribe_audio(file)

Handle audio transcription requests.

Parameters:: file – Uploaded audio file
Returns:: JSON response with the transcription

class ChatRequest

Pydantic model for chat requests.

query: str: The user’s query text

Admin Dashboard

serve_layout()

Create the dashboard layout.

Returns:: Dash HTML layout components

register_callbacks(app)

Register all dashboard callbacks.

Parameters:: app – Dash application instance

authenticate_user(username, password)

Authenticate a user against stored credentials.

Parameters:

username – Username to authenticate
password – Password to verify

Returns:

True if authentication succeeds, False otherwise

generate_metrics()

Generate system metrics for the dashboard.

Returns:: Dictionary of metrics (users, queries, token usage, costs)

Bot Implementations

Telegram Bot

class TelegramBot

Telegram bot implementation.

EMAIL_REGEX: str: Regular expression for validating email addresses

ADMINS: list: List of admin user IDs

__init__(link)

Initialize the Telegram bot.

Parameters:: link – Website link to initialize the RAG system with

extract_domain_name(link)

Extract domain name from a URL.

Parameters:: link – URL to extract domain from
Returns:: Domain name

fetch_content(link, domain_name, max_depth=None, file_path=None, webpage_only=True)

Fetch content from a URL or file.

Parameters:

link – URL to fetch content from
domain_name – Domain name
max_depth – Maximum crawl depth
file_path – Path to file
webpage_only – Whether to only fetch a single page

Returns:

Path to fetched content

_init_rag_system()

Initialize the RAG system with content from the website.

Returns:: None

_setup_handlers()

Set up Telegram command and message handlers.

Returns:: None

start(update, context)

Handle the /start command.

Parameters:

update – Update from Telegram
context – CallbackContext for the bot

Returns:

Next conversation state

transcribe(audio_buffer)

Transcribe voice messages to text.

Parameters:: audio_buffer – Audio buffer containing voice message
Returns:: Transcribed text

add_content(update, context)

Add new content to the RAG system.

Parameters:

update – Update from Telegram
context – CallbackContext for the bot

Returns:

None

add_admin(update, context)

Add a new admin user.

Parameters:

update – Update from Telegram
context – CallbackContext for the bot

Returns:

None

remove_admin(update, context)

Remove an admin user.

Parameters:

update – Update from Telegram
context – CallbackContext for the bot

Returns:

None

get_admins(update, context)

List current admin users.

Parameters:

update – Update from Telegram
context – CallbackContext for the bot

Returns:

None

_is_admin(user_id)

Check if a user is an admin.

Parameters:: user_id – User ID to check
Returns:: True if user is admin, False otherwise

_user_exists(id)

Check if a user exists in the database.

Parameters:: id – User ID to check
Returns:: True if user exists, False otherwise

_run_rag_query(question, user_id)

Run a RAG query in a separate thread.

Parameters:

question – User question
user_id – User ID

Returns:

RAG response

_extract_question(msg, context)

Extract question from text or voice message.

Parameters:

msg – Message from Telegram
context – CallbackContext for the bot

Returns:

Extracted question text

handle_question(update, context)

Process questions from users.

Parameters:

update – Update from Telegram
context – CallbackContext for the bot

Returns:

Next conversation state

cancel_conversation(update, context)

Cancel the current conversation.

Parameters:

update – Update from Telegram
context – CallbackContext for the bot

Returns:

End of conversation

set_email(update, context)

Set email for receiving notifications.

Parameters:

update – Update from Telegram
context – CallbackContext for the bot

Returns:

None

_is_valid_email(new_email)

Check if an email address is valid.

Parameters:: new_email – Email address to check
Returns:: True if valid, False otherwise

broadcast(update, context)

Broadcast a message to all bot subscribers.

Parameters:

update – Update from Telegram
context – CallbackContext for the bot

Returns:

None

run_async()

Run the bot asynchronously.

Returns:: None

_handle_exit(signum, frame)

Handle exit signals gracefully.

Parameters:

signum – Signal number
frame – Current stack frame

Returns:

None

run()

Run the bot using asyncio.run.

Returns:: None

WhatsApp Bot

class TwilioClient

WhatsApp bot implementation using Twilio.

TWILIO_SID: str: Twilio account SID

TWILIO_PHONE_NUMBER: str: Twilio phone number

TWILIO_AUTH_TOKEN: str: Twilio authentication token

__init__(): Initialize the Twilio client for WhatsApp messaging.

send_whatsapp_message(to_number, message)

Send a WhatsApp message.

Parameters:

to_number – Recipient’s phone number
message – Message to send

Returns:

Success status

webhook(request)

Handle incoming webhook requests from Twilio.

Parameters:: request – FastAPI request object
Returns:: TwiML response for Twilio

app.post("/sms")(twilio_bot.webhook)

Route for Twilio SMS webhook.

Returns:: Response from webhook handler