API Reference

This section provides detailed API documentation for the DataVerse ChatBot components.

chatbot Package

rag Module

BaseRAG

class BaseRAG

The base class for all RAG implementations.

__init__(content_path, index_path=None, rerank=True, model_name=None, chunking_type='recursive')

Initialize the RAG system.

Parameters:
  • content_path – Path to content files

  • index_path – Path to store/load vector indexes

  • rerank – Whether to enable reranking

  • model_name – LLM model name

  • chunking_type – Method for chunking text

get_response(query, user_id)

Get a response for the user query.

Parameters:
  • query – User query text

  • user_id – Unique user identifier

Returns:

Generated response

_create_embeddings(texts, is_query=False)

Create embeddings for text chunks.

Parameters:
  • texts – List of text chunks

  • is_query – Whether these are query embeddings

Returns:

List of embeddings

_generate_system_prompt(query, user_id, context, include_query=True, include_context=True, include_prev_conv=True)

Generate a standardized system prompt for all LLMs.

Parameters:
  • query – User query

  • user_id – User identifier

  • context – Retrieved context

  • include_query – Whether to include the query

  • include_context – Whether to include context

  • include_prev_conv – Whether to include previous conversation

Returns:

Formatted system prompt

_get_index_path(content_path)

Generate unique index path based on content.

Parameters:

content_path – Path to content

Returns:

Path to index

_clean_html_content(content)

Clean HTML content and convert to markdown.

Parameters:

content – HTML content

Returns:

Cleaned markdown content

_create_chunks(text)

Create chunks using the specified chunking method.

Parameters:

text – Text to chunk

Returns:

List of text chunks

_load_or_create_vectorstore(content_path)

Load existing index or create new one.

Parameters:

content_path – Path to content

Returns:

FAISS vectorstore

_create_vectorstore(content_path)

Create FAISS vectorstore from content with incremental embedding saving.

Parameters:

content_path – Path to content

Returns:

FAISS vectorstore

_update_vectorstore(new_content)

Update existing vectorstore with new content.

Parameters:

new_content – New content to add

Returns:

None

_save_vectorstore(vectorstore, path)

Save vectorstore to disk.

Parameters:
  • vectorstore – FAISS vectorstore

  • path – Path to save to

Returns:

None

_load_vectorstore(path)

Load vectorstore from disk.

Parameters:

path – Path to load from

Returns:

FAISS vectorstore

_rerank_docs(query, docs)

Refine the top-k retrieved chunks for relevance.

Parameters:
  • query – User query

  • docs – Retrieved documents

Returns:

Reranked documents

_find_relevant_context(query, top_k=5)

Find relevant context using similarity search.

Parameters:
  • query – User query

  • top_k – Number of top chunks to retrieve

Returns:

Relevant context as string

classmethod get_models()

Get available models for this RAG implementation.

Returns:

List of available models

classmethod get_config_class()

Get the configuration class for this RAG implementation.

Returns:

Configuration class

ClaudeRAG

class ClaudeRAG

RAG implementation using Anthropic’s Claude models.

__init__(content_path, index_path=None, rerank=True, model_name=None, chunking_type='recursive')

Initialize the Claude RAG system.

Parameters:
  • content_path – Path to content files

  • index_path – Path to store/load vector indexes

  • rerank – Whether to enable reranking

  • model_name – Claude model name

  • chunking_type – Method for chunking text

get_response(query, user_id)

Get a Claude-powered response.

Parameters:
  • query – User query

  • user_id – User identifier

Returns:

Generated response

_initialize_models()

Initialize Claude API client.

Returns:

None

classmethod get_config_class()

Get the Claude configuration class.

Returns:

ClaudeConfig class

OpenAIRAG

class OpenAIRAG

RAG implementation using OpenAI’s models.

__init__(content_path, index_path=None, rerank=True, model_name=None, chunking_type='recursive')

Initialize the OpenAI RAG system.

Parameters:
  • content_path – Path to content files

  • index_path – Path to store/load vector indexes

  • rerank – Whether to enable reranking

  • model_name – OpenAI model name

  • chunking_type – Method for chunking text

get_response(query, user_id)

Get an OpenAI-powered response.

Parameters:
  • query – User query

  • user_id – User identifier

Returns:

Generated response

_initialize_models()

Initialize OpenAI API client.

Returns:

None

classmethod get_config_class()

Get the OpenAI configuration class.

Returns:

OpenAIConfig class

CohereRAG

class CohereRAG

RAG implementation using Cohere’s models.

__init__(content_path, index_path=None, rerank=True, model_name=None, chunking_type='recursive')

Initialize the Cohere RAG system.

Parameters:
  • content_path – Path to content files

  • index_path – Path to store/load vector indexes

  • rerank – Whether to enable reranking

  • model_name – Cohere model name

  • chunking_type – Method for chunking text

get_response(query, user_id)

Get a Cohere-powered response.

Parameters:
  • query – User query

  • user_id – User identifier

Returns:

Generated response

_initialize_models()

Initialize Cohere API client.

Returns:

None

classmethod get_config_class()

Get the Cohere configuration class.

Returns:

CohereConfig class

GeminiRAG

class GeminiRAG

RAG implementation using Google’s Gemini models.

__init__(content_path, index_path=None, rerank=True, model_name=None, chunking_type='recursive')

Initialize the Gemini RAG system.

Parameters:
  • content_path – Path to content files

  • index_path – Path to store/load vector indexes

  • rerank – Whether to enable reranking

  • model_name – Gemini model name

  • chunking_type – Method for chunking text

get_response(query, user_id)

Get a Gemini-powered response.

Parameters:
  • query – User query

  • user_id – User identifier

Returns:

Generated response

_initialize_models()

Initialize Gemini API client.

Returns:

None

classmethod get_config_class()

Get the Gemini configuration class.

Returns:

GeminiConfig class

MistralRAG

class MistralRAG

RAG implementation using Mistral AI models.

__init__(content_path, index_path=None, rerank=True, model_name=None, chunking_type='recursive')

Initialize the Mistral RAG system.

Parameters:
  • content_path – Path to content files

  • index_path – Path to store/load vector indexes

  • rerank – Whether to enable reranking

  • model_name – Mistral model name

  • chunking_type – Method for chunking text

get_response(query, user_id)

Get a Mistral-powered response.

Parameters:
  • query – User query

  • user_id – User identifier

Returns:

Generated response

_initialize_models()

Initialize Mistral API client.

Returns:

None

classmethod get_config_class()

Get the Mistral configuration class.

Returns:

MistralConfig class

DeepseekRAG

class DeepseekRAG

RAG implementation using Deepseek models.

__init__(content_path, index_path=None, rerank=True, model_name=None, chunking_type='recursive')

Initialize the Deepseek RAG system.

Parameters:
  • content_path – Path to content files

  • index_path – Path to store/load vector indexes

  • rerank – Whether to enable reranking

  • model_name – Deepseek model name

  • chunking_type – Method for chunking text

get_response(query, user_id)

Get a Deepseek-powered response.

Parameters:
  • query – User query

  • user_id – User identifier

Returns:

Generated response

_initialize_models()

Initialize Deepseek API client.

Returns:

None

classmethod get_config_class()

Get the Deepseek configuration class.

Returns:

DeepSeekConfig class

GrokRAG

class GrokRAG

RAG implementation using Grok models.

__init__(content_path, index_path=None, rerank=True, model_name=None, chunking_type='recursive')

Initialize the Grok RAG system.

Parameters:
  • content_path – Path to content files

  • index_path – Path to store/load vector indexes

  • rerank – Whether to enable reranking

  • model_name – Grok model name

  • chunking_type – Method for chunking text

get_response(query, user_id)

Get a Grok-powered response.

Parameters:
  • query – User query

  • user_id – User identifier

Returns:

Generated response

_initialize_models()

Initialize Grok API client.

Returns:

None

classmethod get_config_class()

Get the Grok configuration class.

Returns:

GrokConfig class

embeddings Module

BaseEmbedding

class BaseEmbedding

Base class for text embedding providers.

__init__(api_key=None)

Initialize the embedding provider.

Parameters:

api_key – API key for the embedding service

embed(texts, is_query=False)

Create embeddings for a list of texts.

Parameters:
  • texts – List of text strings

  • is_query – Whether these are query embeddings

Returns:

List of embedding vectors

CohereEmbedding

class CohereEmbedding

Cohere embedding provider.

__init__(api_key=None)

Initialize the Cohere embedding provider.

Parameters:

api_key – Cohere API key

embed(texts, is_query=False)

Create embeddings using Cohere.

Parameters:
  • texts – List of text strings

  • is_query – Whether these are query embeddings

Returns:

List of embedding vectors

MistralEmbedding

class MistralEmbedding

Mistral embedding provider.

__init__(api_key=None)

Initialize the Mistral embedding provider.

Parameters:

api_key – Mistral API key

embed(texts, is_query=False)

Create embeddings using Mistral.

Parameters:
  • texts – List of text strings

  • is_query – Whether these are query embeddings

Returns:

List of embedding vectors

OpenAIEmbedding

class OpenAIEmbedding

OpenAI embedding provider.

__init__(api_key=None)

Initialize the OpenAI embedding provider.

Parameters:

api_key – OpenAI API key

embed(texts, is_query=False)

Create embeddings using OpenAI.

Parameters:
  • texts – List of text strings

  • is_query – Whether these are query embeddings

Returns:

List of embedding vectors

HuggingFaceEmbedding

class HuggingFaceEmbedding

Hugging Face embedding provider.

__init__(model_name='sentence-transformers/all-MiniLM-L6-v2', device='cpu')

Initialize the Hugging Face embedding provider.

Parameters:
  • model_name – Name of the Hugging Face model

  • device – Device to run the model on (cpu or cuda)

embed(texts, is_query=False)

Create embeddings using Hugging Face models.

Parameters:
  • texts – List of text strings

  • is_query – Whether these are query embeddings

Returns:

List of embedding vectors

crawler Module

Crawler

class Crawler

Web content crawler.

__init__(base_url, domain_name, max_depth=2, max_pages=50, wait_time=1.0, follow_links=True, ignore_query_params=True, client='crawl4ai')

Initialize the crawler.

Parameters:
  • base_url – Starting URL

  • domain_name – Target domain name

  • max_depth – Maximum crawl depth

  • max_pages – Maximum pages to crawl

  • wait_time – Time to wait between requests

  • follow_links – Whether to follow links

  • ignore_query_params – Whether to ignore URL query parameters

  • client – Client library to use for crawling

extract_content(link, webpage_only=True, max_depth=None)

Extract content from a webpage or multiple webpages.

Parameters:
  • link – URL to crawl

  • webpage_only – Whether to only extract content from a single page

  • max_depth – Maximum depth to crawl

Returns:

Path to the extracted content

_clean_html(html)

Clean HTML content.

Parameters:

html – HTML content

Returns:

Cleaned text content

_save_extracted_content(text, file_path)

Save extracted content to a file.

Parameters:
  • text – Content to save

  • file_path – Path to save the content to

Returns:

Path to the saved content

utils Module

General Utilities

create_folder(path)

Create a folder if it doesn’t exist.

Parameters:

path – Path to create

Returns:

Path object of the created folder

File Operations

class FileLoader

File loading and processing utility.

__init__(file_path, content_path=None, client='docling')

Initialize the file loader.

Parameters:
  • file_path – Path to the file to load

  • content_path – Path to save extracted content

  • client – Document processing client to use

extract_from_file()

Extract and process content from a file.

Returns:

List of document objects

_get_extension(file_path)

Get the extension of a file.

Parameters:

file_path – Path to the file

Returns:

File extension

supported_extensions()

Get list of supported file extensions.

Returns:

List of supported extensions

Database Operations

class DatabaseOps

Database operations utility.

__init__(db_path=None)

Initialize database operations.

Parameters:

db_path – Path to SQLite database

_init_db()

Initialize database tables if they don’t exist.

Returns:

None

get_chat_history(user_id=None, last_n=3, full_history=False, last_n_hours=24)

Retrieve chat history for user.

Parameters:
  • user_id – User identifier

  • last_n – Maximum entries to retrieve when not using full_history

  • full_history – Whether to retrieve full history for all users

  • last_n_hours – Number of hours to look back when using full_history

Returns:

Chat history as formatted string or list of interactions

append_chat_history(user_id, question, answer, model_used, embedding_model_used)

Save chat interaction to database.

Parameters:
  • user_id – User identifier

  • question – User question

  • answer – System response

  • model_used – LLM model used

  • embedding_model_used – Embedding model used

Returns:

None

append_cost(user_id, model_used, embedding_model_used, input_tokens, output_tokens, cost_per_input_token, cost_per_output_token)

Track token usage and cost.

Parameters:
  • user_id – User identifier

  • model_used – LLM model used

  • embedding_model_used – Embedding model used

  • input_tokens – Number of input tokens

  • output_tokens – Number of output tokens

  • cost_per_input_token – Cost per million input tokens

  • cost_per_output_token – Cost per million output tokens

Returns:

None

get_monitored_resp()

Get monitored responses from the last 24 hours.

Returns:

List of question-answer tuples

append_bot_sub(user_id, first_name, platform)

Add a new bot subscriber.

Parameters:
  • user_id – User identifier

  • first_name – User’s first name

  • platform – Platform (Telegram, WhatsApp)

Returns:

None

get_bot_sub(user_id=None)

Get bot subscribers.

Parameters:

user_id – Optional user ID to filter by

Returns:

List of subscribers or single subscriber

Email Services

class EmailService

Email notification service.

__init__(smtp_server=None, smtp_port=None, sender_email=None, sender_password=None, receiver_email=None)

Initialize the email service.

Parameters:
  • smtp_server – SMTP server address

  • smtp_port – SMTP server port

  • sender_email – Sender email address

  • sender_password – Sender email password

  • receiver_email – Receiver email address

subscribe(callback)

Allow other classes to subscribe to email state changes.

Parameters:

callback – Callback function to notify

Returns:

None

unsubscibe(callback)

Remove a subscriber.

Parameters:

callback – Callback function to remove

Returns:

None

_notify_subscribers(old_email, new_email)

Notify subscribers of email changes.

Parameters:
  • old_email – Previous email

  • new_email – New email

Returns:

None

_format_email_content(unknowns)

Format the email content with a table of uncertain responses.

Parameters:

unknowns – List of (question, answer) tuples

Returns:

Formatted HTML content

_send_without_attachment(message, unknowns)

Prepare message without attachments for uncertain responses.

Parameters:
  • message – The email message object

  • unknowns – List of (question, answer) tuples

Returns:

HTML content for the message

_add_file_attachment(message, file_path, content_type=None)

Add a file attachment to the email message.

Parameters:
  • message – The email message object

  • file_path – Path to the file to attach

  • content_type – Content type of the file

Returns:

True if attachment was successful, False otherwise

send_email_with_attachments(subject, message_body, file_paths=None)

Send an email with multiple file attachments.

Parameters:
  • subject – The email subject

  • message_body – The email body text

  • file_paths – List of file paths to attach

Returns:

None

_send_with_attachment(message, json_data, filename)

Add JSON data as an attachment to the email.

Parameters:
  • message – The email message object

  • json_data – JSON data to attach

  • filename – Filename for the attachment

Returns:

The JSON attachment

send_email(subject, unknowns=None, json_data=None, filename='conversations.json')

Send an email with either uncertain responses or JSON data.

Parameters:
  • subject – The email subject line

  • unknowns – List of uncertain responses

  • json_data – JSON data to attach

  • filename – Filename for JSON attachment

Returns:

None

property receiver_email

Get the receiver email address.

Returns:

Email address

receiver_email.setter()

Set the receiver email address.

Parameters:

value – New email address

Returns:

None

Data Processing

count_labels(df, column)

Count the occurrences of each label in a DataFrame column.

Parameters:
  • df – Pandas DataFrame

  • column – Column name to count

Returns:

Series with label counts

standardize_length(df, max_length=250)

Standardize the length of text in a DataFrame column.

Parameters:
  • df – Pandas DataFrame

  • max_length – Maximum length for text

Returns:

DataFrame with standardized text

truncate_to_n_tokens(text, tokenizer, max_tokens=50)

Truncate text to a maximum number of tokens.

Parameters:
  • text – Text to truncate

  • tokenizer – Tokenizer to use

  • max_tokens – Maximum number of tokens

Returns:

Truncated text

Monitoring Services

class UncertainResponseMonitor

Monitor and detect uncertain responses.

__init__(email_service, every_hours=24, start_service=True)

Initialize the monitor.

Parameters:
  • email_service – Email service for notifications

  • every_hours – Check frequency in hours

  • start_service – Whether to start monitoring immediately

check_for_uncertain_responses()

Check database for potentially uncertain responses.

Returns:

List of uncertain responses

_start_monitoring()

Start the monitoring service.

Returns:

None

_stop_monitoring()

Stop the monitoring service.

Returns:

None

_schedule_monitoring()

Schedule periodic monitoring.

Returns:

None

_on_exception(e)

Handle exceptions during monitoring.

Parameters:

e – Exception object

Returns:

None

class ChatHistoryMonitor

Monitor chat history and generate reports.

__init__(email_service, every_hours=24, start_service=True)

Initialize the monitor.

Parameters:
  • email_service – Email service for notifications

  • every_hours – Check frequency in hours

  • start_service – Whether to start monitoring immediately

generate_report()

Generate usage report from chat history.

Returns:

Report data

_start_monitoring()

Start the monitoring service.

Returns:

None

_stop_monitoring()

Stop the monitoring service.

Returns:

None

_schedule_monitoring()

Schedule periodic monitoring.

Returns:

None

Path Management

BASE_DIR

Base directory of the project.

DATA_DIR

Directory for all data.

WEB_CONTENT_DIR

Directory for web content.

DATASETS_DIR

Directory for datasets.

DATABASE_DIR

Directory for database files.

INDEXES_DIR

Directory for vector indexes.

VOICES_DIR

Directory for voice recordings.

MODELS_DIR

Directory for ML models.

LOGS_DIR

Directory for log files.

TRAIN_FILES_DIR

Directory for training files.

CHAT_HIST_DIR

Directory for chat history.

FONTS_DIR

Directory for fonts.

CLF_PATH

Path to the classifier model.

config Module

get_api_key(provider)

Get API key for a specified provider.

Parameters:

provider – Provider name (e.g., “OPENAI”, “COHERE”)

Returns:

API key string

Raises:

MissingAPIKeyError – If the API key is not found

MissingAPIKeyError

exception MissingAPIKeyError

Exception raised when an API key is missing.

Config

class Config

Global configuration container.

TEMPERATURE: float

Temperature setting for language models (0.0-1.0)

MAX_TOKENS: int

Maximum tokens for LLM responses

CHUNKING_CONFIGS: dict

Configuration for different text chunking methods

AVAILABLE_MODELS: list

List of available language models

LLM Provider Configs

class OpenAIConfig

OpenAI-specific configuration.

AVAILABLE_MODELS: list

List of available OpenAI models

class ClaudeConfig

Claude-specific configuration.

AVAILABLE_MODELS: list

List of available Claude models

class CohereConfig

Cohere-specific configuration.

AVAILABLE_MODELS: list

List of available Cohere models

class GeminiConfig

Gemini-specific configuration.

AVAILABLE_MODELS: list

List of available Gemini models

class MistralConfig

Mistral-specific configuration.

AVAILABLE_MODELS: list

List of available Mistral models

class DeepSeekConfig

DeepSeek-specific configuration.

AVAILABLE_MODELS: list

List of available DeepSeek models

class GrokConfig

Grok-specific configuration.

AVAILABLE_MODELS: list

List of available Grok models

web Package

Chat Web App

home()

Serve the iframe HTML interface.

Returns:

HTML response with the chat interface

chat(request)

Handle chat requests from the web interface.

Parameters:

request – ChatRequest object containing the query

Returns:

JSON response with the chatbot’s answer

transcribe_audio(file)

Handle audio transcription requests.

Parameters:

file – Uploaded audio file

Returns:

JSON response with the transcription

class ChatRequest

Pydantic model for chat requests.

query: str

The user’s query text

Admin Dashboard

serve_layout()

Create the dashboard layout.

Returns:

Dash HTML layout components

register_callbacks(app)

Register all dashboard callbacks.

Parameters:

app – Dash application instance

authenticate_user(username, password)

Authenticate a user against stored credentials.

Parameters:
  • username – Username to authenticate

  • password – Password to verify

Returns:

True if authentication succeeds, False otherwise

generate_metrics()

Generate system metrics for the dashboard.

Returns:

Dictionary of metrics (users, queries, token usage, costs)

Bot Implementations

Telegram Bot

class TelegramBot

Telegram bot implementation.

EMAIL_REGEX: str

Regular expression for validating email addresses

ADMINS: list

List of admin user IDs

__init__(link)

Initialize the Telegram bot.

Parameters:

link – Website link to initialize the RAG system with

extract_domain_name(link)

Extract domain name from a URL.

Parameters:

link – URL to extract domain from

Returns:

Domain name

fetch_content(link, domain_name, max_depth=None, file_path=None, webpage_only=True)

Fetch content from a URL or file.

Parameters:
  • link – URL to fetch content from

  • domain_name – Domain name

  • max_depth – Maximum crawl depth

  • file_path – Path to file

  • webpage_only – Whether to only fetch a single page

Returns:

Path to fetched content

_init_rag_system()

Initialize the RAG system with content from the website.

Returns:

None

_setup_handlers()

Set up Telegram command and message handlers.

Returns:

None

start(update, context)

Handle the /start command.

Parameters:
  • update – Update from Telegram

  • context – CallbackContext for the bot

Returns:

Next conversation state

transcribe(audio_buffer)

Transcribe voice messages to text.

Parameters:

audio_buffer – Audio buffer containing voice message

Returns:

Transcribed text

add_content(update, context)

Add new content to the RAG system.

Parameters:
  • update – Update from Telegram

  • context – CallbackContext for the bot

Returns:

None

add_admin(update, context)

Add a new admin user.

Parameters:
  • update – Update from Telegram

  • context – CallbackContext for the bot

Returns:

None

remove_admin(update, context)

Remove an admin user.

Parameters:
  • update – Update from Telegram

  • context – CallbackContext for the bot

Returns:

None

get_admins(update, context)

List current admin users.

Parameters:
  • update – Update from Telegram

  • context – CallbackContext for the bot

Returns:

None

_is_admin(user_id)

Check if a user is an admin.

Parameters:

user_id – User ID to check

Returns:

True if user is admin, False otherwise

_user_exists(id)

Check if a user exists in the database.

Parameters:

id – User ID to check

Returns:

True if user exists, False otherwise

_run_rag_query(question, user_id)

Run a RAG query in a separate thread.

Parameters:
  • question – User question

  • user_id – User ID

Returns:

RAG response

_extract_question(msg, context)

Extract question from text or voice message.

Parameters:
  • msg – Message from Telegram

  • context – CallbackContext for the bot

Returns:

Extracted question text

handle_question(update, context)

Process questions from users.

Parameters:
  • update – Update from Telegram

  • context – CallbackContext for the bot

Returns:

Next conversation state

cancel_conversation(update, context)

Cancel the current conversation.

Parameters:
  • update – Update from Telegram

  • context – CallbackContext for the bot

Returns:

End of conversation

set_email(update, context)

Set email for receiving notifications.

Parameters:
  • update – Update from Telegram

  • context – CallbackContext for the bot

Returns:

None

_is_valid_email(new_email)

Check if an email address is valid.

Parameters:

new_email – Email address to check

Returns:

True if valid, False otherwise

broadcast(update, context)

Broadcast a message to all bot subscribers.

Parameters:
  • update – Update from Telegram

  • context – CallbackContext for the bot

Returns:

None

run_async()

Run the bot asynchronously.

Returns:

None

_handle_exit(signum, frame)

Handle exit signals gracefully.

Parameters:
  • signum – Signal number

  • frame – Current stack frame

Returns:

None

run()

Run the bot using asyncio.run.

Returns:

None

WhatsApp Bot

class TwilioClient

WhatsApp bot implementation using Twilio.

TWILIO_SID: str

Twilio account SID

TWILIO_PHONE_NUMBER: str

Twilio phone number

TWILIO_AUTH_TOKEN: str

Twilio authentication token

__init__()

Initialize the Twilio client for WhatsApp messaging.

send_whatsapp_message(to_number, message)

Send a WhatsApp message.

Parameters:
  • to_number – Recipient’s phone number

  • message – Message to send

Returns:

Success status

webhook(request)

Handle incoming webhook requests from Twilio.

Parameters:

request – FastAPI request object

Returns:

TwiML response for Twilio

app.post("/sms")(twilio_bot.webhook)

Route for Twilio SMS webhook.

Returns:

Response from webhook handler