API Reference
This section provides detailed API documentation for the DataVerse ChatBot components.
chatbot Package
rag Module
BaseRAG
- class BaseRAG
The base class for all RAG implementations.
- __init__(content_path, index_path=None, rerank=True, model_name=None, chunking_type='recursive')
Initialize the RAG system.
- Parameters:
content_path – Path to content files
index_path – Path to store/load vector indexes
rerank – Whether to enable reranking
model_name – LLM model name
chunking_type – Method for chunking text
- get_response(query, user_id)
Get a response for the user query.
- Parameters:
query – User query text
user_id – Unique user identifier
- Returns:
Generated response
- _create_embeddings(texts, is_query=False)
Create embeddings for text chunks.
- Parameters:
texts – List of text chunks
is_query – Whether these are query embeddings
- Returns:
List of embeddings
- _generate_system_prompt(query, user_id, context, include_query=True, include_context=True, include_prev_conv=True)
Generate a standardized system prompt for all LLMs.
- Parameters:
query – User query
user_id – User identifier
context – Retrieved context
include_query – Whether to include the query
include_context – Whether to include context
include_prev_conv – Whether to include previous conversation
- Returns:
Formatted system prompt
- _get_index_path(content_path)
Generate unique index path based on content.
- Parameters:
content_path – Path to content
- Returns:
Path to index
- _clean_html_content(content)
Clean HTML content and convert to markdown.
- Parameters:
content – HTML content
- Returns:
Cleaned markdown content
- _create_chunks(text)
Create chunks using the specified chunking method.
- Parameters:
text – Text to chunk
- Returns:
List of text chunks
- _load_or_create_vectorstore(content_path)
Load existing index or create new one.
- Parameters:
content_path – Path to content
- Returns:
FAISS vectorstore
- _create_vectorstore(content_path)
Create FAISS vectorstore from content with incremental embedding saving.
- Parameters:
content_path – Path to content
- Returns:
FAISS vectorstore
- _update_vectorstore(new_content)
Update existing vectorstore with new content.
- Parameters:
new_content – New content to add
- Returns:
None
- _save_vectorstore(vectorstore, path)
Save vectorstore to disk.
- Parameters:
vectorstore – FAISS vectorstore
path – Path to save to
- Returns:
None
- _load_vectorstore(path)
Load vectorstore from disk.
- Parameters:
path – Path to load from
- Returns:
FAISS vectorstore
- _rerank_docs(query, docs)
Refine the top-k retrieved chunks for relevance.
- Parameters:
query – User query
docs – Retrieved documents
- Returns:
Reranked documents
- _find_relevant_context(query, top_k=5)
Find relevant context using similarity search.
- Parameters:
query – User query
top_k – Number of top chunks to retrieve
- Returns:
Relevant context as string
- classmethod get_models()
Get available models for this RAG implementation.
- Returns:
List of available models
- classmethod get_config_class()
Get the configuration class for this RAG implementation.
- Returns:
Configuration class
ClaudeRAG
- class ClaudeRAG
RAG implementation using Anthropic’s Claude models.
- __init__(content_path, index_path=None, rerank=True, model_name=None, chunking_type='recursive')
Initialize the Claude RAG system.
- Parameters:
content_path – Path to content files
index_path – Path to store/load vector indexes
rerank – Whether to enable reranking
model_name – Claude model name
chunking_type – Method for chunking text
- get_response(query, user_id)
Get a Claude-powered response.
- Parameters:
query – User query
user_id – User identifier
- Returns:
Generated response
- _initialize_models()
Initialize Claude API client.
- Returns:
None
- classmethod get_config_class()
Get the Claude configuration class.
- Returns:
ClaudeConfig class
OpenAIRAG
- class OpenAIRAG
RAG implementation using OpenAI’s models.
- __init__(content_path, index_path=None, rerank=True, model_name=None, chunking_type='recursive')
Initialize the OpenAI RAG system.
- Parameters:
content_path – Path to content files
index_path – Path to store/load vector indexes
rerank – Whether to enable reranking
model_name – OpenAI model name
chunking_type – Method for chunking text
- get_response(query, user_id)
Get an OpenAI-powered response.
- Parameters:
query – User query
user_id – User identifier
- Returns:
Generated response
- _initialize_models()
Initialize OpenAI API client.
- Returns:
None
- classmethod get_config_class()
Get the OpenAI configuration class.
- Returns:
OpenAIConfig class
CohereRAG
- class CohereRAG
RAG implementation using Cohere’s models.
- __init__(content_path, index_path=None, rerank=True, model_name=None, chunking_type='recursive')
Initialize the Cohere RAG system.
- Parameters:
content_path – Path to content files
index_path – Path to store/load vector indexes
rerank – Whether to enable reranking
model_name – Cohere model name
chunking_type – Method for chunking text
- get_response(query, user_id)
Get a Cohere-powered response.
- Parameters:
query – User query
user_id – User identifier
- Returns:
Generated response
- _initialize_models()
Initialize Cohere API client.
- Returns:
None
- classmethod get_config_class()
Get the Cohere configuration class.
- Returns:
CohereConfig class
GeminiRAG
- class GeminiRAG
RAG implementation using Google’s Gemini models.
- __init__(content_path, index_path=None, rerank=True, model_name=None, chunking_type='recursive')
Initialize the Gemini RAG system.
- Parameters:
content_path – Path to content files
index_path – Path to store/load vector indexes
rerank – Whether to enable reranking
model_name – Gemini model name
chunking_type – Method for chunking text
- get_response(query, user_id)
Get a Gemini-powered response.
- Parameters:
query – User query
user_id – User identifier
- Returns:
Generated response
- _initialize_models()
Initialize Gemini API client.
- Returns:
None
- classmethod get_config_class()
Get the Gemini configuration class.
- Returns:
GeminiConfig class
MistralRAG
- class MistralRAG
RAG implementation using Mistral AI models.
- __init__(content_path, index_path=None, rerank=True, model_name=None, chunking_type='recursive')
Initialize the Mistral RAG system.
- Parameters:
content_path – Path to content files
index_path – Path to store/load vector indexes
rerank – Whether to enable reranking
model_name – Mistral model name
chunking_type – Method for chunking text
- get_response(query, user_id)
Get a Mistral-powered response.
- Parameters:
query – User query
user_id – User identifier
- Returns:
Generated response
- _initialize_models()
Initialize Mistral API client.
- Returns:
None
- classmethod get_config_class()
Get the Mistral configuration class.
- Returns:
MistralConfig class
DeepseekRAG
- class DeepseekRAG
RAG implementation using Deepseek models.
- __init__(content_path, index_path=None, rerank=True, model_name=None, chunking_type='recursive')
Initialize the Deepseek RAG system.
- Parameters:
content_path – Path to content files
index_path – Path to store/load vector indexes
rerank – Whether to enable reranking
model_name – Deepseek model name
chunking_type – Method for chunking text
- get_response(query, user_id)
Get a Deepseek-powered response.
- Parameters:
query – User query
user_id – User identifier
- Returns:
Generated response
- _initialize_models()
Initialize Deepseek API client.
- Returns:
None
- classmethod get_config_class()
Get the Deepseek configuration class.
- Returns:
DeepSeekConfig class
GrokRAG
- class GrokRAG
RAG implementation using Grok models.
- __init__(content_path, index_path=None, rerank=True, model_name=None, chunking_type='recursive')
Initialize the Grok RAG system.
- Parameters:
content_path – Path to content files
index_path – Path to store/load vector indexes
rerank – Whether to enable reranking
model_name – Grok model name
chunking_type – Method for chunking text
- get_response(query, user_id)
Get a Grok-powered response.
- Parameters:
query – User query
user_id – User identifier
- Returns:
Generated response
- _initialize_models()
Initialize Grok API client.
- Returns:
None
- classmethod get_config_class()
Get the Grok configuration class.
- Returns:
GrokConfig class
embeddings Module
BaseEmbedding
- class BaseEmbedding
Base class for text embedding providers.
- __init__(api_key=None)
Initialize the embedding provider.
- Parameters:
api_key – API key for the embedding service
- embed(texts, is_query=False)
Create embeddings for a list of texts.
- Parameters:
texts – List of text strings
is_query – Whether these are query embeddings
- Returns:
List of embedding vectors
CohereEmbedding
- class CohereEmbedding
Cohere embedding provider.
- __init__(api_key=None)
Initialize the Cohere embedding provider.
- Parameters:
api_key – Cohere API key
- embed(texts, is_query=False)
Create embeddings using Cohere.
- Parameters:
texts – List of text strings
is_query – Whether these are query embeddings
- Returns:
List of embedding vectors
MistralEmbedding
- class MistralEmbedding
Mistral embedding provider.
- __init__(api_key=None)
Initialize the Mistral embedding provider.
- Parameters:
api_key – Mistral API key
- embed(texts, is_query=False)
Create embeddings using Mistral.
- Parameters:
texts – List of text strings
is_query – Whether these are query embeddings
- Returns:
List of embedding vectors
OpenAIEmbedding
- class OpenAIEmbedding
OpenAI embedding provider.
- __init__(api_key=None)
Initialize the OpenAI embedding provider.
- Parameters:
api_key – OpenAI API key
- embed(texts, is_query=False)
Create embeddings using OpenAI.
- Parameters:
texts – List of text strings
is_query – Whether these are query embeddings
- Returns:
List of embedding vectors
HuggingFaceEmbedding
- class HuggingFaceEmbedding
Hugging Face embedding provider.
- __init__(model_name='sentence-transformers/all-MiniLM-L6-v2', device='cpu')
Initialize the Hugging Face embedding provider.
- Parameters:
model_name – Name of the Hugging Face model
device – Device to run the model on (cpu or cuda)
- embed(texts, is_query=False)
Create embeddings using Hugging Face models.
- Parameters:
texts – List of text strings
is_query – Whether these are query embeddings
- Returns:
List of embedding vectors
crawler Module
Crawler
- class Crawler
Web content crawler.
- __init__(base_url, domain_name, max_depth=2, max_pages=50, wait_time=1.0, follow_links=True, ignore_query_params=True, client='crawl4ai')
Initialize the crawler.
- Parameters:
base_url – Starting URL
domain_name – Target domain name
max_depth – Maximum crawl depth
max_pages – Maximum pages to crawl
wait_time – Time to wait between requests
follow_links – Whether to follow links
ignore_query_params – Whether to ignore URL query parameters
client – Client library to use for crawling
- extract_content(link, webpage_only=True, max_depth=None)
Extract content from a webpage or multiple webpages.
- Parameters:
link – URL to crawl
webpage_only – Whether to only extract content from a single page
max_depth – Maximum depth to crawl
- Returns:
Path to the extracted content
- _clean_html(html)
Clean HTML content.
- Parameters:
html – HTML content
- Returns:
Cleaned text content
- _save_extracted_content(text, file_path)
Save extracted content to a file.
- Parameters:
text – Content to save
file_path – Path to save the content to
- Returns:
Path to the saved content
utils Module
General Utilities
- create_folder(path)
Create a folder if it doesn’t exist.
- Parameters:
path – Path to create
- Returns:
Path object of the created folder
File Operations
- class FileLoader
File loading and processing utility.
- __init__(file_path, content_path=None, client='docling')
Initialize the file loader.
- Parameters:
file_path – Path to the file to load
content_path – Path to save extracted content
client – Document processing client to use
- extract_from_file()
Extract and process content from a file.
- Returns:
List of document objects
- _get_extension(file_path)
Get the extension of a file.
- Parameters:
file_path – Path to the file
- Returns:
File extension
- supported_extensions()
Get list of supported file extensions.
- Returns:
List of supported extensions
Database Operations
- class DatabaseOps
Database operations utility.
- __init__(db_path=None)
Initialize database operations.
- Parameters:
db_path – Path to SQLite database
- _init_db()
Initialize database tables if they don’t exist.
- Returns:
None
- get_chat_history(user_id=None, last_n=3, full_history=False, last_n_hours=24)
Retrieve chat history for user.
- Parameters:
user_id – User identifier
last_n – Maximum entries to retrieve when not using full_history
full_history – Whether to retrieve full history for all users
last_n_hours – Number of hours to look back when using full_history
- Returns:
Chat history as formatted string or list of interactions
- append_chat_history(user_id, question, answer, model_used, embedding_model_used)
Save chat interaction to database.
- Parameters:
user_id – User identifier
question – User question
answer – System response
model_used – LLM model used
embedding_model_used – Embedding model used
- Returns:
None
- append_cost(user_id, model_used, embedding_model_used, input_tokens, output_tokens, cost_per_input_token, cost_per_output_token)
Track token usage and cost.
- Parameters:
user_id – User identifier
model_used – LLM model used
embedding_model_used – Embedding model used
input_tokens – Number of input tokens
output_tokens – Number of output tokens
cost_per_input_token – Cost per million input tokens
cost_per_output_token – Cost per million output tokens
- Returns:
None
- get_monitored_resp()
Get monitored responses from the last 24 hours.
- Returns:
List of question-answer tuples
- append_bot_sub(user_id, first_name, platform)
Add a new bot subscriber.
- Parameters:
user_id – User identifier
first_name – User’s first name
platform – Platform (Telegram, WhatsApp)
- Returns:
None
- get_bot_sub(user_id=None)
Get bot subscribers.
- Parameters:
user_id – Optional user ID to filter by
- Returns:
List of subscribers or single subscriber
Email Services
- class EmailService
Email notification service.
- __init__(smtp_server=None, smtp_port=None, sender_email=None, sender_password=None, receiver_email=None)
Initialize the email service.
- Parameters:
smtp_server – SMTP server address
smtp_port – SMTP server port
sender_email – Sender email address
sender_password – Sender email password
receiver_email – Receiver email address
- subscribe(callback)
Allow other classes to subscribe to email state changes.
- Parameters:
callback – Callback function to notify
- Returns:
None
- unsubscibe(callback)
Remove a subscriber.
- Parameters:
callback – Callback function to remove
- Returns:
None
- _notify_subscribers(old_email, new_email)
Notify subscribers of email changes.
- Parameters:
old_email – Previous email
new_email – New email
- Returns:
None
- _format_email_content(unknowns)
Format the email content with a table of uncertain responses.
- Parameters:
unknowns – List of (question, answer) tuples
- Returns:
Formatted HTML content
- _send_without_attachment(message, unknowns)
Prepare message without attachments for uncertain responses.
- Parameters:
message – The email message object
unknowns – List of (question, answer) tuples
- Returns:
HTML content for the message
- _add_file_attachment(message, file_path, content_type=None)
Add a file attachment to the email message.
- Parameters:
message – The email message object
file_path – Path to the file to attach
content_type – Content type of the file
- Returns:
True if attachment was successful, False otherwise
- send_email_with_attachments(subject, message_body, file_paths=None)
Send an email with multiple file attachments.
- Parameters:
subject – The email subject
message_body – The email body text
file_paths – List of file paths to attach
- Returns:
None
- _send_with_attachment(message, json_data, filename)
Add JSON data as an attachment to the email.
- Parameters:
message – The email message object
json_data – JSON data to attach
filename – Filename for the attachment
- Returns:
The JSON attachment
- send_email(subject, unknowns=None, json_data=None, filename='conversations.json')
Send an email with either uncertain responses or JSON data.
- Parameters:
subject – The email subject line
unknowns – List of uncertain responses
json_data – JSON data to attach
filename – Filename for JSON attachment
- Returns:
None
- property receiver_email
Get the receiver email address.
- Returns:
Email address
- receiver_email.setter()
Set the receiver email address.
- Parameters:
value – New email address
- Returns:
None
Data Processing
- count_labels(df, column)
Count the occurrences of each label in a DataFrame column.
- Parameters:
df – Pandas DataFrame
column – Column name to count
- Returns:
Series with label counts
- standardize_length(df, max_length=250)
Standardize the length of text in a DataFrame column.
- Parameters:
df – Pandas DataFrame
max_length – Maximum length for text
- Returns:
DataFrame with standardized text
- truncate_to_n_tokens(text, tokenizer, max_tokens=50)
Truncate text to a maximum number of tokens.
- Parameters:
text – Text to truncate
tokenizer – Tokenizer to use
max_tokens – Maximum number of tokens
- Returns:
Truncated text
Monitoring Services
- class UncertainResponseMonitor
Monitor and detect uncertain responses.
- __init__(email_service, every_hours=24, start_service=True)
Initialize the monitor.
- Parameters:
email_service – Email service for notifications
every_hours – Check frequency in hours
start_service – Whether to start monitoring immediately
- check_for_uncertain_responses()
Check database for potentially uncertain responses.
- Returns:
List of uncertain responses
- _start_monitoring()
Start the monitoring service.
- Returns:
None
- _stop_monitoring()
Stop the monitoring service.
- Returns:
None
- _schedule_monitoring()
Schedule periodic monitoring.
- Returns:
None
- _on_exception(e)
Handle exceptions during monitoring.
- Parameters:
e – Exception object
- Returns:
None
- class ChatHistoryMonitor
Monitor chat history and generate reports.
- __init__(email_service, every_hours=24, start_service=True)
Initialize the monitor.
- Parameters:
email_service – Email service for notifications
every_hours – Check frequency in hours
start_service – Whether to start monitoring immediately
- generate_report()
Generate usage report from chat history.
- Returns:
Report data
- _start_monitoring()
Start the monitoring service.
- Returns:
None
- _stop_monitoring()
Stop the monitoring service.
- Returns:
None
- _schedule_monitoring()
Schedule periodic monitoring.
- Returns:
None
Path Management
- BASE_DIR
Base directory of the project.
- DATA_DIR
Directory for all data.
- WEB_CONTENT_DIR
Directory for web content.
- DATASETS_DIR
Directory for datasets.
- DATABASE_DIR
Directory for database files.
- INDEXES_DIR
Directory for vector indexes.
- VOICES_DIR
Directory for voice recordings.
- MODELS_DIR
Directory for ML models.
- LOGS_DIR
Directory for log files.
- TRAIN_FILES_DIR
Directory for training files.
- CHAT_HIST_DIR
Directory for chat history.
- FONTS_DIR
Directory for fonts.
- CLF_PATH
Path to the classifier model.
config Module
- get_api_key(provider)
Get API key for a specified provider.
- Parameters:
provider – Provider name (e.g., “OPENAI”, “COHERE”)
- Returns:
API key string
- Raises:
MissingAPIKeyError – If the API key is not found
MissingAPIKeyError
- exception MissingAPIKeyError
Exception raised when an API key is missing.
Config
LLM Provider Configs
- class OpenAIConfig
OpenAI-specific configuration.
- AVAILABLE_MODELS: list
List of available OpenAI models
- class ClaudeConfig
Claude-specific configuration.
- AVAILABLE_MODELS: list
List of available Claude models
- class CohereConfig
Cohere-specific configuration.
- AVAILABLE_MODELS: list
List of available Cohere models
- class GeminiConfig
Gemini-specific configuration.
- AVAILABLE_MODELS: list
List of available Gemini models
- class MistralConfig
Mistral-specific configuration.
- AVAILABLE_MODELS: list
List of available Mistral models
web Package
Chat Web App
- home()
Serve the iframe HTML interface.
- Returns:
HTML response with the chat interface
- chat(request)
Handle chat requests from the web interface.
- Parameters:
request – ChatRequest object containing the query
- Returns:
JSON response with the chatbot’s answer
- transcribe_audio(file)
Handle audio transcription requests.
- Parameters:
file – Uploaded audio file
- Returns:
JSON response with the transcription
Admin Dashboard
- serve_layout()
Create the dashboard layout.
- Returns:
Dash HTML layout components
- register_callbacks(app)
Register all dashboard callbacks.
- Parameters:
app – Dash application instance
- authenticate_user(username, password)
Authenticate a user against stored credentials.
- Parameters:
username – Username to authenticate
password – Password to verify
- Returns:
True if authentication succeeds, False otherwise
- generate_metrics()
Generate system metrics for the dashboard.
- Returns:
Dictionary of metrics (users, queries, token usage, costs)
Bot Implementations
Telegram Bot
- class TelegramBot
Telegram bot implementation.
- EMAIL_REGEX: str
Regular expression for validating email addresses
- ADMINS: list
List of admin user IDs
- __init__(link)
Initialize the Telegram bot.
- Parameters:
link – Website link to initialize the RAG system with
- extract_domain_name(link)
Extract domain name from a URL.
- Parameters:
link – URL to extract domain from
- Returns:
Domain name
- fetch_content(link, domain_name, max_depth=None, file_path=None, webpage_only=True)
Fetch content from a URL or file.
- Parameters:
link – URL to fetch content from
domain_name – Domain name
max_depth – Maximum crawl depth
file_path – Path to file
webpage_only – Whether to only fetch a single page
- Returns:
Path to fetched content
- _init_rag_system()
Initialize the RAG system with content from the website.
- Returns:
None
- _setup_handlers()
Set up Telegram command and message handlers.
- Returns:
None
- start(update, context)
Handle the /start command.
- Parameters:
update – Update from Telegram
context – CallbackContext for the bot
- Returns:
Next conversation state
- transcribe(audio_buffer)
Transcribe voice messages to text.
- Parameters:
audio_buffer – Audio buffer containing voice message
- Returns:
Transcribed text
- add_content(update, context)
Add new content to the RAG system.
- Parameters:
update – Update from Telegram
context – CallbackContext for the bot
- Returns:
None
- add_admin(update, context)
Add a new admin user.
- Parameters:
update – Update from Telegram
context – CallbackContext for the bot
- Returns:
None
- remove_admin(update, context)
Remove an admin user.
- Parameters:
update – Update from Telegram
context – CallbackContext for the bot
- Returns:
None
- get_admins(update, context)
List current admin users.
- Parameters:
update – Update from Telegram
context – CallbackContext for the bot
- Returns:
None
- _is_admin(user_id)
Check if a user is an admin.
- Parameters:
user_id – User ID to check
- Returns:
True if user is admin, False otherwise
- _user_exists(id)
Check if a user exists in the database.
- Parameters:
id – User ID to check
- Returns:
True if user exists, False otherwise
- _run_rag_query(question, user_id)
Run a RAG query in a separate thread.
- Parameters:
question – User question
user_id – User ID
- Returns:
RAG response
- _extract_question(msg, context)
Extract question from text or voice message.
- Parameters:
msg – Message from Telegram
context – CallbackContext for the bot
- Returns:
Extracted question text
- handle_question(update, context)
Process questions from users.
- Parameters:
update – Update from Telegram
context – CallbackContext for the bot
- Returns:
Next conversation state
- cancel_conversation(update, context)
Cancel the current conversation.
- Parameters:
update – Update from Telegram
context – CallbackContext for the bot
- Returns:
End of conversation
- set_email(update, context)
Set email for receiving notifications.
- Parameters:
update – Update from Telegram
context – CallbackContext for the bot
- Returns:
None
- _is_valid_email(new_email)
Check if an email address is valid.
- Parameters:
new_email – Email address to check
- Returns:
True if valid, False otherwise
- broadcast(update, context)
Broadcast a message to all bot subscribers.
- Parameters:
update – Update from Telegram
context – CallbackContext for the bot
- Returns:
None
- run_async()
Run the bot asynchronously.
- Returns:
None
- _handle_exit(signum, frame)
Handle exit signals gracefully.
- Parameters:
signum – Signal number
frame – Current stack frame
- Returns:
None
- run()
Run the bot using asyncio.run.
- Returns:
None
WhatsApp Bot
- class TwilioClient
WhatsApp bot implementation using Twilio.
- TWILIO_SID: str
Twilio account SID
- TWILIO_PHONE_NUMBER: str
Twilio phone number
- TWILIO_AUTH_TOKEN: str
Twilio authentication token
- __init__()
Initialize the Twilio client for WhatsApp messaging.
- send_whatsapp_message(to_number, message)
Send a WhatsApp message.
- Parameters:
to_number – Recipient’s phone number
message – Message to send
- Returns:
Success status
- webhook(request)
Handle incoming webhook requests from Twilio.
- Parameters:
request – FastAPI request object
- Returns:
TwiML response for Twilio
- app.post("/sms")(twilio_bot.webhook)
Route for Twilio SMS webhook.
- Returns:
Response from webhook handler