Overview

System Architecture

DataVerse ChatBot follows a modular architecture designed with extensibility and maintainability in mind:

┌─────────────────┐       ┌───────────────────┐       ┌─────────────────┐
│  Data Ingestion │       │ Vector Processing │       │   LLM Systems   │
│                 │       │                   │       │                 │
│ - Web Crawling  │──────▶│ - Embeddings     │──────▶│ - RAG Interface │
│ - File Loading  │       │ - Vectorization  │       │ - LLM Providers │
│                 │       │ - FAISS Indexing │       │                 │
└─────────────────┘       └───────────────────┘       └─────────────────┘
        ▲                                                     │
        │                                                     ▼
        │                 ┌───────────────────┐       ┌─────────────────┐
        │                 │  Chat Interfaces  │       │   Monitoring    │
        └─────────────────│                   │◀──────│                 │
                          │ - Telegram Bot    │       │ - Uncertainty   │
                          │ - WhatsApp Bot    │       │ - Email Alerts  │
                          │ - Web Interface   │       │ - Chat History  │
                          └───────────────────┘       └─────────────────┘

Key Components

  1. Data Ingestion

    • Crawler: Extracts content from websites using configurable crawlers

    • FileLoader: Processes various file formats into uniform text representations

  2. Vector Processing

    • BaseEmbedding: Creates vector embeddings from text

    • CohereEmbedding, OpenAIEmbedding, etc.: Provider-specific embedding implementations

    • FAISS integration: Efficient similarity search for document retrieval

  3. RAG Systems

    • BaseRAG: Core Retrieval-Augmented Generation functionality

    • ClaudeRAG, OpenAIRAG, etc.: LLM-specific implementations

    • Context management and reranking

  4. Chat Interfaces

    • Telegram bot for messenger integration

    • WhatsApp bot via Twilio integration

    • Web-based chat interface with iframe embedding support

  5. Monitoring & Utilities

    • Uncertainty detection via trained classifier

    • Response monitoring and email alerting

    • Database operations for chat history and usage tracking

Technology Stack

  • Programming Language: Python 3.11+

  • Vector Database: FAISS (Facebook AI Similarity Search)

  • LLM Providers: OpenAI, Anthropic, Google, Mistral, Cohere, DeepSeek, Grok

  • Web Crawling: Crawl4AI, ScrapegraphAI

  • Data Processing: LangChain, DocLing, unstructured

  • Classification: scikit-learn, XGBoost

  • Embedding Models: sentence-transformers, provider-specific embedding APIs

  • Web Framework: FastAPI, Flask

  • Messenger Integrations: python-telegram-bot, Twilio

  • Database: SQLite

  • Voice Support: OpenAI Whisper, TTS libraries