Top Banner

Infoly AI Chatbot

Problem Statement

Organizations often struggle with timely access to internal information, especially when employees need quick answers to domain-specific questions. In many companies, Go-to-Market (GTM) teams—including sales, marketing, and customer success—frequently rely on the product team to clarify product details, technical specifications, or feature-related queries. However, due to the product team's workload and competing priorities, responses can be delayed, slowing down the GTM teams' workflows and reducing their overall efficiency.

The goal is to build an intelligent, chat-based AI assistant that is trained on internal company datasets such as product documentation, wikis, knowledge bases, and training manuals. This AI chatbot should enable employees to instantly access accurate information and get their queries resolved in real time without human intervention.

In addition, the company wants to provide a similar experience to external users—such as customers, prospects, and partners—by training the AI model on publicly available website data (e.g., FAQs, product pages, support documentation). The chatbot will be exposed over a public URL, allowing users to ask questions and receive immediate, reliable answers, thus improving customer engagement and reducing support ticket volume.

Solutioning Approach

To address the challenge of enabling real-time, accurate responses to user queries based on internal company datasets, we implemented a multi-layered AI solution that prioritized data security, scalability, and precision.

  1. Fine-Tuning the AI Model on Internal Datasets

    We began by selecting a high-performing open-source large language model (LLM) as the base for our solution. Before fine-tuning, we performed a comprehensive data preprocessing step that included anonymization and the removal of Personally Identifiable Information (PII) from all internal documents. This was crucial not only for ensuring compliance with data privacy regulations but also for maintaining the confidentiality of sensitive organizational information.

    The fine-tuning process was then carried out using the cleaned dataset, which consisted of product documentation, internal wikis, playbooks, training material, and other proprietary knowledge sources. By customizing the model with this internal context, we enabled it to understand domain-specific terminology and respond accurately to company-specific queries.

  2. Retrieval-Augmented Generation (RAG) for Scalable Contextualization

    Given the extensive volume of internal documentation, it was not feasible to fit the entire knowledge base into the model's prompt context window. To address this, we implemented a Retrieval-Augmented Generation (RAG) architecture on top of the fine-tuned model.

    RAG allowed the system to dynamically retrieve relevant pieces of information from a pre-indexed document store in real time, based on the user's query. These relevant document snippets were then appended to the user query as context before being passed to the model for inference. This approach significantly enhanced the accuracy and relevance of the responses, especially for complex or nuanced questions.

  3. Secure and Compliant Model Deployment

    To meet the client's stringent security and compliance requirements, the entire AI model deployment was hosted within their private cloud infrastructure. Specifically, the model was deployed on a dedicated GPU-enabled instance within the client's secure cloud environment, ensuring that sensitive data and model inferences remained entirely within their controlled perimeter.

  4. Vector Store Integration Using OpenSearch

    The document embeddings used for retrieval in the RAG system were stored in a vector database built on Amazon OpenSearch Service, which was already part of the client's AWS environment. This ensured that all vector search operations occurred within the same secure cloud infrastructure and no data was transmitted to external servers or third-party platforms. The use of OpenSearch also allowed for seamless scalability, high availability, and low-latency retrievals.

  5. Enhanced Chunk Ranking with Cohere's Re-Ranker on AWS Bedrock

    To further improve the quality of the retrieved documents and ensure that the most relevant information was presented to the AI model, we integrated Cohere's Re-Ranker via AWS Bedrock. The re-ranker scored and sorted the retrieved chunks based on semantic relevance to the user query. This re-ranking mechanism played a critical role in handling complex queries with high accuracy, ensuring that the AI model always received the most pertinent information for response generation.

Challenges and Mitigation Strategies

As with any enterprise-grade AI deployment, building the Infoly AI Chatbot involved navigating several technical, operational, and infrastructural challenges. Below is a breakdown of the key challenges we encountered and the strategies we used to overcome them:

  1. Model Accuracy and Evaluation

    Ensuring high accuracy in responses was critical to the chatbot's adoption and effectiveness, especially for internal GTM teams relying on precise, domain-specific information.

    • Evaluation Dataset Creation: To benchmark the performance of the model, we developed a comprehensive Evaluation Dataset that closely mimicked real-world user queries. This dataset was built in collaboration with key stakeholders from the GTM teams to reflect the breadth and depth of questions typically encountered in their day-to-day roles.
    • Human-in-the-Loop (HITL) QA Process: Our Quality Assurance team conducted detailed manual evaluations of the model outputs through a HITL framework. This iterative feedback loop helped fine-tune both the model and retrieval pipeline to ensure consistent and reliable performance across query categories.
    • Domain-Specific Fine-Tuning: The base LLM was fine-tuned using proprietary internal documentation, which allowed the model to adapt to the client's domain-specific vocabulary, tone, and structure. This domain adaptation was crucial for enhancing the model's understanding of nuanced terms and internal acronyms.
    • Customized Re-Ranker Training: In addition to fine-tuning the LLM, we also fine-tuned Cohere's Re-Ranker on the client's specific document-query pairs. This ensured that document chunk relevance scoring was aligned with the actual needs of the end users, further improving response accuracy.
  2. Model Latency and Infrastructure Optimization

    Achieving high throughput with minimal latency was a key challenge, especially given the real-time expectations of internal users.

    • A100 GPU Deployment: To ensure low-latency inference and handle concurrent query loads efficiently, we deployed the LLM on NVIDIA A100 GPUs within the client's private cloud infrastructure. These high-performance GPUs significantly reduced response times and ensured smooth performance during peak usage hours.
    • Inference Optimization: We implemented multiple optimizations at the model serving layer, including dynamic batching, precision tuning (FP16), and token streaming to further reduce latency without compromising output quality.
  3. End-to-End System Integration

    Making the AI chatbot easily accessible to both internal and external users required seamless integration with various platforms and user touchpoints.

    • Slack Integration for Internal Users: To align with existing workflows, we integrated the model's APIs into the client's internal Slack workspace. This allowed GTM team members to query the AI chatbot directly within their daily communication tool, leading to higher adoption and usability.
    • External Access via Shareable Chatbot Link: For external stakeholders (customers, partners, prospects), we developed a standalone web-based chatbot interface. This chatbot was powered by a large-context LLM such as Gemini, trained exclusively on the client's publicly available website data (e.g., FAQs, documentation, product pages). The result was a highly intuitive, branded experience that could be easily shared via a link.
  4. Real-Time Model Updates and Data Freshness

    Keeping the chatbot updated with the latest internal and external information was vital for maintaining trust and usefulness.

    • Custom Update and Retraining Pipeline: We built a modular, automated retraining pipeline that continuously ingested new data—such as updated documentation, support articles, or release notes—into the model's knowledge base.
    • Graph-Based RAG System: To support efficient updates, we engineered a Graph-Based RAG (Retrieval-Augmented Generation) system. This structure enabled hierarchical and relationship-aware document retrieval, which significantly reduced redundancy and allowed for near real-time retraining of the retrieval system whenever new data was added.
  5. Mitigating Model Hallucinations

    One of the most critical risks in deploying LLMs is hallucination—where the model generates information that is not grounded in actual source data.

    • LLM-as-a-Judge Evaluation Framework: To combat hallucinations, we implemented an LLM-as-a-Judge (LaaJ) framework, wherein a separate LLM was used to critically evaluate whether the chatbot's responses were properly grounded in the retrieved context. This acted as a secondary validation layer before final delivery to the user.
    • Graph-Based Context Grounding: The use of a graph-based RAG architecture provided stronger semantic and structural grounding by maintaining relationships between topics and content nodes. This architecture played a pivotal role in significantly reducing hallucinations and ensuring contextual fidelity.

Outcome

As a result of these measures, the system achieved over 98% accuracy on evaluation queries, with hallucination rates reduced to negligible levels.