Article

Hybrid Large Language Model, query, regulatory standards, medical devices

Medical Devices

6/21/2025

Leveraging AI for Queries in a Chatbot Accessing Regulatory Documents

My motivation for developing this solution stems from a critical need in medical device compliance to quickly find relevant and accurate information from regulatory documents:

Locate pertinent text passages without searching in the PDF file.
Get reliable answers in conversational style without the hallucinations that plague general-purpose AI systems.

In our highly regulated sector, imprecise responses can lead to misunderstandings, confusion as to what is factually correct, and even to non-conformities.

The solution lies in Retrieval-Augmented Generation (RAG), a technique that combines document retrieval with AI generation. Instead of relying solely on an LLM's training data, RAG first searches through the specific documents to find relevant passages. It then uses those passages to generate accurate, grounded responses.

This approach maintains the precision of source documents while leveraging the LLM's broader knowledge for context and interpretation. Through proper prompt engineering, the query can be limited to verbatim quotes only, if required. It is up to the user to define how much supplementary information should be used to broaden the context beyond the document.

Understanding Vector Databases and Embeddings

At the heart of RAG systems are vector databases. These are specialized storage systems that enable a semantic search. How does that work?

Embeddings transform text into high-dimensional numerical vectors that capture semantic meaning. Similar concepts cluster together, allowing the system to find relevant content even when exact keywords don't match. For example, "cybersecurity risk assessment" and "security vulnerability evaluation" would have similar vector representations despite different wording.

Vector databases like FAISS store these embeddings and enable fast similarity searches across thousands of document chunks. When asking a question, the system converts the query into a vector and finds the most semantically similar document passages.

The key insight is that the distance between vectors quantitatively reflects how similar or related the data points are to each other. In practice, the similarity of the question with existing text and information in the database is calculated by rather straightforward mathematical operations (mostly linear algebra: e.g. Euclidian distance or dot product between two vectors).

The Local vs. Cloud Dilemma

Initially, I pursued a local LLM approach, keeping the AI model, the queries and databases with the interpretations locally on my PC. Thereby, I could avoid costs from API calls to a cloud provider for embeddings and queries. Local solutions offer complete privacy, offline access, and full data control.

However, several experiments with solutions such as GPT4All and LocalGPT revealed a critical limitation: embedding quality.

Local embedding models significantly underperform cloud-based solutions in:

Semantic depth: OpenAI's text-embedding-3-large leverages massive, diverse training data
Domain adaptation: Superior handling of specialized regulatory terminology
Mathematical precision: More nuanced vector representations for subtle concept relationships

The Hybrid Solution

This led to my current approach: leveraging OpenAI's superior embeddings and language capabilities while implementing local caching for fast queries. The system processes documents once, stores embeddings locally, and enables rapid offline querying while maintaining semantic precision.

The following implementation in Windows 11 (x86-64) demonstrates this regulatory RAG system using LangChain for orchestration, FAISS for vector storage, and OpenAI's API for embeddings and generation, creating a solution that showed promising performance and accuracy (verification & validation is still ongoing).

LangChain

LangChain is an open-source orchestration framework designed to help developers build applications that run LLMs like OpenAI’s GPT, Google Gemini, Anthropic Claude, and open-source alternatives. It is a generic interface to LLMs and uses Python or Java libraries.

LangChain operates between the Python application and the LLM. It provides standardized interfaces and utilities in a chain-like sequence of functions:

Connect to different LLM agent via an API (OpenAI, Llama, etc.)
Handles prompt formatting, API calls
Retrieves the relevant data from the document
Parses the output

LangChain also provides prompt templates that help steer queries in a desired direction. Moreover, it is possible to retain the conversation of the chatbot.

Technical Overview - Installation and Setup Steps

Environment Setup:

OS: Windows 11 (x86-64)
Python v.3.12 (is the version that is proven to work well in this setup)
Install Python packages via pip install
Set OpenAI API key as environment variable
Create .env file with the OpenAI API key for persistent configuration
Create local directories

System Architecture Components:

OpenAI Embeddings: Using text-embedding-3-large model for document vectorization
FAISS Vector Store: Local vector database with IndexFlatL2 for similarity search
LangChain Retrieval Chain: Combining document retrieval with ChatGPT-4 for Q&A
Cached Embeddings: Local file store to avoid re-computing embeddings
Vision Integration: GPT-4o for analyzing technical diagrams and images
Document Processing: MarkdownTextSplitter with 1000 char chunks and 200 char overlap

Chunking is Essential for RAG Systems

Chunking is the process of breaking down large documents into smaller, manageable segments that can be effectively processed by AI systems. It can be thought of as dividing a large book into chapters and sections - each chunk becomes a digestible unit that preserves meaningful context while remaining small enough for efficient processing.

In the document analysis system, chunking serves several critical purposes:

Model Limitations: Large language models and embedding models have maximum input size constraints measured in tokens. Tokens are the basic units that AI models use to process text - roughly equivalent to words or word fragments. For example, "cybersecurity" might be one token, while "cyber-security" could be two tokens. Even advanced models like GPT-4 can handle up to 4096 tokens, but processing smaller chunks is more efficient and accurate.

Token-to-Chunk Relationship: the 1000-character chunks typically contain 150-250 tokens, depending on the text complexity. This size ensures each chunk stays well within model limits while preserving complete regulatory concepts.

Semantic Coherence: Rather than forcing the AI to process entire documents at once, chunking allows the system to focus on specific, contextually relevant sections when answering queries. This prevents information overload and improves retrieval precision.

Vector Database Efficiency: FAISS and other vector databases work optimally with consistently-sized chunks, enabling faster similarity searches and more accurate semantic matching.

Chunking Configuration Explained

My implementation uses MarkdownTextSplitter with 1000 character chunks and 200 character overlap - a specialized approach quite suitable for regulatory documents.

MarkdownTextSplitter: This content-aware chunking strategy offered by LangChain respects the markdown document structure. Unlike simple character-based splitting, it understands markdown syntax (headers, tables, figure captions) and splits at logical boundaries rather than mid-sentence. I found markdown to be the best format for correct parsing. It requires a lot of manual cleaning, but it gave far better results than parsing from a PDF document.

1000 Character Chunks: This size represents an optimal balance for regulatory content - large enough to capture complete regulatory concepts, small enough for precise retrieval without overwhelming the embedding model.

200 Character Overlap: This 20% overlap maintains context continuity by preventing context loss, preserving regulatory relationships between sections, and improving retrieval accuracy when queries match content near chunk boundaries.

Parameter Optimization

These parameters are optimized for regulatory documents' dense, technical language. The 1000-character size captures complete regulatory statements, while the overlap ensures cross-references and building requirements remain connected. This configuration balances thoroughness with speed, creating high-quality vector representations for fast similarity searches.

Installation and runtime process:

Get the Python version 3.12 installation file from https://www.python.org/
Install Python 3.12 (remember to add PATH)
With git bash (or any other CLI) create a root project folder and go to that folder
To reduce problems with dependencies create a separate environment and activate it:


py -3.12 -m venv langchain_env
source langchain_env/Scripts/activate
python --version

(With an OS = Ubuntu the commands to create and activate an environment are somewhat different


python3.12 -m venv langchain_env


source langchain_env/bin/activate

)

5. Install the LLM-specific software


pip install langchain-openai
pip install faiss-cpu
pip install python-dotenv
pip install langchain-community
pip install langchain-text-splitters

Create the environment file and place the API key that you obtain from https://openai.com/


touch .env OPENAI_API_KEY=enter-openai-key-here


mkdir documents
mkdir embeddings_cache
mkdir scripts

On top of the Python script that manages the LLM pipeline and provides a CLI interface for query and retrieval from the vector database, write:


$ import os
from dotenv import load_dotenv
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import MarkdownTextSplitter
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain.embeddings import CacheBackedEmbeddings
from langchain.storage import LocalFileStore

The Python script named rag-langchain.py uses functions to call LangChain index operations to load regulatory documents, process them into chunks with overlap using MarkdownTextSplitter.

The script - via LangChain - creates and manages a FAISS vector database for storing embeddings and performs similarity searches. It then implements the complete RAG workflow that converts user queries to embeddings, retrieves relevant document chunks, and combines them with ChatGPT-4 to generate accurate responses.

The script provides an interactive command-line interface for querying documents while maintaining the precision required for compliance applications (see topmost figure for an example).

All this is accomplished (in an activated environment) and running the LLM script:


python scripts/langchain_rag.py

Multiple Documents and Using Environments

The notion of environments is important when you distinguish cases for having a chatbot relying on one source document (e.g. a particular regulatory document such as the MDR) only. In that case, you create a sandbox with all software dependencies and source documents (text and images).

In another use case you might prefer to have the content of the MDR merged with domain-specific standards in, say, risk management. So you add ISO 14971 and ISO/TR 24971 to form a 'context', where document borders vanish and you have access to a combined body of knowledge. Asking questions such as "give me the section of the MDR that covers risk management" won't work because the LLM sees all documents as one data "lake".

Utilizing different AI models (e.g. Anthropic instead of ChatGPT) can also be a good reason to maintain separate environments, even when using the same documents.

AI Model Selection

After some literature research regarding the suitability of an AI model for regulatory source documents, I decided to use ChatGPT-4, because it showed the best results in handling legal and regulatory text [1].

Costs, Prerequisites and Efforts

This setup does not require dedicated GPUs or large RAM memory. The computationally heavy AI processing happens remotely at OpenAI. The costs for creating the embeddings of one document with appr. 2000 lines of formatted text is around $1.

Image interpretation can be costly, as OpenAI Vision may use the same amount for analyzing the content of about 5 schematic diagrams in PNG format. OpenAI Vision is very good at extracting the meaning of schematic drawings, which adds substantial value to the chatbot.

The good part is that once the embeddings and the vector database are available, all the queries are performed with local resources, at no cost. The delay between the question and the first answer by the chatbot appearing on the screen is usually between 2-3 s. The printout speed is more than sufficient for productive work.

A chatbot empowers the learning process because it serves as a mentor. It is fun and helps to raise self-learning to a new level!

Experience with the command-line interface (Linux CLI commands or PowerShell) and Python programming skills are highly recommended. But also here, you do not need to be a professional software engineer: AI can assist here to set up the system and code the necessary scripts.

Formal validation is a separate, important topic that requires a lot of good planning and execution. As a starter, place the original text side-by-side with the chatbot and assess the accuracy and tendency of the AI model to augment the replies with other sources of information.

References

[1] Large Language Model in Financial Regulatory Interpretation, Cao et al., May 14, 2024, https://arxiv.org/pdf/2405.06808v1

Go back