Article

Building a RAG Chatbot with Microsoft's Native C#/.NET Stack

6/5/2026

A follow-up to my LangChain implementation – same problem, different philosophy

In a previous article, I described how to build a hybrid LLM query system for regulatory documents using Python, LangChain, and FAISS (LinkedIn article). The approach worked well, but it relied on Python's ecosystem and a file-based vector store. This time I rebuilt the same application using Microsoft's native stack: C# / .NET 10, Microsoft Kernel Memory, and Qdrant. The contrast is instructive.

Why a second implementation?

LangChain is a well established standard in the Python RAG world, but it is not the only way. Microsoft has built a production-grade alternative inside its Semantic Kernel ecosystem: Microsoft Kernel Memory.

Many organisations and my website at https://dirkmueller8.com/ run on .NET; this approach therefore integrates naturally with Azure and ASP.NET Core, avoids Python dependency management, and provides strongly typed configuration — which makes coding and deploying code more secure.

My implementation is available with a README file (incl. architecture diagrams) at https://github.com/DirkMueller8/RegulatoryRagBot.

The Building Blocks

1. The Document Store – Qdrant

Where the LangChain version used FAISS (a local, file-based index), this implementation uses Qdrant — a dedicated vector database running in Docker. The practical difference: Qdrant persists independently of the application process.

I can restart the app, update the code, or deploy a new version without losing my indexed documents. For regulatory work where ingestion can be expensive (both in time and API cost), this matters.

2. The Orchestration Layer – Microsoft Kernel Memory

Kernel Memory handles the full pipeline: document ingestion, chunking, embedding, storage, retrieval, and answer generation. It is the .NET equivalent of LangChain's document loaders + retrievers + chains, but with a single unified API surface. Three lines of configuration replace what took dozens of lines in LangChain, with a lot of bug fixing and lack of backward compatibility after major updates of the LangChain framework.

3. The Embedding Model text-embedding-3-large

Both implementations use OpenAI's text-embedding-3-large. At 3072 dimensions it is the most accurate embedding model available from OpenAI — important for regulatory text, where two clauses can be semantically similar yet legally distinct. The cost premium over text-embedding-3-small is approximately 2×, but for a corpus of regulatory standards this is negligible relative to the precision gained.

4. The Chat Model GPT-4o

Answer generation uses gpt-4o. Regulatory language is dense with cross-references, defined terms, and long clauses. GPT-4o's reasoning capability handles this better than smaller models. Once output quality is validated, gpt-4o-mini is a viable cost-reduction step.

The Parameters That Matter

This is where the implementation decisions become engineering decisions.

Chunk size: 1024 tokens

The LangChain version used 1000-character chunks (~150–250 tokens). This implementation uses 1024 tokens — roughly 4–5× larger. Regulatory text does not compress well into small chunks. A single ISO standard clause can span multiple sentences with back-references to definitions in earlier sections. A 200-token chunk captures a fragment; a 1024-token chunk captures the clause in context. The trade-off is that larger chunks produce less precise similarity scores — hence the importance of the next parameter.

Overlap: 256 tokens (25%)

Chunk boundaries are arbitrary. A cross-reference or conditional clause split across two chunks will be underrepresented in both. A 25% overlap (256 of 1024 tokens) ensures that any 768-token passage is fully represented in at least one chunk. The LangChain version used a 20% character overlap — conceptually identical, but token-based overlap is more precise because embeddings operate on tokens, not characters.

Minimum relevance: 0.55 (cosine similarity)

This is the threshold below which retrieved passages are discarded before being sent to GPT-4o. Set it too high and valid passages are excluded; set it too low and noise enters the context, producing hallucinated or diluted answers.

A value of 0.4–0.55 works well for a mature, well-structured corpus. For a first ingestion — especially documents with dense technical terminology — 0.3–0.4 is a safer starting point while validating quality. It is possible to tune upward once a baseline of expected answers is established.

What the Microsoft Stack Adds

Beyond the code differences, three things stand out:

Persistence by design. Qdrant + SimpleFileStorage means the application state survives restarts. FAISS requires manual management of serialisation.
Skip-on-reingest. Kernel Memory tracks document IDs and skips already-indexed files on subsequent runs. The embedding cost is paid once per document, not once per run.
Citations out of the box. The API returns RelevantSources with passage-level relevance scores, making it transparent and straightforward to show users exactly which document and paragraph produced an answer.

Conclusion

LangChain and Microsoft Kernel Memory solve the same problem from different starting points. If you are in the Python ecosystem, LangChain remains the more flexible choice. If you are in .NET, or if you want a tighter integration with Azure services down the road, Microsoft Kernel Memory gives a cleaner path to production.

The regulatory domain is an ideal fit for RAG: the documents are authoritative, updates are infrequent, and the cost of a wrong answer is high. Both implementations reflect that — but the parameter choices described here are what separate a prototype from something you can stand behind in an audit.

Go back