Building a Public-Data RAG System End-to-End

May 5, 2026 20 min read

Introduction

This project started with a practical question:

How far can I get in a legal research workflow with only publicly available data?

I built this as an end-to-end RAG system, not just a retrieval prototype. The scope includes source discovery, ingestion, normalization, chunking, retrieval, and evaluation. The legal sphere is an example domain here, not the core point of the article. What I want to showcase is the full RAG workflow, the techniques involved, and where quality actually breaks once you move beyond a narrow demo. The overall retrieval-augmented generation pattern follows the original RAG paper by Lewis et al. (Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks).

I picked legal as the example because I have worked in this sphere for the last five years. That gave me a rough understanding of the data available, the research needs of legal professionals, and the constraints that make this kind of public-data setup interesting to test.

The current benchmark slices already make the first concrete legal areas explicit: Werkvertrag, Schuldrecht B2B, and GBR. I am using those three areas as the first public research tracks while continuing to refine the broader professional legal research workflow around them.

Another goal was to design a workflow that is measurable and easy to iterate on. I used a medallion-style structure to separate raw ingestion, structured processing, retrieval, and evaluation so that each layer can be improved independently, following the general Bronze/Silver layering pattern described by Databricks (What is the medallion lakehouse architecture?).

The current benchmark vs RAG deltas are still weak, which is an important result in itself. What this project already demonstrates is the kind of engineering work I want to showcase: building real pipelines around messy data, making tradeoffs explicit, and evaluating systems honestly instead of treating early output as proof.

None of the core building blocks here are proprietary tricks. The stack combines standard public RAG techniques: layered data preparation, chunking, hybrid retrieval, and dataset-based evaluation, all of which are common in production RAG systems as well as research workflows (Retrieval-augmented generation (RAG) in Azure AI Search).

Project Snapshot

At a high level, this experiment combines four pieces:

a public-data ingestion layer built around sitemaps, direct URLs, search-intent discovery, and raw document intake
a normalization and chunking pipeline that turns mixed legal content into retrieval-ready documents
a retrieval setup with vector, BM25, and hybrid modes
an evaluation workflow that compares benchmark-only generation against retrieval-augmented generation across correctness, completeness, structure, and grounding

The current state is promising but still early. The full loop is running end to end, and the system is now structured well enough to make targeted improvements and measure whether they actually help.

Interactive demo

Open the Legal Research Retrieval Replay

Review one saved professional legal research workflow across benchmark, vector, BM25, and hybrid retrieval modes, and compare each output against its supporting evidence trail.

Open retrieval explorer

Why This Experiment Exists

The starting hypothesis was simple: retrieval should improve answer quality for a domain with fragmented public knowledge.

I used legal research for professionals as the concrete end-to-end use case because it is a domain I know reasonably well and because it makes the retrieval problem very visible. The current public benchmark focuses on Werkvertrag, Schuldrecht B2B, and GBR, but the workflow focus is already clear: support research-oriented questions where source quality, grounding, and traceability matter. High-value public data is unevenly accessible:

A lot of detailed commentary is paywalled.
Public datasets are fragmented.
Judgments are hard to get and either paywalled or harshly protected.

That makes it a useful case study for RAG techniques. Model choice is only one part of the problem. The bigger constraint is data logistics: sourcing, structuring, retrieval quality, and evaluation discipline.

So I approached the problem from both sides: build the pipeline end-to-end while exploring different retrieval and evaluation techniques against one realistic example use case.

What I Built End-to-End

I implemented an end-to-end flow from source discovery and ingestion to benchmark comparison, organized into clear layers.

Bronze: Discovery, Ingestion, and Raw Collection

The Bronze layer exists because legal public data is not available through one clean, structured channel. The first design decision was therefore to treat ingestion as a multi-path acquisition problem instead of assuming one canonical feed.

Bronze ingestion

Interactive source-to-bronze flow

Select a source to see how it is ingested before every path lands in raw_documents_bronze.

Web scraper paths

All three paths converge on the same web-scraper ingestion logic. Sitemap and Search Intent just add discovery steps in front.

Document drops

Separate file-intake flow for PDFs and uploaded documents that do not use the web scraper path.

Selected source

Search Intent

Discovery from legal-style user phrasing, then filtered into usable source content.

Input signal

Relationship to shared logic

Source-specific front steps

Shared downstream path

Output

raw_documents_bronze

Every ingestion path converges into one raw Bronze table before Silver-specific structuring begins.

table: raw_documents_bronze
source_type: sitemap | search_intent | direct_url | s3_pdf
source_locator: canonical URL or file key
raw_markdown: cleaned page or extracted document text
content_hash: stable change detection hash
fetched_at: ingestion timestamp
silver_status: ready_for_normalization

Metadata shaping, quality flags, and chunk generation happen in the Silver layer described in the next section.

It currently supports multiple entry paths:

sitemap ingestion
search-intent URL discovery
direct URL ingestion
S3-based document drop flows

I chose this setup because the corpus has to be assembled opportunistically. Some useful sources expose clean sitemaps. Others are only reachable through targeted search queries. Some materials are available only as raw PDFs or manually collected files. A single ingestion strategy would have made corpus growth too brittle.

The first priority was to make source expansion cheap. Sitemaps and direct URLs gave me a reliable base of trusted, crawlable sources. PDF and raw-file intake covered the document types that are common in legal education and public reference material. Search-intent discovery became the scaling mechanism once the initial source pool was in place.

The main tradeoff in Bronze is breadth versus noise. SEO-driven discovery scales well and keeps the corpus closer to real user phrasing, but it also brings in weaker sources and more irrelevant pages. That is acceptable at this stage because I would rather have a pipeline that can grow and then be filtered than a narrow corpus that looks clean but cannot expand.

Beyond source intake, Bronze also handles the regular crawler lifecycle:

track the last refresh timestamp per source so recrawls can be scheduled incrementally
skip unnecessary fetches when a source was refreshed recently or no change is detected
download source content and clean it into normalized Markdown files
persist a content hash so changes can be detected quickly and only updated documents move forward

That part of the design matters because iteration speed is part of the product here. If ingestion is too expensive or too manual, evaluation loops slow down and retrieval tuning becomes guesswork.

The next Bronze improvements are mostly about source quality control: tighter domain filtering, better source scoring, and stronger rules for deciding which discovered documents are worth carrying into Silver.

Silver: Normalization and Chunking

The Silver layer turns noisy raw text into a retrieval asset. Its job is not just cleanup. It is the point where the corpus becomes structured enough to support ranking, filtering, and later diagnosis.

The current transformation includes:

normalization
metadata shaping
chunk generation

Silver transformation

From Bronze rows to retrieval-ready chunks

Silver is where one cleaned Bronze document becomes a structured asset that retrieval can rank, filter, and inspect later.

Bronze input

raw_documents_bronze

Each row arrives as one canonical raw document with source information, cleaned markdown, and a stable content hash.

document_id: bronze_1842
source_type: search_intent
raw_markdown: "Der Unternehmer muss erkennbare Bedenken ..."
content_hash: 8b9f...e21

Step 1

Normalize content

Tighten headings, remove leftover page noise, and preserve the document structure that matters for later diagnosis.

headings cleaned
section order preserved
parser noise reduced

Step 2

Shape metadata

Attach fields that make retrieval and debugging more precise, such as keywords, legal references, and quality signals.

keywords: werkvertrag, hinweispflicht
paragraph_refs: § 633 BGB, § 241 Abs. 2 BGB
quality_flags: weak_heading_map

Step 3

Generate chunks

Split the normalized document into coherent retrieval units so ranking works on meaningful legal passages instead of raw page-sized text.

chunk_01: "Der Unternehmer muss erkennbare Bedenken ..."
chunk_02: "Wird die Entwässerung nicht mitgedacht ..."
chunk_count: 6

Silver output

Retrieval-ready chunk records

The result is a chunk-level asset with structured metadata, better observability, and the shape needed for downstream ranking and filtering.

document_id: bronze_1842
chunk_id: bronze_1842_03
chunk_text: "Wird die Entwässerung nicht mitgedacht ..."
keywords: werkvertrag, entwässerung
paragraph_refs: § 633 BGB, § 241 Abs. 2 BGB
quality_flags: weak_heading_map

I deliberately put metadata extraction and chunk generation in the same stage because legal retrieval depends heavily on structure. It is not enough to store text embeddings alone. I want the system to know, where possible, which legal paragraphs are mentioned, which keywords characterize the chunk, and which quality issues appeared during parsing.

The main tradeoff here is cost versus coherence. I chose LLM-based semantic chunking instead of naive token windows because legal explanations often break badly when split mechanically. The more coherent chunk boundaries are worth the added cost at this phase because retrieval quality is still a larger bottleneck than processing efficiency. That tradeoff is consistent with common RAG chunking guidance: chunking strategy materially affects retrieval quality, and the right choice depends on document structure, cost, and downstream retrieval behavior (RAG chunking phase).

The next Silver iteration is to tighten chunk and metadata quality. That includes better handling of noisy page structure, stronger paragraph-reference extraction, and more explicit quality flags for weak documents before they distort retrieval results.

Retrieval Layer

For retrieval, I wanted a baseline that is broad enough to compare approaches without pretending that one method already won. That is why I implemented vector, lexical (BM25), and hybrid retrieval paths instead of optimizing one retrieval mode too early.

This matters in legal research because lexical overlap still carries real signal for statute names, paragraph references, and recurring legal terminology, while semantic retrieval helps when the query and source use different wording. Hybrid retrieval is the practical compromise, not a theoretical preference, and it is also a standard public production pattern rather than a project-specific invention (Hybrid search | Elastic Docs).

The current tradeoff is simplicity versus ranking quality. The system can already compare modes and parameters, but ranking is still intentionally basic. I have not yet added the deeper reranking and filtering logic that would be needed to claim retrieval is fully tuned.

The next retrieval work is straightforward: tune top-k, tune the hybrid weighting, improve candidate filtering, and introduce reranking once the corpus quality is stable enough that those changes are worth measuring. That second-stage reranking pattern is also well established in public retrieval stacks (Retrieve & Re-Rank | Sentence Transformers).

Retrieval options

The retrieval modes explored in this article

The retrieval layer was intentionally kept broad first so lexical, semantic, and merged strategies could be compared before deeper ranking logic was added.

Mode comparison

Baseline retrieval matrix

Each mode captures a different signal. The current article shows why hybrid was the most practical compromise rather than an assumption made up front.

Mode

Vector

Semantic retrieval when the query and source use different wording.

Mode

BM25

Lexical retrieval for statutes, paragraph references, and recurring legal terminology.

Mode

Hybrid

Practical merge of lexical and semantic signals. Best current lift in this article.

Current tuning levers

top-k
alpha / hybrid weighting
candidate filtering
reranking (next step)

Why retrieval is still early

This layer is still exploratory rather than fully optimized. Ranking logic is still simple, and retrieval quality still depends heavily on chunk and metadata quality coming out of Silver.

Evaluation Layer

The evaluation layer is where I wanted the project to be stricter than a normal prototype. A legal AI system is easy to overstate if the benchmark design is weak, so I built evaluation in from the start instead of treating it as documentation after the fact.

I added repeatable workflows to compare:

benchmark-only generation
retrieval-augmented generation

Both notebook and scripted execution paths exist. That is deliberate: notebooks are useful for close inspection of individual research runs, while scripted runs are necessary if I want repeated comparisons to stay consistent across benchmark slices. That kind of dataset-based offline evaluation and experiment comparison is standard practice for LLM systems, not an extra presentation layer added after the fact (LangSmith Evaluation).

I also push experiments and datasets to LangSmith so runs can be inspected at trace level. That makes it easier to see not just whether RAG won or lost, but where the failure happened: retrieval, generation, or the benchmark setup itself.

The biggest tradeoff in evaluation has been realism versus cleanliness. I tested different dataset options and evaluation paths:

Legal blog and forum material looked promising at first, but collecting high-quality labels and validating correctness at scale was too costly for this phase.
Law-student exam prompts provided clear target answers, but benchmark-only runs reached near-100% performance, a strong sign of leakage or memorization effects from public training data.
That result reinforced an important follow-up question: how much proprietary legal commentary can models already reproduce without explicit retrieval context? I will tackle this in a separate experiment.

The most useful current option came from a Hugging Face dataset: DomainLLM/gerlayqa-bgb-paraphrased. It contains a large set of practical legal research-style prompts and reference answers across domains, often with law references, and is a better fit for this use case than academic exam-style prompts.

The next evaluation improvement is not mainly about more runs. It is about better benchmark hygiene: reducing leakage risk further, segmenting results by research task type, and making failure analysis easier to inspect than a single aggregate score.

Evaluation options

The benchmark paths explored in this article

The evaluation work explored not just benchmark vs RAG, but also which dataset choices and execution modes were realistic enough to trust and repeat.

Comparison setup

Workflow and dataset choices

The main question was not only “does RAG help?” but also “which benchmark setup is clean enough to measure that honestly?”

Workflow

Benchmark only

Generation without retrieval as the control path.

Workflow

RAG

Retrieval-augmented generation with fixed retrieval settings.

Dataset

Blog / forum QA

Promising, but too expensive to label and validate well at this phase.

Dataset

Law-student exams

Clear targets, but benchmark-only performance looked too strong and likely leaked.

Dataset

GerLayQA

Most useful current benchmark option and the main comparison set used here.

Execution and observability

notebooks for inspection
scripts for repeatable runs
LangSmith for trace-level failure analysis

How I Measure Progress

One thing I wanted to avoid in this project was vague progress reporting. The pipeline is set up so I can inspect improvement at three levels instead of relying on a single impressionistic demo.

1. Corpus and pipeline quality

At the ingestion and normalization level, I care about whether the corpus is actually becoming more useful:

how many sources and documents are available per ingestion path
how many documents make it successfully into the normalized Silver layer
how many chunks contain useful metadata such as keywords or legal paragraph references

These metrics matter because retrieval quality in this domain is heavily constrained by source coverage and document quality long before model choice becomes the main factor.

2. Retrieval quality

At the retrieval level, I compare vector, BM25, and hybrid retrieval rather than assuming semantic search is automatically better.

The core questions are:

which retrieval mode returns the most useful context for legal research questions
whether hybrid retrieval improves grounding without adding too much noise
how sensitive the system is to top-k, candidate limits, and hybrid weighting
whether metadata such as paragraph mentions can be used to filter or rerank results more effectively

This is the layer where I expect a lot of the eventual quality lift to come from, so I want the article to show retrieval behavior directly, not just final answer scores.

3. Research output quality

At the output layer, I compare benchmark-only generation against retrieval-augmented generation across four dimensions:

correctness
completeness
structure
grounding

Those dimensions are scored consistently across runs so I can compare benchmark and RAG outputs on the same task set instead of relying on anecdotal examples.

The evaluation entrypoint is simple enough to rerun with fixed settings, which makes it practical to publish comparable snapshots later:

python scripts/run_gerlayqa_evaluation.py \
  --include-rag \
  --rag-retrieval-mode hybrid \
  --rag-source-id search_intent_discovery \
  --rag-top-k 6 \
  --rag-alpha 0.7

I also log runs to LangSmith so I can inspect traces, retrieved chunks, and output behavior at the individual-task level when an aggregate score changes.

Why I Still Favor the SEO-First Approach

Even with only modest gains so far, I still consider SEO-first ingestion a valuable strategy for this stage. It gives me a clear growth path and high control over corpus expansion, which matters in a domain where public data is scarce and fragmented.

It is easy to automate, easy to scale, and easy to adjust based on observed relevance. Most importantly, it keeps the ingestion loop aligned with real user-intent queries, which is likely to matter for retrieval quality over time.

Current Results: Small But Real RAG Lift

I now have three stable 10-question comparisons on fixed GerLayQA slices: Werkvertrag, Schuldrecht B2B, and GBR. I am keeping the GerLayQA name visible in public copy because it is the actual benchmark currently backing these comparisons, but it should be read as the current baseline benchmark for these research slices rather than the full shape of the long-term workflow. The results diverge enough that averaging them into one number hides the useful story, so I now show them separately instead of squashing them together.

That is still not a dramatic jump overall, but it is enough to show that retrieval can help in this setup when it is tuned carefully. Just turning retrieval on is not sufficient, and the best settings are not equally portable across slices.

Here is the current comparison:

Setup	Weighted total	Delta vs benchmark
Benchmark only	2.770	0.000
Hybrid k=6 · α=0.7	3.105	+0.335
Hybrid k=4 · α=0.7	2.895	+0.125
Hybrid k=8 · α=0.7	3.055	+0.285
Hybrid k=6 · α=0.5	2.930	+0.160
Hybrid k=6 · α=0.85	3.090	+0.320
Vector only	2.850	+0.080
BM25 only	2.770	0.000

What stands out: this slice responds clearly to retrieval tuning. The best hybrid run improves correctness, completeness, normalized precision, structure, and grounding at the same time, with the strongest gain coming from correctness.

Setup	Weighted total	Delta vs benchmark
Benchmark only	2.855	0.000
Hybrid k=6 · α=0.7	2.855	0.000
Hybrid k=4 · α=0.7	2.680	-0.175
Hybrid k=8 · α=0.7	2.895	+0.040
Hybrid k=6 · α=0.5	2.805	-0.050
Hybrid k=6 · α=0.85	2.750	-0.105
Vector only	2.855	0.000
BM25 only	2.830	-0.025

What stands out: this slice is much flatter. Only the wider `top_k=8` hybrid run produces a small lift, while the best run improves correctness, completeness, and normalized precision slightly, keeps structure flat, and gives back a bit of grounding.

Setup	Weighted total	Delta vs benchmark
Benchmark only	2.900	0.000
Hybrid k=6 · α=0.7	2.915	+0.015
Hybrid k=4 · α=0.7	2.910	+0.010
Hybrid k=8 · α=0.7	2.940	+0.040
Hybrid k=6 · α=0.5	2.700	-0.200
Hybrid k=6 · α=0.85	2.990	+0.090
Vector only	3.020	+0.120
BM25 only	2.760	-0.140

What stands out: this slice behaves differently again. The best run is `vector only`, which improves correctness, completeness, and normalized precision, keeps structure flat, and gives back a little grounding. Hybrid helps only in a narrow range here, while BM25 and low-alpha hybrid both regress clearly.

Three things stand out from this split view:

Werkvertrag is where hybrid retrieval currently helps most. The best run there is hybrid at k=6 and alpha=0.7, with a +0.335 weighted lift over the benchmark.
Schuldrecht B2B is much flatter. The best run there is hybrid at k=8 and alpha=0.7, but the gain is only +0.040, and several nearby settings regress below the benchmark.
GBR shifts the picture again. Its best run is vector only at +0.120, which suggests the best retrieval mode is still slice-specific rather than stable across the whole benchmark.

Looking at the dimensions, Werkvertrag improves across correctness, completeness, normalized precision, structure, and grounding. Schuldrecht B2B is much more mixed: its best run improves correctness, completeness, and normalized precision slightly, keeps structure flat, and gives back a bit of grounding. GBR sits in between: the best vector run improves correctness, completeness, and normalized precision, keeps structure flat, and sacrifices a bit of grounding.

The practical interpretation is straightforward: retrieval quality matters more than the mere presence of retrieval. The current evidence does not support a broad claim that “RAG beats the benchmark” in this domain by default. It does support the narrower claim that tuned hybrid retrieval can produce a measurable, if still modest, improvement.

What Is Already Valuable

Even with limited uplift so far, this phase produced durable value:

A complete and reproducible data-to-evaluation loop now exists.
Experiment cycles are faster because ingestion, retrieval, and eval are connected.
The repository and workflow are cleaner and easier to extend.
I can now isolate bottlenecks instead of guessing where quality breaks.

In other words: I now have a real system, not disconnected notebooks and one-off scripts.

What Are The Next Steps

This kickoff established the baseline. The next work is straightforward and can now be tracked as a concrete todo list for the next iteration of the end-to-end RAG setup.

Integrate judgments as a first-class source type and measure their impact on grounding and harder legal research questions.

Test commentary reproducibility to evaluate how much proprietary commentary current models can already reproduce from pretraining and memorization, and how that can be used safely in the pipeline.

Build a stronger agent workflow: add a commentary sub-agent, a second retrieval pass using metadata filters, a direct legal-code lookup step for cited paragraphs, and a reviewer loop to compare answer quality with and without RAG.

Tune retrieval quality: focus on top-k and alpha settings, candidate generation and reranking, deduplication and relevance filtering, and chunk / metadata quality.

Explore reranking more explicitly: compare lightweight and stronger reranking approaches after initial retrieval to see whether candidate ordering, grounding, and final answer quality improve measurably.

Final Thought

The core takeaway from this kickoff is straightforward:

I did not prove that public-data RAG out of the box beats the benchmark right away. I did prove that I can run a full, repeatable RAG cycle on a realistic end-to-end use case and now improve it systematically.

That is the right starting point for the next experiments.

References

Lewis, Patrick, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Kuttler, Mike Lewis, Wen-tau Yih, Tim Rocktaschel, Sebastian Riedel, and Douwe Kiela. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” Advances in Neural Information Processing Systems 33 (2020). https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html
Databricks. “What is the medallion lakehouse architecture?” https://docs.databricks.com/aws/en/lakehouse/medallion
Microsoft Learn. “RAG chunking phase.” https://learn.microsoft.com/azure/architecture/ai-ml/guide/rag/rag-chunking-phase
Microsoft Learn. “Retrieval-augmented generation (RAG) in Azure AI Search.” https://learn.microsoft.com/en-us/azure/search/retrieval-augmented-generation-overview
Elastic Docs. “Hybrid search.” https://www.elastic.co/docs/solutions/search/hybrid-search
Sentence Transformers. “Retrieve & Re-Rank.” https://www.sbert.net/examples/applications/retrieve_rerank/README.html
LangSmith Docs. “LangSmith Evaluation.” https://docs.langchain.com/langsmith/evaluation

Back to home