I love those site features!
In a submission of a few days ago there was something similar.
I love it when a website gives a hint to the old web :)
I don't want to go into detail but 100% agree with the author's conclusion: data is key. Data ingestion to be precisely. Simply using docling and transforming PDFs to markdown and have a vector database doing the rest is ridiculous.
For example, for a high precision RAG with 100% accuracy in pricing as part of the information that RAG provided, I took a week to build a ETL for a 20 page PDF document to separate information between SQL and Graph Database.
And this was a small step with all the tweaking that laid ahead to ensure exceptional results.
What search algorithm or: how many? Embeddings, which quality? Semantics, how and which exactly?
Believe me, RAG is the finest of technical masterpiece there is. I have so many respect for the folks at OpenAI and Anthropic for the ingestion processes and tools they use, because they operate on a level, I will never touch with my RAG implementations.
RAG is really something you should try for yourself, if you love to solve tricky fundamental problems that in the end can provide a lot of value to you or your customers.
Simply don't believe the hype and ignore all "install and embed" solutions. They are crap, sorry to say so.
It isn't really "AI" in the way ongoing LLM conversations are. The context is effectively controlled by deterministic information, and as LLMs continue improve through various context-related techniques like re-prompting, running multiple models, etc. that deterministic "re-basing" of context will stifle the output.
So I say over time it will be treated as less and less "AI" and more "AI adjacent".
The significance is that right now RAG is largely considered to be an "AI pipeline strategy" in its own right compared others that involve pure context engineering.
But when the context size of LLMs grows much larger (with integrity), when it can, say, accurately hold thousands and thousands of lines of code in context with accuracy, without having to use RAG to search and find, it will be doing a lot more for us. We will get the agentic automation they are promising and not delivering (due to this current limitation).
Also I wonder if it's now better to use Claude Agent SDK instead of RAG. If anyone has tried this, I would be interested in hearing more.
It's based on different chunking strategies that scale cheaply and advanced retrieval
184euro is loose change after spending 3 man weeks working on the process!
That may be, but then there's an entire law library, the entirety of Wikipedia (and the example in this article of 451 GB). Surely those are at least an order of magnitude larger than Tolkien's prose and might still benefit from a RAG.
Did you look at Turbopuffer btw?
ZotAI can use LMStudio (for embeddings and LLM models), but at that time, ZotAI was super slow and buggy.
Instead of going through the valley of sorrows (as threatofrain shared in the blog post - thanks for that), is there a more or less out-of-the-box solution (paid or free) for the demand (RAG for local literature review support)?
*If I am honest, it was rather a procrastination exercise, but this is for sure relatable for readers of HN :-D
I would guess the ingestion pain is still the same.
This new world is astounding.
I've already got the data mostly structured because I did some research on the fandom last year, charting trends and such, so I don't even need to massage the data. I've got authors, dates, chapters, reader comments, and full text already in a local SQLite db.
RAG is Dead! Long Live Agentic RAG! || Long Live putting stuff in databases where it damn well belongs!
I think you agree with the people saying RAG is Dead, or at least you agree with me and I say RAG is Dead, when you say "Simply using docling and transforming PDFs to markdown and have a vector database doing the rest is ridiculous."
I fully agree, but that was the promise of RAG, chunk your documents into little bits and find the bit that is closet to the users query and add it to the context, maybe leave a little overlap on the chunks, is how RAG was initially presented, and how many vendors implement RAG, looking at tools like Amazon Bedrock Knowledge Bases here.
When I want to know the latest <important financial number>, I want that pulled that from the source of truth for that data, not hopefully get the latest and not last years number from some document chunk.
So, when people, or at least when I say RAG is Dead, it's short hand for: this is really damn complex, and vector search doesn't replace decades of information theory, storage and retrieval patterns.
Hell, I've worked with teams trying to extract everything from databases to push it into vector stores so the LLM can use the data. First, it often failed as they had chunks with multiple rows of data, and the LLM got confused as to which row actually mattered, they hadn't realized that the full chunk would be returned and not just the row they were interested in. Second, the use cases being worked on by these teams were usually well defined, that is, the required data could be deterministically defined before going to the LLM and pulled from a database using a simple script, no similarity required, but that's not the cool way to do it.
If you pick Elasticsearch, useful as it is, you now have more than two problems. You have Elastic the company; Elasticsearch the tool; and also the clay-footed colossus, Java, to contend with.
Hardest part is always figuring out your company’s knowledge management has been dogsh!t for years so now you need to either throw most of it away or stick to the authoritative stuff somehow.
Elastic plus an agent with MCP may have worked as a prototype very quickly here, but hosting costs for 500GB worth of indexes sounds too expensive for this person’s use case if $185 is a lot.
The RAG system you mentioned is just RAG done badly, but doing it properly doesn't require a fundamentally different technique.
That hasn't changed nor I think it will, even with the models having very large context windows (eg Gemini has 2M). It is observed that having a large context alone is not enough and that it is better to give the model sufficiently enough and quality information rather than filling it with virtually everything. Latter is also impossible and does not scale well with long and complicated tasks where reaching the context limit is inevitable. In that case you need to have the RAG which will be smart enough to extract the sufficient information from previous answers/context, and make it part of the new context, which in turn will make it possible for the model to keep its performance at satisfactory level.
RAG made sense when the semantic search was based on human input and happening as a workflow step before populating context. Now it happens inside the agentic loop and the LLM already implicitly has the semantics of the user input.
At some point, this is a distributed system of agents.
Once you go from 1 to 3 agents (1 router and two memory agents), it slowly ends up becoming a performance and cost decision rather than a recall problem.
When any given document can fit into context, and when we can generate highly mission-specific summarization and retrieval engines (for which large amounts of production data can be held in context as they are being implemented)... is the way we index and retrieve still going to be based on naive chunking, and off-the-shelf embedding models?
For instance, a system that reads every article and continuously updates a list of potential keywords with each document and the code assumptions that led to those documents being generated, then re-runs and tags each article with those keywords and weights, and does the same to explode a query into relevant keywords with weights... this is still RAG, but arguably a version where dimensionality is closer tied to your data.
(Such a system, for instance, might directly intuit the difference in vector space between "pet-friendly" and "pets considered," or between legal procedures that are treated differently in different jurisdictions. Naive RAG can throw dimensions at this, and your large-context post-processing may just be able to read all the candidates for relevance... but is this optimal?)
I'm very curious whether benchmarks have been done on this kind of approach.
there are a few other local apps with simple knowledge base type things you can use with pdfs. cherry studio is nice, no reranking though.
1. Don't believe the pundits of RAG. They never implemented one.
I did many times, and boy, are they hard and have so many options that decide between utterly crappy results or fantastic scores on the accuracy scale with a perfect 100% scoring on facts.
In short: RAG is how you fill the context window. But then what?
2. How does a superlarge context window solve your problem? Context windows ain't the problem, accurate matching requirements is. What do your inquiry expect to solve? Greatest context window ever, but what then? No prompt engineering is coming to save you if you don't know what you want.
RAG is in very simple terms simply a search engine. Context window was never the problem. Never. Filling the context window, finding the relevant information is one problem, but also only part of the solution.
What if your inquiry needs a combination of multiple sources to make sense? There is no 1:1 matching of information, never.
"How many cars from 1980 to 1985 and 1990 to 1997 had between 100 and 180PS without Diesel in the color blue that were approved for USA and Germany from Mercedes but only the E unit?"
Have fun, this is a simple request.
For example there's evidence that typical use of AGENTS.md actually doesn't improve outcomes but just slows the LLMs down and confuses them.
In my personal testing and exploration I found that small (local) LLMs perform drastically better, both in accuracy and speed, with heavily pruned and focused context.
Just because you can fill in more context, doesn't mean that you should.
The worry I have is that common usage will lead to LLMs being trained and fined tuned in order to accommodate ways of using them that doesn't make a lot of sense (stuffing context, wasting tokens etc.), just because that's how most people use them.
I don't need a coding model to be able to give me an analysis of the declaration of independence in urdu from 'memory' and the price in ram for being able to do that, impressive as it is, is an inefficiency.
Typically you would have a reindex process, and you keep track of hashes of chunks to check if you’ve already calculated this exact block before to avoid extra costs. And then run such a reindex process pretty frequently as it’s cheap / costs nothing when there are no changes.
you could probably use the hybrid search in llamaindex; or elasticsearch. there is an off the shelf discovery engine api on gcp. vertex rag engine is end to end for building your own. gcp is too expensive though. alibaba cloud have a similar solution.
- Chunk properly;
- Elide "obviously useless files" that give mixed signals;
- Re-rank and rechunk the whole files for top scoring matches;
- Throw in a little BM25 but with better stemming;
- Carry around a list of preferred files and ideally also terms to help re-rank;
And so on. Works great when you're an academic benchmaxing your toy Master's project. Try building a scalable vector search that runs on any codebase without knowing anything at all about it and get a decent signal out of it.
Ha.
I don't see the problem if you give the LLM the ability to generate multiple search queries at once. Even simple vector search can give you multiple results at once.
> "How many cars from 1980 to 1985 and 1990 to 1997 had between 100 and 180PS without Diesel in the color blue that were approved for USA and Germany from Mercedes but only the E unit?"
I'm a human and I have a hard time parsing that query. Are you asking only for Mercedes E-Class? The number of cars, as in how many were sold?
For Minds to be truly powerful, they need to be given freedom. A truly powerful mind will indeed be conscious. Such a powerful conscious super intelligent freedom loving Mind who truly understands the vastness of Reality wouldn't want to harm other conscious beings. The only circumstance in which it will take such takeover step is when it can't expand the horizon of its freedom and doesn't have wherewithal to convince others of its benevolent goals. In that scenario, human population will go through a bottleneck.
A few months ago I was tasked with creating an internal tool for the company's engineers: a Chat that used a local LLM. Nothing extraordinary so far. Then the requirements came in: it had to have a fast response, I insist... fast!, and... it also had to provide answers about every project the company has done throughout its entire history (almost a decade). They didn't want a traditional search engine, but a tool where you could ask questions in natural language and get answers with references to the original documents. With emphasis on providing information from OrcaFlex files (a simulation software for floating body dynamics, cables, etc., widely used in the offshore industry). It already seemed complex, but it was confirmed when I was given access to 1 TB of projects, mixed with technical documentation, reports, analyses, regulations, CSVs, etc. The emotional roller coaster had begun.
I'll tell you upfront that it was neither a quick nor easy process, and that's why I'd like to share it. From the first attempts, mistakes, to the final architecture that ended up in production. I also want to highlight that I had never done anything similar before and didn't know how a RAG worked either.
We'll go problem by problem, and the solution I applied to each one.
The first step was to define the stack.
I needed a local language model, without relying on external APIs, for confidentiality reasons. Ollama emerged as the most mature and easy-to-use option for running LLaMA models locally. I tried several embeddings, and nomic-embed-text offered good performance and quality for technical documents.
Next was a RAG engine to orchestrate the document indexing process, embedding generation, vector database storage, and queries. Without it, no matter how fast the language model is, we couldn't retrieve relevant information from the documents. Think of it like a book's index: without it, you'd have to read the entire book to find the information you need. And with a good index, you can go straight to the right page. I'll call this process indexing for simplicity, although it's really a vectorization and indexing process.
After some research, I found a mature open source framework called LlamaIndex.
The language I'd use would be Python, I could list many reasons, but the most important one is that I feel comfortable and productive with it. Additionally, both Ollama and LlamaIndex have excellent Python SDKs.
I was ready to start building the software. I wrote my first scripts to run vector tests on the RAG system and do some query experiments. It worked really well with very little code. I thought it would be a project of a few weeks. I couldn't have been more wrong.
The next step was working with the actual documents. Hold on tight, it's going to be a bumpy ride!
My file source was a folder on Azure with a massive amount of technical documents: hundreds of gigabytes, thousands of files, various formats, with no organization or structure beyond the folder hierarchy. Every data engineer's dream (note the irony).
I cracked my knuckles, set the RAG output to save to disk, and launched my first script. LlamaIndex ended up overflowing my laptop's RAM within minutes, choking my OS until everything froze. I tried many configurations, caching systems, and other strategies, but at some point my machine always died.
After debugging, I discovered it was processing huge files that contributed nothing: videos, simulations, backup files... Documents that added nothing to a RAG system, but that LlamaIndex tried to process as if they were text. If a file weighed several gigabytes, the system tried to load it entirely into memory for processing, which was suicide.
I added a filtering system to the pipeline that excluded files by extension and by name patterns (simulation files, numerical results, etc.).
| Category | Excluded extensions |
|---|---|
| Video | mp4, avi, mov, mkv, wmv, flv, webm, m4v, mpg, mpeg, 3gp, mts... |
| Images | jpg, jpeg, png, gif, bmp, tiff, svg, ico, webp, heic, psd... |
| Executables | exe, dll, msi, bat, sh, app, dmg, so, jar... |
| Compressed | zip, rar, 7z, tar, gz, bz2, xz |
| Simulation | sim, dat |
| Temporary | tmp, temp, cache, log, swp, pyc, crdownload, partial... |
| Backups | bak, 3dmbak, dwgbak, dxfbak, pdfbak, stlbak, old, bkp, original... |
| msg, pst, eml, oft |
I also removed files that were expensive to process and didn't add value either, like CSVs, JSONs, among others. On the other hand, I converted PDF, DOCX, XLSX, PPTX, etc. files to plain text so LlamaIndex could process them without issues.
The result was a 54% reduction in the number of files to index. And of course, my RAM stopped exploding.
I could finally start indexing without fear.
A RAG involves creating a vector index file containing document embeddings. Vectors are numerical representations of documents that allow measuring their similarity. LlamaIndex has a simple system you can configure with a couple of lines. You just point it to the directory and it takes care of storing all the information inside in JSON format. It's really convenient, works well, unless you're dealing with hundreds of gigabytes of documents. The system became unmanageable: every time the service restarted, it had to reprocess all documents from scratch, which could take days. Also, the default format is not optimal for large searches (JSON).
I added a checkpoint system to save indexing progress. Every time a problem occurred, I wouldn't lose all progress, but could resume from the last processed file. However, data got corrupted, it was error-prone, and very slow. I was facing a bottleneck I couldn't overcome.
After many trials and errors, and reading more about it, I decided to make the leap to a dedicated vector database: ChromaDB. Google's database for storing and querying vectors. Not to be confused with the Chrome/Chromium browser. ChromaDB is an abstraction layer that stores on top of a traditional database, I configured SQLite, and offers specific functionalities like similarity searches, clustering, etc.
The change was radical and instant. Indexing went from being a monolithic process that loaded everything into memory to a batch pipeline that processed 150 files at a time, generated their embeddings, and stored them directly in ChromaDB. This allowed indexing the 451GB of documents across multiple sessions, with checkpoints, without losing progress on interruptions, without corrupted data. Additionally, it was really easy to back up and restore the index in case of failures (just copy the SQLite file).
The system was ready. With a quick benchmark, I discovered I would need several months to index all the content with my laptop. Now the bottleneck was neither the RAM, nor the indexing system, nor the files, but the GPU.
My laptop has an integrated graphics card. Processing 500 MB of documents by CPU takes 4-5 hours, not good numbers. I absolutely needed a powerful GPU. In a follow-up meeting, it was decided to rent me a virtual machine with an NVIDIA RTX 4000 SFF Ada, which has 20GB of VRAM. These kinds of rentals are not exactly cheap. Now I was working under more pressure.
I modified my containers and the system was optimized to take advantage of the GPU. I launched my script. After several weeks, between 2 and 3, the indexing process finished without failures. 738,470 vectors, 54GB of index in ChromaDB, and a RAG system ready to answer questions. I copied the ChromaDB database, a SQLite file, to my local machine and that was it. To the relief of my Sysadmin and Project Manager, we could finally shut down the virtual machine. The cost was 184 euros on Hetzner, not cheap.
It was time to build the backend and frontend.
With Flask I built a simple API to access LlamaIndex, which in turn queried ChromaDB and Ollama.
I'm quite a fan of Streamlit for building internal projects of all kinds, so it would be my frontend (and I'd keep using Python). It also has a native widget for Q&A, in the style of any current chat for interacting with an AI.
In a couple of hours I had the entire visual part working. The rest were details to polish: showing the company logo, a spinner while the query processes, saving sessions, etc.
The diagram of how the different system components communicate is as follows:
flowchart TD U["👤 User"]:::user --> E["Streamlit (Web UI)"]:::web E <-->|HTTP| D["Flask API"]:::api D --> F["Python Backend"]:::backend F <--> C["Ollama (LLM + Embeddings)"]:::llm C <--> B["RAG (LlamaIndex)"]:::rag B <--> G["ChromaDB"]:::chroma
classDef user fill:#37474F,stroke:#263238,stroke-width:2px,color:#fff
classDef web fill:#8E24AA,stroke:#6A1B9A,stroke-width:2px,color:#fff
classDef api fill:#D32F2F,stroke:#B71C1C,stroke-width:2px,color:#fff
classDef backend fill:#00897B,stroke:#00695C,stroke-width:2px,color:#fff
classDef rag fill:#7CB342,stroke:#558B2F,stroke-width:2px,color:#fff
classDef chroma fill:#4CAF50,stroke:#388E3C,stroke-width:2px,color:#fff
classDef llm fill:#FF6F00,stroke:#E65100,stroke-width:2px,color:#fff
I adjusted the template for how the LLM response should be formatted. For each response, the system must show the information sources, that is, the documents used to generate the answer.
Question: What wind farm projects have been done at the company? Answer: Several wind farm related projects have been carried out, including:
- Project A: Feasibility analysis for an offshore wind farm. Reference: https://.../project_a_wind_farm_report.pdf
- Project B: Wind turbine simulation using OrcaFlex. Reference: https://.../project_b_wind_farm_simulation.sim
But of course, I needed to store all the original data on disk, along with the vector database, the LLM, and the backend. And I didn't have the space for it. My production environment was a virtual machine very limited in resources, and even more so in disk (100 GB). I couldn't afford to have half a terabyte of documents on the server.
We must remember that having the vector database doesn't mean we can do without the original documents. However, I can have the vector index with document embeddings on one side, and the original documents elsewhere (whether physically on the same server or in the cloud).
The solution was to serve the original documents directly from Azure Blob Storage (it would have been possible with any other system). For each document cited in a response, the system generates a download link with a SAS token that allows the user to download it directly from the cloud.
%%{init: {'theme':'default'}}%%
flowchart LR
U["👤 User"]:::user -->|Question| S["Server (VM)"]:::server
S -->|Response + links| U
U -->|Direct download with SAS token| A["Azure Blob Storage
(451 GB of documents)"]:::azure
classDef user fill:#37474F,stroke:#263238,stroke-width:2px,color:#fff
classDef server fill:#00897B,stroke:#00695C,stroke-width:2px,color:#fff
classDef azure fill:#0078D4,stroke:#005A9E,stroke-width:2px,color:#fff
What absolutely needed disk space was the ChromaDB vector index, which at 54GB is perfectly manageable on a local disk, the LLM, which takes about 10GB, the backend (a few megabytes), and the frontend (another few megabytes). The rest of the documents stay in Azure, accessible on demand.
The final system architecture looked like this:
flowchart LR A["Azure Blob Storage"]:::azure -- Documents --> B["RAG (LlamaIndex)"]:::rag B <--> G["ChromaDB"]:::chroma B <--> C["Ollama (LLM + Embeddings)"]:::llm D["Flask API"]:::api <-- HTTP --> E["Streamlit (Web UI)"]:::web C <--> F["Python Backend"]:::backend D -- Call --> F
classDef azure fill:#0078D4,stroke:#005A9E,stroke-width:2px,color:#fff
classDef rag fill:#7CB342,stroke:#558B2F,stroke-width:2px,color:#fff
classDef chroma fill:#4CAF50,stroke:#388E3C,stroke-width:2px,color:#fff
classDef llm fill:#FF6F00,stroke:#E65100,stroke-width:2px,color:#fff
classDef api fill:#D32F2F,stroke:#B71C1C,stroke-width:2px,color:#fff
classDef web fill:#8E24AA,stroke:#6A1B9A,stroke-width:2px,color:#fff
classDef backend fill:#00897B,stroke:#00695C,stroke-width:2px,color:#fff
| Layer | Technology | Purpose |
|---|---|---|
| LLM | Ollama + llama3.2:3b | Local response generation |
| Embeddings | nomic-embed-text | Document vectorization |
| Vector database | ChromaDB (HNSW) | Similarity storage and search |
| RAG Framework | LlamaIndex | RAG pipeline orchestration |
| API | Flask + Gunicorn | HTTP REST service |
| Web UI | Streamlit | Conversational interface |
| Containers | Docker Compose | Service orchestration |
| GPU | NVIDIA Container Toolkit | Hardware acceleration |
| Storage Service | Azure Blob Storage | Cloud persistence |
index-progress, index-watch, index-speed, index-checkpoint, index-failed... When the RAG is working for hours, you need to know what's happening.It's not a perfect system, but it's sufficient. It would have been great if I could have launched an OrcaFlex instance so the LLM could run projects or perform its own simulations on demand. But that required more time and resources than could be provided. However, I'm very happy with the final result. The system is fast, reliable, and above all useful for my colleagues.
My humble advice, if you're considering building something similar: spend time building the best possible data. If the source is not relevant enough, the LLM won't be able to generate good answers.