Such a great write-up. The more accessible RAG is, the more widespread the adoption of LLMs becomes. For example, chunking and preprocessing, which remain manual steps that many aren't experienced with, can be semi-automated based on the task at hand. A superior chunking method can lead to a double-digit increase in performance.
Totally agree! We are already seeing some really great analysis on how to do things like chunking and data cleaning way better (see the link to the databricks RAG article in further resources for example). Hopefully this trend continues
Thank you so much for this deep dive, very informative! I'm a UX researcher that's new to conversational AI design (no ML practitioner, so please bare with me!) and for our LLM-powered voice assistant and chatbot, RAG has obviously become immensely important. I was wondering, when it comes to preprocessing data and curating a knowledge base for RAG, do the data for the external knowledge base need to be formatted/"cleaned" in a specific way? For example, is it sufficient to simply use links to a website and embedded PDFs/documents are then automatically included in the knowledge base, or would one need to extract and separately add the contents of the PDFs/documents to the database?
Such a great write-up. The more accessible RAG is, the more widespread the adoption of LLMs becomes. For example, chunking and preprocessing, which remain manual steps that many aren't experienced with, can be semi-automated based on the task at hand. A superior chunking method can lead to a double-digit increase in performance.
Totally agree! We are already seeing some really great analysis on how to do things like chunking and data cleaning way better (see the link to the databricks RAG article in further resources for example). Hopefully this trend continues
Thank you so much for this deep dive, very informative! I'm a UX researcher that's new to conversational AI design (no ML practitioner, so please bare with me!) and for our LLM-powered voice assistant and chatbot, RAG has obviously become immensely important. I was wondering, when it comes to preprocessing data and curating a knowledge base for RAG, do the data for the external knowledge base need to be formatted/"cleaned" in a specific way? For example, is it sufficient to simply use links to a website and embedded PDFs/documents are then automatically included in the knowledge base, or would one need to extract and separately add the contents of the PDFs/documents to the database?
You need to extract the raw textual data and add it into the database!