In the last couple of months, I have presented Qlik Answers a dozen times at events and for customers. One question I get asked every single time is: can we add Excel files to a Knowledge Base and create an Assistant to "talk" to the data? My answer is usually the following: "No. Qlik Answers is designed for unstructured data like PDF, DOC, HTML, and others. For your request, you need to use Insight Advisor." However, that "no, you can't" bothers me. Then I remembered: "This is not just any LLM tool; this is Qlik, the Jedi Master of Data Analytics and Science." I decided to put it to the test and check if and how we can use Qlik Answers to help you have a nice chat with your structured data.
This is not easy, but I decided to wear my Master Jedi (of data) robe, grab my green (of course) lightsaber (aka mouse), and hop on the Millennium Falcon for this intriguing challenge. Oh, and sorry for all the Star Wars references—I will try to keep them to a minimum.
In a Cloud Far, Far Away: Introduction
Before we begin, let's briefly understand some very important concepts.
What Is a Chunk in the Context of LLM?
When dealing with large documents, Large Language Models (LLMs) like Qlik Answers break down the text into smaller, manageable pieces called "chunks." This process helps the model process and understand the data more efficiently. Each chunk is a segment of text that the model can analyze individually, which is crucial because LLMs have a limit on the amount of text they can process at once. By dividing the text into chunks, the model can maintain context and provide more accurate responses.
What Are Structured and Unstructured Data?
Structured Data: This is highly organized information formatted in a way that is easily searchable in databases. Examples include data stored in Excel spreadsheets, SQL databases, or CSV files. This type of data is arranged in tables with rows and columns, making it straightforward to analyze using traditional data processing methods.
Unstructured Data: This lacks a predefined format or organization. It includes text-heavy documents like PDFs, Word documents, emails, and HTML pages. Unstructured data is more challenging to process because it doesn't fit neatly into relational databases but is well-suited for analysis using LLMs and natural language processing techniques.
A New Hope: Load the Structured Data to a Knowledge Base
Hopping to "hack the Death Star" and use structured data in a tool designed for unstructured data, I started my new adventure. For this exercise, I am using an Excel file with inventory information for a clothing store. See the next image for a preview of this dataset. The data is displayed as a pivot table. However, if I try to add Excel, CSV, or other structured data files to a Knowledge Base, I quickly notice I can't. Knowledge Bases are designed for unstructured data. Therefore, the only file formats available are things like PDF, DOC, HTML, MD, etc.
The first step is to save my Excel file as a PDF. This is easily accomplished with the "Save as a Copy" option in Microsoft Excel.
Now that my data table is in PDF format, I can add it to my Knowledge Base, index it, and create an Assistant or attach it to an existing one.
The Empire Strikes Back: Asking Questions to a PDF Table
I performed a lot of testing on this. Showing it all would make this article too repetitive and boring, so let's focus on one row of data and three different questions. The product I chose is Lace Shoes. The next image shows its values.
The first question is simple: can it find the value of a cell? For example, what is the Total Stock for that product? There is no need to perform any aggregation; we already have that as a column. What I am really testing here is what makes this type of task so difficult for a language model: how to correlate a row with a column and find the intersection (cell). That sounds simple, but in fact, it is very complex for a language model because the data is embedded and stored internally in a vector database by chunks of text, as I explained in the introduction. The columns of a row might end up in different chunks, which makes the retrieval task much more complex.
Now, back to the first challenge. The correct answer is 124, but the Assistant gave me 224. Too bad for Qlik Answers.
For the second question, let's ask the value of the farthest (last) column: Description. Again, it could not find it. However, it was a bit worse than that: it tried to guess the category of the product even though that column did not exist in the data. A bit of hallucination, which is extremely rare for Qlik Answers.
The third question was to test its ability to compare multiple columns. So I asked, What size has the most products for Lace Shoes? Some size columns are blank; some are not. The correct answer is Size 9-11, with 60 units, but again, Answers could not find it.
Is that the end of my mission? Well, as a Jedi Master and a Brazilian, I never give up.
Return of the Jedi: Trying Something New
We data scientists always say that correlation is not causation. Is the failure of Qlik Answers a limitation of the tool or simply a disturbance in the Force? In other words, is the cause of the issues I found in the indexed Knowledge Base, or is it the documents in the Knowledge Base? Let's revisit how I was able to add structured data to Qlik's Knowledge Base: I had to save the Excel file as a PDF. Experience told me that could be the problem.
When you read a PDF file, what you see is not what exists. PDFs are not just the displayed content. Internally, each file contains four pieces: header, body, cross-reference table, and trailer. It is designed for its visual representation, not for its capacity as a data source. It's like a Sith Lord.
To really test Qlik Answers' capability to read tables, we have to provide a document that contains only the data (text and tables) and nothing more. In past data science projects where I had to use AI to process complex documents, I found that processing Markdown files is much faster and more accurate than doing it against PDFs or DOCs. If you've never heard of Markdown, let me explain.
What Is Markdown?
Markdown is a lightweight markup language that allows you to format text using plain-text syntax. It was created to be an easy-to-read and easy-to-write format that can be converted to HTML or other types. Markdown files are simple and contain only the text and basic formatting, making them easy for both humans and machines to read and process.
You can write simple code in any language (like R or Python) to convert a PDF to Markdown. Actually, this could be a video for the Data Voyagers YouTube channel. Leave a comment in case you would like this to happen. Anyway, if you want something quicker, there are numerous websites that perform this conversion for you. Here is an example.
Once I had the Markdown file ready, I created a new Knowledge Base (just for testing) and uploaded the file to it. After indexing was complete, I removed the PDF Knowledge Base from my Assistant, added the new one, and tried the same questions again. In total, for each scenario—PDF and Markdown—I asked the same 42 questions. For this article, I will summarize with the same three questions.
The first one: What is the Total Stock of Lace Shoes? Now it gives the right answer: 124. Not only that, Qlik Answers replies with the correct quantity for each product size. Brilliant!
However, it is yet too early for celebrations. Let's try the second question and access the second column by asking, What is the description for the product? Again, another correct answer. I can already feel the Force favoring the Green Side.
For the last and more complex question: What size has the most Lace Shoes? Qlik Answers gets it correct one more time.
The final score is Markdown 3, PDF 0.
Conclusion: The Force Awakens
Does this mean I can replace Insight Advisor with Qlik Answers? No. Insight Advisor is still the best tool for structured data. It can read your data model, generate charts, provide insights, learn from your feedback, integrate with multiple devices (have you seen our Alexa interface?), and much more. However, this is a sign that in a larger unstructured document, data displayed inside tables can be used to provide the answers you need.
Moreover, it shows a good practice for your future projects: convert your files to Markdown before adding them to the Knowledge Base. You can do it even inside the Qlik platform through Automations. Should I convert all files? The answer is: it depends. Markdown files have a downside: it usually generates more indexed pages than its PDF version. If you have a lower Knowledge Base monthly capacity available, consider converting only the files that contain tabular data in it.
Finally, I have performed more tests in this endeavor that I will be writing about soon. Spoiler alert: reading tables that are inside larger documents, reading tables that have several columns and rows, reading tables that contain information available on the Internet (does Qlik Answers hallucinate in those scenarios?), asking for some basic aggregations, reading tables with subtotals and hierarchy, and a few more.
I hope you liked this new adventure. If you did, remember to share it, and May the Force be with you.
This is fantastic!
I was tryin to think about data formats I can produce in a Qlik load script and by messing with the file extension make it available for Answers. so for any tabular data we could try JSON for sure.
I think the version 2 of answers should consider how to merge the insight advisor and Answers so in addition to unstructured documents I could attach an app with enabled business logic.
Thanks
Hi, you described that you removed a file from the Knowledge Base. I cannot find a way to remove a file without removing the whole Base. There is an option to click away an individual file but after page refresh or re-index the files that file still is in the base. Do you have an alternative method? Thank you jkassing@kwf.nl