Fact Check with DataGemma

How to get numbers out of a LLM?

Ertuğrul Demir
8 min readSep 13, 2024

Almost every person who has interacted with large language models (LLMs) has encountered the term “hallucinations” or experienced them without knowing the term. Hallucinations occur frequently, even with the most advanced LLMs today, due to the probabilistic nature of generative models. These models are especially prone to generating factually incorrect information when responding to queries involving numerical or statistical data.

If a request falls outside the scope of an LLM’s pretraining or fine-tuning, the model may produce factually incorrect, nonsensical, unrelated answers or refuses to answer at all. Current solutions to this problem involve significant human intervention, such as providing context (using techniques like Retrieval-Augmented Generation) or restricting models through prompting techniques.

Gödel’s First Incompleteness Theorem offers an interesting parallel to this challenge. The theorem states that in any consistent mathematical system powerful enough to describe basic arithmetic, there will always be true statements that cannot be proven within that system. In terms of LLMs, this suggests that no system can be both complete (able to prove every truth) and consistent (without contradictions) if it includes basic arithmetic. Gödel proved that no matter how well-designed the rules of a system are, there will always be some statements about numbers that are true but cannot be proven using the available rules.

Consequently, one cannot be certain of the factual accuracy of information from an LLM (which can be viewed as a set of rules in the context of Gödel’s theorem). However, we can minimize the effects of this phenomenon by increasing the available information, effectively defining a larger system that encompasses the original one.

To express this more technically in terms of Gödel’s theorem, we can define a larger system F’ that contains the whole of the original system F. In mathematical logic, F might represent our initial formal system (analogous to an LLM’s initial training), while F’ represents an expanded system with additional axioms or rules (analogous to augmenting an LLM with additional context or data). By expanding to F’, we can potentially prove statements that were unprovable in F alone. However, it’s crucial to note that according to Gödel’s theorem, even in F’ there will still be true but unprovable statements. This parallels the idea that while we can improve an LLM’s performance by providing more information.

So how do we expand F’? Do we have a vast knowledge base to use in our systems? Fortunately, we do:

Data Commons

Data Commons, a Google-led open-source initiative, aims to centralize and democratize access to global public datasets. With over 250 billion data points from hundreds of sources, it covers a wide range of statistical information from various public entities worldwide. The project’s key innovations include normalizing diverse datasets into a unified Knowledge Graph using Schema.org, and implementing a natural language interface for user queries. This interface allows users to explore the vast database using everyday language, making the wealth of public data more accessible and easier to utilize.

Think of Data Commons as a constantly expanding repository of reliable, public information on a wide range of topics, from health and economics to demographics and the environment. Users can interact with this wealth of data using everyday language. This makes Data Commons an accessible and powerful tool for exploring and utilizing extensive public data resources, opening up new possibilities for research, analysis, and decision-making across various fields.

Leveraging Data Commons with LLMs: Introducing DataGemma

To leverage this pool of factual information using LLMs, we need a specific model to navigate this vast jungle of data. This is where DataGemma comes in. DataGemma is designed to bridge the gap between the extensive factual information in Data Commons and the powerful language processing capabilities of LLMs. By integrating DataGemma with LLMs, we can potentially create a system that combines the vast knowledge base of Data Commons with the natural language understanding of LLMs, resulting in more accurate and reliable responses to queries involving numerical and statistical data.

Source: Knowing When to Ask — Bridging Large Language
Models and Data

So how does DataGemma leverage the Data Commons? It employs well-established techniques: Retrieval-Augmented Generation (RAG) and Retrieval-Infused Generation (RIG). You might be thinking, “There’s nothing new here; these are already widely used!” While that’s true, DataGemma is specifically fine-tuned for three crucial tasks:

  • Determining when to use external sources
  • Identifying which external sources should be used
  • Constructing appropriate queries to fetch the required data

This specialized fine-tuning of Gemma models makes it particularly effective for working with Data Commons and LLMs.

Practice Time

Great! We’ve covered the theoretical aspects, and now it’s time to dive into the practical application. If you’re interested in learning more about the underlying concepts, I’ll be adding references to the original paper at the end of this article. Let’s get started with the actual use case! The best part? You can all use this for free, as it’s completely open source!

Before we jump into the code, let’s quickly go over what you’ll need to get started. You’ll want to have a machine with at least 16 GB of VRAM to run this smoothly. As for setup, make sure you’ve got a Python environment set up with torch, transformers, and data_gemma installed. Don’t forget to set up your token for downloading from Hugging Face, and you’ll also need a Data Commons API key. Once you’ve got all that sorted out and your environment is good to go with the right packages and API access, you’ll be all set to dive into the code. It might sound like a bit to handle, but trust me, it’s pretty straightforward once you get into it. Ready to get coding?

DataGemma RIG

Along with these essential imports, we’ll also be including a custom helper function. This function is designed to stylize the output from our LLM. It’s a neat little trick that will make our results more visually appealing and easier to read.

#!pip install -q git+https://github.com/datacommonsorg/llm-tools
#!pip install -q bitsandbytes accelerate

from IPython.display import Markdown
import textwrap
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
import data_gemma as dg
from IPython.display import Markdown
import textwrap

HF_TOKEN = "YOUR_HF_TOKEN_HERE"
DC_API_KEY = "YOUR_DATACOMMONS_KEY_HERE"
dc = dg.DataCommons(api_key=DC_API_KEY)

def display_chat(prompt, text):
formatted_prompt = "<font size='+1' color='brown'>🙋‍♂️<blockquote>" + prompt + "</blockquote></font>"
text = text.replace('•', ' *')
text = textwrap.indent(text, '> ', predicate=lambda _: True)
formatted_text = "<font size='+1' color='teal'>🤖\n\n" + text + "\n</font>"
return Markdown(formatted_prompt + formatted_text)

Next, we’ll set up our quantization configuration using BitsAndBytes. Quantization is a technique that reduces the precision of the model’s weights, allowing us to run larger models with limited computational resources. Since we’re working with limited VRAM, we’ll use 4-bit quantization, which significantly reduces memory usage while maintaining reasonable model performance.

nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_compute_dtype=torch.bfloat16,
)

model_id = 'google/datagemma-rig-27b-it'
tokenizer = AutoTokenizer.from_pretrained(model_id, token=HF_TOKEN)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map='auto',
quantization_config=nf4_config,
torch_dtype=torch.bfloat16,
token=HF_TOKEN
)

After configuring the quantization, we’ll load the datagemma-rig model specifically for the RIG (Retrieval-Infused Generation) task. RIG is a technique that combines information retrieval with text generation, allowing the model to access and incorporate external knowledge when generating responses.

It’s worth noting that if you plan to feed your outputs into models with longer context windows, such as Gemini, you might consider using datagemma-rag model instead. RAG is similar to RIG but is particularly well-suited for models that can handle more extensive context.

Next we set our query, datagemma wrapper for our model and RIGFlow for final answer.

input_text = "YOUR_QUERY_GOES_HERE"

datagemma_model_wrapper = dg.HFBasic(model, tokenizer)

ans = dg.RIGFlow(llm=datagemma_model_wrapper, data_fetcher=dc, verbose=False).query(query=input_text)
Markdown(textwrap.indent(ans.answer(), '> ', predicate=lambda _: True))

display_chat(input_text, ans.answer())

For test purposes I asked about Turkey’s last 10 year gdp/c statistics. Which is a good example to test fact capabilities of DataGemma model. And here’s the result:

DataGemma output.

As you can see it gave pretty detailed and factual answers even with footnotes! Model used multiple sources and queried their answers before generating the final output.

I didn’t want to copy paste whole answer as a text but you can test it yourself with the given code examples. Now let’s compare it to same model (gemma-2–27b) without the extra capabilities of DataGemma.

Standard Gemma:

Standard Gemma 2 27b Result

Well… The standard Gemma model refuses to give specific answer. Which is still better than hallucinating but at the same time it’s not that useful.

In contrary to standard Gemma, the DataGemma model used multiple related queries using it’s capabilities and great data sources it has from Data Commons. Here you can see some of the queries it used:

... calling HF Pipeline API "how is the turkish(türkiye) gdp per capita from 20..."
... calling DC with "what was the gdp per capita of turkey in 2014?"
... calling DC with "what was the gdp per capita of turkey in 2023?"
... calling DC with "what is the gdp per capita of turkey in 2024?"
... calling DC with "how many jobs were created in turkey between 2014 and 2024?"
... calling DC with "what is the poverty rate in turkey?"

As you can see it breaks down the task with smaller queries and catches the related data for final answer.

Eval Results:

The paper claims:

  • DataGemma’s Retrieval Information Generation (RIG) system improved factual accuracy from 5–17% to 58%
  • Additionally, DataGemma’s Retrieval-Augmented Generation (RAG) system achieved 99% accuracy for statistical claims when citing directly from the data table.

Conclusions

The challenge of hallucinations in LLMs has been a persistent issue in the field of artificial intelligence. This article has introduced DataGemma as a promising solution to mitigate this problem, particularly when dealing with numerical and statistical data. By leveraging the vast repository of public information available through Data Commons, DataGemma demonstrates a significant improvement in the factual accuracy of LLM responses.

The key advantages of DataGemma lie in its ability to:

  1. Determine when to use external sources
  2. Identify which external sources to use
  3. Construct appropriate queries to fetch the required data

These capabilities, combined with the extensive dataset provided by Data Commons, allow DataGemma to produce more reliable and factually accurate responses compared to standard LLMs.

In our example you can see the difference easily. While the standard model refused to provide specific information, DataGemma was able to break down the query into smaller, manageable parts and retrieve relevant data to construct a comprehensive and accurate response.

This advancement has promising implications for the use of LLMs in various fields, particularly those requiring precise numerical or statistical information.

Future developments may focus on expanding the range of data sources, improving the efficiency of data retrieval and integration, and fine-tuning the models for even more specialized applications.

As we continue to push the boundaries of AI capabilities, solutions like DataGemma represent a important step towards creating more trustworthy and dependable artificial intelligence systems.

Resources:

--

--

Ertuğrul Demir
Ertuğrul Demir

Written by Ertuğrul Demir

Data Scientist, Google Developer Expert — Machine Learning, Kaggle Grandmaster, Interested in Deep Learning and Machine Learning

No responses yet