PAT-Questions: A Self-Updating Benchmark for Present-Anchored Temporal Question-Answering

Jannat Ara Meem, Muhammad Shihab Rashid, Yue Dong, Vagelis Hristidis

University of California, Riverside

Abstract

Existing work on Temporal Question Answering (TQA) has predominantly focused on questions anchored to specific timestamps or events (e.g. "Who was the US president in 1970?"). Little work has studied questions whose temporal context is relative to the present time (e.g. "Who was the previous US president?"). We refer to this problem as Present-Anchored Temporal QA (PATQA). PATQA poses unique challenges: (1) large language models (LLMs) may have outdated knowledge, (2) complex temporal relationships (e.g. 'before', 'previous') are hard to reason, (3) multi-hop reasoning may be required, and (4) the gold answers of benchmarks must be continuously updated. To address these challenges, we introduce the PAT-Questions benchmark, which includes single and multi-hop temporal questions. The answers in PAT-Questions can be automatically refreshed by re-running SPARQL queries on a knowledge graph, if available. We evaluate several state-of-the-art LLMs and a SOTA temporal reasoning model (TEMPREASON-T5) on PAT-Questions through direct prompting and retrieval-augmented generation (RAG). The results highlight the limitations of existing solutions in PATQA and motivate the need for new methods to improve PATQA reasoning capabilities.

Overview

Our main contributions are:

We publish a novel PATQA benchmark, PAT-Questions, with annotated single-hop and multi-hop questions for two different timestamps (December 2021, December 2023). We provide an automatic answer updating system for the research community to always get up-to-date answers to PAT-Questions.
We evaluate our benchmark on a wide range of LLMs in direct prompting and RAG settings, and identify limitations of the LLMs in tackling PAT-Questions.
We modify a state-of-the-art temporal reasoning system, TEMPREASON , to answer our PAT-Questions, and experimentally show how it performs on our dataset.

Llama2-7b-Chat responses to PAT-Questions

The examples reveal that Llama-2 responds to the multi-hop questions with outdated answers e.g. Diego Simeone, coach of Estudiantes de La Plata (current team) from 2006-07 or incomplete answers e.g. Dnipro which is the name of the current team. For the cases when asked with a timestamp mentioned in the question, the models responds with completely false information (Sosa never played for Club Atletico, and Dmytro was never a coach for any of Filippov's teams) which was never true in any point in time. Other open-source models perform worse (only Mistral-7B is comparable to Llama2-7B) for multi-hop PAT-Questions.

New Bing responses to PAT-Questions

This example reveals that although New Bing has access to the current knowledge, it provides outdated answers (Desna Chernihiv), and correct one-hop answers but cannot find the two-hop answers (Estudiantes de La Plata). For some cases. it provides completely false answers, or fails to understand the questions and responds with some other facts about the subjects.

Results

Our findings indicate that pre-trained Large Language Models (LLMs) face challenges with PAT-Questions, both single and multi-hop, showing very low EM and F1 scores. Accuracy improves to some extent with document retrieval, especially for single-hop questions, due to the retrieval of up-to-date and relevant documents.

Example instance from PAT-Questions dataset

An example multi-hop question from our PAT-Question dataset is shown above. PAT-Questions contains 2882 single-hop and 3290 multi-hop questions with varied subjects and time-dependent relations from Wikidata. Each question in our dataset has seven common fields associated with it, `question', `subject', `text answers', `answer annotations', `relations', `template', and `uniq_id'. Multi-hop questions have an extra field named `intermediate entities' denoting the one-hop answers to the multi-hop questions.