Existing work on Temporal Question Answering (TQA) has predominantly focused on questions anchored to specific timestamps or events (e.g. "Who was the US president in 1970?"). Little work has studied questions whose temporal context is relative to the present time (e.g. "Who was the previous US president?"). We refer to this problem as Present-Anchored Temporal QA (PATQA). PATQA poses unique challenges: (1) large language models (LLMs) may have outdated knowledge, (2) complex temporal relationships (e.g. 'before', 'previous') are hard to reason, (3) multi-hop reasoning may be required, and (4) the gold answers of benchmarks must be continuously updated. To address these challenges, we introduce the PAT-Questions benchmark, which includes single and multi-hop temporal questions. The answers in PAT-Questions can be automatically refreshed by re-running SPARQL queries on a knowledge graph, if available. We evaluate several state-of-the-art LLMs and a SOTA temporal reasoning model (TEMPREASON-T5) on PAT-Questions through direct prompting and retrieval-augmented generation (RAG). The results highlight the limitations of existing solutions in PATQA and motivate the need for new methods to improve PATQA reasoning capabilities.
Our main contributions are:
The examples reveal that Llama-2 responds to the multi-hop questions with outdated answers e.g. Diego Simeone, coach of Estudiantes de La Plata (current team) from 2006-07 or incomplete answers e.g. Dnipro which is the name of the current team. For the cases when asked with a timestamp mentioned in the question, the models responds with completely false information (Sosa never played for Club Atletico, and Dmytro was never a coach for any of Filippov's teams) which was never true in any point in time. Other open-source models perform worse (only Mistral-7B is comparable to Llama2-7B) for multi-hop PAT-Questions.
This example reveals that although New Bing has access to the current knowledge, it provides outdated answers (Desna Chernihiv), and correct one-hop answers but cannot find the two-hop answers (Estudiantes de La Plata). For some cases. it provides completely false answers, or fails to understand the questions and responds with some other facts about the subjects.
Our findings indicate that pre-trained Large Language Models (LLMs) face challenges with PAT-Questions, both single and multi-hop, showing very low EM and F1 scores. Accuracy improves to some extent with document retrieval, especially for single-hop questions, due to the retrieval of up-to-date and relevant documents.
An example multi-hop question from our PAT-Question dataset is shown above. PAT-Questions contains 2882 single-hop and 3290 multi-hop questions with varied subjects and time-dependent relations from Wikidata. Each question in our dataset has seven common fields associated with it, `question', `subject', `text answers', `answer annotations', `relations', `template', and `uniq_id'. Multi-hop questions have an extra field named `intermediate entities' denoting the one-hop answers to the multi-hop questions.
@article{meem2024pat,
title={PAT-Questions: A Self-Updating Benchmark for Present-Anchored Temporal Question-Answering},
author={Meem, Jannat Ara and Rashid, Muhammad Shihab and Dong, Yue and Hristidis, Vagelis},
journal={arXiv preprint arXiv:2402.11034},
year={2024}
}