AI Summary • Published on Dec 28, 2023
Information Extraction (IE) is a critical component of Natural Language Processing (NLP) that aims to convert unstructured text into structured knowledge. This process is foundational for various downstream applications such as knowledge graph construction and question answering. Traditional IE tasks, including Named Entity Recognition (NER), Relation Extraction (RE), and Event Extraction (EE), face significant challenges. These include handling diverse information sources, complex and evolving domain requirements, and the necessity of deploying multiple independent models for different subtasks, which incurs substantial resource costs for development and training. The recent emergence of Large Language Models (LLMs), with their advanced capabilities in text comprehension and generation, presents a new paradigm for IE. This generative approach, unlike traditional discriminative methods, offers the potential to process complex schemas and unify various IE tasks under a single framework, addressing the limitations of prior methods.
This survey systematically explores the integration of Large Language Models (LLMs) into generative Information Extraction (IE). It categorizes existing approaches using two primary taxonomies: specific IE subtasks and the underlying techniques employed. IE subtasks, such as Named Entity Recognition (NER), Relation Extraction (RE), and Event Extraction (EE), are reframed as generative problems where LLMs are trained to maximize the conditional probability of generating a target sequence based on input text and tailored prompts. The survey also delves into universal IE frameworks, differentiating between natural language-based LLMs (NL-LLMs) and code-based LLMs (Code-LLMs), both designed to uniformly model various IE tasks. Key techniques discussed include Data Augmentation (e.g., generating synthetic data, knowledge retrieval, inverse generation), Prompt Design (strategies like Question Answering, Chain-of-Thought, and Self-Improvement), Zero-shot Learning, Constrained Decoding Generation, Few-shot Learning, and Supervised Fine-tuning. Each technique is examined for its role in adapting LLMs for IE tasks.
The survey's empirical analysis highlights several critical observations regarding LLMs in generative IE. For Named Entity Recognition (NER), a notable performance disparity exists between few-shot/zero-shot settings and supervised fine-tuning (SFT) or data augmentation (DA). In-context learning (ICL) methods show wide performance variations, with approaches like GPT-NER demonstrating significant F1 score improvements. SFT models, despite substantial differences in backbone parameters, exhibit only minor performance differences, though their effectiveness can vary significantly across diverse datasets, particularly for universal models due to domain distribution gaps. Data augmentation strategies, such as EnTDA, consistently prove robust. In Relation Extraction (RE), universal IE models generally perform better on complex "Relation Strict" problems by leveraging inter-task dependencies, and performance differences among RE models are more pronounced than in NER. For Event Extraction (EE), the majority of methods rely on SFT, with generative methods markedly outperforming discriminative ones, especially in argument classification. Overall, universal IE models fine-tuned with SFT generally show superior performance compared to task-specific models across various NER, RE, and EE datasets.
The ongoing development of LLMs for generative Information Extraction presents numerous avenues for future research. One significant direction involves further enhancing universal IE frameworks to achieve greater adaptability across diverse domains and tasks, potentially by integrating insights from specialized task-specific models. For resource-limited IE scenarios, it is crucial to explore advanced in-context learning strategies for LLMs, focusing on improved example selection and developing robust cross-domain techniques like domain adaptation and multi-task learning. Additionally, research into efficient LLM-powered data annotation methods is needed. Prompt design remains a vital area, requiring optimization of input/output pairs to align with LLM pre-training (e.g., code generation), refining prompts for better model reasoning (e.g., Chain-of-Thought), and developing interactive prompt designs such as multi-turn question answering. Finally, while LLMs exhibit promise in Open IE settings, further investigation is necessary to overcome performance challenges in more complex, open-ended information extraction tasks.