This is the official implementation for our ACL 2024(Findings) paper: Unveiling the Art of Heading Design: A Harmonious Blend of Summarization, Neology, and Algorithm.
We introduce (LOgogram), a novel heading-generation benchmark comprising 6,653 paper abstracts with corresponding descriptions and acronyms as headings.
To measure the generation quality, we propose a set of evaluation metrics from three aspects: summarization, neology, and algorithm.
Additionally, we explore three strategies (generation ordering, tokenization, and framework design) under prelavent learning paradigms (supervised fine-tuning, reinforcement learning, and in-context learning with Large Language Models).
We recommend you to create a new conda virtual environment for running codes in the repository:
conda create -n logogram python=3.8
conda activate logogram
Then install PyTorch 1.13.1. For example install with pip and CUDA 11.6:
pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116
Finally, install the remaining packages using pip:
pip install -r requirements.txt
We crawl the ACL Anthology and then exclude examples whose headings do not contain acronyms.
The unfiltered dataset is saved in /raw-data/acl-anthology/data_acl_all.jsonl
.
We further applied a set of tailored filtering rules based on data inspection to eliminate anomalies. Acronyms in the abstracts were replaced with a mask to prevent acronym leakage. The details are in src/data_processing.ipynb
.
We plot the distributions with regard to the text length and the publication number of our dataset in Figure 3 and 4 in our paper. To reproduce, see src/data_statistics.ipynb
.
We evaluate the generated headings from the summarization, neologistic, and algorithmic constraints. Specifically, we propose three novel metrics, WordLikeness (WL), WordOverlap (WO), and LCSRatio (LR) from the neologistic and algorithmic aspects. To justify our metrics, we also plot the density estimation of different metrics and their joint distribution in Figure 5 and 6, demonstrating that the gold-standard examples achieve high value in these metrics. To reproduce, see src/data_statistics.ipynb
.
We fine-tune the T5 model and explore the effectiveness of the generation ordering, tokenization, and framework design strategies.
- To fine-tune and inference (description then acronym, acronym subword-level tokenization, onestop framework), run:
accelerate launch t5_brute_finetune.py --model_name t5-base --model_mode abstract2description:shorthand --model_save_path ./models/t5-a2ds-token-base --save_total_limit 1
accelerate launch t5_brute_inference.py --model_name models/t5-a2ds-token-base/checkpoint-5 --model_mode abstract2description:shorthand --prediction_save_path ./prediction/brute_t5_a2ds_token_predictions.csv
- To fine-tune and inference (acronym then description, acronym subword-level tokenization, onestop framework), run:
accelerate launch t5_brute_finetune.py --model_name t5-base --model_mode abstract2shorthand:description --model_save_path ./models/t5-a2sd-token-base --save_total_limit 1
accelerate launch t5_brute_inference.py --model_name models/t5-a2sd-token-base/checkpoint-5 --model_mode abstract2shorthand:description --prediction_save_path ./prediction/brute_t5_a2sd_token_predictions.csv
- To fine-tune and inference (description then acronym, acronym letter-level tokenization, onestop framework), run:
accelerate launch t5_brute_finetune.py --model_name t5-base --model_mode abstract2description:shorthand --shorthand_mode character --model_save_path ./models/t5-a2ds-char-base --save_total_limit 1
accelerate launch t5_brute_inference.py --model_name models/t5-a2ds-char-base/checkpoint-5 --model_mode abstract2description:shorthand --shorthand_mode character --prediction_save_path ./prediction/brute_t5_a2ds_char_predictions.csv
- To fine-tune and inference (description then acronym, acronym subword-level tokenization, pipeline framework), run:
accelerate launch t5_brute_finetune.py --model_name t5-base --model_mode abstract2description --model_save_path ./models/t5-a2ds-token-pipe/1 --save_total_limit 1
accelerate launch t5_brute_finetune.py --model_name t5-base --model_mode abstract-description2shorthand --model_save_path ./models/t5-a2ds-token-pipe/2 --save_total_limit 1
accelerate launch t5_brute_inference.py --model_name models/t5-a2ds-token-pipe/1/checkpoint-5 --model_mode abstract2description --prediction_save_path ./prediction/brute_t5_a2ds_token_pipe_predictions.csv
accelerate launch t5_brute_inference.py --model_name models/t5-a2ds-token-pipe/2/checkpoint-5 --model_mode abstract-description2shorthand --prediction_save_path ./prediction/brute_t5_a2ds_token_pipe_predictions.csv
The RL paradigm is built upon the foundation of the SFT paradigm. Specifically, we choose the Proximal Policy Optimization (PPO) algorithm. We evaluate all strategies with the exception of due to the relatively unexplored territory of feedback mechanisms within the RL paradigm for pipeline language models.
TOKENIZERS_PARALLELISM=false accelerate launch t5_ppo_finetune.py --model_mode abstract2description:shorthand --model_save_path ./models/t5-a2ds-token-ppo --save_total_limit 1
accelerate launch t5_brute_inference.py --model_name models/t5-a2ds-token-ppo --model_mode abstract2description:shorthand --prediction_save_path ./prediction/brute_t5_a2ds_token_ppo_predictions.csv
TOKENIZERS_PARALLELISM=false accelerate launch t5_ppo_finetune.py --model_mode abstract2shorthand:description --model_save_path ./models/t5-a2sd-token-ppo --save_total_limit 1
accelerate launch t5_brute_inference.py --model_name models/t5-a2sd-token-ppo --model_mode abstract2shorthand:description --prediction_save_path ./prediction/brute_t5_a2sd_token_ppo_predictions.csv
TOKENIZERS_PARALLELISM=false accelerate launch t5_ppo_finetune.py --model_mode abstract2description:shorthand --shorthand_mode character --model_save_path ./models/t5-a2ds-char-ppo --save_total_limit 1
accelerate launch t5_brute_inference.py --model_name models/t5-a2ds-char-ppo --model_mode abstract2description:shorthand --shorthand_mode character --prediction_save_path ./prediction/brute_t5_a2ds_char_ppo_predictions.csv
To replicate the results of ICL, run the following code
python icl_main.py
The generation model can be selected from
To evaluate the generated acronyms, run:
python run_eval.py \
--file <CSV file> \
--eval_type shorthand \
--hypos-col <the column name of generated acronyms> \
--refs-col <the column name of ground truth acronyms>
For descriptions, run:
python run_eval.py \
--file <CSV file> \
--eval_type description \
--hypos-col <the column name of generated descriptions> \
--refs-col <the column name of ground truth descriptions>
By default, the CSV files are saved in prediction/
.
If you want to cite our dataset and paper, you can use this BibTex:
@inproceedings{cui-etal-2024-unveiling,
title = "Unveiling the Art of Heading Design: A Harmonious Blend of Summarization, Neology, and Algorithm",
author = "Cui, Shaobo and
Feng, Yiyang and
Mao, Yisong and
Hou, Yifan and
Faltings, Boi",
editor = "Ku, Lun-Wei and
Martins, Andre and
Srikumar, Vivek",
booktitle = "Findings of the Association for Computational Linguistics ACL 2024",
month = aug,
year = "2024",
address = "Bangkok, Thailand and virtual meeting",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.findings-acl.368",
pages = "6149--6174",
abstract = "Crafting an appealing heading is crucial for attracting readers and marketing work or products. A popular way is to summarize the main idea with a refined description and a memorable acronym. However, there lacks a systematic study and a formal benchmark including datasets and metrics. Motivated by this absence, we introduce LOgogram, a novel benchmark comprising 6,653 paper abstracts with corresponding descriptions and acronyms. To measure the quality of heading generation, we propose a set of evaluation metrics from three aspects: summarization, neology, and algorithm. Additionally, we explore three strategies for heading generation(generation ordering, tokenization of acronyms, and framework design) under various prevalent learning paradigms(supervised fine-tuning, in-context learning with Large Language Models(LLMs), and reinforcement learning) on our benchmark. Our experimental results indicate the difficulty in identifying a practice that excels across all summarization, neologistic, and algorithmic aspects.",
}