a utility for using transformers summarization models on text docs 🖇
This package provides easy-to-use interfaces for using summarization models on text documents of arbitrary length. Currently implemented interfaces include a python API, CLI, and a shareable demo app.
For details, explanations, and docs, see the wiki
- Install the package with pip:
pip install textsum
- Import the package and create a summarizer:
from textsum.summarize import Summarizer
summarizer = Summarizer() # loads default model and parameters
- Summarize a text string:
text = "This is a long string of text that will be summarized."
summary = summarizer.summarize_string(text)
print(f'Summary: {summary}')
Install using pip with Python 3.8 or later (after creating a virtual environment):
pip install textsum
The textsum
package is now installed in your virtual environment. CLI commands are available in your terminal, and the python API is available in your python environment.
For a full installation, which includes additional features such as PDF OCR, Gradio UI demo, and Optimum, run the following commands:
git clone https://github.com/pszemraj/textsum.git
cd textsum
# create a virtual environment (optional)
pip install -e .[all]
The package also supports a number of optional extra features, which can be installed as follows:
8bit
: Install withpip install -e .[8bit]
optimum
: Install withpip install -e .[optimum]
PDF
: Install withpip install -e .[PDF]
app
: Install withpip install -e .[app]
unidecode
: Install withpip install -e .[unidecode]
Read below for more details on how to use these features.
Note: The
unidecode
extra is a GPL-licensed dependency that is not included by default with theclean-text
python package. While it can be used for text cleaning pre-summarization, it generally should not make a significant difference in most use cases.
There are three ways to use this package:
To use the python API, import the Summarizer
class and instantiate it. This will load the default model and parameters.
You can then use the summarize_string
method to summarize a long text string.
from textsum.summarize import Summarizer
summarizer = Summarizer() # loads default model and parameters
# summarize a long string
out_str = summarizer.summarize_string('This is a long string of text that will be summarized.')
print(f'summary: {out_str}')
you can also directly summarize a file:
out_path = summarizer.summarize_file('/path/to/file.txt')
print(f'summary saved to {out_path}')
To summarize a directory of text files, run the following command:
textsum-dir /path/to/dir
A full list:
Click to expand table
Flag | Description |
---|---|
--output_dir |
Specify the output directory |
--model |
Specify the model to use |
--no_cuda |
Disable CUDA |
--tf32 |
Use TF32 precision |
--force_cache |
Force cache usage |
--load_in_8bit |
Load in 8-bit mode |
--compile |
Compile the model |
--optimum_onnx |
Use optimum ONNX |
--batch_length |
Specify the batch length |
--batch_stride |
Specify the batch stride |
--num_beams |
Specify the number of beams |
--length_penalty |
Specify the length penalty |
--repetition_penalty |
Specify the repetition penalty |
--max_length_ratio |
Specify the maximum length ratio |
--min_length |
Specify the minimum length |
--encoder_no_repeat_ngram_size |
Specify the encoder no repeat ngram size |
--no_repeat_ngram_size |
Specify the no repeat ngram size |
--early_stopping |
Enable early stopping |
--shuffle |
Shuffle the input data |
--lowercase |
Convert input to lowercase |
--loglevel |
Specify the log level |
--logfile |
Specify the log file |
--file_extension |
Specify the file extension |
--skip_completed |
Skip completed files |
Some useful options are:
Arguments:
input_dir
: The directory containing the input text files to be summarized.--model
: model name or path to use for summarization. (Optional)--shuffle
: Shuffle the input files before processing. (Optional)--skip_completed
: Skip already completed files in the output directory. (Optional)--batch_length
: The maximum length of each input batch. Default is 4096. (Optional)--output_dir
: The directory to write the summarized output files. Default is./summarized/
. (Optional)
For more information, run the following:
textsum-dir --help
For convenience, a UI demo1 is provided using gradio. To ensure you have the dependencies installed, clone the repo and run the following command:
pip install textsum[app]
To run the demo, run the following command:
textsum-ui
This will start a local server that you can access in your browser & a shareable link will be printed to the console.
Summarization is a memory-intensive task, and the default model is relatively small and efficient for long-form text summarization. If you want to use a bigger model, you can specify the model_name_or_path
argument when instantiating the Summarizer
class.
summarizer = Summarizer(model_name_or_path='pszemraj/long-t5-tglobal-xl-16384-book-summary')
You can also use the -m
argument when using the CLI:
textsum-dir /path/to/dir -m pszemraj/long-t5-tglobal-xl-16384-book-summary
Any text-to-text or summarization model from the HuggingFace model hub can be used. Models are automatically downloaded and cached in ~/.cache/huggingface/hub
.
Memory usage can also be reduced by adjusting the parameters for inference. This is discussed in detail in the project wiki.
tl;dr for this README: use the summarizer.set_inference_params()
and summarizer.get_inference_params()
methods to adjust the parameters for inference from either a python dict
or a JSON file.
Support for GenerationConfig
as the primary method to adjust inference parameters is planned for a future release.
Some methods of reducing memory usage if you have compatible hardware include loading the model in 8-bit precision via LLM.int8 and using the --tf32
flag to use TensorFloat32 precision. See the transformers docs for more details on how this works. Using LLM.int8 requires the bitsandbytes package, which can either be installed directly or via the textsum[8bit]
extra:
pip install textsum[8bit]
To use these options, use the -8bit
and --tf32
flags when using the CLI:
textsum-dir /path/to/dir -8bit --tf32
Or in python, using the load_in_8bit
argument:
summarizer = Summarizer(load_in_8bit=True)
If using the python API, it's better to initiate tf32 yourself; see here for how.
⚠️ Note: This feature is experimental and might not work as expected. Use at your own risk.⚠️ 🧪
ONNX Runtime is a performance-focused inference engine for ONNX models. It can be used to enhance the speed of model predictions, especially on Windows and in environments where GPU acceleration is not available. If you want to use ONNX runtime for inference, you need to set optimum_onnx=True
when initializing the Summarizer
class.
First, install with pip install textsum[optimum]
. Then, you can use the following code to initialize the Summarizer
class with ONNX runtime:
summarizer = Summarizer(optimum_onnx=True)
Notes:
- ONNX runtime+cuda needs an additional package. Manually install
onnxruntime-gpu
if you plan to use ONNX with GPU. - Using ONNX runtime might lead to different behavior in certain models. It is recommended to test the model with and without ONNX runtime the same input text before using it for anything important.
By default, the summarization model uses past computations to speed up decoding. If you want to force the model to always use cache irrespective of the model's default behavior, you can set force_cache=True
when initializing the Summarizer
class.
summarizer = Summarizer(force_cache=True)
Note: Setting force_cache=True
might lead to different behavior in certain models.
By default, the model isn't compiled for efficient inference. If you want to compile the model for faster inference times, you can set compile_model=True
when initializing the Summarizer
class.
summarizer = Summarizer(compile_model=True)
Note: Compiling the model might not be supported on all platforms and requires pytorch > 2.0.0.
Contributions are welcome! Please open an issue or PR if you have any ideas or suggestions.
See the CONTRIBUTING.md file for details on how to contribute.
- add CLI for summarization of all text files in a directory
- python API for summarization of text docs
- add argparse CLI for UI demo
- put on PyPI
- LLM.int8 inference
- optimum inference integration
- better documentation in the wiki, details on improving performance (speed, quality, memory usage, etc.)
- in-progress
- improvements to the PDF OCR helper module (TBD - may focus more on being a summarization tool)
Other ideas? Open an issue or PR!
Footnotes
-
The demo is minimal but will be expanded to accept other arguments and options. ↩