Thank you for coming here! It's always nice to have third-party contributors 🤗 Depending on the type of contribution, you may need to follow different steps.
The project is split into the following parts:
include/stringzilla/stringzilla.h
- single-header C implementation.include/stringzilla/stringzilla.hpp
- single-header C++ wrapper.python/*
- Python bindings.javascript/*
- JavaScript bindings.scripts/*
- Scripts for benchmarking and testing.
For minimal test coverage, check the following scripts:
test.cpp
- tests C++ API (not underlying C) against STL.test.py
- tests Python API against native strings.test.js
.
At the C++ level all benchmarks also validate the results against the STL baseline, serving as tests on real-world data. They have the broadest coverage of the library, and are the most important to keep up-to-date:
bench_token.cpp
- token-level ops, like hashing, ordering, equality checks.bench_search.cpp
- bidirectional substring search, both exact and fuzzy.bench_similarity.cpp
- benchmark all edit distance backends.bench_sort.cpp
- sorting, partitioning, merging.bench_container.cpp
- STL containers with different string keys.
The role of Python benchmarks is less to provide absolute number, but to compare against popular tools in the Python ecosystem.
bench_search.(py|ipynb)
- compares against native Pythonstr
.bench_sort.(py|ipynb)
- compares againstpandas
.bench_similarity.(ipynb)
- compares againstjellyfish
,editdistance
, etc.
It's not always easy to find good datasets for benchmarking strings workloads. I use several ASCII and UTF8 international datasets. You can download them using the following commands:
# English Leipzig Corpora Collection
# 124 MB, 1'000'000 lines of ASCII, 8'388'608 tokens of mean length 5
wget --no-clobber -O leipzig1M.txt https://introcs.cs.princeton.edu/python/42sort/leipzig1m.txt
# Hutter Prize "enwik9" dataset for compression
# 1 GB (0.3 GB compressed), 13'147'025 lines of ASCII, 67'108'864 tokens of mean length 6
wget --no-clobber -O enwik9.zip http://mattmahoney.net/dc/enwik9.zip
unzip enwik9.zip && rm enwik9.zip && mv enwik9 enwik9.txt
# XL Sum dataset for multilingual extractive summarization
# 4.7 GB (1.7 GB compressed), 1'004'598 lines of UTF8, 268'435'456 tokens of mean length 8
wget --no-clobber -O xlsum.csv.gz https://github.com/ashvardanian/xl-sum/releases/download/v1.0.0/xlsum.csv.gz
gzip -d xlsum.csv.gz
# Human chromosome generator dataset generated by:
# 1200 rows, each 800 characters long (939K)
# https://github.com/rghilduta/human-chromosome-data-generator/blob/main/generate_chromosome_data.sh
wget --no-clobber -O human_protein_1200row_800len.txt https://media.githubusercontent.com/media/rghilduta/human-chromosome-data-generator/main/examples/human_protein_1200row_800len.txt
The project was originally developed in VS Code, and contains a set of configuration files for that IDE under .vscode/
.
tasks.json
- build tasks for CMake.launch.json
- debugger launchers for CMake.extensions.json
- recommended extensions for VS Code, including:ms-vscode.cpptools-themes
- C++ language support.ms-vscode.cmake-tools
,cheshirekow.cmake-format
- CMake integration.ms-python.python
,ms-python.black-formatter
- Python language support.yzhang.markdown-all-in-one
- formatting Markdown.aaron-bond.better-comments
- color-coded comments.
The project uses .clang-format
to enforce a consistent code style.
Modern IDEs, like VS Code, can be configured to automatically format the code on save.
- East const over const West. Write
char const*
instead ofconst char*
. - For color-coded comments start the line with
!
for warnings or?
for questions. - Sort the includes: standard libraries, third-party libraries, and only then internal project headers.
For C++ code:
- Explicitly use
std::
orsz::
namespaces over globalmemcpy
,uint64_t
, etc. - Explicitly mark
noexcept
ornoexcept(false)
for all library interfaces. - Document all possible exceptions of an interface using
@throw
in Doxygen. - Avoid C-style variadic arguments in favor of templates.
- Avoid C-style casts in favor of
static_cast
,reinterpret_cast
, andconst_cast
, except for places where a C function is called. - Use lower-case names for everything, except settings/conditions macros. Function-like macros, that take arguments, should be lowercase as well.
- In templates prefer
typename
overclass
. - Prepend "private" symbols with
_
underscore.
For Python code:
- Use lower-case names for functions and variables.
The primary C implementation and the C++ wrapper are built with CMake. Assuming the extensive use of new SIMD intrinsics and recent C++ language features, using a recent compiler is recommended. We prefer GCC 12, which is available from default Ubuntu repositories with Ubuntu 22.04 LTS onwards. If this is your first experience with CMake, use the following commands to get started on Ubuntu:
sudo apt-get update && sudo apt-get install cmake build-essential libjemalloc-dev g++-12 gcc-12
On MacOS it's recommended to use Homebrew and install Clang, as opposed to "Apple Clang". Replacing the default compiler is not recommended, as it may break the system, but you can pass it as an environment variable:
brew install llvm
cmake -D CMAKE_BUILD_TYPE=Release -D SIMSIMD_BUILD_TESTS=1 \
-D CMAKE_C_COMPILER="$(brew --prefix llvm)/bin/clang" \
-D CMAKE_CXX_COMPILER="$(brew --prefix llvm)/bin/clang++" \
-B build_release
cmake --build build_release --config Release
Using modern syntax, this is how you build and run the test suite:
cmake -D STRINGZILLA_BUILD_TEST=1 -D CMAKE_BUILD_TYPE=Debug -B build_debug
cmake --build build_debug --config Debug # Which will produce the following targets:
build_debug/stringzilla_test_cpp20 # Unit test for the entire library compiled for current hardware
build_debug/stringzilla_test_cpp20_x86_serial # x86 variant compiled for IvyBridge - last arch. before AVX2
build_debug/stringzilla_test_cpp20_arm_serial # Arm variant compiled without Neon
To use CppCheck for static analysis make sure to export the compilation commands. Overall, CppCheck and Clang-Tidy are extremely noisy and not suitable for CI, but may be useful for local development.
sudo apt install cppcheck clang-tidy-11
cmake -B build_artifacts \
-D CMAKE_BUILD_TYPE=RelWithDebInfo \
-D CMAKE_EXPORT_COMPILE_COMMANDS=1 \
-D STRINGZILLA_BUILD_BENCHMARK=1 \
-D STRINGZILLA_BUILD_TEST=1
cppcheck --project=build_artifacts/compile_commands.json --enable=all
clang-tidy-11 -p build_artifacts
I'd recommend putting the following breakpoints:
__asan::ReportGenericError
- to detect illegal memory accesses.__GI_exit
- to stop at exit points - the end of running any executable.__builtin_unreachable
- to catch unexpected code paths._sz_assert_failure
- to catch StringZilla logic assertions.
For benchmarks, you can use the following commands:
cmake -D STRINGZILLA_BUILD_BENCHMARK=1 -B build_release
cmake --build build_release --config Release # Produces the following targets:
build_release/stringzilla_bench_memory <path> # - for string copies and fills
build_release/stringzilla_bench_search <path> # - for substring search
build_release/stringzilla_bench_token <path> # - for hashing, equality comparisons, etc.
build_release/stringzilla_bench_similarity <path> # - for edit distances and alignment scores
build_release/stringzilla_bench_sort <path> # - for sorting arrays of strings
build_release/stringzilla_bench_container <path> # - for STL containers with string keys
Running on modern hardware, you may want to compile the code for older generations to compare the relative performance. The assumption would be that newer ISA extensions would provide better performance. On x86_64, you can use the following commands to compile for Sandy Bridge, Haswell, and Sapphire Rapids:
cmake -D CMAKE_BUILD_TYPE=Release -D STRINGZILLA_BUILD_BENCHMARK=1 \
-D STRINGZILLA_TARGET_ARCH="ivybridge" -B build_release/ivybridge && \
cmake --build build_release/ivybridge --config Release
cmake -D CMAKE_BUILD_TYPE=Release -D STRINGZILLA_BUILD_BENCHMARK=1 \
-D STRINGZILLA_TARGET_ARCH="haswell" -B build_release/haswell && \
cmake --build build_release/haswell --config Release
cmake -D CMAKE_BUILD_TYPE=Release -D STRINGZILLA_BUILD_BENCHMARK=1 \
-D STRINGZILLA_TARGET_ARCH="sapphirerapids" -B build_release/sapphirerapids && \
cmake --build build_release/sapphirerapids --config Release
Alternatively, you may want to compare the performance of the code compiled with different compilers. On x86_64, you may want to compare GCC, Clang, and ICX.
cmake -D CMAKE_BUILD_TYPE=Release -D STRINGZILLA_BUILD_BENCHMARK=1 -D STRINGZILLA_BUILD_SHARED=1 \
-D CMAKE_CXX_COMPILER=g++-12 -D CMAKE_C_COMPILER=gcc-12 \
-B build_release/gcc && cmake --build build_release/gcc --config Release
cmake -D CMAKE_BUILD_TYPE=Release -D STRINGZILLA_BUILD_BENCHMARK=1 -D STRINGZILLA_BUILD_SHARED=1 \
-D CMAKE_CXX_COMPILER=clang++-14 -D CMAKE_C_COMPILER=clang-14 \
-B build_release/clang && cmake --build build_release/clang --config Release
To simplify tracing and profiling, build with symbols using the RelWithDebInfo
configuration.
Here is an example for profiling one target - stringzilla_bench_token
.
cmake -D STRINGZILLA_BUILD_BENCHMARK=1 \
-D STRINGZILLA_BUILD_TEST=1 \
-D STRINGZILLA_BUILD_SHARED=1 \
-D CMAKE_BUILD_TYPE=RelWithDebInfo \
-B build_profile
cmake --build build_profile --config Release --target stringzilla_bench_token
# Check that the debugging symbols are there with your favorite tool
readelf --sections build_profile/stringzilla_bench_token | grep debug
objdump -h build_profile/stringzilla_bench_token | grep debug
# Profile
sudo perf record -g build_profile/stringzilla_bench_token ./leipzig1M.txt
sudo perf report
It might be a good idea to check the compatibility against the most popular Linux distributions. Docker is the goto-choice for that.
Alpine is one of the most popular Linux distributions for containers, due to it's size. The base image is only ~3 MB, and it's based on musl libc, which is different from glibc.
sudo docker run -it --rm -v "$(pwd)":/workspace/StringZilla alpine:latest /bin/ash
cd /workspace/StringZilla
apk add --update make cmake g++ gcc
cmake -D STRINGZILLA_BUILD_TEST=1 -D CMAKE_BUILD_TYPE=Debug -B build_debug
cmake --build build_debug --config Debug
build_debug/stringzilla_test_cpp20
Clear Linux is a distribution optimized for Intel hardware, and is known for its performance.
It has rolling releases, and is based on glibc
.
It might be a good choice for compiling with Intel oneAPI compilers.
sudo docker run -it --rm -v "$(pwd)":/workspace/StringZilla clearlinux:latest /bin/bash
cd /workspace/StringZilla
swupd update
swupd bundle-add c-basic dev-utils
cmake -D STRINGZILLA_BUILD_TEST=1 -D CMAKE_BUILD_TYPE=Debug -B build_debug
cmake --build build_debug --config Debug
build_debug/stringzilla_test_cpp20
For benchmarks:
cmake -D STRINGZILLA_BUILD_TEST=1 -D STRINGZILLA_BUILD_BENCHMARK=1 -B build_release
cmake --build build_release --config Release
For CentOS-based Amazon Linux 2023:
sudo docker run -it --rm -v "$(pwd)":/workspace/StringZilla amazonlinux:2023 bash
cd /workspace/StringZilla
yum install -y make cmake3 gcc g++
cmake3 -D STRINGZILLA_BUILD_TEST=1 -D CMAKE_BUILD_TYPE=Debug \
-D CMAKE_CXX_COMPILER=g++ -D CMAKE_C_COMPILER=gcc -D STRINGZILLA_TARGET_ARCH="ivybridge" \
-B build_debug
cmake3 --build build_debug --config Debug --target stringzilla_test_cpp11
build_debug/stringzilla_test_cpp11
The CentOS-based Amazon Linux 2 is still used in older AWS Lambda functions. Sadly, the newest GCC version it supports is 10, and it can't handle AVX-512 instructions.
sudo docker run -it --rm -v "$(pwd)":/workspace/StringZilla amazonlinux:2 bash
cd /workspace/StringZilla
yum install -y make cmake3 gcc10 gcc10-c++
cmake3 -D STRINGZILLA_BUILD_TEST=1 -D CMAKE_BUILD_TYPE=Debug \
-D CMAKE_CXX_COMPILER=g++ -D CMAKE_C_COMPILER=gcc -D STRINGZILLA_TARGET_ARCH="ivybridge" \
-B build_debug
cmake3 --build build_debug --config Debug --target stringzilla_test_cpp11
build_debug/stringzilla_test_cpp11
Caution
Even with GCC 10 the tests compilation will fail, as the STL implementation of the insert
function doesn't conform to standard.
The s.insert(s.begin() + 1, {'a', 'b', 'c'}) == (s.begin() + 1)
expression is illformed, as the std::string::insert
return void
.
Don't forget to clean up Docker afterwards.
docker system prune -a --volumes
Unlike GCC, LLVM handles cross compilation very easily.
You just need to pass the right TARGET_ARCH
and BUILD_ARCH
to CMake.
The list includes:
crossbuild-essential-amd64
for 64-bit x86crossbuild-essential-arm64
for 64-bit Armcrossbuild-essential-armhf
for 32-bit ARM hard-floatcrossbuild-essential-armel
for 32-bit ARM soft-float (emulatesfloat
)crossbuild-essential-riscv64
for RISC-Vcrossbuild-essential-powerpc
for PowerPCcrossbuild-essential-s390x
for IBM Zcrossbuild-essential-mips
for MIPScrossbuild-essential-ppc64el
for PowerPC 64-bit little-endian
Here is an example for cross-compiling for Arm64 on an x86_64 machine:
sudo apt-get update
sudo apt-get install -y clang lld make crossbuild-essential-arm64 crossbuild-essential-armhf
export CC="clang"
export CXX="clang++"
export AR="llvm-ar"
export NM="llvm-nm"
export RANLIB="llvm-ranlib"
export TARGET_ARCH="aarch64-linux-gnu" # Or "x86_64-linux-gnu"
export BUILD_ARCH="arm64" # Or "amd64"
cmake -D CMAKE_BUILD_TYPE=Release \
-D CMAKE_C_COMPILER_TARGET=${TARGET_ARCH} \
-D CMAKE_CXX_COMPILER_TARGET=${TARGET_ARCH} \
-D CMAKE_SYSTEM_NAME=Linux \
-D CMAKE_SYSTEM_PROCESSOR=${BUILD_ARCH} \
-B build_artifacts
cmake --build build_artifacts --config Release
Python bindings are implemented using pure CPython, so you wouldn't need to install SWIG, PyBind11, or any other third-party library.
pip install -e . # To build locally from source
For testing we use PyTest, which may not be installed on your system.
pip install pytest # To install PyTest
pytest scripts/test.py -s -x # Runs tests printing logs and stops on the first failure
On a related note, StringZilla for Python seems to cover more OS and hardware combinations, than NumPy. That's why NumPy isn't a required dependency. Still, many tests may use NumPy, so consider installing it on mainstream platforms.
pip install numpy
Before you ship, please make sure the cibuilwheel
packaging works and tests pass on other platforms.
Don't forget to use the right CLI arguments to avoid overloading your Docker runtime.
cibuildwheel
cibuildwheel --platform linux # works on any OS and builds all Linux backends
cibuildwheel --platform linux --archs x86_64 # 64-bit x86, the most common on desktop and servers
cibuildwheel --platform linux --archs aarch64 # 64-bit Arm for mobile devices, Apple M-series, and AWS Graviton
cibuildwheel --platform linux --archs i686 # 32-bit Linux
cibuildwheel --platform linux --archs s390x # emulating big-endian IBM Z
cibuildwheel --platform macos # works only on MacOS
cibuildwheel --platform windows # works only on Windows
You may need root privileges for multi-architecture builds:
sudo $(which cibuildwheel) --platform linux
On Windows and MacOS, to avoid frequent path resolution issues, you may want to use:
python -m cibuildwheel --platform windows
For high-performance low-latency benchmarking, stick to C/C++ native benchmarks, as the CPython is likely to cause bottlenecks. For benchmarking, the following scripts are provided.
python scripts/bench_search.py --haystack_path "your file" --needle "your pattern" # real data
python scripts/bench_search.py --haystack_pattern "abcd" --haystack_length 1e9 --needle "abce" # synthetic data
python scripts/similarity_bench.py --text_path "your file" # edit distance computations
Alternatively, you can explore the Jupyter notebooks in scripts/
directory.
npm ci && npm test
swift build && swift test
Running Swift on Linux requires a couple of extra steps, as the Swift compiler is not available in the default repositories. Please get the most recent Swift tarball from the official website. At the time of writing, for 64-bit Arm CPU running Ubuntu 22.04, the following commands would work:
wget https://download.swift.org/swift-5.9.2-release/ubuntu2204-aarch64/swift-5.9.2-RELEASE/swift-5.9.2-RELEASE-ubuntu22.04-aarch64.tar.gz
tar xzf swift-5.9.2-RELEASE-ubuntu22.04-aarch64.tar.gz
sudo mv swift-5.9.2-RELEASE-ubuntu22.04-aarch64 /usr/share/swift
echo "export PATH=/usr/share/swift/usr/bin:$PATH" >> ~/.bashrc
source ~/.bashrc
You can check the available images on swift.org/download
page.
For x86 CPUs, the following commands would work:
wget https://download.swift.org/swift-5.9.2-release/ubuntu2204/swift-5.9.2-RELEASE/swift-5.9.2-RELEASE-ubuntu22.04.tar.gz
tar xzf swift-5.9.2-RELEASE-ubuntu22.04.tar.gz
sudo mv swift-5.9.2-RELEASE-ubuntu22.04 /usr/share/swift
echo "export PATH=/usr/share/swift/usr/bin:$PATH" >> ~/.bashrc
source ~/.bashrc
Alternatively, on Linux, the official Swift Docker image can be used for builds and tests:
sudo docker run --rm -v "$PWD:/workspace" -w /workspace swift:5.9 /bin/bash -cl "swift build -c release --static-swift-stdlib && swift test -c release --enable-test-discovery"
cargo test
If you are updating the package contents, you can validate the list of included files using the following command:
cargo package --list --allow-dirty
If you want to run benchmarks against third-party implementations, check out the ashvardanian/memchr_vs_stringzilla
repository.
One of the hardest things to learn in HPC is when to stop optimizing, and where not to start.
It doesn't make sense to optimize sz_order
, because almost always, the relative order of two strings depends on the first bytes.
Fetching more bytes is not worth it.
In sz_equal
, however, in rare cases, SIMD can help, if the user is comparing two mostly similar strings with identical hashes or checksums.
One common surface of attack for performance optimizations is minimizing unaligned loads. Such solutions are beautiful from the algorithmic perspective, but often lead to worse performance. It's often cheaper to issue two interleaving wide-register loads, than try minimizing those loads at the cost of juggling registers. Unaligned stores are a different story, especially on x86, where multiple reads can be issued in parallel, but only one write can be issued at a time.
Byte-level comparisons are simpler and often faster, than n-gram comparisons with subsequent interleaving. In the following example we search for 4-byte needles in a haystack, loading at different offsets, and comparing then as arrays of 32-bit integers.
h0_vec.zmm = _mm512_loadu_epi8(h);
h1_vec.zmm = _mm512_loadu_epi8(h + 1);
h2_vec.zmm = _mm512_loadu_epi8(h + 2);
h3_vec.zmm = _mm512_loadu_epi8(h + 3);
matches0 = _mm512_cmpeq_epi32_mask(h0_vec.zmm, n_vec.zmm);
matches1 = _mm512_cmpeq_epi32_mask(h1_vec.zmm, n_vec.zmm);
matches2 = _mm512_cmpeq_epi32_mask(h2_vec.zmm, n_vec.zmm);
matches3 = _mm512_cmpeq_epi32_mask(h3_vec.zmm, n_vec.zmm);
if (matches0 | matches1 | matches2 | matches3)
return h + sz_u64_ctz(_pdep_u64(matches0, 0x1111111111111111) | //
_pdep_u64(matches1, 0x2222222222222222) | //
_pdep_u64(matches2, 0x4444444444444444) | //
_pdep_u64(matches3, 0x8888888888888888));
A simpler solution would be to compare byte-by-byte, but in that case we would need to populate multiple registers, broadcasting different letters of the needle into them. That may not be noticeable on a micro-benchmark, but it would be noticeable on real-world workloads, where the CPU will speculatively interleave those search operations with something else happening in that context.
It's important to keep compiler support in mind when extending to new instruction sets.
Check the most recent CI pipeline configurations in prerelease.yml
and release.yml
to see which compilers are used.
When implementing dynamic dispatch, avoid compiler intrinsics and OS-specific APIs, as they may not be available on all platforms.
Instead, use inline assembly to check feature flags and dispatch them to the proper implementation.
When dealing with non-trivial algorithms, like edit distances, it's advisory to provide pseudo-code or a reference implementation in addition to the optimized one.
Ideally, include it in scripts/
as a Python Jupyter Notebook with explanations and visualizations.
Sorting algorithms for strings are a deeply studied area. In general, string sorting algorithms discourage the use of comparisons, as they are expensive for variable-length data and also require pointer-chasing for most array layouts. They are also harder to accelerate with SIMD, as most layouts imply 16-byte entries, which are often too big to benefit from simple SIMD techniques.