Releases: flame/blis
Releases · flame/blis
BLIS 1.0
This release contains several new features and optimizations related to threaded execution, as well as internal changes that improve maintainability and lay the groundwork for future refactoring. The build system and kernel sets saw lots of new code and tweaks to old code, and of course there were many bugfixes.
Improvements present in 1.0:
Framework:
- Initialize/finalize BLIS via a new
bli_pthread_switch_t
API. (Field Van Zee, Devin Matthews) - Revamped
bli_init()
to use TLS where feasible. (Field Van Zee, Edward Smyth, Minh Quan Ho) - Implemented support for fat multithreading.
- Implemented tile-level load balancing (tlb), or tile-level partitioning, in jr/ir loops for
gemm
,gemmt
, andtrmm
macrokernels. (Field Van Zee, Devin Matthews, Leick Robinson, Minh Quan Ho) - Added padding to
thrcomm_t
fields to avoid false sharing of cache lines. (Leick Robinson) - Rewrote/fixed broken tree barrier implementation. (Leick Robinson)
- Refactored some
rntm_t
management code. (Field Van Zee, Devin Matthews) - Initialize
rntm_t
nt/ways fields with 1 (not -1). (Field Van Zee, Jeff Diamond, Leick Robinson, Devin Matthews) - Defined
invscalv
,invscalm
,invscald
operations. - Added consistent
NaN
/Inf
handling insumsqv
. (Devin Matthews) - Implemented support for HPX as a threading backend option. (Christopher Taylor, Srinivas Yadav)
- Relocated the pba, sba pool (from the
rntm_t
), andmem_t
(from thecntl_t
) to thethrinfo_t
object. - Modified which communicator is associated with a given node of the
thrinfo_t
tree. (Devin Matthews) - Refactored level-3 thread decorator into two parts: a thread launcher and a function to pass operands. (Devin Matthews)
- Refactored structure awareness in
bli_packm_blk_var1.c
. (Devin Matthews) - Reimplemented
bli_l3_determine_kc()
. (Devin Matthews) - Implemented
cntx_t
pointer caching in gks. (Field Van Zee, Harihara Sudhan S) - Added
const
keyword to pointers in kernel APIs. (Field Van Zee, Nisanth M P) - Migrated all kernel APIs to use
void*
pointers. - Defined new global scalar constants:
BLIS_ONE_I
,BLIS_MINUS_ONE_I
,BLIS_NAN
. (Devin Matthews) - Disabled modification of KC in the
gemmsup
kernels. (Devin Matthews) - Defined
lt
,lte
,gt
,gte
operations and other miscellaneous updates. - Consolidated
INSERT_
macro sets via variadic macros. (Devin Matthews) - De-templatized macrokernels for
gemmt
,trmm
, andtrsm
to match that ofgemm
. (Devin Matthews) - De-templatized
bli_l3_sup_var1n2m.c
and unified_sup_packm_a/b()
. (Devin Matthews) - Fixed 1m enablement for
herk
/her2k
/syrk
/syr2k
. (Devin Matthews) - Fixed
trmm[3]
/trsm
performance bug introduced incf7d616
. (Field Van Zee, Leick Robinson) - Fixed a 1m optimization bug in right-sided
hemm
/symm
. (Field Van Zee, Nisanth M P) - Fixed a bug in sup threshold registration. (Devin Matthews, Field Van Zee)
- Fixed brokenness in the small block allocator (sba) when the sba is disabled. (Field Van Zee, John Mather)
- Fixed type bug in
bli_cntx_set_ukr_prefs()
. (Field Van Zee, Leick Robinson, Devin Matthews, Jeff Diamond) - Fixed incorrect
sizeof(type)
in edge case macros. (@moon-chilled) - Fixed bugs and added sanity check in
bli_pool.c
. (Devin Matthews) - Fixed a typo in the macro definition for
VEXTRACTF64X2
inbli_x86_asm_macros.h
. (Harsh Dave) - Fixed a typo in
bli_type_defs.h
whereBLIS_BLAS_INT_TYPE_SIZE
was misspelled. (Devin Matthews) - Typecast
printf()
args inbli_thread_range_tlb.c
to avoid compiler warnings. (Lee Killough) - Minor tweaks to
bli_l3_check.c
. - Partial addition of
const
to all interfaces above the (micro)kernels. (Devin Matthews) - Fixed a harmless misspelling of
xpbys
in gemm macrokernel. - Various internal API renaming/reorganization.
- Various other fixes.
Compatibility:
- Implemented
[cz]symv_()
,[cz]syr_()
,[cz]rot_()
. (Field Van Zee, James Foster) - Fixed compilation errors when
BLIS_DISABLE_BLAS_DEFS
is defined. (Field Van Zee, Edward Smyth, Devin Matthews) - Include
bli_config.h
beforebli_system.h
incblas.h
so thatBLIS_ENABLE_SYSTEM
is defined in time for proper OS detection. (Edward Smyth)
Kernels:
- Updated ARMv8a kernels to fix two prefetching issues and re-enable general stride IO. (Jeff Diamond)
- Restored general storage case to
armsve
kernels. (RuQing Xu) - Added arm64
dgemmsup
with extended MR and NR. (RuQing Xu) - Reorganized the way
packm
kernels are stored within thecntx_t
so that BLIS only stores twopackm
kernels per datatype: one for MRxk upanels and one for kxNR upanels. (Devin Matthews) - Fixed bugs in
scal2v
reference kernel when alpha == 1. - Fixed out-of-bounds read in
haswell
gemmsup
kernels. (Daniël de Kok, Bhaskar Nallani, Madeesh Kannan) - Fixed k = 0 edge case in
power10
microkernels. (Nisanth M P) - Disabled
power10
kernels other thansgemm
,dgemm
. (Nisanth M P) - Fixed
bli_gemm_small()
prototype mismatch. (Jeff Diamond)
Extras:
- Use the conventional level-3 sup thread decorator within the
gemmlike
sandbox. - Fixed type-mismatch errors in
power10
sandbox. (Nisanth M P) - Fixed
gemmlike
sandbox bug that stems from reuse ofbli_thrinfo_sup_grow()
.
Build system:
- Added two arm64 subconfigs:
altra
andaltramax
. (Jeff Diamond, Leick Robinson) - Added support for RISC-V configuration targets. (Angelika Schwarz, Lee Killough)
- Auto-detect the RISC-V ABI of the compiler and use
-mabi=
during RISC-V builds. (Lee Killough) - Added
sifive_x280
subconfig and kernel set. (Aaron Hutchinson, Lee Killough, Devin Matthews, and Angelika Schwarz) - Added AddressSanitizer (--enable-asan) option to
configure
. (Devin Matthews) - Added option to disable thread-local storage via
--disable-tls
. (Field Van Zee, Nick Knight) - Exclude
-lrt
on Android with Bionic libraries. (Lee Killough) - Omit
-fPIC
option when shared library build is disabled. (Field Van Zee, Nick Knight) - Move
-fPIC
option insertion to subconfigs'make_defs.mk
files. (Field Van Zee, Nick Knight) - Install one-line helper headers to
INCDIR
prefix so that user can#include "blis.h"
instead of#include <blis/blis.h>
and/or"cblas.h"
instead of<blis/cblas.h>
if CBLAS is enabled). (Field Van Zee, Jed Brown, Devin Matthews, Mo Zhou) - Enhanced detection of Fortran compiler when checking the version string for the purposes of determining a default return convention for complex domain values. (Bart Oldeman)
- Added detection of the NVIDIA nvhpc compiler (
nvc
) inconfigure
. (Ajay Panyala) - Updated
zen3
subconfig to support NVHPC compilers. (Abhishek Bagusetty) - Use kernel CFLAGS for
kernels
subdirs in addons. (AMD, Mithun Mohan) - Created
power
umbrella configuration family (which currently includespower9
andpower10
subconfigs). (Nisanth M P) - Defined
BLIS_VERSION_STRING
inblis.h
instead of via command line argument during compilation. (Field Van Zee, Mohsen Aznaveh, Tim Davis) - Rewrote
regen-symbols.sh
asgen-libblis-symbols.sh
. (Field Van Zee) - Support
clang
targetting MinGW. (Isuru Fernando) - Added autodetection (via
/proc/cpuinfo
) for POWER7, POWER9 and POWER10 microarchitectures. (Alexander Grund) - Added
#line
directives to flattenedblis.h
to facilitate easier debugging. (Devin Matthews) - Added
--nosup
and--sup
shorthand options toconfigure
. - Use here-document syntax for
configure --help
output. (Lee Killough) - Updated
configure
to pass allshellcheck
checks. (Lee Killough) - Tweaks to
.dir-locals.el
to enhance emacs formatting of C files. (Lee Killough) - Removed buggy cruft from
power10
subconfig. (Field Van Zee, Nicholai Tukanov) - Added missing
#include <io.h>
for Windows. (@h-vetinari) - Fixed hardware auto-detection for
firestorm
(Apple M1) subconfig. (Devin Matthews) - Fixed bug in detection of Fortran compiler vendor. (Devin Matthews)
- Fixed version check for
znver3
, which needs gcc >= 10.3. (Jed Brown) - Fixed typo in
configure --help
text. (Lee Killough) - Fixed warning about regular expressions with stray backslashes as the result of recent changes to
grep
. - Added
output.testsuite
to.gitignore
. - Minor changes to .gitignore and LICENSE files. (Jeff Diamond)
- Minor decluttering of top-level directory.
- Very minor tweaks to common.mk.
Testing:
- Rewrote
test/3
drivers to take parameters via command line arguments. (Field Van Zee, Jeff Diamond, Leick Robinson) - Added
arm64
entry to.travis.yml
so that Travis CI will compile/test ARM builds. (Field Van Zee, RuQing Xu) - Test the
gemmlike
sandbox via AppVeyor. (Jeff Diamond) - Added
-q
quiet mode option to testsuite. - Fixed non-deterministic segfault in standalone
test/3
drivers. (Field Van Zee, Leick Robinson) - Fixed a crash that occurs when either
cblat1
orzblat1
are linked with a build of BLIS that was compiled with--complex-return=intel
. (Bart Oldeman) - Other minor fixes/tweaks.
Documentation:
- Added Discord documentation (
docs/Discord.md
) and logo toREADME.md
. - Added the
mm_algorithm
files (for bp and pb) todocs/diagrams
. - Added mention of Wilkinson Prize to
README.md
. - Minor fixes and improvements to
docs/Multithreading.md
. - Fix typos in docs + example code comments. (Igor Zhuravlov)
BLIS 0.9.0
This release contains a slew of improvements, new kernels and APIs, bugfixes, and more (including lots of code reduction). It also contains foundational support for an exciting new class of expert functionality: creating new operations without the need to duplicate the middleware that sits between the API and kernels.
Improvements present in 0.9.0:
Framework:
- Added various fields to
obj_t
that relate to storing function pointers to custompackm
kernels, microkernels, etc as well as accessor functions to set and query those fields. (Devin Matthews) - Enabled user-customized
packm
microkernels and variants via the aforementioned newobj_t
fields. (Devin Matthews) - Moved edge-case handling out of the macrokernel and into the
gemm
andgemmtrsm
microkernels. This also required updating of APIs and definitions of all existing microkernels inkernels
directory. Edge-case handling functionality is now facilitated via new preprocessor macros found inbli_edge_case_macro_defs.h
. (Devin Matthews) - Avoid
gemmsup
thread barriers when not packing A or B. This boosts performance for many small multithreaded problems. (Field Van Zee, AMD) - Allow the 1m method to operate normally when single and double real-domain microkernels mix row and column I/O preference. (Field Van Zee, Devin Matthews, RuQing Xu)
- Removed support for execution of complex-domain level-3 operations via the 3m and 4m methods.
- Refactored
herk
,her2k
,syrk
,syr2k
in terms ofgemmt
. (Devin Matthews) - Defined
setijv
andgetijv
to set/get vector elements. - Defined
eqsc
,eqv
, andeqm
operations to test equality between two scalars, vectors, or matrices. - Added new bounds checking to
setijm
andgetijm
to prevent use of negative indices. - Renamed
membrk
files/variables/functions topba
. - Store error-checking level as a thread-local variable. (Devin Matthews)
- Add
err_t*
"return" parameter tobli_malloc_*()
and friends. - Switched internal mutexes of the
sba
andpba
to static initialization. - Changed return value method of
bli_pack_get_pack_a()
,bli_pack_get_pack_b()
. - Fixed a bug that allows
bli_init()
to be called more than once (without segfaulting). (@lschork2, Minh Quan Ho, Devin Matthews) - Removed a sanity check in
bli_pool_finalize()
that prevented BLIS from being re-initialized. (AMD) - Fixed insufficient
pool_t
-growing logic inbli_pool.c
, and always allocate at least one element in.block_ptrs
array. (Minh Quan Ho) - Cleanups related to the error message array in
bli_error.c
. (Minh Quan Ho) - Moved language-related definitions from
bli_macro_defs.h
to a new header,bli_lang_defs.h
. - Renamed
BLIS_SIMD_NUM_REGISTERS
toBLIS_SIMD_MAX_NUM_REGISTERS
andBLIS_SIMD_SIZE
toBLIS_SIMD_MAX_SIZE
for improved clarity. (Devin Matthews) - Many minor bugfixes.
- Many cleanups, including removal of old and commented-out code.
Compatibility:
- Expanded BLAS layer to include support for
?axpby_()
and?gemm_batch_()
. (Meghana Vankadari, AMD) - Added
gemm3m
APIs to BLAS and CBLAS layers. (Bhaskar Nallani, AMD) - Handle
?gemm_()
invocations where m or n is unit by calling?gemv_()
. (Dipal M Zambare, AMD) - Removed option to finalize BLIS after every BLAS call.
- Updated default definitions of
bli_slamch()
andbli_dlamch()
to use constants from standard C library rather than values computed at runtime. (Devin Matthews)
Kernels:
- Added 512-bit SVE-based
a64fx
subconfiguration that uses empirically-tuned blocksizes (Stepan Nassyr, RuQing Xu) - Added a vector-length agnostic
armsve
subconfig that computes blocksizes via an analytical model. (Stepan Nassyr) - Added vector-length agnostic d/s/sh
gemm
kernels for Arm SVE. (Stepan Nassyr) - Added
gemmsup
kernels to thearmv8a
kernel set for use in new Apple Firestorm subconfiguration. (RuQing Xu) - Added 512-bit SVE
dpackm
kernels (16xk and 10xk) with in-register transpose. (RuQing Xu) - Extended 256-bit SVE
dpackm
kernels by Linaro Ltd. to 512-bit for size 12xk. (RuQing Xu) - Reorganized register usage in
bli_gemm_armv8a_asm_d6x8.c
to accommodate clang. (RuQing Xu) - Added
saxpyf
/daxpyf
/caxpyf
kernels tozen
kernel set. (Dipal M Zambare, AMD) - Added
vzeroupper
instruction tohaswell
microkernels. (Devin Matthews) - Added explicit
beta == 0
handling in s/darmsve
andarmv7a
gemm
microkernels. (Devin Matthews) - Added a unique tag to branch labels to accommodate clang. (Devin Matthews, Jeff Hammond)
- Fixed a copy-paste bug in the loading of
kappa_i
in the two assemblycpackm
kernels inhaswell
kernel set. (Devin Matthews) - Fixed a bug in Mx1
gemmsup
haswell
kernels whereby thevhaddpd
instruction is used with uninitialized registers. (Devin Matthews) - Fixed a bug in the
power10
microkernel I/O. (Nicholai Tukanov) - Many other Arm kernel updates and fixes. (RuQing Xu)
Extras:
- Added support for addons, which are similar to sandboxes but do not require the user to implement any particular operation.
- Added a new
gemmlike
sandbox to allow rapid prototyping ofgemm
-like operations. - Various updates and improvements to the
power10
sandbox, including a new testsuite. (Nicholai Tukanov)
Build system:
- Added explicit support for AMD's Zen3 microarchitecture. (Dipal M Zambare, AMD, Field Van Zee)
- Added runtime microarchitecture detection for Arm. (Dave Love, RuQing Xu, Devin Matthews)
- Added a new
configure
option--[en|dis]able-amd-frame-tweaks
that allows BLIS to compile certain framework files (each with the_amd
suffix) that have been customized by AMD for improved performance (provided that the targeted configuration is eligible). By default, the more portable counterparts to these files are compiled. (Field Van Zee, AMD) - Added an explicit compiler predicate (
is_win
) for Windows inconfigure
. (Devin Matthews) - Use
-march=haswell
instead of-march=skylake-avx512
on Windows. (Devin Matthews, @h-vetinari) - Fixed
configure
breakage on MacOSX by accepting eitherclang
orLLVM
in vendor string. (Devin Matthews) - Blacklist clang10/gcc9 and older for
armsve
subconfig. - Added a
configure
option to control whether or not to use@rpath
. (Devin Matthews) - Added armclang detection to
configure
. (Devin Matthews) - Use
@path
-based install name on MacOSX and use relocatableRPATH
entries for testsuite binaries. (Devin Matthews) - For environment variables
CC
,CXX
,FC
,PYTHON
,AR
, andRANLIB
,configure
will now print an error message and abort if a user specifies a specific tool and that tool is not found. (Field Van Zee, Devin Matthews) - Added symlink to
blis.pc.in
for out-of-tree builds. (Andrew Wildman) - Register optimized real-domain
copyv
,setv
, andswapv
kernels inzen
subconfig. (Dipal M Zambare, AMD) - Added Apple Firestorm (A14/M1) subconfiguration,
firestorm
. (RuQing Xu) - Added
armsve
subconfig toarm64
configuration family. (RuQing Xu) - Allow using clang with the
thunderx2
subconfiguration. (Devin Matthews) - Fixed a subtle substitution bug in
configure
. (Chengguo Sun) - Updated top-level Makefile to reflect a dependency on the "flat"
blis.h
file for the BLIS and BLAS testsuite objects. (Devin Matthews) - Mark
xerbla_()
as a "weak" symbol on MacOSX. (Devin Matthews) - Fixed a long-standing bug in
common.mk
whereby the header path tocblas.h
was omitted from the compiler flags when compiling CBLAS files within BLIS. - Added a custom-made recursive
sed
script tobuild
directory. - Minor cleanups and fixes to
configure
,common.mk
, and others.
Testing:
- Fixed a race condition in the testsuite when the SALT option (simulate application-level threading) is enabled. (Devin Matthews)
- Test 1m method execution during
make check
. (Devin Matthews) - Test
make install
in Travis CI. (Devin Matthews) - Test C++ in Travis CI to make sure
blis.h
is C++-compatible. (Devin Matthews) - Disabled SDE testing of pre-Zen microarchitectures via Travis CI.
- Added Travis CI support for testing Arm SVE. (RuQing Xu)
- Updated SDE usage so that it is downloaded from a separate repository (ci-utils) in our GitHub organization. (Field Van Zee, Devin Matthews)
- Updated octave scripts in
test/3
to be robust against missing datasets as well as to fixed a few minor issues. - Added
test_axpbyv.c
andtest_gemm_batch.c
test driver files totest
directory. (Meghana Vankadari, AMD) - Support all four datatypes in
her
,her2
,herk
, andher2k
drivers intest
directory. (Madan mohan Manokar, AMD)
Documentation:
- Added documentation for:
setijv
,getijv
,eqsc
,eqv
,eqm
. - Added
docs/Addons.md
. - Added dedicated "Performance" and "Example Code" sections to
README.md
. - Updated
README.md
. - Updated
docs/Sandboxes.md
. - Updated
docs/Multithreading.md
. (Devin Matthews) - Updated
docs/KernelHowTo.md
. - Updated
docs/Performance.md
to report Fujitsu A64fx (512-bit SVE) results. (RuQing Xu) - Updated
docs/Performance.md
to report Graviton2 Neoverse N1 results. (Nicholai Tukanov) - Updated
docs/FAQ.md
with new questions. - Fixed typos in
docs/FAQ.md
. (Gaëtan Cassiers) - Various other minor fixes.