Skip to content

Hashed Lookup Table based Matrix Multiplication (halutmatmul) built on MADDness/bolt

License

Notifications You must be signed in to change notification settings

a0917bc/halutmatmul

This branch is 250 commits behind joennlae/halutmatmul:master.

Folders and files

NameName
Last commit message
Last commit date

Latest commit

fdf57e9 · May 2, 2023
Mar 22, 2023
Mar 22, 2023
Mar 22, 2023
Mar 22, 2023
Mar 22, 2023
Mar 22, 2023
May 2, 2023
Mar 31, 2022
Mar 22, 2023
Mar 6, 2022
Mar 22, 2023
Sep 18, 2022
Apr 12, 2023
Mar 22, 2023
Mar 8, 2022
Sep 26, 2022
Mar 31, 2023
Mar 31, 2023
Mar 22, 2023
Mar 22, 2023
Mar 22, 2023
Mar 22, 2023
Apr 8, 2022

Repository files navigation

Halutmatmul

Algorithmic CI

GPU Tests (Vast.ai) PyTest Linting MyPy C++ build

Hardware CI

HW Synth + PAR OpenROAD RTL Linting HW Design Verification

General Information

This repo is used for the algorithmic exploration. I will try to update this repo with as much hardware information as I am allowed to publish.

Install

# install conda environment & activate
conda env create -f environment_gpu.yml
conda activate halutmatmul

# IIS prefixed env
conda env create -f environment_gpu.yml --prefix /scratch/janniss/conda/halutmatmul_gpu

# install CLI
./scripts/install-cli.sh

# now use CLI with
halut --help

# or without install
./halut --help

Hackernews mention (comments only) and discussion

Hardware OpenROAD flow results

All Designs ASAP7 NanGate45
All Report All All
History History History

Total Circuit (M=2)

halut_matmul ASAP7 NanGate45
Area [μm^2] 9643.6787 140647.7656
Freq [Mhz] 666.7 333.3
GE 110.238 kGE 176.25 kGE
Std Cell [#] 68186 68994
Voltage [V] 0.77 1.1
Util [%] 45.0 59.2
TNS -1086.59 -0.31
Clock Net Clock_net Clock_net
Gallery Gallery Viewer Gallery Viewer
Metrics Metrics Viewer Metrics Viewer
Report Report Viewer Report Viewer

Encoder

halut_encoder_4 ASAP7 NanGate45
Area [μm^2] 4844.5405 69711.9531
Freq [Mhz] 666.7 333.3
GE 55.378 kGE 87.358 kGE
Std Cell [#] 34334 33746
Voltage [V] 0.77 1.1
Util [%] 45.0 58.7
TNS 0.0 0.0
Clock Net Clock_net Clock_net
Gallery Gallery Viewer Gallery Viewer
Metrics Metrics Viewer Metrics Viewer
Report Report Viewer Report Viewer

Decoder

halut_decoder ASAP7 NanGate45
Area [μm^2] 4749.8286 68923.7891
Freq [Mhz] 666.7 333.3
GE 54.296 kGE 86.37 kGE
Std Cell [#] 33709 34395
Voltage [V] 0.77 1.1
Util [%] 44.4 58.9
TNS -11340.5098 -0.66
Clock Net Clock_net Clock_net
Gallery Gallery Viewer Gallery Viewer
Metrics Metrics Viewer Metrics Viewer
Report Report Viewer Report Viewer

Progress Slides

Slides preview

CUDA kernels

I am aware that there is still a lot that could be optimized here (warp etc.), but it was only developed for fast analysis

Results

Caveats: No retraining and fine-tuning done yet!

Single Layer replacement with C=32 and K=16

LeViT (Source)

SOTA Vision Transformer on ImageNet 1K LeViT Results

ResNet-50 (only interesting layers in analysis)

Legacy Classifier on ImageNet 1K ResNet-50 Results

Depthwise seperable CNN

on Google Speech v2 DS-CNN Results

C, K and encoding_algorithm parameter sweep for ResNet-50

Data visualizer

Offline learning convergence on ResNet-50

The goal was to find out how much offline training data is needed to get the maximum accuracy.

ResNet-50 Convergence Results

Formalism

Some definitions about the forward path.

Encode kernel

Read and accumulate LUTs kernel

Links

About

Hashed Lookup Table based Matrix Multiplication (halutmatmul) built on MADDness/bolt

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 89.4%
  • SystemVerilog 6.5%
  • Shell 1.7%
  • Makefile 1.4%
  • Other 1.0%