Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LearningPath - Vision LLM inference on Android with KleidiAI and MNN #1651

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
---
title: Build the MNN Android Demo with GUI
weight: 3

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Set up development environment
In this learning path, you will learn how to build and deploy a Vision Transformer(ViT) chat app to an Android device using MNN-LLM. You will learn how to build the MNN-LLM and how to run the Qwen model for the Android application.

The first step is to prepare a development environment with the required software:

- Android Studio (latest version recommended)
- Android NDK (tested with version 28.0.12916984)
- CMake (4.0.0-rc1)
- Python3 (Optional)
- Git

## Clone MNN repo
Open up a Windows PowerShell or Git Bash and checkout the source tree:

```shell
cd C:\Users\$env:USERNAME
git clone https://github.com/HenryDen/MNN.git
cd MNN
git checkout 83b650fc8888d7ccd38dbc68330a87d048b9fe7a
```

{{% notice Note %}}
The app code is currently not merged into the MNN repo. The repo above is a fork from the MNN.
{{% /notice %}}

## Build the app using Android Studio

Create a signing.gradle file at android/app with the following template:
```shell
ext{
signingConfigs = [
release: [
storeFile: file('PATH_TO_jks_file'),
storePassword: "****",
keyAlias: "****",
keyPassword: "****"
]
]
}
```

If you don't need to compile a release version of the app, you can skip the following step of creating a sign file and write anything in the signing.gradle.

- Navigate to **Build -> Generate Signed App Bundle or APK**.
- Select **APK** and click **next**.
- Press **Create new** and fill in the information..
- Fill in the information of the newly generated JKS file in the template above.

Open the MNN/transformers/llm/engine/android directory with Android Studio and wait for the Gradle project sync to finish.

## Prepare the model
You can download the model from ModelScope : https://www.modelscope.cn/models/qwen/qwen2-vl-2b-instruct

Or Hugging Face : https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct

If you need to test other vision transformer models, you can download models from https://modelscope.cn/organization/qwen?tab=model and convert them to MNN format.

```shell
// make sure install git lfs
$ git lfs install
$ git clone https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct
// install llm-export
$ git clone https://github.com/wangzhaode/llm-export && cd llm-export/
$ pip install .
// CONVERT model
$ llmexport --path /path/to/mnn-llm/Qwen2-VL-2B-Instruct/ --export mnn --quant_bit 4 --quant_block 0 --dst_path Qwen2-VL-2B-Instruct-convert-4bit-per_channel --sym
```

- --quant_bit: the quantization parameter, for example 4 is the q4 quantization
- --quant_block: the quantization parameter, for example 0 is per channel quantization, 128 is 128 per block quantization
- --sym: the quantization parameter, means symmetrical quantization.

## Build and run the app
Before launching the app, you need to push the model into the device manually:

```shell
$ adb shell mkdir /data/local/tmp/models/
$ adb push <path to the model folder> /data/local/tmp/models
```

When you select Run, the build will be executed, and then the app will be copied and installed on the Android device.

After opening the app, you will see:

![Loading screenshot](Loading_page.png)

After the Model is loaded, you can chat with the APP.

![Loading screenshot](chat2.png)




Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
---
title: Build the MNN Command-line ViT Demo
weight: 4

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Set up development environment
In this learning path, you will learn how to build and deploy a Vision Transformer(ViT) chat command line Demo to an Android device using MNN-LLM. You will learn how to build the MNN-LLM with cross-compile and how to run the Qwen model for the Android application.

The first step is to prepare a development environment with the required software:

- Linux ubuntu (20.04 or higher)
- Android NDK (tested with version 28.0.12916984)
- CMake (4.0.0-rc1)
- Python3 (Optional)
- Git

## Build and run command-line demo

Push the Model to device, how to obtain model is mention on last page.
```shell
$ adb shell mkdir /data/local/tmp/models/
$ adb push <path to the model folder> /data/local/tmp/models
```

```shell
# Download a ndk file from https://developer.android.com/ndk/downloads/
$ upzip android-ndk-r27d-linux.zip
$ export ANDROID_NDK=./android-ndk-r27d-linux/

$ git clone https://github.com/alibaba/MNN.git
% cd MNN/project/android
$ mkdir build_64 && cd build_64
$ ../build_64.sh "-DMNN_LOW_MEMORY=true -DLLM_SUPPORT_VISION=true -DMNN_KLEIDIAI=true -DMNN_CPU_WEIGHT_DEQUANT_GEMM=true -DMNN_BUILD_LLM=true -DMNN_SUPPORT_TRANSFORMER_FUSE=true -DMNN_ARM82=true -DMNN_OPENCL=true -DMNN_USE_LOGCAT=true -DMNN_IMGCODECS=true -DMNN_BUILD_OPENCV=true"
$ adb push *so llm_demo tools/cv/*so /data/local/tmp/
$ adb shell
```

Here switch to android adb shell environment.

```shell
$ cd /data/local/tmp/
$ chmod +x llm_demo
$ export LD_LIBRARY_PATH=./
# <img>./example.png</img> get your image here
$ echo " <img>./example.png</img>Describe the content of the image." >prompt
$ ./llm_demo models/Qwen-VL-2B-convert-4bit-per_channel/config.json prompt
```

Here is an example image:

![example image](example.png)

If the launch is success, you can see the output

```shell
config path is models/Qwen-VL-2B-convert-4bit-per_channel/config.json
tokenizer_type = 3
prompt file is prompt
The image features a tiger standing in a grassy field, with its front paws raised and its eyes fixed on something or someone behind it. The tiger's stripes are clearly visible against the golden-brown background of the grass. The tiger appears to be alert and ready for action, possibly indicating a moment of tension or anticipation in the scene.

#################################
prompt tokens num = 243
decode tokens num = 70
vision time = 5.96 s
audio time = 0.00 s
prefill time = 1.80 s
decode time = 2.09 s
prefill speed = 135.29 tok/s
decode speed = 33.53 tok/s
##################################
```

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
---
title: Vision LLM inference on Android with KleidiAI and MNN

minutes_to_complete: 30

who_is_this_for: This is an advanced topic for Android developers who want to efficiently run Vision-Transformer(ViT) on android device.

learning_objectives:
- Run Vision-Transformer inference on an Android device with the Qwen Vision 2B model using the MNN inference framework.
- Download and Convert a Qwen Vision model from Hugging Face.

prerequisites:
- A x86_64 development machine with Android Studio installed.
- A 64-bit Arm powered smartphone running Android with i8mm/dotprod supported.

author: Shuheng Deng,Arm

### Tags
skilllevels: Introductory
subjects: ML
armips:
- Cortex-A
- Cortex-X
tools_software_languages:
- Android Studio
- KleidiAI
operatingsystems:
- Android



further_reading:
- resource:
title: "MNN : A UNIVERSAL AND EFFICIENT INFERENCE ENGINE"
link: https://arxiv.org/pdf/2002.12418
type: documentation
- resource:
title: MNN-Doc
link: https://mnn-docs.readthedocs.io/en/latest/
type: blog
- resource:
title: Vision transformer
link: https://en.wikipedia.org/wiki/Vision_transformer
type: website



### FIXED, DO NOT MODIFY
# ================================================================================
weight: 1 # _index.md always has weight of 1 to order correctly
layout: "learningpathall" # All files under learning paths have this same wrapper
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
---
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
---
# ================================================================================
# FIXED, DO NOT MODIFY THIS FILE
# ================================================================================
weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation.
title: "Next Steps" # Always the same, html page title.
layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing.
---
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
---
title: Background
weight: 2

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## MNN Introduction
MNN is a highly efficient and lightweight deep learning framework. It supports inference and training of deep learning models and has industry-leading performance for inference and training on-device. At present, MNN has been integrated into more than 30 apps of Alibaba Inc, such as Taobao, Tmall, Youku, DingTalk, Xianyu, etc., covering more than 70 usage scenarios such as live broadcast, short video capture, search recommendation, product searching by image, interactive marketing, equity distribution, security risk control. In addition, MNN is also used on embedded devices, such as IoT.

MNN-LLM is a large language model runtime solution developed based on the MNN engine. The mission of this project is to deploy LLM models locally on everyone's platforms(Mobile Phone/PC/IOT). It supports popular large language models such as Qianwen, Baichuan, Zhipu, LLAMA, and others.

KleidiAI is currently integrated into the MNN framework, enhancing the inference performance of large language models (LLMs) within MNN. The Android app on this page demonstrates Vision Transformer inference using the MNN framework, accelerated by KleidiAI.

## Vision Transformer(ViT)
The Vision Transformer (ViT) is a deep learning model designed for image recognition tasks. Unlike traditional convolutional neural networks (CNNs), which process images using convolutional layers, ViT leverages the transformer architecture originally developed for natural language processing (NLP).
The Vit workflow contains:

- **Image Patching** : The input image is divided into fixed-size patches, similar to how text is tokenized in NLP tasks.
- **Linear Embedding** : Each image patch is flattened and linearly embedded into a vector.
- **Position Encoding** : Positional information is added to the patch embeddings to retain spatial information.
- **Transformer Encoder** : The embedded patches are fed into a standard transformer encoder, which uses self-attention mechanisms to process the patches and capture relationships between them.
- **Classification** : The output of the transformer encoder is used for image classification or other vision tasks.

ViT has shown competitive performance on various image classification benchmarks and has been widely adopted in computer vision research


Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.