Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transcription on device (phone) #1249

Open
kodjima33 opened this issue Nov 4, 2024 · 13 comments
Open

Transcription on device (phone) #1249

kodjima33 opened this issue Nov 4, 2024 · 13 comments
Labels

Comments

@kodjima33
Copy link
Collaborator

We all know that we need local transcription.

Both Bitalik Buterin and George Hotz said it when trying out our tech.

Creating this issue to aggregate feedback and prepare for the switch gradually

@kodjima33 kodjima33 moved this to Backlog in omi TODO / bounties Nov 4, 2024
@kodjima33 kodjima33 changed the title Local transcription Transcription on device (phone) Nov 4, 2024
@Ronuhz
Copy link

Ronuhz commented Nov 5, 2024

I agree, also Whisper is the way to go. It's on device performance (I only used it on an iPhone 12 mini, to develop something simple) is truly incredible. It should be downloaded on demand, 'cause including it in the bundle would be a terrible idea.

@beastoin
Copy link
Collaborator

tell me more about your experience with Whisper + iPHone 12 Mini pls @Ronuhz / such as transcripts quality, speeds, battery draining.

@Ronuhz
Copy link

Ronuhz commented Nov 12, 2024

tell me more about your experience with Whisper + iPHone 12 Mini pls @Ronuhz / such as transcripts quality, speeds, battery draining.

Here is a little demo running on an iPhone 12 mini, iOS 18.2 Beta 3, model: Whisper Tiny using the  Neural Engine for both decoding and encoding. The voice is streamed to the model in real time. Everything runs locally.

output.mp4

@Ronuhz
Copy link

Ronuhz commented Nov 12, 2024

In a native app using Swift and SwiftUI it takes about 10-20 minutes to get this implemented using WhisperKit. In Flutter I don't know.

@mdmohsin7
Copy link
Collaborator

mdmohsin7 commented Nov 15, 2024

Found this with a quick search but it does not support transcribing in real-time

https://pub.dev/packages/whisper_flutter_plus

@kodjima33
Copy link
Collaborator Author

kodjima33 commented Feb 14, 2025

ok let's make this happen

We need to make omi FULLY LOCAL - fully local transcription

Might be in React Native (i don't care about the stack)

Bounty is $20k

I will lock it on whoever will show the best MVP
/bounty $20000

Copy link

algora-pbc bot commented Feb 14, 2025

💎 $20,000 bounty • omi

Steps to solve:

  1. Start working: Comment /attempt #1249 with your implementation plan
  2. Submit work: Create a pull request including /claim #1249 in the PR body to claim the bounty
  3. Receive payment: 100% of the bounty is received 2-5 days post-reward. Make sure you are eligible for payouts

Thank you for contributing to BasedHardware/omi!

Add a bountyShare on socials

Attempt Started (GMT+0) Solution
🟢 @yuvrajjsingh0 Feb 15, 2025, 10:06:15 PM WIP
🟢 @Ritesh2351235 Feb 17, 2025, 4:53:22 AM WIP

@kodjima33 kodjima33 added Paid Bounty 💰 flutter flutter work backend Backend Task (python) labels Feb 14, 2025
@kodjima33 kodjima33 moved this to Someday in omi TODO / bounties Feb 14, 2025
@ayewo
Copy link

ayewo commented Feb 14, 2025

@kodjima33 how to lay hands on the omi hardware?

@yuvrajjsingh0
Copy link

yuvrajjsingh0 commented Feb 15, 2025

/attempt #1249
Hi, if we are doing it on device, I'd suggest using Device's default speech to text functionality as that is Hardware accelerated and optimized for that device. It's available for both iOS and Android, also it can do it in real time.
I will make use of speech to text of device. Using whisper is fine, but whisper is an LLM based model and is really big which can bloat the application and using it on low end devices will make the app suffer with crashes. I have previously worked with integrating Tesseract on Android devices natively and from that experience I can say that using whisper locally is never an option as it will only work well on high end devices.
@kodjima33
Here's a sample app I created in Flutter and its demo in iOS:
https://github.com/user-attachments/assets/6511fc7a-7c15-433e-a8c5-79870658e270

Algora profile Completed bounties Tech Active attempts Options
@yuvrajjsingh0 1 bounty from 1 project
PureBasic
Cancel attempt

@Ronuhz
Copy link

Ronuhz commented Feb 16, 2025

@yuvrajjsingh0 The problem with using the platform's own STT is that then you won't have Speaker separation. For Whisper Tiny you need less then a GB of VRAM and storage. It should be downloaded on-demand and NOT be included in the bundle. It can be ran on the ANE on Apple Devices at least, sadly I can't speak about Android because it's not my area of expertise.

@yuvrajjsingh0
Copy link

@kodjima33 Okay, if we want to use Whisper, do we need this transcription thing in real-time? Or we'll be doing it on saved audio?

There is an option to use a Voice Recognition model on the voice that will tell us who is speaking at what timeframe and use STT to transcribe it.

@Ritesh2351235
Copy link

Ritesh2351235 commented Feb 17, 2025

/attempt #1249
Hey @kodjima33, here is my take on the local transcription for Omi.

Why Whisper Tiny?
Mobile-first: Tiny (39M params) is built for edge devices. I ran tests on an iPhone 11 ~150-300ms per audio chunk, no server calls. For Android, TFLite/MediaPipe can handle it, though we’ll need to optimize GPU delegation for weaker devices.

ANE on iOS: WhisperKit (Swift) taps into Apple’s Neural Engine. Battery drain is minimal compared to CPU-only inference. Demo here—got it working in a test app with real-time streaming.

Supports Multiple Languages.

Avoid app bloat: Ship the model (~150MB) via CDN (Hugging Face Hub?) post-install. No need to bake it into the bundle.

Alternatives I tested (and why they suck):

Platform STT (Android/iOS APIs):
Pros: Zero latency, free.
Cons: No speaker diarization, struggles with accents/background noise. Tried it—accuracy tanks in noisy environments.

Distil-Whisper/Hugging Face models:
Smaller, but multilingual support is spotty. Whisper Tiny handles 100+ languages out of the box.

Larger Whisper models (Base/Medium):
Overkill. Medium needs ~5GB RAM—not happening on phones.

Implementation Plan
iOS:
Use WhisperKit (Swift) for ANE-accelerated inference. Wrote a PoC—it’s ~20 lines of Swift to hook into mic input and stream to the model.

Android:
Option A: MediaPipe’s TFLite build (C++ → Kotlin/JNI).

Option B: Transformers Android (Java), but might need model quantization.

Speaker Diarization Hack:
Whisper doesn’t do this natively. Workaround: Add Silero VAD to detect pauses/speaker changes. Not perfect, but gets us 80% there without cloud calls.

Using Whisper Tiny on the device is possible. The trade-offs are a slightly bigger app size after downloading and some tweaks needed for speaker identification. But it's worth it for better privacy and lower server costs.

@Ronuhz , I saw that you're working on Whisper Tiny. Let me know if you're open to collaborating on this.

@louis030195
Copy link

what about using https://github.com/mediar-ai/screenpipe/tree/main/screenpipe-audio

it's pure rust, meaning you can make it mobile friendly easily

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Someday
Development

No branches or pull requests

8 participants