Important
Created by Preternatural AI, an exhaustive client-side AI infrastructure for Swift.
This project and the frameworks used are presently in alpha stage of development.
The PhotoTranslator app leverages OpenAI's Vision API to bring translations into the user's surroundings seamlessly. Users can simply take a photo, and the app, using an on-device YOLO model, identifies objects within the image. Then, creative sentences in the target language are generated about the picture in general and each object specifically along with the foreign language audio using ElevenLabs API, making learning a new language an engaging and immersive experience.
To install and run the PhotoTranslator app:
- Download and open the project
- Add your OpenAI API Key in the
LLMClientManager
file:
// AIManagers/LLMClientManager
private static let client: any LLMRequestHandling = OpenAI.Client(
apiKey: "YOUR_API_KEY"
)
You can get the OpenAI API key on the OpenAI developer website. Note that you have to set up billing and add a small amount of money for the API calls to work (this will cost you less than 1 dollar).
- Add your ElevenLabs API Key in the
TTSClientManager
file:
// AIManagers/TTSClientManager
static let client = ElevenLabs.Client(apiKey: "YOUR_API_KEY")
ElevenLabs is a “Text-to-Speech” service which is used in the PhotoTranslator app to generate the audio of the translated sentence in a foreign language. You can get your ElevenLabs API Key on the ElevenLabs website. The API key is located in your user profile:
- Select the target language for translation. The app is currently set to Hindi.
// AIManagers/LLMClientManager
private static let targetLanguage = "Hindi"
- Create the target language speaker in
AIManagers/Speakers
. The app is currently set to aHindiSpeaker
// AIManagers/Speakers
// change the speaker to your target language
// you can find the voice for your target language on the ElevenLabs website
struct HindiSpeaker: Speaker {
let speakerName: String = "Akshay"
let elevenLabsVoiceID = "qO2mI1DuN2aagyvZHwwt"
}
- Run the app on device - either iPhone, iPad or Mac as the camera is required to take a photo.
- Take a photo and wait for the app to generate creative sentences about the photo in your target language, with English translation.
Bug: Note that there is currently a bug where the photo is flipped 90 degrees on the phone and iPad.
The PhotoTranslator app is developed to demonstrate the the following key concepts:
- Using OpenAI's Vision API
- Function calling to get structured data from LLMs
- Integrating ElevenLabs Multilingual Audio generation
The following Preternatural Frameworks were used in this project:
- AI: The definitive, open-source Swift framework for interfacing with generative AI.
- Media: Media makes it stupid simple to work with media capture & playback in Swift.
The PhotoTranslator uses several AI frameworks in the following steps:
- The user captures a photo
- The photo is analyzed by the YOLOv8 on-device model, which detects and identifies individual objects within the image. Each object is highlighted with uniquely colored, numbered boxes. See
PhotoObjectDetectionManager
for the implementation. - The processed photo is sent to OpenAI using the completion API with function calling. This step involves generating creative sentences in the apps's target language about the picture as a whole and each individual object identified in the picture. Transliteration and english translation is also provided for each sentence. See
LLMClientManager
for implementation. - Finally, the translated text is converted into spoken audio using ElevenLabs' voice synthesis technology, so the user can learn how to say the sentence in the app's target foreign language. See
TTSClientManager
for implementation.
As a result, the PhotoTranslator app exemplifies the effective integration of diverse AI technologies to create a comprehensive and interactive language learning tool.
This package is licensed under the MIT License.