Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: generative AI instrument sample generation #4322

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

haroon10725
Copy link
Contributor

@walterbender @pikurasa Can you please review this.

image
image

@pikurasa pikurasa marked this pull request as draft January 30, 2025 16:05
@pikurasa
Copy link
Collaborator

This is not quite ready as a "PR".

Some quick feedback:

  • Keyboard shortcuts should be disabled when this is open.
  • We should add a few pre-made prompts to help users understand the idea of how to use this**
  • We'll need to implement an API to a backend that does the work.

**perhaps we use a random words generator that strings together the following: instrument-adjective + instrument-noun + "with" + additional-instrument adjective. For example, it may generate the following sentence: "cedar top + acoustic guitar + with + a buzzy fret sound". This is somewhat inspired by the way Jitsi prompts room names for users. See https://meet.jit.si/

Screenshot from 2025-01-30 11-18-01

@haroon10725
Copy link
Contributor Author

haroon10725 commented Jan 30, 2025

@pikurasa Let's keep this as a draft PR. Thankyou for your feedback

@haroon10725 haroon10725 reopened this Jan 31, 2025
@haroon10725 haroon10725 changed the title feat: add user prompt in sampler widget feat: generative AI instrument sample generation Jan 31, 2025
@haroon10725 haroon10725 changed the title feat: generative AI instrument sample generation proposal: generative AI instrument sample generation Jan 31, 2025
@haroon10725
Copy link
Contributor Author

@walterbender @pikurasa I think we should maintain consistency in the buttons and input fields used for widgets. There is an AI widget with a similar feature that takes user input and provides output. While the functionality is different, similar elements should have the same height, width, margin, etc. I think it will give the user a good user experience.
What are your opinions on it.

@walterbender
Copy link
Member

I very much like the idea of this enhancement. But as Devin pointed out, we need to get the AI side working (and explore it some) before we settle in on the UI/UX details.

@haroon10725
Copy link
Contributor Author

haroon10725 commented Feb 1, 2025

image

Before actually starting the coding part, I think designing the architecture (how it is going to work) is important.

This is the design I came up with: I added an extra LLM layer between the user input and the Music LLM because the Music LLM requires a detailed prompt describing the sound font to generate high-quality and accurate results.

I believe students may struggle to write such a detailed prompt describing the sound font they have in mind. They might only provide a brief description, which may not accurately capture the sound font they envision.

@walterbender @pikurasa What do you think about it?

@walterbender
Copy link
Member

Probably this layered approach will be necessary.

@haroon10725
Copy link
Contributor Author

haroon10725 commented Feb 3, 2025

I researched open-source models for generating sound fonts and came across https://audioldm.github.io/. I tried it, and the results were good. The model requires a prompt to generate the sound, and the better the prompt, the better the results.
The following sample is generated from it.

The prompt was "A smooth, warm clarinet with a clear, sharp attack, transitioning into a mellow sustain, offering a soothing, rich tone with natural woodiness and subtle vibrato"

techno.mp4

@walterbender @pikurasa What are your opinions on it?

@walterbender
Copy link
Member

Seems like it has real promise.
It would be interesting to explore note duration as well.

@haroon10725
Copy link
Contributor Author

Yes, I will be exploring it also.

@haroon10725
Copy link
Contributor Author

haroon10725 commented Feb 4, 2025

@walterbender There is a audio_length_in_s argument, I think we can use it for note duration
audio = pipe(prompt, num_inference_steps=10, audio_length_in_s=1.0).audios[0]

In the layered approach, we can extract the note duration and the description of the sound font. The note duration can be converted into seconds, while the description can be used as a prompt for a Music LLM, with the converted duration placed accordingly.

techno.mp4

@haroon10725
Copy link
Contributor Author

@pikurasa What are your opinions on it?

@pikurasa
Copy link
Collaborator

pikurasa commented Feb 5, 2025

Yes, this is going in a good direction. Thanks for the research @haroon10725

@haroon10725
Copy link
Contributor Author

haroon10725 commented Feb 9, 2025

@walterbender @pikurasa Should I try to find some more opensource models? Or is this fine.

@therealharshit
Copy link
Member

@haroon10725 can you please explain how you tested this model, as I was also looking for some open source model for sample generator.

@pikurasa
Copy link
Collaborator

@walterbender @pikurasa Should I try to find some more opensource models? Or is this fine.

This model is probably fine, but it's nice to know what other models are available (if any).

@haroon10725
Copy link
Contributor Author

@haroon10725 can you please explain how you tested this model, as I was also looking for some open source model for sample generator.

@therealharshit I tested this model on my computer.

@haroon10725
Copy link
Contributor Author

@walterbender @pikurasa Should I try to find some more opensource models? Or is this fine.

This model is probably fine, but it's nice to know what other models are available (if any).

@pikurasa Thankyou for your feedback. I have found some other opensource models, will share the results soon.

@haroon10725
Copy link
Contributor Author

I researched about some more open-source models. I tried those, and the results were good then the previous one. The pro's of this model was that it generated a good sound font without a detailed prompt. But the con's was that it was a heavy model and took some time to generate the sound. (As the model will be deployed so I think it won't be the issue). The results are as follow.

The prompt was "something between a clarinet and a human singing 'ah'"
https://github.com/user-attachments/assets/e32202e5-3bec-4a04-84ed-5a40c7d1426c

The prompt was "something between a heavy metal guitar and a lion roar"
https://github.com/user-attachments/assets/f14ea4b8-6953-4f85-ad74-5739c923a5be

(Note: The audio converter added extra seconds while converting from .wav to .mp4. Please listen to the first 5 seconds only)

The good part is that we have an option.
@walterbender @pikurasa What do you think about it?

@pikurasa
Copy link
Collaborator

pikurasa commented Feb 11, 2025

The good part is that we have an option.

Yes, that's great.

@walterbender @pikurasa What do you think about it?

It's interesting.

Certainly, it's good that we are also working on how to process a sample for sound fonts (i.e. virtual instruments) over the summer as it seems that all these generated sounds may need some extra processing before they can be useful for our needs.

@haroon10725
Copy link
Contributor Author

haroon10725 commented Feb 21, 2025

I researched about some more open-source models. I tried those, and the results were good then the previous one. The pro's of this model was that it generated a good sound font without a detailed prompt. But the con's was that it was a heavy model and took some time to generate the sound. (As the model will be deployed so I think it won't be the issue). The results are as follow.

https://huggingface.co/spaces/facebook/MusicGen
@pikurasa @walterbender You can try this opensource model, it is hosted (no need to download anything). It has all the functionality as discussed in last meet. Do give it a try and let me know your opinion.

@haroon10725
Copy link
Contributor Author

haroon10725 commented Feb 23, 2025

@pikurasa @walterbender I think the server is busy. You can share some prompts or audio files, I can try those in my computer. Also will keep an eye whether the server is up or not, so you can also try.

@haroon10725
Copy link
Contributor Author

haroon10725 commented Feb 24, 2025

I found this interesting website MusicGen by Facebook . It has some description about the model and some sounds samples generated from this model.

The good part is it generates high quality samples and is opensource. I was thinking that we can use this model for sample generation. So far this model looks good to me as compared to previous one.

@walterbender @pikurasa What do you think about it?

@walterbender
Copy link
Member

Worth exploring. MIT License, which is good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants