-
Notifications
You must be signed in to change notification settings - Fork 992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: generative AI instrument sample generation #4322
base: master
Are you sure you want to change the base?
Conversation
This is not quite ready as a "PR". Some quick feedback:
**perhaps we use a random words generator that strings together the following: instrument-adjective + instrument-noun + "with" + additional-instrument adjective. For example, it may generate the following sentence: "cedar top + acoustic guitar + with + a buzzy fret sound". This is somewhat inspired by the way Jitsi prompts room names for users. See https://meet.jit.si/ |
@pikurasa Let's keep this as a draft PR. Thankyou for your feedback |
@walterbender @pikurasa I think we should maintain consistency in the buttons and input fields used for widgets. There is an AI widget with a similar feature that takes user input and provides output. While the functionality is different, similar elements should have the same height, width, margin, etc. I think it will give the user a good user experience. |
I very much like the idea of this enhancement. But as Devin pointed out, we need to get the AI side working (and explore it some) before we settle in on the UI/UX details. |
Before actually starting the coding part, I think designing the architecture (how it is going to work) is important. This is the design I came up with: I added an extra LLM layer between the user input and the Music LLM because the Music LLM requires a detailed prompt describing the sound font to generate high-quality and accurate results. I believe students may struggle to write such a detailed prompt describing the sound font they have in mind. They might only provide a brief description, which may not accurately capture the sound font they envision. @walterbender @pikurasa What do you think about it? |
Probably this layered approach will be necessary. |
I researched open-source models for generating sound fonts and came across https://audioldm.github.io/. I tried it, and the results were good. The model requires a prompt to generate the sound, and the better the prompt, the better the results. The prompt was "A smooth, warm clarinet with a clear, sharp attack, transitioning into a mellow sustain, offering a soothing, rich tone with natural woodiness and subtle vibrato" techno.mp4@walterbender @pikurasa What are your opinions on it? |
Seems like it has real promise. |
Yes, I will be exploring it also. |
@walterbender There is a audio_length_in_s argument, I think we can use it for note duration In the layered approach, we can extract the note duration and the description of the sound font. The note duration can be converted into seconds, while the description can be used as a prompt for a Music LLM, with the converted duration placed accordingly. techno.mp4 |
@pikurasa What are your opinions on it? |
Yes, this is going in a good direction. Thanks for the research @haroon10725 |
@walterbender @pikurasa Should I try to find some more opensource models? Or is this fine. |
@haroon10725 can you please explain how you tested this model, as I was also looking for some open source model for sample generator. |
This model is probably fine, but it's nice to know what other models are available (if any). |
@therealharshit I tested this model on my computer. |
@pikurasa Thankyou for your feedback. I have found some other opensource models, will share the results soon. |
I researched about some more open-source models. I tried those, and the results were good then the previous one. The pro's of this model was that it generated a good sound font without a detailed prompt. But the con's was that it was a heavy model and took some time to generate the sound. (As the model will be deployed so I think it won't be the issue). The results are as follow. The prompt was "something between a clarinet and a human singing 'ah'" The prompt was "something between a heavy metal guitar and a lion roar" (Note: The audio converter added extra seconds while converting from .wav to .mp4. Please listen to the first 5 seconds only) The good part is that we have an option. |
Yes, that's great.
It's interesting. Certainly, it's good that we are also working on how to process a sample for sound fonts (i.e. virtual instruments) over the summer as it seems that all these generated sounds may need some extra processing before they can be useful for our needs. |
https://huggingface.co/spaces/facebook/MusicGen |
@pikurasa @walterbender I think the server is busy. You can share some prompts or audio files, I can try those in my computer. Also will keep an eye whether the server is up or not, so you can also try. |
I found this interesting website MusicGen by Facebook . It has some description about the model and some sounds samples generated from this model. The good part is it generates high quality samples and is opensource. I was thinking that we can use this model for sample generation. So far this model looks good to me as compared to previous one. @walterbender @pikurasa What do you think about it? |
Worth exploring. MIT License, which is good. |
@walterbender @pikurasa Can you please review this.