My personal insights after weeks of using Rave #300

ballerburg9005 · 2024-03-10T22:13:58Z

ballerburg9005
Mar 10, 2024

NEW: There are 3 official tutorials now that explains quite a few bits and pieces that were previously unknown (one about training). Unfortunately still a lot of questions that this guide wrestles with remain unanswered. I have since then rewritten my guide to reflect all of the recent insigths from this and my own.

Watch the new tutorials:

I have raised a Github issue to improve documentation, and one of the suggestions was that community members could do this. This is why I want to post my personal experiences as a first step here.

I want to stress that I don't really have deeper insights into how Rave is programmed, or the more academic side of machine learning, and also not music generation in general. I am just a former sysadmin doing ML stuff as a hobby, basically to test my 3090. Therefore my understanding might be limited. And what I have experienced to be true doesn't necessarily reflect on what is truly the case, although I try my best to check if those things are universally so.

My experiences are more or less limited to the v2 model, 1/2 channels, wasserstein and an accidental messed up "discrete" model. I am not considering special application such as low latency or low compute power. Also I will only talk about Pure Data to run nn~ and not the commercial alternatives.

Also as time passes, things might change, so what's now broken might be not some months later. I'm just trying to document what you can expect and where you have to be careful, and can run into problems. And maybe no amount of fiddling can fix it for you.

First, read the README.md of course, but also watch all the Youtube videos (second one is not in README.md!):

Then follow this guide (at minimum read "my command chain from start to finish"), which contains critical information not mentioned elsewhere.

what to expect, quality of output

I would say the quality with v2 is really good, but you can still notice it isn't perfect. From the impressiveness of it, I would say it is somewhat similar as if going from 128kbps to 96kbps, or 256kbps to 128kbps if you will. But this doesn't pertain only or actually to pure audio data, but generally how the model learns the entire sound. It sounds quite good, but just by a notch, you can still notice it is slightly inferior to the original. With a lot of sounds, like drums and a lot of instruments, this loss of quality doesn't matter at all and might not even be audible. Conversely, people have suggested to just train the model longer in both phases to yield quasi-indistinguishable quality (which I have not tested to the extreme). It probably still depends on the sound somewhat how much this is true.

I think RAVE is great, extremely usable, and just the state of the art. Probably Rave is the best thing out there so far.

I was asked why the model doesn't properly transform a humming sound (voice) into a violin sound. This is because the human voice sounds so radically different to a violin. The model only really understands it's training data, and there is no special logic in Rave that makes the model detect melodies or musical patterns, other than whatever the model believes to make sense on it's own, by pure magic of machine learning (which is often not really logical or intuitive in human terms). You can use nn~ and Pure Data and various filters on your voice input, so it essentially sounds more like a violin in a very crude kind of manner. And then the model will be more inclined to actually play the melody you hum, and it will of course then sound very much like a violin. But it will still transfer features from the input (which is what makes Rave great), so it doesn't sound identical to a real violin. If you know better methods, let me know. There are other projects like MusicGen that take different approaches, and those are designed around feeding it note sheets or text data and such. This is not something that Rave can do (but could potentially do in the future, if someone programmed a prior model for this). It is pretty much a very candid low-level kind of transformation, but one that can work and yields much better results than other approaches. And if you want something higher level, you have to build something on top of it (like simply input wav filtering, or more advanced new prior designs) or choose a different project.

From what I understand, to make a Rave model output more coherent patterns, you can train a prior, which operates on the latent space only. The latent space is essentially a choke point designed into the model (16 channels at 20Hz), where you have the opportunity to manipulate between encoder and decoder, whatever the model distilled to be really meaningful inside it's own mystery black box understanding about the input audio. The 16 different channels will randomly resemble some kind of (possibly ineffable) feature quality, like "loudness" or "pitchy" or "vibrato-ish". So the prior is an extension that works on top of your model in this dimensional space. From how I understand it, the prior can for example make your model speak rather (unintelligible) words and sentences, than just single syllables. I don't have experience with that much yet, unfortunately. But I believe as it is now to use it out of the box, this is mainly useful to make your model output more coherent sounds, when not providing any, or not very definitive input sounds. The general experience seems that the output of the priors currently leaves much to be desired. But ostensibly, potentially somehow you can do all sorts of things with a prior. Maybe some day even feed it text commands and note sheets like in MusicGen, if someone does the programming for this.

With msprior it should be true that currently only the prior configs "modulated_alibi" and "rwkv_semantic" give you a "semantic control" interface, as shown on the screenshot on the msprior Github page. You can then use joysticks or motion sensors to perform semantic manipulations (e.g. how entire sentences are phrased and shaped), rather than just skew the latent dimensions of the model directly (e.g. how the pitch of the voice is or how soft/harsh it sounds). I am not sure though if you need to pick "discrete" config in rave as architecture for this to work properly. Please read "the actual training" on how you cannot combine "discrete" with other configs.

Check out RAVE-Latent-Diffusion. It generates audio unconditionally on the latent space of Rave, and in their demo makes it output very well sounding and very coherent techno music. You can also easily combine this with latent manipulations as suggested in "python script" here. At the end of this guide there are audio examples I generated from both things.

hardware and time requirements, cost and recommendations

Any GPU less beefy than a 3090 is potentially problematic and too slow.

The initial v1 config has been designed for 16GB VRAM, but with later configs this got bigger and bigger. The default batch size is 8, which directly correlates with VRAM used, and can be lowered if your GPU has too little VRAM. So batch size of 1 would be 2GB instead of 16GB ... however no one I spoke to knows if changing batch size or changing it by a lot affects (and potentially ruins) training results significantly. Rave is both compute-heavy and VRAM heavy. Changing batch size from 8 to 4 is about 5% slower.

Do not use Google Colab. They changed their plan some year back, and now you get like 10-20x less compute power for the same price. Before you could do 3/4 of a run or 1/2 of a run in 1 month on $10 plan (V100, almost half as fast as 3090). Depending on your settings, a 6M run takes somewhere in the range of 150-250 hours on V100, which currently costs about $400-$600 on Colab.

A used 3090 now costs $600 on Ebay and renting it is like $0.44/h (someone claimed $0.22/h is possible). For my country power is €0.33/kWh, so I pay like €20-€40 (€28-€55 with wasserstein and increased num_signal) for a sort of "production-level" model (2 ch takes twice as long) and it takes like 6/12 days (1ch/2ch), but at minimum 3/6 days to make it sound somewhat usable/workable (which again would be more like 8/16 days and 4/8 days with wasserstein +num_signal). To test the tools I used mono in my first run. Then there was an issue due to lack of / ambiguous documentation and I had to discard second run. So now with my third run, I am close to €100 in power cost alone.

I hope this guide helps you to avoid this extra cost, and why I try to be as comprehensive and verbose as possible.

On v2 2ch wasserstein causal with 4x the sample window on 3090, I currently get 250k steps per day in phase 1 and somewhat less than half that in phase 2 (without wasserstein training is about 40% faster overall, given that phase 1 is equal in length, which it is not by default). In phase 2 it uses 22GB VRAM with batch size 4. V2 without wasserstein, 1 channel and 2x sample window only took 20GB with batch size 8 (I think).

I did increase num_signals (sample window) though to this and that, which inflates thoses numbers quite some.

As I revised this guide, I am now inclined to recommend against using num_signal at all, because of such and maybe other unintended consequences.

If talking speed always use steps per day/hour as seen in Tensorboard, since the other measurements differ with different parameters.

running on Windows

If you use WSL2 on Windows, you basically get a Ubuntu Linux VM with CUDA Integration, so it shouldn't be an issue to do all this under Windows as well. If there are any issues, let me know so I can update this guide. The only annoying thing I remember about WSL2 is that you have to download it from the Microsoft store, and that the store needs you to register an account and malfunctions upon first use (packages claimed to be unavailable).

general procedure

Follow the README.md or this guide to train and export your model.

During training your model will generate checkpoints, which can be used to either export a model or resume training. The best.ckpt is a safe and much older version. Conversely with the epoch-epoch=XXXX.ckpt there are no 100% guarantees that it doesn't somewhat degrade your model somehow (a little bit only, I suppose). I don't know how important this really is, but I think it is not such a huge deal to use epoch-.. checkpoints a couple of times. Just, if you can avoid it, don't abort the training or wait for best.ckpt.

I highly recommend that you immediately export your very first checkpoint (& cancel training if necessary), then test it in nn~, generate your prior if desired (again only use first checkpoint to export it), and just test out if whatever tools you use actually work. Because a lot of stuff doesn't actually work and might never work. You don't want to train on "discrete" for 6 days and then notice it only produces silent output in nn~. Or that v3 config doesn't actually export. It is really important to test out the entire command chain before trying to produce anything of use. As a rule of thumb, mono is very well supported, but stereo is not really and sometimes creates issues, or doesn't allow you to progress. You certainly can create a normal 2 channel model though and prior (unknown if prior output is actually intelligible and it trains only on mono-audio, so not true stereo) that works in nn~. You can also use RAVE-Latent-Diffusion with 2 channel model via hack/fix. But as of right now, it will only work with mono wav files and produces mono output.

During training Tensorboard is very useful:

tensorboard --logdir=~/Violin_out_out/ --bind_all

In Tensorboard, you have three things are really important:

Frequently updated audio samples of your model (X seconds of original, then X seconds of model)
what phase your model is in and the total number of steps (phase 2 will show you "adversarial" graph)
The fidelity_95 curve which is should stabilize >500k between 3 (simple sounds) and 8 (complex sounds)

Rave always trains your model in two distinct phases:

Phase 1 is about teaching your model what the sounds are all about. In this phase, audio samples will always sound very very distorted and bad, stingy and offensive kind of bad. But you should very slowly notice very small improvements in how well the sound is reflected. With v2 the default length is 1M steps, with other configs like wasserstein the default is 200k steps. In this phase you should be able after a while to somewhat make out the sounds from the training data, somewhat as if listening to a very distorted analogue radio call. If you only hear totally pure noise or muted audio, then there is something wrong.

Phase 2, the adversarial phase, is when you will actually be able to notice much bigger improvements in audio quality. Although it will sound very noisy and bad for quite a while, if you reach some 500k steps of this phase it will sound better and better. After 1M steps or so, it might actually sound workable and good. People have recommended 3M steps total for good results, so 2M steps in this phase. If you run it longer, you have to watch out for bad effects such as overfitting. Make sure to compare your final model's output in Tensorboard as well as nn~ with various input wavs against a previous version that sounded almost as good. Also watch out for artifacts, sometimes those will be present from the beginning and never clear, so that's a bummer. I have often seen people ujse 6M total for best results.

As explained in the tutorial fidelity_95, states that the model believes to be able to explain 95% of the data with so many dimensions. It can fluctuate up and down in the beginning, but if it plummets below 3 and stays there for like 100k steps or more and until the end of phase 1, then it indicates that your model has degraded (unless your audio data is super simple). I have had that happen for example from 100k to 200k steps with extremely noisy mashup breakcore music and --config noise.

Otherwise the curves usually just go up and down with some smoothing and make less and less progress. Unless the phases switches (which is at 1M with default + v2), then the curve radically flips.

Generally speaking, the training works very reliable given that you had good input data. It is not as if you have to be afraid all the time that something is going wrong by chance. Probably whatever it is will just iron out over time.

my command chain from start to finish

Before installing rave with pip and using rave, make sure to use python-3.11 and pytorch-cuda, if your distribution offers multiple versions pytorch. There might also be multiple -cuda and -rocm versions for other APIs, but I think you only need pytorch for Rave. You can also just use conda on Linux, which installs custom python version and all the packages into a seperate environment. I am using 3.10.9 in conda currently because Archlinux uses 3.12 and it doesn't work.

First, gather your input data. Ideally you want very clean studio-level recordings of your sounds (no noise, echos, etc.) and you want lots and lots of data. A minimum of 2-3 hours has been recommended, but the more the better. In order to not run into IO slowdowns, use either a few GB less than RAM size or read from NVMe/SSD. If you train on just a few minutes of data, results might be way too poor and also your training might fail. Rave will learn any and all sounds from the data, including noise.

Do not use --lazy, simply convert your sound data to pcm_s16le, with either two channels or one channel as desired for the model.

2 channels ffmpeg prepping

IFS=$'\n'; for i in `find ./ -name '*.mp3' -maxdepth 1`; do ffmpeg -i "$i" -c:a pcm_s16le -ar 44100 -ac 2 -y "${i##*/}".wav && touch -r "$i" "${i##*/}".wav; done

For 1 channel downmix, change -ac 2 to -ac 1.

1 channel, left only:

IFS=$'\n'; for i in `find ./ -name '*.mp3' -maxdepth 1`; do ffmpeg -i "$i" -c:a pcm_s16le -filter_complex '[0:a]channelsplit=channel_layout=stereo:channels=FL[left]' -map '[left]' -ar 44100 -y "${i##*/}".wav && touch -r "$i" "${i##*/}".wav; done

Now put all the wav files into a new folder "./raw_wav_files/".

run rave preprocess

rave preprocess --input_path ./raw_wav_files/ --output_path ./output_pp/ --channels 2 --sampling_rate 44100

It is only possible to alter the sampling rate to 22050, whereas 44100 is the default. Change channels as desired. I now recommend against using num_signal. From what I understand now, num_signal is the raw sample length in which your data is chopped up. It is unknown how that alters the model's behavior. If you want to, it is mandatory to use (some sort of) power of two for this. The default value is 131072 which is 3 seconds. When you double this number, it halves the number of steps and hence doubles the length of your training. It also doubles the amount of VRAM required. This perhaps increases output quality and addresses issues such as longer-than-num_signal audio patterns being chopped up too much to be learned properly, or too much chopping resulting in a bad understanding. But in the end no one knows at this point. It could also mess things up like possibly altering batch size could. I have trained multiple times with increased num_signal (since I had long coherent sound patterns that last 10-20 seconds) with good results though. Maybe someone with a better understanding can clarify this more.

the actual training

rave train --config v2 --config wasserstein --override PHASE_1_DURATION=1000000 --db_path ./output_pp/ --out_path ./output/ --name Mymodel --gpu 0 --val_every 5000 --channels 2 --batch 8

--config v2: I highly recommend to use this, as it is the most trialed and tested config
--config wasserstein: this selects an alternative encoder that's supposedly better in quality (I suspect this could give you bad results with msprior and probably Rave prior also: unknown!)
--override PHASE_1_DURATION=1000000: The default is 200k for wasserstein (1M for v2), but I have seen people use more ... maybe it is better? like you can see, with this command you can override any sort of parameter from the config files
--val_every 5000: this is approx every 20 minutes (12 in phase 1 w/o wasserstein) to create checkpoints on 3090 (default 10000)
--n_signal XXXXX: this must match the num_signal parameter used in preprocessing, don't use: increases VRAM and training time
--config causal: lowers latency, but as per official tutorial it is confirmed to lower model quality
--batch 8: it is unknown/risky to change this and increasing/decreasing it doesn't really result any differences in speed. Only lower this value if your GPU is out of memory
--config noise: I used this on noisy music (which is sort of the purpose of this switch if I understand it right) and my model degraded - it is somewhat mysterious what this truly does and when it is proper to use it (so maybe it is better to abstain from using it at all)

Now it is paramount to understand, that you cannot wildly combine config parameters with each other, even if you have seen this done elsewhere, and it doesn't result in error messages. To be really sure, you have to actually open the config files in a text editor, and check if they contain conflicting information, or otherwise seem to make sense with each other. For example --config discrete overrides the encoder specified by --config wasserstein, so this will give you botched results, without being immediately obvious.

It is unfortunately not exactly documented on what configs work with each other and even what all the different configs do. There is for example a "discrete_v3". So this seems to suggest to me, that using just "discrete" with v3 can yield worse results, or is otherwise not functioning well, but with v2 there is no such issue. But it could also mean, that this is just the third version of the discrete config, and the second version of it was somehow trash and discarded. Someone in Discord, if I understood them correctly, also said that you can't combine discrete with v1/v2/v3 in general (even though people have done this in Discord and there are Colab notebooks which suggest it, even with wasserstein, which is quite certainly wrong). I think it is certainly so, that you can combine wasserstein with v2 and maybe v3, but then you have to be really really careful when adding other stuff on top of it. Check out the test_configs.py file to see what configs are certain to be safe to combine. This is not to say v3 wouldn't work with wasserstein because it isn't listed (?), but just that apparently this automated test doesn't account for it.

We have tried "discrete" and it always produces silent output (in Python script and nn~), no matter what you do and if you use it with msprior. Is this a bug, how does it work? The "discrete" encoder, it remains a mystery. In the tutorial it was confirmed that discrete is basically only good for msprior. But like I said, it was broken with msprior for us also. So I would say just don't use discrete.

In the tutorial it was explained, that wasserstein increases quality at the penality of decreased latent representation and generalization. So it doesn't seem like a good bargain to me.

Now the model will start training, and you see the steps in the progress bar per epoch. Check your progress in Tensorboard. After 5000 steps, it will give you a checkpoint that you can export, or use to resume training with --ckpt.

exporting the model

If you have not chosen to train mono audio, you have to add this information to the config.gin in the output/ path:

model.RAVE.n_channels = 2

To export your checkpoint, run the export command like this:

rave export --run output/Mymodel_5b61af7ec4/ --channels 2 --sr 44100

Now you can load your model into nn~ with Pure Data or use the following Python script (adapted from the 30m Rave demo video).

Unfortunately, rave generate currently neither works with mono nor stereo models.

python script

I have tried to fix this script to work with stereo, but somehow it only produces double-mono with my model (but in nn~ it does produce actual stereo). It doesn't seem to be that simple to do. At least though it doesn't fail with 2 channel model.

import torch
import librosa as li
import soundfile as sf
import argparse

parser = argparse.ArgumentParser(description='Example python script to generate & manipulate audio with RAVE.')
parser.add_argument('--model', type=str, help='The model file (needs non-streaming model?)')
parser.add_argument('--input', type=str, help='Input wav (please pre-convert with ffmpeg)')
parser.add_argument('--output', type=str, default="<<unset>>", help='Default = [input]_out.wav')
parser.add_argument('--duration', type=float, default=9000001.0, help='Cap input wav length')
args = parser.parse_args()

if not (args.model and args.input):
#    parser.print_help()
    parser.error('Needs more arguments.')
if args.output == "<<unset>>":
    args.output = args.input+"_out.wav"

torch.set_grad_enabled(False)
model = torch.jit.load(args.model).eval()
x = li.load(args.input, **mono=(True if model.n_channels == 1 else False)**, **sr=44100**, duration=args.duration)[0]
x = torch.from_numpy(x).**reshape(1,model.n_channels,-1)**
z = model.encode(x)

## skews latent dimension #6 from -2 to 2 - uncomment if desired
# z[:, 5] += torch.linspace(-2,2,z.shape[-1])

## interlaces data with zeros
# interlaced_z = torch.zeros(z.shape[0], z.shape[1], z.shape[2]*2, device=z.device, dtype=z.dtype)
# interlaced_z[:, :, ::2] = z
# z = interlaced_z

## elongate 2x
# z = z.unsqueeze(-1).expand(-1, -1, -1, 2).reshape(z.shape[0], z.shape[1], -1)

## skews latent dimension #4 by +2 and interlaces it
# zz = z.clone()
# zz[:, 3] += torch.linspace(2,2,z.shape[-1])
# z[:, :, 1::2] = zz[:, :, 1::2]

y = model.decode(z).detach().numpy().reshape(-1, model.n_channels)
sf.write(args.output, y, 44100)

python generate.py --model Mymodel_2ch_wsreal_5b61af7ec4/version_3/checkpoints/Mymodel_2ch_wsreal_5b61af7ec4.ts --input something_else.wav --duration 30 && mplayer something_else.wav_out.wav

You can simply copy & paste those latent alterations in other audio generation projects, like RAVE-Latent-Diffusion, right before the rave.decode(z) step.

compiling nn~

If you are on Windows, it ships precompiled.

Unfortunately I don't really recall much about compiling nn~, but I think it was a little bumpy. I think the first issue was not having the -cuda version of pytorch installed, and the second issue was that I had to specify where it was installed, like so:

TORCH_INSTALL_PREFIX="/usr" cmake ../src/ -DCMAKE_BUILD_TYPE=Release

Then you have to put it into the proper PureData directory:

cp ./frontend/puredata/nn_tilde/nn~.pd_linux ~/.local/lib/pd/extra/nn\~.pd_linux

using PureData

PureData doesn't seem as well supported. The Max software has a 30 day free trial, but it is Windows only.

This program might seem quite awkward to use at first but it is actually not that bad.

Go to File->Preferences->Edit P.. and enter a new search path. This path should contain all your stuff, like input wavs and the model.ts file. Now hit File->New.

The following things are not really obvious at first:

Press Ctrl + E to toggle between edit mode and normal mode
If you Put->Object, you have to write out the module name to load, followed by its parameters, it will tell you whether or not it failed in the initial "debug" window of Pd
Put->Message allows you to specify "message" type parameters to objects, but only if you click on them in normal mode (and they are connected to object)
There are all sorts of modules with weird names and mysterious functions that you may or may not find in the help files
dac~ is audio output and adc~ is input (probably your mic won't work because the API is very basic)
In order to play back a wav file, you have to create a readfs~ object and then connect two "messages" at the top ... 1. "open input.wav 1", 2. "1". Then in normal mode, you have to click them in that order. Unfortunately, it doesn't loop even though the second parameter to first message (1) should accomplish this.
Turn on audio with Media->DSP On

The rest should be much more obvious and you can see how to work with it from the Youtube video I posted in the beginning at the end, or other videos.

What I found most helpful so far is osc~, noise~, -~ and *~. For example feed noise~ to *~ on the left side and osc~ 0.5 to the right side, it will oscillate the noise with 0.5Hz. Then feed that signal to left side of another *~ and connect a slider on the right side. Right-click slider and set range between 0 and 1 => simple volume control. If you feed a signal to -~ on the left and a *~ with a slider on the right, it will subtract that slided signal from the other signal. Then there are filters like bp~ (bandpass). Simply connect the signal to left of bp~ and a slider (or Number, does the same thing) on the right, set it to like 0.02 and at the top another slider.

This is immediately useful to manipulate the latent space an input wav and should give you quite an interesting experience. But Pure Data is so much more capable of doing so much more stuff, you should really check out what else it can do.

When it comes to actually loading your model with nn~, do it as suggested in the docs with two objects like so: "nn~ mymodel.ts encode 40000" and "nn~ mymodel.ts decode 40000". Connect everything, feed it input, you should hear the output. Notice the 40000 value at the end, which is the buffer size. This buffer size is outrageously huge (many seconds). But what I have found is that with a low buffer and the normal buffer size, the output sounded very wobbly and distorted. So far I have never bothered to try to fix this. You should check as well if doing the same results in remarkable improvements and then lower it further such that it becomes more usable.

For some reason nn~ didn't run in GPU mode for me and it needs like Ryzen 5 5XXX at least to run somewhat well in CPU mode. I hacked the source code to bypass the GPU check, which is trivial and maybe not required for you, so I won't explain this further. Just be aware that it can run in GPU mode and that stuttering etc. in CPU mode is normal, if you don't have a beefy and new CPU.

Now the model does accept various "messages" which you can connect at the top left, as you can do with readsf~. This is important to keep in mind when dealing with the prior. You have to check the source code what those messages are (documentation lacking), some are sometimes visible in the demo videos and in some example screenshot somewhere. For msprior there is for example "set temperature XXX" and "set listen true", "set listen false". Any nn~ model should also accept "gpu true" or "set gpu" or something like this, but this did not work for me. I think it only works in the Max plugin.

Generating and using the prior

From my experience with simple v2 mono test run, I am fairly certain that you currently need to use msprior and that "rave train_prior" is abandoned/broken (or only works with plain v1?). "rave train_prior" didn't work with mono or stereo either way, not matter what I tried. But I was able to train msprior with a 2 channel model, when I converted the input audio in preprocessing to mono! It remains to be seen though if the output is intelligible. To reflect stereo sound properly, it must necessarily also be able to train on stereo, which is clearly not the case. There is also the question of whether or not using anything but v1 (like v2/v3 and wasserstein also) will influence prior training in a bad way. Other people said to me that their prior output also was not coherent and intelligible. But personally I have probably not let it run long enough to reach certain conclusions.

If you follow the documentation, the process should be very simple. The only issue with msprior docs is, that the config files don't match and it is not obvious what the new configs correspond to. I have just randomly picked rwkv for a short test, but the test more or less yielded garbage-ish results and then I stopped caring. What you probably wanted in the first place is encoder_decoder ... so maybe the next best thing is "modulated_alibi" (not tried it) or "rwkv_semantic" (fails with error) now? I don't know. I can only tell you that it fails if you don't supply a config parameter.

If I used "set listen" it only generated pure noise, it just seemed defective. Then I actually supplied --continuous when exporting as required, but it was still only some kind of deep-ish noise, until I figured out to use "set temperature 200" and then "set reset". But what it generated didn't sound much different than what it generated without the prior just from silence or a little noise input. Even worse it generated (very random and incoherent) noises at a rate about 10x faster than what I felt was desirable for my purposes, and lowering temperature didn't really improve this, and in the lower "7x too fast" range, it was only this deep noise again. I also didn't get any of the "semantic control" inputs, which I now assume to be only provided by "modulated_alibi" or "rwkv_semantic".

Like in the Github issue I mentioned in the beginning, there are many questions about how to use this stuff properly, and how it interacts with different configs. So it might or might not be basically garbage with this and that combo, I don't know. For example msprior complains (but doesn't fail) if you don't use a "discrete" model, and that it limits functionality / "pretrained_embedding". But what exactly does this mean now to the end result? From what I understand, "discrete" is kind of bad for quality (and we were only able to produce silent output with it, no matter if prior was used or what not) so I rather take my chances without it. Does the prior work better or worse with "causal", no idea.

Considering it didn't really turn out well, those would be the commands I used:

# use ffmpeg command to convert all your audio data to mono first!
rave export --run output/Mymodel_5b61af7ec4/version_0/checkpoints/epoch-epoch=XXXX.ckpt --nostreaming
msprior preprocess --audio raw_wav_mono/  --out_path Mymodel_prior/ --rave output/Mymodel_5b61af7ec4/version_0/checkpoints/Mymodel_2f16146f62.ts --num_secs 32 --num_tokens 256 # choose last 2 params (length & vocabulary size) to your liking (16 & 1024 are default)
msprior train  --config rwkv --db_path Mymodel_prior/ --name Mymodel output/Mymodel_5b61af7ec4/version_0/checkpoints/Mymodel_e18d54798e.ts --gpu 0 --val_every 512
msprior export --run runs/Mymodel_msprior/ --continuous # append this if not using "discrete"

As mentioned, it seems to me like the prior functionality is kind of neglected and it only really yields usable results with very specific not or not explicitly documented config combinations, maybe in both the model training, and the prior training. And maybe you have to make a lot of quality sacrifices for a prior to work (i.e. regress to plain v1-only), like you see in the demo video. The docs are very suggestive to me, that you basically need your rave model to train on "discrete" architecture to produce the more desirable and functional results. But whether or not that is really so, is rather ambiguous and not explicitly stated. "Discrete" doesn't seem to work at all for me and someone else also.

I hope such questions can be addressed better in the future by more documentation.

Rave-Latent-Diffusion

I have briefly tried RAVE-Latent-Diffusion (unconditional audio generation) and it worked for my v2 mono test model. But sadly it doesn't seem to support 2 channels (see Github issue). I am no longer really trying to fix it.

Commands used:

# convert all audio data to mono first using ffmpeg!
# 2 channel fix below will only make your model output double-sided mono!
python preprocess.py --rave_model output/Mymodel_5b61af7ec4/version_3/checkpoints/Mymodel_5b61af7ec4.ts --audio_folder Mymodel_mono/ --latent_length 2048 --sample_rate 44100 --latent_folder latent_out/
python train.py --name Mymodel --latent_folder latent_out/ --save_out_path latent_out_out/ --batch_size 32 --max_epochs 6250 --save_interval 250 # batch size is for 24GB VRAM, max_epochs must match 25000 / (32/8)
python generate.py --model_path latent_out_out/Mymodel_best_epoch1153_loss_0.9473531246185303.pt --rave_model output/Mymodel_5b61af7ec4/version_3/checkpoints/Mymodel_5b61af7ec4.ts --diffusion_steps 200 --seed 990021 --output_path ./ --latent_length $((2048*8)) --length_mult 1 --temperature 1.33333333 --seed_a 89098 --seed_b 54309 --lerp=True

Here is my fix to make it work with 2 channels. This will however only output double-sided (identical) mono, like the "python script". I don't understand why that is.

1. Replace "reshape(1, 1, -1)" at line 43:
x = torch.from_numpy(audio_data).reshape(1, rave.n_channels, -1)

Here is the final output. I put two versions on Youtube with different temperature. Please note that I simply ran the generator twice and then combined the mono audios to stereo.

ASMR Rapunzel model: (click image)

Model files: https://mega.nz/file/ZfI1WCjT#UAu4I5HM_YIhfVFICrgTpIGLllauAsfs-iT-plJJnVQ

video converter

Here is some bash mumbo-jumbo that I have used to convert Youtube videos. The idea is to pass the sound through a bunch of filters, namely amplification + limiters + high and lowpass, so it fits better to whatever sounds the model understands. The script is really ugly trash with lots of deficits. But hey, it works. Use AMP to make silent sounds more loud, VOL to lower volume and OUTLEVEL for the final volume level to the model. SPEED is pitch, not actually speed. It is fiddly to make this turn out right. You basically want to raise OUTLEVEL to 1.0 and AMP to 2-6, then find the right pitch. But you will probably hear the original sound "punching through" the model. So you have to adjust the OUTLEVEL again to something like 0.1-0.6. But not as low as that the model wouldn't reflect the sound anymore or only poorly so. In theory the limiter should mostly accomplish this automatically, but in doesn't actually work this way for whatever reason.

#!/bin/bash
MODEL="model.ts"
SPEED=0.8
VOL=1.0
AMP=4
OUTLEVEL=0.4

TOOO="" # 00:01:25"
FROMM="" # 00:00:25"
EXTRA="false"
DOUBLEDEEP="false"
HIGHPITCH="false"

INPUT="$1"

ffmpeg -i "$INPUT" `if [[ "$TOOO" =~ [0-9][0-9]:[0-9][0-9]:[0-9][0-9] ]]; then echo "-to $TOOO"; fi` `if [[ "$FROMM" =~ [0-9][0-9]:[0-9][0-9]:[0-9][0-9] ]]; then echo "-ss $FROMM"; fi` -acodec copy -vcodec copy -c:a pcm_s16le -ac 1 -ar 44100 -y /tmp/out0.wav
ffmpeg -i "$INPUT" `if [[ "$TOOO" =~ [0-9][0-9]:[0-9][0-9]:[0-9][0-9] ]]; then echo "-to $TOOO"; fi` `if [[ "$FROMM" =~ [0-9][0-9]:[0-9][0-9]:[0-9][0-9] ]]; then echo "-ss $FROMM"; fi` -vcodec copy -af volume=0.0 -y /tmp/out."${INPUT##*.}"

ffmpeg -i /tmp/out0.wav -af "alimiter=level_in=1:level_out=1,alimiter=level_in=1:level_out=1:limit=0.625:level=false,volume=$AMP,alimiter=level_in=1:level_out=$AMP:limit=0.625:level=false,alimiter=level_in=1:level_out=0.3:limit=0.625:level=false" -y /tmp/out00.wav

ffmpeg -i /tmp/out00.wav -c:a pcm_s16le -ac 1 -af asetrate=44100*$SPEED,aresample=44100,atempo=1/$SPEED -y /tmp/out1.wav 


ffmpeg -y -i /tmp/out1.wav -af highpass=f=100,highpass=f=200,highpass=f=200,highpass=f=300 /tmp/newout.wav

ffmpeg -i /tmp/out00.wav  -af volume="`echo "$VOL * 0.2" | bc --mathlib --quiet`" -y /tmp/out00t.wav
ffmpeg -i /tmp/newout.wav -af volume="$VOL" -y /tmp/out00tt.wav

ffmpeg -i /tmp/out00t.wav -i /tmp/out00tt.wav -filter_complex amix=inputs=2:duration=longest -y /tmp/outt.wav

# adds on top of other sound
if [[ "$DOUBLEDEEP" == "true" ]]; then
	ffmpeg -i /tmp/out1.wav -c:a pcm_s16le -ac 1 -af asetrate=44100*$SPEED,aresample=44100,atempo=1/$SPEED,volume="`echo "$VOL * 1.0" | bc --mathlib --quiet`" -y /tmp/out2.wav
	ffmpeg -i /tmp/out2.wav -c:a pcm_s16le -ac 1 -af asetrate=44100*0.5,aresample=44100,atempo=1/0.5,volume="`echo "$VOL * 1.0" | bc --mathlib --quiet`" -y /tmp/out21111.wav
	ffmpeg -i  /tmp/out2.wav -i /tmp/out21111.wav -filter_complex amix=inputs=2:duration=longest -y /tmp/outtddNUUU.wav
	ffmpeg -y -i /tmp/outtddNUUU.wav -af highpass=f=60,highpass=f=100,highpass=f=200,highpass=f=200,highpass=f=300 /tmp/newout_deep.wav
	ffmpeg -i /tmp/newout_deep.wav -i /tmp/outt.wav -filter_complex amix=inputs=2:duration=longest -y /tmp/outtdd.wav
	mv /tmp/outtdd.wav /tmp/outt.wav
fi

# adds on top of other sound
if [[ "$HIGHPITCH" == "true" ]]; then
	ffmpeg -i /tmp/out1.wav -c:a pcm_s16le -ac 1 -af asetrate=44100*2,aresample=44100,atempo=1/2.0,volume="`echo "$VOL * 0.05" | bc --mathlib --quiet`" -y /tmp/out2.wav
	ffmpeg -i /tmp/out2.wav -c:a pcm_s16le -ac 1 -af asetrate=44100*2,aresample=44100,atempo=1/2.0,volume="`echo "$VOL * 2.4" | bc --mathlib --quiet`" -y /tmp/out26.wav
	ffmpeg -i /tmp/out26.wav -c:a pcm_s16le -ac 1 -af asetrate=44100*2,aresample=44100,atempo=1/2.0,volume="`echo "$VOL * 4.4" | bc --mathlib --quiet`" -y /tmp/out266.wav
#	ffmpeg -i /tmp/out266.wav -c:a pcm_s16le -ac 1 -af asetrate=44100*2,aresample=44100,atempo=1/2.0,volume="`echo "$VOL * 1.4" | bc --mathlib --quiet`" -y /tmp/outtNU.wav
	ffmpeg -i /tmp/out266.wav -i /tmp/out26.wav -filter_complex amix=inputs=2:duration=longest -y /tmp/outtyes.wav

	ffmpeg -y -i /tmp/outtyes.wav -af highpass=f=600 /tmp/newout_deep.wav
	ffmpeg -i /tmp/newout_deep.wav -i /tmp/outt.wav -filter_complex amix=inputs=2:duration=longest -y /tmp/outtdd.wav
	mv /tmp/outtdd.wav /tmp/outt.wav
fi


ffmpeg -y -i /tmp/outt.wav -af "lowpass=f=20000,lowpass=f=15000,lowpass=f=12000,lowpass=f=10000,highpass=f=60,highpass=f=100,highpass=f=120,volume=5,alimiter=level_in=1:level_out=1:level=true,alimiter=level_in=1:level_out=$OUTLEVEL:level=false" /tmp/outt_final.wav

python gen.py --model "$MODEL" --input /tmp/outt_final.wav --output /tmp/newout2.wav 

# this feeds the model output again through the model a second time
if [[ "$EXTRA" == "true" ]]; then
#	ffmpeg -i /tmp/newout2.wav -c:a pcm_s16le -ac 1 -af asetrate=44100*$SPEED,aresample=44100,atempo=1/$SPEED -y /tmp/newout1.wav && cp /tmp/newout1.wav /tmp/newout2.wav
	ffmpeg -y -i /tmp/newout2.wav -af highpass=f=200,volume=0.6 /tmp/newout.wav 
#	ffmpeg -y -i /tmp/newout.wav -af highpass=f=100 /tmp/newout2.wav 
#	ffmpeg -y -i /tmp/newout2.wav -af highpass=f=200 /tmp/newout.wav 
	python gen.py --model "$MODEL" --input /tmp/newout.wav --output /tmp/newout2.wav 
fi

echo "+++ FINAL ENCODE"
ffmpeg -i /tmp/out."${INPUT##*.}" -i /tmp/newout2.wav -filter_complex amix=inputs=2:duration=longest -vcodec copy -y out."${INPUT##*.}"

echo out."${INPUT##*.}"

mplayer `bash convert.sh welding.webm | tail -n 1`

ASMR Rapunzel model:

welding_example.out.out.mp4

wood_turning_example.mp4

Well, it is a work in progress... :D

end

This is pretty much all I know so far. Please correct me if you find anything wrong or if you know something better.

Best of luck!

Federico8691 · 2024-03-23T10:07:34Z

Federico8691
Mar 23, 2024

Hi, many many thanks for your kind and very detailed review of your personal experience. It is a great pity that Ircam until now has never presented a decent and well organized documentation on this project. I asked developers many times clues, posting here, contacting the via mail for additional insights about configuration strategies etc.. but they are almost not supportive and as we can see they do not even answer the most basic questions over here. It is sad but it is the plain reality. So your input is highly apreciated, many many thanks, and let’s hope that in a near future things will change a bit from Ircam side if they intend to support this project for real, providing more resources and content to Rave and its team.

3 replies

ballerburg9005 Mar 23, 2024
Author

Thanks.

Considering this is not a commercial project, I think they have been doing a really great job and the documentation is quite good by any normal standard.

However since training the models costs so much time and money, even the tiniest issues become super-dramatic to the user, which is quite a unique situation.

I hope that more information from the community can be incorporated into the documentation in the future.

Federico8691 Mar 23, 2024

I understand of course. You know, I am no developer or programmer. I use tools from Ircam since 1997, and they have been always very well documented. Even today, I made a presentation at the Forum this week using Dicy2, I use Mubu and CataRT all the time. They are free as well, but the support and documentation is very well done. I am used to program my tools with MaxMsp or Pure Data, nevertheless I am a composer and teacher of electroacoustic music composition. So for user like us the absence of support, and well structured documentation makes the tool unusable for practical application. Furthermore I invested a lot of money in a linux workstation in order to deal with Rave. After the main developer quit Ircam, the comunication with the team has been quite problematic. I made an open request for video tutorial for installing Rave and training the models explaining the model and features, like Ircam does for most of its research software and max packages, ( i did it two years ago), until today we never got anything. So your contribution is really welcomed and ‘till today is the best documented userguide we have. Thanks a lot!

jreus Mar 24, 2024

Maybe worth mentioning is that Axel Chemla Romeu Santos gave a presentation at the IRCAM forum this past week giving an update on RAVE. Two of the main things he reported as coming soon were an updated RAVE VST and detailed documentation videos.

Federico8691 · 2024-03-24T20:35:13Z

Federico8691
Mar 24, 2024

Finally, good news. Many many thanks for the update. all the best *Prof. Federico Placidi* CAC and Live electronics www.slmc.it +39 06 48 700 17 whatsapp +39 3661019692

…

On Sun, Mar 24, 2024 at 9:27 PM jchai.me ***@***.***> wrote: Maybe worth mentioning is that Axel Chemla Romeu Santos gave a presentation at the IRCAM forum this past week giving an update on RAVE. Two of the main things he reported as coming soon were an updated RAVE VST and detailed documentation videos. — Reply to this email directly, view it on GitHub <#300 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AXYTIRYTBNQ6XZ72LD5FTZLYZ4ZJTAVCNFSM6AAAAABEPJVIOGVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DQOJUHE3TQ> . You are receiving this because you commented.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

My personal insights after weeks of using Rave #300

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

My personal insights after weeks of using Rave #300

ballerburg9005 Mar 10, 2024

what to expect, quality of output

hardware and time requirements, cost and recommendations

running on Windows

general procedure

my command chain from start to finish

2 channels ffmpeg prepping

1 channel, left only:

run rave preprocess

the actual training

exporting the model

python script

compiling nn~

using PureData

Generating and using the prior

Rave-Latent-Diffusion

video converter

end

Replies: 2 comments · 3 replies

Federico8691 Mar 23, 2024

ballerburg9005 Mar 23, 2024 Author

Federico8691 Mar 23, 2024

jreus Mar 24, 2024

Federico8691 Mar 24, 2024

ballerburg9005
Mar 10, 2024

Replies: 2 comments 3 replies

Federico8691
Mar 23, 2024

ballerburg9005 Mar 23, 2024
Author

Federico8691
Mar 24, 2024