Talknet usage

The video tutorial is here incase if you don't wanna read it. Go ahead and watch it, otherwise continue on reading.

Okay, recently i have found out this new vocoder that has been send trough uberduck's discord server, but some people are asking like:

how?

Well, in this tutorial i'll show you how to train and use it for talknet model.

The reason why i said "train", it's because it will not work with tacotron2 models (sorry folks 😞)

Before we get started.

Alright let's start of what we have learned for. First we will need gather some dataset and if you don't have one then what are you doing here!

And if you have done collecting speeches but didn't compress as dataset then check here:

Alright, now that everything is out of control, it's time for training!

Training time!

Alright we need an colab notebook for this so i will pull colab notebook in here:

Colab Link

And make sure you copy from drive.

Alright, let's go trough over one by one and there are some few steps to learn about training. There's also instructions in few cells but i'll explain it in few.

Step 1: Checking GPU

Yes, you have probably done this in training section. This requires to check of which GPU you are using in VM so remember.

  • P100

  • V100

  • and T4

Are only the ones you should use GPU's for (if you are lucky) but otherwise if you get K80 or P4 then do the factory reset like so, clicking Runtime and Factory reset runtime.

Step 2: Logging in your Google Drive

It requires access to your Google account. Don't worry about it, this colab won't harm your drive, it only saves up some stuffs like checkpoints and more. But it also requires you to load some few stuff.

So go ahead click on play and login your google account, copy code and paste it in there and you are good to go:

Step 3: Importing dataset and configuring training data paths

Well here's what you wanna do:

Upload dataset and text file in to your drive and do the same location as google drive has.

(Like just don't put your text file in to zip just put it zip and text file in to your drive or else that will not work.)

Dataset meaning it requires your zip/tar file just like we did in dataset compressing, again if you didn't do that, go back to the dataset compressing in here.

train_filelist and val_filelist are basically your text file with transcripts on it.

output_dir is basically output for you character.

Step 4: Downloading NVIDIA NeMo

Just simply downloads NVIDIA NeMo, Ignore red texts if says something like, version doesn't compare or something (i don't remember sadly.)

Step 5: Dataset processing, part 1

Alright now that's done, it's time to make colab notebook to check everything you've got in dataset. If you get an error, just make sure you check:

  • Drive location

  • Transcripts in your text file

  • Dataset should be zip file with wavs on it like so: dataset.zip --> wavs

Other than that, let's move on.

Step 6: Dataset processing, part 2

Not really a training process but it'll check everything again which is if your wavs are mono and 22khz.

Step 7: Train Duration Predictor

This will automatically stops and you don't have to worry about this, Keep the batch size as it is (unless if you have colab pro you can increase it if you want)

If you ran in to CUDA out of memory then do the runtime reset in runtime section, run step 3 and go back to step 7.

Step 8: Train Pitch Predictor

This also automatically stops. Again, batch size stuff and if it's cuda out of memory do runtime reset in runtime section, run step 3 and go to step 8.

Step 9: Train spectrogram generator

Alright now this is where the real sh*t starts.

So here the only one thing we care is epochs and batch_size.

You've probably remembered the advanced training before so if you know how to do some epoch stuff then do that, if not then keep as 200 as it is.

Keep batch size also but unless if you have colab pro you can increase it.

But oh no after training trough steps after 9 we ran out of memory. Well what you'll do? Correct:

Runtime --> Reset runtime (and not factory reset)

Then go trough step 3 and go back to step 9.

Now you click play and you should see like this:

Alright, you also don't have to worry about this, it'll automatically stops at epoch 200. But if i remember correctly, val_loss must be probably reaching to 0.1 or 0.2.

Step 10: Generate GTA spectrograms

As it says here in colab page: "This will help HiFi-GAN learn what your TalkNet model sounds like."

Which is that's pretty much it for that. Just click play and we are moving on to the final step.

Step 11: Train HiFi-GAN.

This, is where you have to do it MANUALLY

Now, calm down. It's nearly not that hard.

What we need is to pay attention to is steps. Go ahead click play and pay attention to this:

Steps are the only ones you should be paying attention and when to stop?

You have to stop somewhere at 2000 or 3000 (if you have time of course.)

I know this hifi-gan training is slow but patience is the key for it.

Once you have 2k or more steps you can now stop training by clicking stop.

And... we are done.

Step 12: Package the models.

Now we are on to the final step, just name the character and it should be creating zip file for it. By the way, clear out trash on your drive also, it created checkpoints and stuff if you have done training correctly and it also runs out disk space on your drive so everytime you have done training, clear the trash can.

Training done.

If you have done everything correctly then congratulations, you have made your first talknet model. Let's move on to the synthesize page now.

Synthesize page

There should be in colab training page section for synthesize but i'll link here as well.

Colab page

Remeber, copy to your drive.

If you have not done training with talknet model, go back to the training time section.

Alright let's start first things first.

Step 1 of course, gpu check, again we've been trough this before but if you have bad gpu's go for factory reset.

Step 2 downloads NVIDIA NeMo, quite obvious.

And the step 3. Now is the main part, loads a page. You will notice this main page here:

You can try experiment with some pony models (like Twilight and Pinkie Pie) They both have very good datasets and it's much cleaner but we came here for custom. So go for custom model and here's a one thing you should do.

Go to your talknet folder on drive, your name character and you should see something like this:

The zip file is the only one you should give to, so what you wanna do is right click and select Get Link

And then do restricted in to anyone with link or else it will not work.

Now copy the following part:

https://drive.google.com/file/d/*id*/view?usp=sharing

The *id* part, is where you should be copy and pasting. So go ahead do that and we are on to the next step.

Loading audio reference

Now, as far as i know. The audio reference requires you to do the SAME AS TACOTRON AUDIO FILE. which is sample rate 22050, mono and 16-bit wav. Or if you didn't do conversion, don't worry. it'll automatically convert it since it's loaded with ffmpeg on it.

You can do this on audacity but if you have not done or learned anything from audacity then i recommend you check it out in there:

Also, audio max seconds should be 15. Or else if you are doing 20 seconds, that sometimes will not work or will work a bit.

One more thing, your audio shouldn't be noisy or echoy (for example reverb or delay) you can do this with RX7/8 if you have one:

Now once you have done and saved as wav file, go ahead and uploaded to google colab (and not google drive)

Once you have done uploading click Update file list and it should be appear your wav file.

Once you have done, select one and type something in output.

If you have a spoken audio reference then type the same as character has spoken, or just copy from subtitles.

If you have a singing audio reference then type it yourself as it sings or copy from lyrics.

Now once that's done click synthesize and there you go (can't play example audios because gitbook doesn't let you play audios smh.)

If you character goes low pitch or high pitch or bad pitch then here's what you wanna do.

Check the Set Pitch Multiplier and then type a number that it needs, 1 is for low pitch and 2 is for high pitch. you may also go for 1.5 or 2.3 but don't go too much like 5 or 10. and don't go 0 also.

You can also turn off singing mode, it will capture timing but not the pitch itself.

And so you are done.

CONGRATULATIONS!

Congratulations, you have done the first and ever talknet model and synthesized for a moment!

If you want to do it more sure go for it, we won't force you to stop. Other than that.

Thanks for reading this guide.

Last updated