suflaj

suflaj t1_iykea48 wrote

Doesn't matter. Softmax is just a multidimensional sigmoid. For binary classifications you can therefore use either 1 output and a sigmoid, or 2 outputs and a softmax. The only difference is that with a sigmoid you resolve the result as

is_fraud = result > 0.5

while with softmax you'd do

is_fraud = argmax(result) == 1
3

suflaj t1_iyh3csl wrote

Again, for the 1000th time, NVLink is not necessary for multi-GPU training.

You will not need 64 lanes for 4 GPUs because the 4090 doesn't have enough bandwidth to fill it up. 32 PCIE-4 or 16 PCIE-5 lanes will be enough. This just barely requires a threadripper since 4090s are still PCIE-4.

Your bigger issue is cooling. To have 4 GPUs you will need to water cool them with at least 2 radiators and you will need an especially large case to fit them.

But even if you do get the cooling, there is no way in hell you will find a consumer powersupply that can power those cards simultaneously, meaning you will need to spend several thousands of dollars getting an industrial-grade power supply for your server.

Overall it would be best to get a single or dual GPU setup and spend the rest of the money on A100 compute when you actually need it.

8

suflaj t1_ixlvsbg wrote

The solution for 2 power supplies refers to the 2 computers solution. For a singular 4kW power supply you'll need to go beyond consumer products, into industrial power supplies for high performance servers or supercomputers. At that point you're no longer building a PC, and I don't know how you'd handle it, sorry. IMO it's just better to build 2 machines.

2

suflaj t1_ix4bbi0 wrote

3. and 4. in your case are probably intertwined, and likely the reason why you are stuck. You should probably keep the learning rate constant for all layers, freeze some at most if you're dealing with a big shift when finetuning.

You should use warmup, a low learning rate (what that entails depends, but since music data is similar to text, that means 1e-6 to 1e-5 maximum learning rate), and increase batch size if you get stuck as training progresses.

Without warmup, your network will not converge.

With a high learning rate, it will likely diverge on a nasty sample, or even have its gradients explode. In practice even when using gradient clipping your network might run in a circle, depending on your samples.

By lowering the learning rate when stuck it will not generalize well, but increasing the batch size (even if you slightly increase the learning rate while you're at it) seems to fix the problem, you just have to find the right numbers. I work on text, so whenever I doubled the batch size, I increased the learning rate by a factor of square or cube root of 2 to keep the "learning pressure" the same. YMMV.

EDIT: And as other people said make sure you have a large enough datasets. Transformers have almost no inductive biases, meaning that they have to learn them from data. Unless your augmentations are really good, I wouldn't recommend even attempting to train a transformer without at least 100k-1mil unique samples. For the size you're mentioning, the model would ideally like 1-10mil samples for finetuning and 1-10bil for pretraining.

28

suflaj t1_iwyuyyo wrote

If your goal is to work on those things, you should look into getting a PhD, as you'll need to work at a fairly large company to even have a chance of working on those things, and the competition is fierce, so you need papers and good words for your name to push through.

In a year at that pace I assume you can cover the beginnings of deep learning till 2018 or 2019 (assuming 5 hours every day is around 1825 hours, which amounts to around 150 papers read thoroughly). Andrew Ngs course is OK, but it doesn't come close to being enough for what you need for your aspirations. You'll probably need one more year of reading papers and experimenting after that to reach state of the art.

0

suflaj t1_ivumz3h wrote

It has not been marketed as such because it's built on top of ASR. Hence, you search for ASR and then look for its features. The same way you look for object detection, and if you need segmentation, you look if it has a detector that does segmentation. A layman looking for a solution does not search for specific terms and marketers know this.

Be as it be, the answer remains the same - Google offers the most advanced and performant solution, it markets it as ASR or how they call it text to speech, with this so called diarization being one feature of it.

2

suflaj t1_ivuam4k wrote

You mean automatic speech recognition? Yeah, there are models for that, Google probably has the best proprietary one but from what I understand it is still a work in progress, despite ex. Whisper releasing recently.

3