suflaj t1_iym8e85 wrote on December 2, 2022 at 1:10 PM

Reply to Will I ever need more than 24GB VRAM to train models like Detectron2 and YOLOv5? by [deleted]

NVLink will not pool your memory. You were already told this in your previous post.

Tbose models already require more than 24GB RAM if you do not accumulate your gradients, and it's unlikely they'll need more than 24 GB per batch even for their auccessors. 4090s will be faster, obviously.

suflaj t1_iyktdbc wrote on December 2, 2022 at 3:13 AM

Reply to comment by Santhosh999 in Doubt regarding activation functions by Santhosh999

Well, you have to change your labels from being 1 long to 2 long. If your labels are True or 1 and False or 0, you will need to change them to [0, 1] and [1, 0]

suflaj t1_iykea48 wrote on December 2, 2022 at 1:16 AM

Reply to comment by Santhosh999 in Doubt regarding activation functions by Santhosh999

Doesn't matter. Softmax is just a multidimensional sigmoid. For binary classifications you can therefore use either 1 output and a sigmoid, or 2 outputs and a softmax. The only difference is that with a sigmoid you resolve the result as

is_fraud = result &gt; 0.5

while with softmax you'd do

is_fraud = argmax(result) == 1

suflaj t1_iyhi4z6 wrote on December 1, 2022 at 1:10 PM

Reply to comment by duschendestroyer in [Discussion] Should I go with Threadripper 5000 and multi-GPU, or Ryzen 9 with single GPU? by normie1990

One radiator cannot handle 4 4090s, unless it's the size of at least 2 ordinary ones.

suflaj t1_iyh4h4x wrote on December 1, 2022 at 10:24 AM

Reply to comment by normie1990 in [Discussion] Should I go with Threadripper 5000 and multi-GPU, or Ryzen 9 with single GPU? by normie1990

No. The reason you put the radiator on top is so air doesn't fill up in the water block. Air in the water block means no cooling, since air barely conducts heat. Therefore you'd need a case big enough to mount both the radiators on top or keep one of them outside the case.

suflaj t1_iyh3csl wrote on December 1, 2022 at 10:07 AM

Reply to [Discussion] Should I go with Threadripper 5000 and multi-GPU, or Ryzen 9 with single GPU? by normie1990

Again, for the 1000th time, NVLink is not necessary for multi-GPU training.

You will not need 64 lanes for 4 GPUs because the 4090 doesn't have enough bandwidth to fill it up. 32 PCIE-4 or 16 PCIE-5 lanes will be enough. This just barely requires a threadripper since 4090s are still PCIE-4.

Your bigger issue is cooling. To have 4 GPUs you will need to water cool them with at least 2 radiators and you will need an especially large case to fit them.

But even if you do get the cooling, there is no way in hell you will find a consumer powersupply that can power those cards simultaneously, meaning you will need to spend several thousands of dollars getting an industrial-grade power supply for your server.

Overall it would be best to get a single or dual GPU setup and spend the rest of the money on A100 compute when you actually need it.

suflaj t1_iyh2wil wrote on December 1, 2022 at 10:00 AM

Reply to Doubt regarding activation functions by Santhosh999

Well for softmax you need at least 2 neurons

suflaj t1_iycm2mj wrote on November 30, 2022 at 12:08 PM

Reply to [D] Does Transformer need huge pretraining process? by minhrongcon2000

Depends on the transformer, but generally yes. Pretraining BERT costs like 10k$ in compute, maybe less now. You can train BiLSTM models from scratch on a single consumer card for a similar task in a day or so.

suflaj t1_iyclnwf wrote on November 30, 2022 at 12:03 PM

Reply to If the dataset is too big to fit into your RAM, but you still wish to train, how do you do it? by somebodyenjoy

Images are loaded from disk, perhaps with some caching.

The most efficient simple solution would be to have workers that fill up a buffer that acts like a queue for data.

suflaj t1_iyc4hvj wrote on November 30, 2022 at 8:03 AM

Reply to comment by majinLawliet2 in Building ResNet for Tabular Data Regression Problem by eternalmathstudent

I know, however try finding a resnet pretrained on something other than a CV dataset.

suflaj t1_iy8xun6 wrote on November 29, 2022 at 5:14 PM

Reply to Building ResNet for Tabular Data Regression Problem by eternalmathstudent

If you have tabular data, the solution is to use XGBoost. Resnets are pretrained on imagenet, meaning if you need it pretrained for any other task, you'll have to do it yourself. I do not see how the task would benefit from a ResNet.

suflaj t1_ixlwf2t wrote on November 24, 2022 at 12:56 PM

Reply to comment by The_Poor_Jew in I'd like to build a deep learning home server - any resources? by The_Poor_Jew

You'll probably have to look hard, as a power supply needed to do this, its installation and modification for computer components is anywhere from 2-5k$ alone. Server components cost way more than PC components. Good luck.

suflaj t1_ixlvsbg wrote on November 24, 2022 at 12:49 PM

Reply to comment by The_Poor_Jew in I'd like to build a deep learning home server - any resources? by The_Poor_Jew

The solution for 2 power supplies refers to the 2 computers solution. For a singular 4kW power supply you'll need to go beyond consumer products, into industrial power supplies for high performance servers or supercomputers. At that point you're no longer building a PC, and I don't know how you'd handle it, sorry. IMO it's just better to build 2 machines.

suflaj t1_ixlvb01 wrote on November 24, 2022 at 12:44 PM

Reply to I'd like to build a deep learning home server - any resources? by The_Poor_Jew

You'll probably need 2 computers or a server rack then (which also means the drastically more expensive Threadripper platform)

There is no commercial consumer PSU like that, but you can always buy 2 PSUs.

suflaj t1_ixgrm1f wrote on November 23, 2022 at 9:14 AM

Reply to How to efficiently re-train a classification model with an addition of a new class? by kingfung1120

You can add a new class to the classifier and then do model surgery to transfer old model weights onto part of your new model.

suflaj t1_ixfjldc wrote on November 23, 2022 at 1:35 AM

Reply to comment by bigbossStrife in [D] Am I stupid for avoiding high level frameworks? by bigbossStrife

When I saw just how easy to use and test things are with ex. Lightning.

suflaj t1_ixffok6 wrote on November 23, 2022 at 1:03 AM

Reply to [D] Am I stupid for avoiding high level frameworks? by bigbossStrife

I was thinking te same as you for perhaps 3-4 years now and over this last year I have been regretting it. I could've saved so much time if I embraced it at the end. I'll learn how to use them eventually, peobably in the next few months.

suflaj t1_ix4bbi0 wrote on November 20, 2022 at 5:30 PM

Reply to [R] Tips on training Transformers by parabellum630

3. and 4. in your case are probably intertwined, and likely the reason why you are stuck. You should probably keep the learning rate constant for all layers, freeze some at most if you're dealing with a big shift when finetuning.

You should use warmup, a low learning rate (what that entails depends, but since music data is similar to text, that means 1e-6 to 1e-5 maximum learning rate), and increase batch size if you get stuck as training progresses.

Without warmup, your network will not converge.

With a high learning rate, it will likely diverge on a nasty sample, or even have its gradients explode. In practice even when using gradient clipping your network might run in a circle, depending on your samples.

By lowering the learning rate when stuck it will not generalize well, but increasing the batch size (even if you slightly increase the learning rate while you're at it) seems to fix the problem, you just have to find the right numbers. I work on text, so whenever I doubled the batch size, I increased the learning rate by a factor of square or cube root of 2 to keep the "learning pressure" the same. YMMV.

EDIT: And as other people said make sure you have a large enough datasets. Transformers have almost no inductive biases, meaning that they have to learn them from data. Unless your augmentations are really good, I wouldn't recommend even attempting to train a transformer without at least 100k-1mil unique samples. For the size you're mentioning, the model would ideally like 1-10mil samples for finetuning and 1-10bil for pretraining.

suflaj t1_ix4ab1j wrote on November 20, 2022 at 5:23 PM

Reply to comment by Terib1e in Question, I am a newbie. I have just finished cs50p and currently learning Django. by Terib1e

Yeah, probably the best one

suflaj t1_iwze762 wrote on November 19, 2022 at 3:25 PM

Reply to comment by Terib1e in Question, I am a newbie. I have just finished cs50p and currently learning Django. by Terib1e

Better ask here so everyone can make use of it

suflaj t1_iwyzm03 wrote on November 19, 2022 at 1:18 PM

Reply to comment by Terib1e in Question, I am a newbie. I have just finished cs50p and currently learning Django. by Terib1e

There may be a few papers under a paywall (one that comes to mind is the Differentiable Neural Computer), but those are not that important. Most are free, yes.

suflaj t1_iwyuyyo wrote on November 19, 2022 at 12:26 PM

Reply to Question, I am a newbie. I have just finished cs50p and currently learning Django. by Terib1e

If your goal is to work on those things, you should look into getting a PhD, as you'll need to work at a fairly large company to even have a chance of working on those things, and the competition is fierce, so you need papers and good words for your name to push through.

In a year at that pace I assume you can cover the beginnings of deep learning till 2018 or 2019 (assuming 5 hours every day is around 1825 hours, which amounts to around 150 papers read thoroughly). Andrew Ngs course is OK, but it doesn't come close to being enough for what you need for your aspirations. You'll probably need one more year of reading papers and experimenting after that to reach state of the art.

suflaj t1_ivumz3h wrote on November 10, 2022 at 6:58 PM

Reply to comment by Snickersman6 in would it be possible to train something that processes a video and outputs a text script like the following? Teacher: That is the topic we will be covering today. Student 1: What about the part of the lesson we didnt go over yesterday. by [deleted]

It has not been marketed as such because it's built on top of ASR. Hence, you search for ASR and then look for its features. The same way you look for object detection, and if you need segmentation, you look if it has a detector that does segmentation. A layman looking for a solution does not search for specific terms and marketers know this.

Be as it be, the answer remains the same - Google offers the most advanced and performant solution, it markets it as ASR or how they call it text to speech, with this so called diarization being one feature of it.

suflaj t1_ivuj62a wrote on November 10, 2022 at 6:34 PM

Reply to comment by Snickersman6 in would it be possible to train something that processes a video and outputs a text script like the following? Teacher: That is the topic we will be covering today. Student 1: What about the part of the lesson we didnt go over yesterday. by [deleted]

Yeah, as said previously, Google is a master of it - ex. look at Pixel 7 ASR.

I believe it's still called ASR.

suflaj t1_ivuam4k wrote on November 10, 2022 at 5:39 PM

Reply to would it be possible to train something that processes a video and outputs a text script like the following? Teacher: That is the topic we will be covering today. Student 1: What about the part of the lesson we didnt go over yesterday. by [deleted]

You mean automatic speech recognition? Yeah, there are models for that, Google probably has the best proprietary one but from what I understand it is still a work in progress, despite ex. Whisper releasing recently.