suflaj t1_j0e2job wrote on December 15, 2022 at 11:37 PM

Reply to comment by twobadkidsin412 in I have 6x3090 looking to build a rig by Outrageous_Room_3167

A mining rig has a load significantly different from DL loads. I work with these cards, we have like 10 rigs in the office with dual/triple 3090s, some Tis, I'm well aware of their limits.

suflaj t1_j0dt970 wrote on December 15, 2022 at 10:32 PM

Reply to comment by rubbledubbletrubble in Why does adding a smaller layer between conv and dense layers break the model? by rubbledubbletrubble

Not really, 950 is smaller than 1000 so not only are you destroying information, but you are potentially getting into a really bad local minimum.

When you add that intermediate layer, what you are essentially doing is random hashing your previous distribution. If your random hash kills the relations between data your model learned, then of course it will not perform.

Now, because Xavier and Kaiming-He initializations aren't exactly initializations to get the functionality of a universal random hash, they might not kill all your relations, but they are still random enough to have the potential depending on the task and data. You might get lucky, but on average, you will almost never get lucky.

If I was in your place I would train with linear warmup to a fairly large learning rate, like 10x higher than previous maximum. This will make very bad weights shoot out of their bad minima once LR reaches the max and hopefully you'll get better results once they settle down as the LR falls down. Just make sure you clip your gradients so your weights don't go to NaN, because this is the equivalent of driving your car into a wall in hopes of the crash turning it into a Ferrari.

As for how long you should train it... Well, the best would be to add the layer without any nonlinear function and see how much you need to reach original performance. Since there is no non-linear function the new network is equally as expressive as the original. Once you get the number of epochs, add like 25% to that number and train the one with the non-linear transformation after your bottleneck that long.

suflaj t1_j0drngf wrote on December 15, 2022 at 10:21 PM

Reply to comment by rubbledubbletrubble in Why does adding a smaller layer between conv and dense layers break the model? by rubbledubbletrubble

It should, but how much it's tough to say and it depends on the rest of the model and where this bottleneck is. If, say, you're doing this is the first layers, the whole model basically has to be retrained from scratch, and performance similar to previous one is not guaranteed.

suflaj t1_j0dojq2 wrote on December 15, 2022 at 10:00 PM

Reply to Why does adding a smaller layer between conv and dense layers break the model? by rubbledubbletrubble

You introduced a bottleneck. Either you needed to train it longer, or your bottleneck destroyed part of the information needed for better performance.

suflaj t1_j0de8ja wrote on December 15, 2022 at 8:53 PM

Reply to comment by Outrageous_Room_3167 in I have 6x3090 looking to build a rig by Outrageous_Room_3167

There is no larger memory. NVLink only increases bandwidth by up to 300 GB/s unless there is a software implementation of memory pooling, which there isn't for any relevant DL framework.

Every week this has to be explained to yet another aspiring system integrator...

suflaj t1_j0cxviw wrote on December 15, 2022 at 7:08 PM

Reply to I have 6x3090 looking to build a rig by Outrageous_Room_3167

Chassis is not a problem, it's the heat.

Generally anything above 2x 3090 will need to be underclocked or in an open case to be under 90°C.

I don't think a 4x 3090 rig is possible without water cooling, since even with riser cables and an open case the cards are going to be fairly close to one another. The cards will need to be underclocked heavily and you will need the best power supply on the market, and you would still risk shutdowns or even hardware failure if 4 cards go into a transient spike at the same time. I would not risk it if you're building 2 rigs anyways, there is little benefit from a 4x and 2x configuration instead of a 3x and 3x configuration.

NVLink probably won't matter much since your CPU will be bottlenecked trying to send 5.2 TB/s of data to your GPUs. But again, there are no benchmarks to show how much, maybe the gains from NVLink will be noticeable.

suflaj t1_iztjolh wrote on December 11, 2022 at 7:16 PM

Reply to comment by MazenAmria in Advices for Deep Learning Research on SWIN Transformer and Knowledge Distillation by MazenAmria

Then it's strange. Unless you're using a similarly sized student model, there is no reason why a no_grad teacher and a student are similarly resource intensive as a teacher with backprop.

As a rule of the thumb, you should expend several times less memory. How much less are you expending for the same batch size in your case?

suflaj t1_izswui5 wrote on December 11, 2022 at 4:45 PM

Reply to What’s different between developing deep learning product and typical ML product? by digital-bolkonsky

From the top of my head, DL requires much more data preprocessing and research. ML is more like - fit an XGBoost model, and if it doesn't work well, see why, fix that in data and try again. If XGBoost can't solve it, your data is bad or you need DL.

suflaj t1_izruvvi wrote on December 11, 2022 at 11:15 AM

Reply to comment by MazenAmria in Advices for Deep Learning Research on SWIN Transformer and Knowledge Distillation by MazenAmria

That makes no sense. Are you sure you're not doing backprop on the teacher model? It should be a lot less resource intensive.

Furthermore, check how you're distilling the model, i.e. what layers and what weights. Generally, for transformer architectures, you distill the first, embedding layer, the attention and hidden layers, and the final, prediction layer. Distilling only the prediction layer works poorly.

suflaj t1_izorabe wrote on December 10, 2022 at 6:32 PM

Reply to comment by MazenAmria in Advices for Deep Learning Research on SWIN Transformer and Knowledge Distillation by MazenAmria

I don't think it's SWIN per se. I think the detectors (which take 5 feature maps of different level of detail) are incompatible with the 4 blocks of transformers which lack the spatial bias convolutional networks provide and the Tiny model being too small.

Other than that, pretraining (near) SOTA models is impractical for anyone other than big corpo for quite some time now. But you could always try asking your mentor for your uni's compute - my faculty offered GPUs ranging from 1080Tis to A100s.

Although I don't realize why you insist on pretraining SWIN, many SWIN models pretrained on ImageNet are already available. So you just have to do the distillation part on some part of the pretraining input distribution. Not only offered as part of MMCV, but Huggingface as well.

suflaj t1_izoh23q wrote on December 10, 2022 at 5:23 PM

Reply to Advices for Deep Learning Research on SWIN Transformer and Knowledge Distillation by MazenAmria

As someone who tried finetuning on SWIN as part of my graduate thesis, I will warn you that you shouldn't expect good results on the Tiny version. No matter what detector I used it performed worse than the ancient RetinaNet for some reason... Regression was near perfect, albeit with many duplicate detections, but classification was complete garbage, getting me up to 0.45 mAP (whereas Retina can get like 0.8 no problem)

So, take at least the small version.

suflaj t1_izihg01 wrote on December 9, 2022 at 10:18 AM

Reply to comment by horselover_f4t in What framework can I use to quantize a deep learning model to specific bit-widths? by MahmoudAbdAlghany

This is only the code license for the open source portion, but the SDK license of the general, proprietary software that TensorRT is, is also something you have to agree on: https://docs.nvidia.com/deeplearning/tensorrt/sla/index.html

In there, ownership is phrased in such an ambiguous way the legal team of a company would probably never greenlight using it.

suflaj t1_izfw75x wrote on December 8, 2022 at 8:28 PM

Reply to comment by Remi_Coulom in What framework can I use to quantize a deep learning model to specific bit-widths? by MahmoudAbdAlghany

They do, but they use bigger registers, so ultimately, unless you can hand optimize it to pack operations together, you will have no benefit from it. That would at least imply writing your own CUDA kernels.

Furthermore, 8 bit is already often too small to be stable. Why go lower? If you want garbage outputs, you could always fit whatever task on a smaller model. It's easier to cut model size in half and use 8-bit or 4x and use 16-bit, than to make 4 bit or lower work.

At this point in time, TensorRT seems to be the best you'll get for as little involvement as possible. Based on benchmarks, it also seems to outperform INT4 precision by a significant margin. The only drawback is its license, which implicitly prevents commercial use.

suflaj t1_izfke61 wrote on December 8, 2022 at 7:12 PM

Reply to What framework can I use to quantize a deep learning model to specific bit-widths? by MahmoudAbdAlghany

There are none, unless you plan on emulating them, which you'd have to do yourself.

The available quantization widths correspond to what the hardware is capable of doing, and hardware generally revolves around widths that have bytes as their base length.

suflaj t1_iz1d2by wrote on December 5, 2022 at 7:19 PM

Reply to comment by Any_Geologist9302 in [D] Are ML platforms honestly useful or just money-making on software that's really free? by [deleted]

Hey, the appeal to relinquish company ownership and embrace public ownership needs no memory refreshing to be categorized as communism.

suflaj t1_iz12fzk wrote on December 5, 2022 at 6:11 PM

Reply to comment by Cizox in [D] Are ML platforms honestly useful or just money-making on software that's really free? by [deleted]

Not at all, do not strawman. It is Stallman himself who notes that the free in software refers to freedom, not cost in the first place.

suflaj t1_iz11s02 wrote on December 5, 2022 at 6:07 PM

Reply to comment by [deleted] in [D] Are ML platforms honestly useful or just money-making on software that's really free? by [deleted]

Azure Databricks is Apache Spark-based, but it is made by Microsoft, which is obviously not Apache. Furthermore, Apache Spark does not compare to Databricks, nor is it published under a copyleft license, so this again seems like product and ideology incompatibility rather than an objective reason.

suflaj t1_iz0w24l wrote on December 5, 2022 at 5:31 PM

Reply to comment by [deleted] in [D] Are ML platforms honestly useful or just money-making on software that's really free? by [deleted]

Well, that is your opinion, as I said, you are free to develop and offer such software to people, as many, ex. Apache, do.

I was under the impression that your assertion had something to do with the platforms itself, not politics.

suflaj t1_iz0vqid wrote on December 5, 2022 at 5:29 PM

Reply to comment by Gustephan in [D] Are ML platforms honestly useful or just money-making on software that's really free? by [deleted]

Except I am not comparing anything to communism, but summarizing Stallman's manifesto as an appeal to communism, which it is.

I asked why someone would consider it necessary for such software to be free because I thought the argument would be about some functionality that already exists as free software or something that was taken from free software.

Yet OP just copy pasted an argument that is incompatible outside of an utopic setting, from a person that no longer has a place in modern society due to his wrongdoings.

suflaj t1_iz0sfqt wrote on December 5, 2022 at 5:07 PM

Reply to comment by [deleted] in [D] Are ML platforms honestly useful or just money-making on software that's really free? by [deleted]

Ah, so the appeal to communism.

You are always free to create your own free version of what this software provides if you feel like financial compensation for the use of it is unfair.

suflaj t1_iz0nxhx wrote on December 5, 2022 at 4:38 PM

Reply to [D] Are ML platforms honestly useful or just money-making on software that's really free? by [deleted]

Why should they be free?

suflaj t1_iyvwqn7 wrote on December 4, 2022 at 4:18 PM

Reply to 4080 vs 3090 by simorgh12

The 4080 is slightly faster, but after the 4090, the 3090 is the best bang for the buck in DL, and VRAM is invaluable, while performance is generally not.

suflaj t1_iyodcdi wrote on December 2, 2022 at 10:08 PM

Reply to GPU Comparisons: RTX 6000 ADA vs A100 80GB vs 2x 4090s by TheButteryNoodle

2x 4090 is the most money efficient if you have model parallelism for CV. For other tasks or vision transformers, it's probably bad because of low bandwidth.

The RTX A6000 will be better for deployment. If you're only planning on training your stuff this is a non-factor. Note that it has similar, even lower bandwidth than a 4090, so there are little benefits besides power consumption, non-FP32 performance and a bigger chunk of RAM.

So honestly it's between whether or not you want a local or cloud setup. Personally, I 'd go for 1x4090 and rest on compute. If there is something you can't run on 1x4090, the A100 compute will be both more money and time efficient.

suflaj t1_iym94jr wrote on December 2, 2022 at 1:16 PM

Reply to comment by normie1990 in Will I ever need more than 24GB VRAM to train models like Detectron2 and YOLOv5? by [deleted]

> I probably should have specified that I'll do fine tuning, not training from scratch, if that makes any difference.

Unless you're freezing layers, it doesn't.

> I know it's a software feature, AFAIK pytorch supports it, right?

No. PyTorch supports Data Parallelism. To get pooling in its full meaning, you need Model Parallelism, for which you'd have to write your own multi-GPU layers and a load balancing heuristic.

Be as it be, using Pytorch itself, NVLink gets you less than 5% gains. Obviously not worth compared to 30-90% gains from a 4090. You need stuff like Apex to see visible improvements, but they do not compare to generational leaps, nor do they parallelize the model (you still have to do it yourself). Apex' data parallelism is similar to PyTorches anyways.

Once you parallelize your model, however, you're bound to be bottlenecked by bandwidth. This is the reason it's not done more often, as it makes sense only if the model itself is very large, yet its gradients fit in pooled memory. NVLink provides only 300 GB/s of bandwidth in the best case scenario, amounting to roughly 30% performance gains in bandwidth bottlenecked tasks in the best case.

suflaj t1_iym8sa5 wrote on December 2, 2022 at 1:13 PM

Reply to comment by normie1990 in Will I ever need more than 24GB VRAM to train models like Detectron2 and YOLOv5? by [deleted]

NVLink itself does not pool memory. It just increases bandwidth. Memory pools are a software feature, partially made easier by NVLink.

> Could you elaborate?

Those model are trained with batch sizes that are too large to fit on any commercial GPU, meaning you will have to accumulate them either way.