suflaj

suflaj t1_j0dt970 wrote

Not really, 950 is smaller than 1000 so not only are you destroying information, but you are potentially getting into a really bad local minimum.

When you add that intermediate layer, what you are essentially doing is random hashing your previous distribution. If your random hash kills the relations between data your model learned, then of course it will not perform.

Now, because Xavier and Kaiming-He initializations aren't exactly initializations to get the functionality of a universal random hash, they might not kill all your relations, but they are still random enough to have the potential depending on the task and data. You might get lucky, but on average, you will almost never get lucky.

If I was in your place I would train with linear warmup to a fairly large learning rate, like 10x higher than previous maximum. This will make very bad weights shoot out of their bad minima once LR reaches the max and hopefully you'll get better results once they settle down as the LR falls down. Just make sure you clip your gradients so your weights don't go to NaN, because this is the equivalent of driving your car into a wall in hopes of the crash turning it into a Ferrari.

As for how long you should train it... Well, the best would be to add the layer without any nonlinear function and see how much you need to reach original performance. Since there is no non-linear function the new network is equally as expressive as the original. Once you get the number of epochs, add like 25% to that number and train the one with the non-linear transformation after your bottleneck that long.

5

suflaj t1_j0drngf wrote

It should, but how much it's tough to say and it depends on the rest of the model and where this bottleneck is. If, say, you're doing this is the first layers, the whole model basically has to be retrained from scratch, and performance similar to previous one is not guaranteed.

2

suflaj t1_j0cxviw wrote

Chassis is not a problem, it's the heat.

Generally anything above 2x 3090 will need to be underclocked or in an open case to be under 90°C.

I don't think a 4x 3090 rig is possible without water cooling, since even with riser cables and an open case the cards are going to be fairly close to one another. The cards will need to be underclocked heavily and you will need the best power supply on the market, and you would still risk shutdowns or even hardware failure if 4 cards go into a transient spike at the same time. I would not risk it if you're building 2 rigs anyways, there is little benefit from a 4x and 2x configuration instead of a 3x and 3x configuration.

NVLink probably won't matter much since your CPU will be bottlenecked trying to send 5.2 TB/s of data to your GPUs. But again, there are no benchmarks to show how much, maybe the gains from NVLink will be noticeable.

8

suflaj t1_iztjolh wrote

Then it's strange. Unless you're using a similarly sized student model, there is no reason why a no_grad teacher and a student are similarly resource intensive as a teacher with backprop.

As a rule of the thumb, you should expend several times less memory. How much less are you expending for the same batch size in your case?

1

suflaj t1_izruvvi wrote

That makes no sense. Are you sure you're not doing backprop on the teacher model? It should be a lot less resource intensive.

Furthermore, check how you're distilling the model, i.e. what layers and what weights. Generally, for transformer architectures, you distill the first, embedding layer, the attention and hidden layers, and the final, prediction layer. Distilling only the prediction layer works poorly.

2

suflaj t1_izorabe wrote

I don't think it's SWIN per se. I think the detectors (which take 5 feature maps of different level of detail) are incompatible with the 4 blocks of transformers which lack the spatial bias convolutional networks provide and the Tiny model being too small.

Other than that, pretraining (near) SOTA models is impractical for anyone other than big corpo for quite some time now. But you could always try asking your mentor for your uni's compute - my faculty offered GPUs ranging from 1080Tis to A100s.

Although I don't realize why you insist on pretraining SWIN, many SWIN models pretrained on ImageNet are already available. So you just have to do the distillation part on some part of the pretraining input distribution. Not only offered as part of MMCV, but Huggingface as well.

3

suflaj t1_izoh23q wrote

As someone who tried finetuning on SWIN as part of my graduate thesis, I will warn you that you shouldn't expect good results on the Tiny version. No matter what detector I used it performed worse than the ancient RetinaNet for some reason... Regression was near perfect, albeit with many duplicate detections, but classification was complete garbage, getting me up to 0.45 mAP (whereas Retina can get like 0.8 no problem)

So, take at least the small version.

5

suflaj t1_izihg01 wrote

This is only the code license for the open source portion, but the SDK license of the general, proprietary software that TensorRT is, is also something you have to agree on: https://docs.nvidia.com/deeplearning/tensorrt/sla/index.html

In there, ownership is phrased in such an ambiguous way the legal team of a company would probably never greenlight using it.

2

suflaj t1_izfw75x wrote

They do, but they use bigger registers, so ultimately, unless you can hand optimize it to pack operations together, you will have no benefit from it. That would at least imply writing your own CUDA kernels.

Furthermore, 8 bit is already often too small to be stable. Why go lower? If you want garbage outputs, you could always fit whatever task on a smaller model. It's easier to cut model size in half and use 8-bit or 4x and use 16-bit, than to make 4 bit or lower work.

At this point in time, TensorRT seems to be the best you'll get for as little involvement as possible. Based on benchmarks, it also seems to outperform INT4 precision by a significant margin. The only drawback is its license, which implicitly prevents commercial use.

1

suflaj t1_iz11s02 wrote

Azure Databricks is Apache Spark-based, but it is made by Microsoft, which is obviously not Apache. Furthermore, Apache Spark does not compare to Databricks, nor is it published under a copyleft license, so this again seems like product and ideology incompatibility rather than an objective reason.

2

suflaj t1_iz0vqid wrote

Except I am not comparing anything to communism, but summarizing Stallman's manifesto as an appeal to communism, which it is.

I asked why someone would consider it necessary for such software to be free because I thought the argument would be about some functionality that already exists as free software or something that was taken from free software.

Yet OP just copy pasted an argument that is incompatible outside of an utopic setting, from a person that no longer has a place in modern society due to his wrongdoings.

−6

suflaj t1_iyvwqn7 wrote

Reply to 4080 vs 3090 by simorgh12

The 4080 is slightly faster, but after the 4090, the 3090 is the best bang for the buck in DL, and VRAM is invaluable, while performance is generally not.

12

suflaj t1_iyodcdi wrote

2x 4090 is the most money efficient if you have model parallelism for CV. For other tasks or vision transformers, it's probably bad because of low bandwidth.

The RTX A6000 will be better for deployment. If you're only planning on training your stuff this is a non-factor. Note that it has similar, even lower bandwidth than a 4090, so there are little benefits besides power consumption, non-FP32 performance and a bigger chunk of RAM.

So honestly it's between whether or not you want a local or cloud setup. Personally, I 'd go for 1x4090 and rest on compute. If there is something you can't run on 1x4090, the A100 compute will be both more money and time efficient.

3

suflaj t1_iym94jr wrote

> I probably should have specified that I'll do fine tuning, not training from scratch, if that makes any difference.

Unless you're freezing layers, it doesn't.

> I know it's a software feature, AFAIK pytorch supports it, right?

No. PyTorch supports Data Parallelism. To get pooling in its full meaning, you need Model Parallelism, for which you'd have to write your own multi-GPU layers and a load balancing heuristic.

Be as it be, using Pytorch itself, NVLink gets you less than 5% gains. Obviously not worth compared to 30-90% gains from a 4090. You need stuff like Apex to see visible improvements, but they do not compare to generational leaps, nor do they parallelize the model (you still have to do it yourself). Apex' data parallelism is similar to PyTorches anyways.

Once you parallelize your model, however, you're bound to be bottlenecked by bandwidth. This is the reason it's not done more often, as it makes sense only if the model itself is very large, yet its gradients fit in pooled memory. NVLink provides only 300 GB/s of bandwidth in the best case scenario, amounting to roughly 30% performance gains in bandwidth bottlenecked tasks in the best case.

1

suflaj t1_iym8sa5 wrote

NVLink itself does not pool memory. It just increases bandwidth. Memory pools are a software feature, partially made easier by NVLink.

> Could you elaborate?

Those model are trained with batch sizes that are too large to fit on any commercial GPU, meaning you will have to accumulate them either way.

1