suflaj
suflaj t1_isbgg32 wrote
Probably because the startup overhead dominates over the processing time. 500 weights is not really something you can apply to real life, as modern neural networks are 100+ million parameters for consumer hardware, and not on a dataset which is considered solved.
suflaj t1_is9njd4 wrote
It's not legal, you can look for videos that are in the public domain or under permissive licenses, but I really doubt you're going to find 20-25 hours of the same "style" unless it's just real life videos.
You can always take a camera, go outside, and record those videos yourself.
suflaj t1_is363fo wrote
Reply to comment by farmingvillein in [D] Wide Attention Is The Way Forward For Transformers by SuchOccasion457
I didn't mean that it is useless. I find it funny that someone would actually say that instead of "they perform roughly the same". Especially since they do not show that is a statistically significant difference, we have seen your average BERT get much more performance by just rerolling on a different seed.
suflaj t1_is327qu wrote
I had to check that I'm not on a satire sub comparing the title and that quote.
suflaj t1_irmvetb wrote
Reply to comment by zero_one_memrisor in 1080 vs 2060 for deeplearning by ccppoo0
There is no 16GB 3070, only 8 GB. The 16 GB one was a rumor.
suflaj t1_irmuzmw wrote
Reply to 1080 vs 2060 for deeplearning by ccppoo0
2060 if it's the 8 GB version, otherwise 1080. But obviously the best would be to not waste your money and just buy cloud compute... With these cards being 200-300$, that is roughly 200-300 hours of A100 training, which is much faster and enables you to train much larger models.
EDIT: I see you've gotten a 110$ offer for 1080, I'd say go with that. You'll be severely limited in what you can run but after you learn your lesson you can still sell it for 100$, or even more.
suflaj t1_irjffgj wrote
There are several reasons.
One is catastrophic forgetting. You can't just hope your model will always remember what it has initially known. Online training for GPT would imply relearning what it has already learned. It has to constantly repeat at least the gist of what it has learned because new data often changes old insights. Otherwise it will just be finetuning, and you can see in practice that it can hurt general knowledge of the model.
Another reason might be that the new data might not be useful. You have to understand that models as big as GPT-3 do not even go through their whole training set, just a small part of it, and they still generally have strong performance.
And finally, even if the new data was useful, there is no guarantee the model can make use of it at a given checkpoint (or from the start, even). The model might be too small, its architecture and task it is trained on might be inadequate for the data etc. Now, GPTs aren't too small, and they have an architecture very adequate for learning, but we also do not know to what extent we are utilizing their processing capabilities and memory. There isn't exactly a theoretical proof that we can do a lot more with them, let alone a procedure how to do it.
So to conclude, the reason why is simply that it was not proven it's worth it. Small experiments prove almost nothing, and larger experiments would require resources and therefore some promise to acquire them (in a corporate setting).
suflaj t1_irj50vf wrote
Reply to comment by aWildTinoAppears in [D] How do you go about hyperparameter tuning when network takes a long time to train? by twocupv60
> Only theoretical papers are publishing guarantees. DeepMind and OpenAI aren't claiming their networks are "done" training or are perfectly optimal, just that they have passed a performance threshold in which the scientific contribution is worth sharing and they have done an extensive hparam search to reach that point.
Great. Now notice we are speaking of theory. In practice in DL trial and error is usually better than formally analyzing or optimizing something.
> They literally say they sometimes see it, more data isn't bad, and they aren't making any claims around it because it deserves more work.
Great. One thing to notice - you are making claims that early stopping is good enough. I am making claims that because of double descent and not understanding it fully, you cannot make such claims. Those are just guesses, and not even well-informed ones.
To make such claims, the prerequisite would be to first prove (without a reasonable doubt) that your model does not exhibit overparametrization side-effects. This would mean that instead of early stopoing, you run it for way more than you intend to. THEN you can do these checkpointing optimizations, if it turns out you don't have to worry about it.
But usually it is just enough to get it working well enough instead of formally optimizing the hyperparameters, because whatever optimization you do, it cannot account for unseen data. My point is not that this is better, it's that whatever you do you are guessing, and might as well take cheaper guesses if you're not interested in it being very robust.
> Moving goal posts again, also dd is from eoy 2019.
What do you mean moving goal posts again? 3 eternities refers to 6 years ago, i.e. 2016. That is the last time models were small enough for double descent to be basically undetectable, since Attention is All You Need was released in June 2017 and worked on for quite some time then. Double descent was formally described in 2019, yes. But the phenomena it describes happened way before, and in my experience, transformers were the first to exhibit it in pretraining. Maybe it was even more than 3+ eternities ago that we had models that experienced double descent, I have not been doing DL seriously for that long.
> I won't be responding here again but encourage you and RealNetworks to publish some peer reviewed research highlighting the claims you're making in this thread.
You might have gotten the wrong person for the job, as we mostly do engineering, but love that you put in the effort to try and stalk me :)
Given that this has become personal, rather than sticking to the topic, I will not respond anymore either.
suflaj t1_irg8l9l wrote
Reply to comment by aWildTinoAppears in [D] How do you go about hyperparameter tuning when network takes a long time to train? by twocupv60
> The point here is that max training steps does not need to be a tuned hyperparamter under this experimental setup--you allow models to train until convergence and stop them once they are clearly overfitting. In this scenario, final performance is always strictly worse than early stopping performance because of the checkpointing strategy.
My point is that you CANNOT guarantee your model is done learning. Again, I will say it, please don't ignore it if you wish to discuss this further: double descent (or overparametrization side-effects in general). Also, there are training setups where you cannot even process the whole dataset and are basically gambling that the dev set you chose is as representative as whatever the model will be seeing in training. You can both overshoot and undershoot. This is not only about the number of steps, but the batch size and learning rate schedules.
> Yes, I have bad news for you if you are under the impression that all published work is fully exploring the entire hparam search space... Are you sampling every learning rate between 1e-7 and 10? That's intractable. Hence, "good enough" or "best seen so far after an exhaustive search".
I was not saying that. What I was saying that even other hyperparameters might be wildly wrong. I take it you have worked with Adam-based optimizers. They generally do not care about hyperparameters in the training period they are most effective with, but other incorrect hyperparameters might have more severe consequences you will simply not be exploring if you early stop. In the modern era, if you have a budget for hyperparameter optimization, you check for a number of steps well beyond what you intend to train, so early stopping has no place outside of very old models, 3+ eternities old. Those are nowadays a special case, given the sheer size of modern models.
> I said that this statement is incorrect "early stopping performance is not indicative of final performance".
And in doing so you have ignored a very prevalent counterexample, double descent. It is not rare (anymore), it is not made up, it is well documented, just poorly understood.
suflaj t1_irg1qkg wrote
Reply to comment by aWildTinoAppears in [D] How do you go about hyperparameter tuning when network takes a long time to train? by twocupv60
> reasonable here means "good enough relative to a network that trains for the max number of steps without hitting early stopping criterion".
First of, how would you know there is a max number of steps (before even knowing what they are)? There is previous work on certain architectures which can give you an idea on what a good number of steps is but:
- there is no theoretical guarantee that is the optimum
- there is no theoretical guarantee that the hyperparameters explored and finally used are the optimum
So this statement is ill-defined and in the best case an opinion.
> you realize the entire point of early stopping + best checkpointing is to help prune the hparam search space so that you can focus on more impactful parameters like batch size, learning rate, etc, right?
Yes. And it is at the same time incomplete search and requires way more than a guess to determine how it should be done and to what extent. Generally there is a fair amount of counterexamples where we didn't know it was suboptimal until it was proven otherwise, most famously with double descent. This is something you can't just ignore and a clear and very common example where any kind of early stopping and checkpointing will fail to find even a reasonable enough configuration.
I feel like as a researcher, you shouldn't double down on things you should know are neither conclusive nor have strong fundamentals behind them. Good enough does not cut it in a general case, especially given that modern models do not even go over the whole dataset in an epoch, and as such might be broken on any one of your checkpoints.
And given that you say you work for Google, maybe you shouldn't pretend that most SOTA models nowadays aren't developed by simply backtracking and making an educated guess on the hyperparameters, rather than thoroughly exploring hyperparameters.
suflaj t1_irdvhm4 wrote
Reply to comment by aWildTinoAppears in [D] How do you go about hyperparameter tuning when network takes a long time to train? by twocupv60
You should probably read up on double descent and the lottery ticket hypothesis. Google engineers have been wrong plenty of times in their "hypotheses". Furthermore, you're referring to a system from 2017, so, 2.5 eternities ago, when these phenomenon were not even known.
Also, what does reasonable mean? I would argue that it highly depends on other hyperparameters, the architecture of a model and data, and as such isn't generally applicable. It's about as reasonable as assuming 3e-4 is a good learning rate, but there are plenty of counterexamples where the network doesn't converge on it and as such cannot be considered reasonable generally.
suflaj t1_ir9g78r wrote
Reply to comment by chatterbox272 in Time Complexity of Detach() in torch "[R]" by mishtimoi
It makes a copy of a tensor, but it is not a copy in memory, just a view. It copies the reference.
suflaj t1_ir83jmj wrote
Reply to comment by onyx-zero-software in Time Complexity of Detach() in torch "[R]" by mishtimoi
I'm talking about detach. From what I could find on the internet the copy part is taking tensor data and wrapping it in a variable. This does not imply that an actual copy in memory happens. And from what I understand to get a hard copy you have to clone the detached tensor.
If all OP does is detach tensors, then it's O(1). But we can't know that without further information, so I elaborated that it's likely closer to O(n) because I presume they might be doing something beyond detach.
suflaj t1_ir7re2g wrote
Reply to Time Complexity of Detach() in torch "[R]" by mishtimoi
Detach creates a new tensor. You need to set the required_grad to False. You will not see performance improvements as long as the earlier layers aren't frozen. You don't get to skip gradient updates, so if your first layers needs the gradient, you will need to calculate them all.
Detach is O(1), if you consider copy a O(1) operation, but it's probably closer to O(n). I don't know to which extent PyTorch can optimize copying by making it in-place or lazy with views.
suflaj t1_ir4ow8t wrote
Reply to comment by red_dragon in [D] How do you go about hyperparameter tuning when network takes a long time to train? by twocupv60
Switch to SGD after 1 epoch or so
But if they do worse than the baseline something else is likely the problem. Adam(W) does not kill performance, it just for some reason isn't as effective as reaching the best final performance as simpler optimizers.
suflaj t1_ir2yl5r wrote
Reply to comment by Doppe1g4nger in [D] How do you go about hyperparameter tuning when network takes a long time to train? by twocupv60
Because early stopping performance is not indicative of final performance, even moreso when using Adam(W)
I don't know why I have to repeat this, early stopping is analogous to fixating a hyperparameter value to a constant. It doesn't matter if you stop at N steps, or at a plateau or at an accuracy threshold. You can do it, but then it's not a thorough search.
You can consider it thorough if the number of steps is comparable to the number of steps you will do to actually train the model. You can consider it thorough even if you slightly increase the number of training steps for the final model as effects related to overparametrization take a long time to converge.
As long as your training steps increase is not as long as the time it takes for side effects related to overparametrization to converge, your results will be representative of the actual final training run. If they are longer, it's again anyone's game, only this time it's potentially even more dependent on initialization than any step before (but in reality those effects are not yet understood enough to conclude aynthing relevant here).
Personally if accounting for the side effects of overparametrization, I would not do hyperparameter tuning at all - instead I would just retrain from scratch several times on "good" hyperparameters for as long as it takes and play around with weight averaging schemes.
suflaj t1_ir26f5c wrote
Reply to comment by there_are_no_owls in [D] How do you go about hyperparameter tuning when network takes a long time to train? by twocupv60
What is there to explain? The statement is quite self-explanatory - by fixating the number of training steps you are not exploring the other number of training steps as hyperparameters. So it's as if you fixated any other hyperparameter to a constant, you're going to have an incomplete search.
However, a user usually has an idea of how many steps the training should take. So you don't do random or grid search on the number of steps, instead you fixate it to the number you will need to complete the training for your final result.
If you wanted to fully search for hyperparameters, then you'd also do grid search on the number of steps. This shouldn't come as a surprise when ex. the XGBoost equivalent of training steps, the number of estimators, is one of the most important hyperparameters you do search on.
Where I work at we do this as the last step, isolated from other hyperparameters. But only to find out if we need MORE training than we usually estimated. This is mostly done to account for the stochasticity of augmentations screwing up the model.
suflaj t1_ir1fjgt wrote
Reply to comment by neato5000 in [D] How do you go about hyperparameter tuning when network takes a long time to train? by twocupv60
This is not in practice true for modern DL models, especially those trained with modern optimization methods, like Adam(W). Adam(W) can have optimal performance at the start but then it's anyone's game till the end of the training.
In other words, not only will the optimal hyperparameters probably be different, because you need to change to SGD to reach max performance, you will have to retune the hyperparameters you already accepted as optimal. Successful early training only somewhat guarantees you won't diverge, but to end up with the best final weights you'll have to do additional hyperparameters search (and there is no guarantee your early training checkpoint will lead you to the best weights in the end either).
suflaj t1_ir1elya wrote
Reply to comment by bphase in [D] How do you go about hyperparameter tuning when network takes a long time to train? by twocupv60
No because training steps is a hyperparameter itself.
suflaj t1_ir1ejki wrote
Reply to [D] How do you go about hyperparameter tuning when network takes a long time to train? by twocupv60
You should probably try to reduce your dataset size first and then tune hyperparameters with that.
What I would do is start with randomly sampled 100 samples. Train fully with that. Then double it for the same hyperparameters and see how the performance changes. You want to stop when the performance no longer changes significantly after doubling the data.
How much is significantly? Well, I would personally stop when doubling the data doesn't halve the test error. But that criterion is arbitrary, so ymmv, and you should adjust it based on how fast it increases. Think of what performance would be acceptable for an average person who is neither stupid, nor informed enough to know your model could be much better. You just need enough data to consider your hyperparameters representative.
If you do not know how to tune that, then try clustering your data strictly. Ex., if you have text, you could divide it into 2-grams, use MinHashes and then say the threshold for a cluster is 1% similarity. This will give you very few clusters from which you can pick the representative and use that as a sample for your dev test.
Search your hyperparameters randomly within a distribution when you reach those diminishing returns and then train with those hyperparameters on the full dataset. Depending on the network the diminishing returns point will be anywhere from 1k (CV resnets) to 100k samples (finetuning transformers).
suflaj t1_iqt971l wrote
Reply to comment by 029187 in [D] - Why do Attention layers work so well? Don't weights in DNNs already tell the network how much weight/attention to give to a specific input? (High weight = lots of attention, low weight = little attention) by 029187
Ah sorry, based on your responses I was convinced you were reading papers so my response might have been overly aggressive due on the incredibly negative experience I have had while reading relevant DL papers. It truly feels like the only difference between SOTA and a garbage paper is that SOTA somehow got to work on a specific machine, specific setup and specific training run. And this spills into whole of DL.
Hopefully you will not have the misfortune of trying to replicate some of the papers that either don't have a repo linked or which are not maintained by a large corporation, you might understand better what I meant.
suflaj t1_iqt876f wrote
Reply to [D] Gpu for machine translation by wrsage
To develop a translation system you will need loads of data. To process this loads of data, you will need loads of processing power. To get good results, you will need large models. Now, depends on how much time you have. If you have a year, then a 2080Ti will be fine, it can pretrain a BERT model in several months. If you have a month, you should probably consider a few 3090s. If you have a week, then renting out 8xA100 rigs might be your best bet.
Overall I'd first focus on collecting the data. 10000 pages of data, assuming 50 sentences per page (and that is a generous guess) is nowhere near enough to develop a translation system. Aim for several tens of millions of sentences, ideally 100-200mil sentence pairs if you wish to outperform Google Translate, or consider developing a handmade tool instead.
The GPUs you mentioned will only be able to run LSTM-CNN models, which will never compete with Google Translate (which is hand-tuned system accompanied by a transformer model). You at least need a 1080Ti/2080Ti, and even that is fairly bad and you'll need months to get anything out of it.
suflaj t1_iqt5hll wrote
Reply to comment by 029187 in [D] - Why do Attention layers work so well? Don't weights in DNNs already tell the network how much weight/attention to give to a specific input? (High weight = lots of attention, low weight = little attention) by 029187
>if there is a theoretical understanding of why the weights are not led to it via backprop
Not really. The intuition is that self-attention is a vastly different kernel than FC layers can handle. Especially with the whole dot product which I assume is the main culprit for it.
>This is an interesting take. Can you elaborate a bit?
I'm not sure how I could elaborate on this. If you read papers you will see that most of the stuff in DL has barely any theoretical basis. On the topic of transformers, about the most theoretical part of it is the normalization in self-attention scores (square root of d_k). Everything else in the original paper is mostly shooting in the dark. It's even more of a joke when you realize that they didn't even check different seeds to realize the one in the original paper gave them fairly bad results.
You can also check all the different transformer architectures that can't seem to converge into anything since the foundation for them is so bad and non-scientific, I'd dare say arbitrary. And then just as you think maybe you can get more hope with CNNs which aren't so arbitrary, you're met with a slightly different residual block in convnext that supposedly gives you results comparable to vision transformers, yet there is barely any theoretical basis over it, mostly intuition.
suflaj t1_iqt1v5b wrote
Reply to comment by 029187 in [D] - Why do Attention layers work so well? Don't weights in DNNs already tell the network how much weight/attention to give to a specific input? (High weight = lots of attention, low weight = little attention) by 029187
> In principle, the DNN could also arrive at those functions.
With a very sparse deep FC network - sure. But in practice it will not happen to any extent it is done in self-attention. In practice it is hard to even reduce the number of transformer blocks and imitate at certain checkpoints, let alone emulate a self-attention with any number of FC layers in series.
You are completely disregarding that just because it is possible to define a mathematical approximation it doesn't mean that there is an algorithm which can consistently lead the weights to it. Certainly in terms of self-attention the landscape optimization algorithms traverse is not really well-behaved.
Theory mostly doesn't matter for deep learning because the subject is not even explainable by theory for the most part. Bunch of trial and error, especially with transformers. There are many parts of DL which are in contradiction with ML theory when applied in practice. Unless you have proof that guarantees something, I'd recommend to avoid trying to apply theory without practice, it's probably a waste of time. There are many theoretical claims, like the universal approximation theorem, but they do not really hold up in practice for the resources we have.
suflaj t1_it4h2vs wrote
Reply to EMA / SWA / SAM by Ttttrrrroooowwww
I wouldn't use any of them because they don't seem to be worth it, and they're generally unproven on modern, relevant models. If I wanted to minimize variance, I'd just build an ensemble of models.
The best advice I can give you is to disregard older papers, model averaging is like a 4 year old idea and doesn't seem to be used much in practice.