suflaj t1_ivlokew wrote on November 8, 2022 at 9:41 PM

Reply to [D] Is there anything like beam search with BERT? by natural_language_guy

BERT has no decoder. Hence you would need to add a decoder. You can use BERTs pretrained weights with huggingfaces EncoderDecoderModel.

suflaj t1_iv5prfi wrote on November 5, 2022 at 2:46 PM

Reply to comment by zaphdingbatman in [D] NVIDIA RTX 4090 vs RTX 3090 Deep Learning Benchmarks by mippie_moe

That's on you tbh

I don't think it's very scientific to judge properties of a card based on the whim of an unregulated agent. That way we could conclude that ancient cards, which are basically worthless, are the best.

But other than that I don't think there's a single person who would recommend anything other than a 3090 even before these benchmarks. Same situation as 1080Ti vs 2080Ti, the 3090 is just too good of a card.

suflaj t1_iv5iar6 wrote on November 5, 2022 at 1:47 PM

Reply to comment by bellyflop111 in [D] NVIDIA RTX 4090 vs RTX 3090 Deep Learning Benchmarks by mippie_moe

That would be another metric, something like performance per watt per dollar, which is not included in the benchmark, and which is probably uninteresting to people, since then cards like 3060 would come out ahead.

suflaj t1_iv2z6fz wrote on November 4, 2022 at 9:57 PM

Reply to comment by Zer01123 in [D] NVIDIA RTX 4090 vs RTX 3090 Deep Learning Benchmarks by mippie_moe

Just a small correction: according to Geizhals RTX 4090 is 2030€, and 3090 is 1150€. So the 4090 would at this point need to be around 176% as powerful, but the prices of the 4090 will fall and of the 3090 will rise, so the comparison on the MSRP makes sense, since that is more stable than street prices.

suflaj t1_iv1ftad wrote on November 4, 2022 at 3:49 PM

Reply to Lawsuit challenging GitHub Copilot by px05j

Well based on the complaint, they probably have a case. However, the solution to the problem may not really be feasible, since it would imply that the copilot also generates a disclaimer based on all the licenses used, so then if a user deletes that, he is breaking the license.

Now, given that this may affect like 100k repositories, the disclaimer file must be in the megabytes.

suflaj t1_iuyg2am wrote on November 3, 2022 at 11:07 PM

Reply to comment by [deleted] in What to tell about a model you make deeper and deeper, doesn't make better results but doesn't overfit as well? by [deleted]

Well I couldn't understand what your task was when you didn't say what it was until now.

Other than that, skimming through the paper it quite clearly says the following:

> Our present results do not indicate our procedure can generalize to motifs that are not present in the training set

Because what they're doing doesn't generalize, I think the starting assumptions (that there will be imprevements with a larger model) are wrong, and so the question is unnecessary... The issue is with the method or the data, they do not elaborate more than that.

suflaj t1_iuybshu wrote on November 3, 2022 at 10:37 PM

Reply to comment by tivotox in What to tell about a model you make deeper and deeper, doesn't make better results but doesn't overfit as well? by [deleted]

That seems very bad. You want your train-dev-test to be different samples of the same distribution, so, not very different sets.

Furthermore, if you're using test for model validation, that means you will have no dataset to finally evaluate your model on. Reconsider your process.

Finally, again, I urge you to evaluate your dataset on an established evaluation metric for the task, not the loss you use to train the model. What is the exact task?

suflaj t1_iuya2f9 wrote on November 3, 2022 at 10:24 PM

Reply to comment by tivotox in What to tell about a model you make deeper and deeper, doesn't make better results but doesn't overfit as well? by [deleted]

It can be seen as an approximation of the variance between the noise and the noise predicted conditioned on some data.

If it's on the training set it is not even usable as a metric, and if it is not directly related to the performance it is not a good metric. You want to see how it acts on unseen data.

suflaj t1_iuy2y71 wrote on November 3, 2022 at 9:11 PM

Reply to What to tell about a model you make deeper and deeper, doesn't make better results but doesn't overfit as well? by [deleted]

Loss doesn't matter, what are the validation metrics?

suflaj t1_ium7xcv wrote on November 1, 2022 at 12:26 PM

Reply to comment by laprika0 in [D] Machine learning prototyping on Apple silicon? by laprika0

ML is a superset of DL. It's very different working on those two, almost like most ML rules and theory straight up do not apply to modern DL.

suflaj t1_ium3372 wrote on November 1, 2022 at 11:39 AM

Reply to [D] Machine learning prototyping on Apple silicon? by laprika0

ML is easy since it's mostly on the CPU. DL still remains shit, unless your definition of prototyping is verifying that the shapes match and that the network can do backprop and save weights at the end of an epoch.

Things are not going to change fast unless Macs start coming with Nvidia CUDA capable GPUs.

suflaj t1_iuk9d9w wrote on October 31, 2022 at 11:38 PM

Reply to comment by Rare_Lingonberry289 in Does the length/size of a dimension affect accuracy? (CNN) by Rare_Lingonberry289

As long as you keep the jumps the same it should be fine.

suflaj t1_iuk2gfj wrote on October 31, 2022 at 10:43 PM

Reply to comment by Rare_Lingonberry289 in Does the length/size of a dimension affect accuracy? (CNN) by Rare_Lingonberry289

Yeah, just experiment with it. Like I said, I would start with 4. Then go higher or lower depending on your needs. I have personally not seen a temporally sensitive neural network to go beyond 6 or 8 time points. As with anything, there are tradeoffs.

Although if you have x, y and c, you will be doing 3D convolutions, not 4D. A 4D convolution on 4D data is essentially a linear layer.

suflaj t1_iuji61k wrote on October 31, 2022 at 8:15 PM

Reply to comment by Rare_Lingonberry289 in Does the length/size of a dimension affect accuracy? (CNN) by Rare_Lingonberry289

Probably not.

I am almost certain you don't have data that would take advantage of this dimensionality or the resources to process it
you can't accumulate so many features and remember all of them in recurrent models
I am almost certain you don't have the hardware to house such a large transformer model that could process it
I am almost certain you will not get a 365 day history of a sample during inference, 4 days seems more reasonable

suflaj t1_iujhefz wrote on October 31, 2022 at 8:10 PM

Reply to comment by GPUaccelerated in Do companies actually care about their model's training/inference speed? by GPUaccelerated

I asked for the specific law so I could show you that it cannot apply to end-to-end encrypted systems, which either have partly destroyed information, or the information that leaves the premises is not comprehensible to anything but the model and there is formal proof that it is infeasible to crack it.

These are all long solved problems, the only hard part is doing hashing without losing too much information, or encryption compact enough to both fit into the model and be comprehensible to it.

suflaj t1_iuioq7c wrote on October 31, 2022 at 4:59 PM

Reply to comment by GPUaccelerated in Do companies actually care about their model's training/inference speed? by GPUaccelerated

Which laws?

suflaj t1_iuim5q2 wrote on October 31, 2022 at 4:42 PM

Reply to Does the length/size of a dimension affect accuracy? (CNN) by Rare_Lingonberry289

It could, but doesn't have to. For temporal dimensions 4 is very often seen, so you probably wanna start with that firat, then see how it compares to 3 or 2.

Intuitively, I think 2 time points are useless. It's difficult to generalize something new from such a short relation. Intuitively, I would like to sample t, t-1, t-2 and t-4, but I'd first confirm it's better than t, t-1, t-2 and t-3.

suflaj t1_iudwxq9 wrote on October 30, 2022 at 4:25 PM

Reply to 1080ti to RTX 3090ti by jackhhchan

Go for the 3090 unless they're the same price. The 3090Ti is slightly more performant, from our benchmarks up to 10-15% in training, but it generates much more heat and noise, and consumes way more power.

suflaj t1_iu4zaqo wrote on October 28, 2022 at 4:16 PM

Reply to comment by GPUaccelerated in Do companies actually care about their model's training/inference speed? by GPUaccelerated

Well then it's a matter of trust - every serious cloud provider has a privacy policy that claims nothing is logged. Of course, you don't have to trust this, but this is a liability for the cloud provider, so you get to shift the blame if something goes wrong. And I'd argue that for most companies the word of a cloud peovider means more than your word, since they've got much to lose.

It's also standard practice to use end-to-end encryption, with some using end-to-end encrypted models. I don't really see a way how our company would handle personal data and retain samples in a GDPR compliant way without proprietary models in the cloud.

suflaj t1_iu4ue5y wrote on October 28, 2022 at 3:43 PM

Reply to comment by GPUaccelerated in Do companies actually care about their model's training/inference speed? by GPUaccelerated

Well, that is your clients' choice. It's not cost effective to buy Quadros when you could just rent them as you go, especially given their low resale value. It's not like there are many places you can't rent a nearby server with sub 10ms or at least 100ms latency.

suflaj t1_iu1yx11 wrote on October 27, 2022 at 11:21 PM

Reply to Do companies actually care about their model's training/inference speed? by GPUaccelerated

I don't think upgrading is ever worth. It's easier to just scale horizontally, i.e. buy more hardware.

The hardware you do inference on for production is not bought anyways, it's mostly rented, so that doesn't matter. And if you are running models on an edge device you don't have much choice.

suflaj t1_it6fwta wrote on October 21, 2022 at 7:40 AM

Reply to comment by Ttttrrrroooowwww in EMA / SWA / SAM by Ttttrrrroooowwww

That is way too old. Here are a few papers:

https://arxiv.org/abs/2202.07136

https://arxiv.org/abs/2201.10836

suflaj t1_it686mk wrote on October 21, 2022 at 6:00 AM

Reply to comment by Ttttrrrroooowwww in EMA / SWA / SAM by Ttttrrrroooowwww

This would depend on whether or not you believe newer noisy data is more important. I would not use it generally because it's not something you can guarantee on all data and would have to be theoretically confirmed beforehand, which might be impossible given a task.

If I wanted to reduce the noisiness of pseudo-labels I would not want to introduce additional biases on the data itself, so I'd rather do sample selection, which seems to be what the newest papers suggest to do. Weight averaging is introducing biases akin to what weight normalization techniques did, which were partially abandoned in favour of different approaches, ex. larger batch sizes, because they proved to be more robust and performant in practice as we got models more different than the ML baselines we based our findings on.

Now, if I wasn't aware of papers that came out this year, maybe I wouldn't be saying this. That's why I recommended you stick to newer papers, becuase problems are never really fully solved and newer solutions tend to make bigger strides than optimizing older ones.

suflaj t1_it66q7y wrote on October 21, 2022 at 5:43 AM

Reply to comment by Lee8846 in EMA / SWA / SAM by Ttttrrrroooowwww

While it is true that the age of a method does not determine its value, the older a method is, the more likely the performance gains you get are surpassed by some other method or model.

Specifically I do not see why I would use any weight averaging over a better model or training technique.

> In this case, an ensemble of models might not help.

Because you'd just use a bigger batch size

suflaj t1_it4no84 wrote on October 20, 2022 at 10:16 PM

Reply to [D] Is it worth paying a data sourcing company to crowdsource a bespoke dataset? by quantifiedvagabond

If you have the means to record the dataset in house it's the best way. You can directly talk to the annotators and the subjects, you make sure that this data cannot be redistributed unless someone leaks it, and you will have a better grasp regarding privacy policies. It is also likely to be cheaper.

With external data it is almost impossible to prove you are allowed to have it, and this data can then just be resold to someone else, potentially a competitor.