suflaj

suflaj t1_iv5prfi wrote

That's on you tbh

I don't think it's very scientific to judge properties of a card based on the whim of an unregulated agent. That way we could conclude that ancient cards, which are basically worthless, are the best.

But other than that I don't think there's a single person who would recommend anything other than a 3090 even before these benchmarks. Same situation as 1080Ti vs 2080Ti, the 3090 is just too good of a card.

0

suflaj t1_iv2z6fz wrote

Just a small correction: according to Geizhals RTX 4090 is 2030€, and 3090 is 1150€. So the 4090 would at this point need to be around 176% as powerful, but the prices of the 4090 will fall and of the 3090 will rise, so the comparison on the MSRP makes sense, since that is more stable than street prices.

5

suflaj t1_iv1ftad wrote

Well based on the complaint, they probably have a case. However, the solution to the problem may not really be feasible, since it would imply that the copilot also generates a disclaimer based on all the licenses used, so then if a user deletes that, he is breaking the license.

Now, given that this may affect like 100k repositories, the disclaimer file must be in the megabytes.

1

suflaj t1_iuyg2am wrote

Well I couldn't understand what your task was when you didn't say what it was until now.

Other than that, skimming through the paper it quite clearly says the following:

> Our present results do not indicate our procedure can generalize to motifs that are not present in the training set

Because what they're doing doesn't generalize, I think the starting assumptions (that there will be imprevements with a larger model) are wrong, and so the question is unnecessary... The issue is with the method or the data, they do not elaborate more than that.

2

suflaj t1_iuybshu wrote

That seems very bad. You want your train-dev-test to be different samples of the same distribution, so, not very different sets.

Furthermore, if you're using test for model validation, that means you will have no dataset to finally evaluate your model on. Reconsider your process.

Finally, again, I urge you to evaluate your dataset on an established evaluation metric for the task, not the loss you use to train the model. What is the exact task?

2

suflaj t1_iuya2f9 wrote

It can be seen as an approximation of the variance between the noise and the noise predicted conditioned on some data.

If it's on the training set it is not even usable as a metric, and if it is not directly related to the performance it is not a good metric. You want to see how it acts on unseen data.

1

suflaj t1_ium3372 wrote

ML is easy since it's mostly on the CPU. DL still remains shit, unless your definition of prototyping is verifying that the shapes match and that the network can do backprop and save weights at the end of an epoch.

Things are not going to change fast unless Macs start coming with Nvidia CUDA capable GPUs.

8

suflaj t1_iuk2gfj wrote

Yeah, just experiment with it. Like I said, I would start with 4. Then go higher or lower depending on your needs. I have personally not seen a temporally sensitive neural network to go beyond 6 or 8 time points. As with anything, there are tradeoffs.

Although if you have x, y and c, you will be doing 3D convolutions, not 4D. A 4D convolution on 4D data is essentially a linear layer.

1

suflaj t1_iuji61k wrote

Probably not.

  • I am almost certain you don't have data that would take advantage of this dimensionality or the resources to process it
  • you can't accumulate so many features and remember all of them in recurrent models
  • I am almost certain you don't have the hardware to house such a large transformer model that could process it
  • I am almost certain you will not get a 365 day history of a sample during inference, 4 days seems more reasonable
1

suflaj t1_iujhefz wrote

I asked for the specific law so I could show you that it cannot apply to end-to-end encrypted systems, which either have partly destroyed information, or the information that leaves the premises is not comprehensible to anything but the model and there is formal proof that it is infeasible to crack it.

These are all long solved problems, the only hard part is doing hashing without losing too much information, or encryption compact enough to both fit into the model and be comprehensible to it.

2

suflaj t1_iuim5q2 wrote

It could, but doesn't have to. For temporal dimensions 4 is very often seen, so you probably wanna start with that firat, then see how it compares to 3 or 2.

Intuitively, I think 2 time points are useless. It's difficult to generalize something new from such a short relation. Intuitively, I would like to sample t, t-1, t-2 and t-4, but I'd first confirm it's better than t, t-1, t-2 and t-3.

1

suflaj t1_iudwxq9 wrote

Go for the 3090 unless they're the same price. The 3090Ti is slightly more performant, from our benchmarks up to 10-15% in training, but it generates much more heat and noise, and consumes way more power.

3

suflaj t1_iu4zaqo wrote

Well then it's a matter of trust - every serious cloud provider has a privacy policy that claims nothing is logged. Of course, you don't have to trust this, but this is a liability for the cloud provider, so you get to shift the blame if something goes wrong. And I'd argue that for most companies the word of a cloud peovider means more than your word, since they've got much to lose.

It's also standard practice to use end-to-end encryption, with some using end-to-end encrypted models. I don't really see a way how our company would handle personal data and retain samples in a GDPR compliant way without proprietary models in the cloud.

2

suflaj t1_it686mk wrote

This would depend on whether or not you believe newer noisy data is more important. I would not use it generally because it's not something you can guarantee on all data and would have to be theoretically confirmed beforehand, which might be impossible given a task.

If I wanted to reduce the noisiness of pseudo-labels I would not want to introduce additional biases on the data itself, so I'd rather do sample selection, which seems to be what the newest papers suggest to do. Weight averaging is introducing biases akin to what weight normalization techniques did, which were partially abandoned in favour of different approaches, ex. larger batch sizes, because they proved to be more robust and performant in practice as we got models more different than the ML baselines we based our findings on.

Now, if I wasn't aware of papers that came out this year, maybe I wouldn't be saying this. That's why I recommended you stick to newer papers, becuase problems are never really fully solved and newer solutions tend to make bigger strides than optimizing older ones.

1

suflaj t1_it66q7y wrote

Reply to comment by Lee8846 in EMA / SWA / SAM by Ttttrrrroooowwww

While it is true that the age of a method does not determine its value, the older a method is, the more likely the performance gains you get are surpassed by some other method or model.

Specifically I do not see why I would use any weight averaging over a better model or training technique.

> In this case, an ensemble of models might not help.

Because you'd just use a bigger batch size

1

suflaj t1_it4no84 wrote

If you have the means to record the dataset in house it's the best way. You can directly talk to the annotators and the subjects, you make sure that this data cannot be redistributed unless someone leaks it, and you will have a better grasp regarding privacy policies. It is also likely to be cheaper.

With external data it is almost impossible to prove you are allowed to have it, and this data can then just be resold to someone else, potentially a competitor.

8