Geneocrat t1_jbu4law wrote on March 11, 2023 at 6:58 PM

Reply to comment by ID4gotten in [Discussion] Compare OpenAI and SentenceTransformer Sentence Embeddings by Simusid

Thanks for asking the questions seemingly obvious questions so that I don’t have to wonder.

Simusid OP t1_jbu3q8m wrote on March 11, 2023 at 6:52 PM

Reply to comment by polandtown in [Discussion] Compare OpenAI and SentenceTransformer Sentence Embeddings by Simusid

Here is some explanation about UMAP axes and why they should usually be ignored: https://stats.stackexchange.com/questions/527235/how-to-interpret-axis-of-umap

Basically it's because they are nonlinear.

[deleted] t1_jbu3k8i wrote on March 11, 2023 at 6:51 PM

Reply to comment by [deleted] in [P] GITModel: Dynamically generate high-quality hierarchical topic tree representations of GitHub repositories using customizable GNN message passing layers, chatgpt, and topic modeling. by NovelspaceOnly

If you’re looking to build something similar. Please consider contributing instead of building from scratch!

machineko t1_jbu36nu wrote on March 11, 2023 at 6:48 PM

Reply to [D] What is the best way to fine tune a LLM with your own data and build a custom text classifier? by pgalgali

How long is your text? If you are doing short sentences, try fine-tuning RoBERTa with your labeled dataset for classification. If you don't have labeled datasets, you need to use zero or few-shot learning on a larger model. I'd start with a smaller LLM like GPT-J, try playing with some prompts on a free playground like this (you can select GPT-J) until you find something that work well.

NovelspaceOnly OP t1_jbu338b wrote on March 11, 2023 at 6:47 PM

Reply to comment by [deleted] in [P] GITModel: Dynamically generate high-quality hierarchical topic tree representations of GitHub repositories using customizable GNN message passing layers, chatgpt, and topic modeling. by NovelspaceOnly

confirmed its antagonist by design.

pyepyepie t1_jbu3245 wrote on March 11, 2023 at 6:47 PM

Reply to comment by Simusid in [Discussion] Compare OpenAI and SentenceTransformer Sentence Embeddings by Simusid

I think you misunderstood my comment. What I say is, that since you have no way to measure how well UMAP worked and how much of the variance of the data this plot contains, the fact that it "seems similar" means nothing (I am really not an expert on it, if I get it wrong feel free to correct me). Additionally, I am not sure how balanced the dataset you used for classification is, and if sentence embeddings are even the right approach for that specific task.

It might be the case - for example, that the OpenAI embeddings + the FFW network classify the data perfectly/as well as you can since the dataset is very imbalanced and the annotation is imperfect/categories are very similar. In this case, 89% vs 91% could be a huge difference. In fact, for some datasets the "majority classifier" would yield high accuracy, I would start by reporting precision & recall.

Again, I don't want to be "the negative guy" but there are serious flaws that make me unable to make any conclusion based on it (I find the project very important and interesting). Could you release the data of your experiments (vectors, dataset) so other people (I might as well) can look into it more deeply?

polandtown t1_jbu2zqe wrote on March 11, 2023 at 6:47 PM

Reply to comment by Simusid in [Discussion] Compare OpenAI and SentenceTransformer Sentence Embeddings by Simusid

Learning here, but how are you axes defined? Some kind of factor(s) or component(s) extracted from each individual embedding? Thanks for the visualization, as it made me curious and interested! Good work!

NovelspaceOnly OP t1_jbu2xyi wrote on March 11, 2023 at 6:46 PM

Reply to comment by jsonathan in [P] GITModel: Dynamically generate high-quality hierarchical topic tree representations of GitHub repositories using customizable GNN message passing layers, chatgpt, and topic modeling. by NovelspaceOnly

This is awesome. I would be happy to discuss this as well! I was going to add GCNs and GATs pretty soon. if you're up for collaborating, please reach out in DMs!

Non-jabroni_redditor t1_jbu2shx wrote on March 11, 2023 at 6:45 PM

Reply to comment by rshah4 in [Discussion] Compare OpenAI and SentenceTransformer Sentence Embeddings by Simusid

That’s to be expected, no? No model is going to be perfect regardless of how it performs on a set (of datasets) as a whole

NovelspaceOnly OP t1_jbu2nki wrote on March 11, 2023 at 6:44 PM

Reply to comment by xt-89 in [P] GITModel: Dynamically generate high-quality hierarchical topic tree representations of GitHub repositories using customizable GNN message passing layers, chatgpt, and topic modeling. by NovelspaceOnly

yes! I would describe my repo as very aligned with ideas from Yann Lecun. "composition of clever abstractions"

Simusid OP t1_jbu2n5w wrote on March 11, 2023 at 6:44 PM

Reply to comment by utopiah in [Discussion] Compare OpenAI and SentenceTransformer Sentence Embeddings by Simusid

That was not the point at all.

Continuing the cat analogy, I have two different cameras. I take 20,000 pictures of the same cats with both. I have two datasets of 20,000 cats. Is one dataset superior to the other? I will build a model that tries to predict cats and see if the "quality" of one dataset is better than the other.

In this case, the OpenAI dataset appears to be slightly better.

Simusid OP t1_jbu229y wrote on March 11, 2023 at 6:40 PM

Reply to comment by pyepyepie in [Discussion] Compare OpenAI and SentenceTransformer Sentence Embeddings by Simusid

Regarding the plot, the intent was not to measure anything, nor identify any specific differences. UMAP is an important tool for humans to get a sense of what is going on at a high level. I think if you ever use a UMAP plot for analytic results, you're using it incorrectly.

At a high level I wanted to see if there were very distinct clusters or amorphous overlapping blobs and to see if one embedding was very distinct. I think these UMAPs clearly show good and similar clustering.

Regarding the classification task; Again, this is a notional task and not trying to solve a concrete problem. The goal was to use nearly identical models with both sets of embeddings to see if there were consistent differences. There were. The OpenAI models marginally outperforms the SentenceTransformer models every single time (several hundreds runs with various hyperparameters). Whether it's a "carefully chosen" task or not is immaterial. In this case "carefully chosen" means softmax classification accuracy of the 4 labels in the curated dataset.

[deleted] t1_jbu1ss7 wrote on March 11, 2023 at 6:38 PM

Reply to [P] Introducing confidenceinterval, the long missing python library for computing confidence intervals by jacobgil

[removed]

[deleted] t1_jbu1gx3 wrote on March 11, 2023 at 6:36 PM

Reply to comment by LikeForeheadBut in [P] GITModel: Dynamically generate high-quality hierarchical topic tree representations of GitHub repositories using customizable GNN message passing layers, chatgpt, and topic modeling. by NovelspaceOnly

Like what?

[deleted] t1_jbu1dcm wrote on March 11, 2023 at 6:35 PM

Reply to comment by hak8or in [P] GITModel: Dynamically generate high-quality hierarchical topic tree representations of GitHub repositories using customizable GNN message passing layers, chatgpt, and topic modeling. by NovelspaceOnly

[deleted]

shahaff32 t1_jbu0zmr wrote on March 11, 2023 at 6:33 PM

Reply to [D] Is Pytorch Lightning + Wandb a good combination for research? by gokulPRO

In our research we ran into issues with Lightning. It is especially annoying when designing non-trivial layers or optimizers. Also, it is much harder to convert the code back to pure pytorch.

For example, in a recent peoject, Lightning caused each forward-backward to operate twice on each batch because we used a combination of two optimizers for a specific reason. And now we are working on rewriting everything without Lightning.

utopiah t1_jbu0qpa wrote on March 11, 2023 at 6:31 PM

Reply to comment by Simusid in [Discussion] Compare OpenAI and SentenceTransformer Sentence Embeddings by Simusid

Still says absolutely nothing if you don't know what a cat is.

Simusid OP t1_jbu0bkv wrote on March 11, 2023 at 6:28 PM

Reply to comment by utopiah in [Discussion] Compare OpenAI and SentenceTransformer Sentence Embeddings by Simusid

>We do but is it what embedding actually provide or rather some kind of distance between items,

A single embedding is a single vector, encoding a single sentence. To identify a relationship between sentences, you need to compare vectors. Typically this is done with cosine distance between the vectors. The expectation is that if you have a collection of sentences that all talk about cats, the vectors that represent them will exist in a related neighborhood in the metric space.

[deleted] t1_jbtztzc wrote on March 11, 2023 at 6:24 PM

Reply to comment by utopiah in [Discussion] Compare OpenAI and SentenceTransformer Sentence Embeddings by Simusid

[deleted]

science-raven OP t1_jbtzsn5 wrote on March 11, 2023 at 6:24 PM

Reply to comment by Real_Revenue_4741 in [D] Development challenges of an autonomous gardening robot using object detection and mapping. by science-raven

Thanks for the info. The robot has an onboard map of the entire garden which is accurate to a couple of inches and an ultrasound ping, so it can arrive anywhere without processing.

I found a visual servoing demo using a Raspberry Pi from 2015, and today the v4 is four times faster than that. How can it fail at accurate placement when all the objects in the garden are static, and it has limitless time for processing the most difficult tasks?

utopiah t1_jbtx8iv wrote on March 11, 2023 at 6:06 PM

Reply to comment by Simusid in [Discussion] Compare OpenAI and SentenceTransformer Sentence Embeddings by Simusid

> What we want is a model that can represent the "semantic content" or idea behind a sentence

We do but is it what embedding actually provide or rather some kind of distance between items, how they might relate or not between each other? I'm not sure that would be sufficient for most people to provide the "idea" behind a sentence, just relatedness. I'm not saying it's not useful but arguing against the semantic aspect here, at least from my understanding of that explanation.

onebigcat OP t1_jbtuxhs wrote on March 11, 2023 at 5:50 PM

Reply to comment by currentscurrents in [D] Unsupervised Learning — have there been any big advances recently? by onebigcat

Any papers or models you could point to using this for a specific purpose?

rshah4 t1_jbtsl7o wrote on March 11, 2023 at 5:34 PM

Reply to comment by rshah4 in [Discussion] Compare OpenAI and SentenceTransformer Sentence Embeddings by Simusid

Also, not sure about a recent comparison, but Nils Reimers also tried to empirically analyze OpenAI's embeddings here: https://twitter.com/Nils_Reimers/status/1487014195568775173

He found across 14 datasets that the OpenAI 175B model is actually worse than a tiny MiniLM 22M parameter model that can run in your browser.

pyepyepie t1_jbtsc3n wrote on March 11, 2023 at 5:32 PM

Reply to [Discussion] Compare OpenAI and SentenceTransformer Sentence Embeddings by Simusid

Your plot doesn't mean much - when you use UMAP you can't even measure the explained variance, differences can be more nuanced than what you get from the results. I would evaluate with some semantic similarity or ranking.

For the "91% vs 89%" - you need to pick the classification task very carefully if you don't describe what it was then it also literally means nothing.

That being said, thanks for the efforts.

quitenominal t1_jbtr6g7 wrote on March 11, 2023 at 5:24 PM

Reply to comment by Simusid in [Discussion] Compare OpenAI and SentenceTransformer Sentence Embeddings by Simusid

fwiw this has also been my finding when comparing these two embeddings for classification tasks. Better, but not enough to justify the cost

Recent comments in /f/MachineLearning