Discussion about this post

User's avatar
Jack's avatar

This is an interesting approach. However, i feel that it could be improved by doing cos sim between the text and the label itself instead of label to label:

1. Embed tweet text and find nearest similarity to other class labels. If found, and above cos sim threshold, use the existing label. If not found above threshold, generate new label and store the label in a vectorDB

2. Repeat.

This way you are comparing the source text directly to the labels. Comparing label to label is noisier as the label of a tweet is a limited representation of the underlying tweet. Why not just use the tweet?

A more stable (and classic) approach is to just embedd all the text, cluster, then create labels for the cluster.

What are the advantages of your approach to the traditional cluster method? To me, it seems your method offers increased variance run to run- something you aimed to reduce with not much obvious benefit. Maybe im missing something?

Expand full comment
Sujeet Pillai's avatar

Do you feed the clustered labels back into the prompt to align the LLMs labeling back into the set of clustered labels if possible?

Expand full comment
6 more comments...

No posts