Hybrid Filtering Approach
Based on the findinfsa of the previous section, I decided to implement a hybrid approach, utilizing Finlang for initial filtering and GPT-3.5-turbo for secondary verification. This hybrid strategy consists of:
- Embedding-based pre-filtering: Compute embeddings for all points and calculate cosine similarities. Points scoring above approximately 0.60 similarity are flagged as potential duplicates.
- GPT-based verification: Points identified as potential duplicates undergo further evaluation by GPT-3.5-turbo in batches to confirm whether they truly represent identical semantic meaning.
This approach ensures efficiency and cost-effectiveness, combining quick and inexpensive embedding calculations with GPT’s nuanced semantic evaluation.
I am going to store the vector embeddings for points I save to the database (which aren’t duplicates) as a database Column under the individual Points so I don’t have to recompute them when comparing them the next time.