On building a semantic search engine

Scout Monitoring - Free Django app performance insights with Scout Monitoring

Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.

www.scoutapm.com

featured

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

beir

8 1,424 4.2 Python

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.

The BEIR project might be what you're looking for: https://github.com/beir-cellar/beir/wiki/Leaderboard

viberary

1 405 8.7 Jupyter Notebook

Good books, good vibes

One issue I always run into when implementing these approaches is the embedding model's context window being too small to represent what I need.
For example, on this project, looking at the generation of training data [1], it seems like what's actually being generated are embeddings on a string concatenated from each review, title, description, etc. [2]. With the max_seq_length set to 200, wouldn't book reviews with long reviews result in the book description text never being encoded? Wouldn't this result in queries not matching against potentially similar descriptions if the reviews are topically dissimilar (e.g., discussing author's style, book's flow, etc. instead of plot).
[1] https://github.com/veekaybee/viberary/blob/main/src/model/ge...

Scout Monitoring

www.scoutapm.com featured

Free Django app performance insights with Scout Monitoring. Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.
unilm

42 18,689 9.0 Python

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

e5-mistral is essentially a distillation from gpt-4 to a smaller model. You can see here https://github.com/microsoft/unilm/blob/16da2f193b9c1dab0a69...
they actually have custom prompts for each dataset being tested.
Question would be, if you haven't seen the task before, what is a good prompt to prepend for your task?
IMO e5-mistral is overfit to MTEB

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

1-Bit LLMs Could Solve AI's Energy Demands

6 projects | news.ycombinator.com | 30 May 2024
The Era of 1-Bit LLMs: Training_Tips, Code And_FAQ [pdf]

1 project | news.ycombinator.com | 21 Mar 2024
The Era of 1-Bit LLMs: Training Tips, Code and FAQ

1 project | news.ycombinator.com | 20 Mar 2024
Splade: Sparse Neural Search

1 project | news.ycombinator.com | 11 Mar 2024
The Era of 1-bit LLMs: ternary parameters for cost-effective computing

6 projects | news.ycombinator.com | 28 Feb 2024

On building a semantic search engine

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
NLP pre-trained-model information-retrieval unilm Bert
Post date: 6 Jan 2024

beir

viberary

Scout Monitoring

unilm

Related posts

1-Bit LLMs Could Solve AI's Energy Demands

The Era of 1-Bit LLMs: Training_Tips, Code And_FAQ [pdf]

The Era of 1-Bit LLMs: Training Tips, Code and FAQ

Splade: Sparse Neural Search

The Era of 1-bit LLMs: ternary parameters for cost-effective computing

On building a semantic search engine

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com NLP pre-trained-model information-retrieval unilm Bert Post date: 6 Jan 2024

beir

viberary

Scout Monitoring

unilm

Related posts

1-Bit LLMs Could Solve AI's Energy Demands

The Era of 1-Bit LLMs: Training_Tips, Code And_FAQ [pdf]

The Era of 1-Bit LLMs: Training Tips, Code and FAQ

Splade: Sparse Neural Search

The Era of 1-bit LLMs: ternary parameters for cost-effective computing

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
NLP pre-trained-model information-retrieval unilm Bert
Post date: 6 Jan 2024