Semantic Search Engine

Semantic Search Engine for News Articles Using Sentence Embeddings

Client

Self Development

YeAr

2025

Category

Natural Language Processing (NLP)

Service

Automation & Efficiency Gains

Tools / Languages Used

Python
- sentence-transformers (for embeddings)
- FAISS or Chroma (for efficient similarity search)
- pandas, numpy (data handling)
- streamlit or gradio (for interactive demo)
Optional: Hugging Face Datasets API for loading text corpora

Technical Skills

Soft Skills

Problem framing: Defined “relevance” and designed evaluation to measure it.
Systems thinking: Architected a modular retrieval system separating embedding, storage, and query layers.
Communication: Simplified complex NLP concepts (embeddings, vector similarity) for non-technical audiences.
User experience: Designed a clean, intuitive search interface for demonstrations.

Step 1: Exploratory Data ANalysis

Collected ~10,000+ news articles from a public dataset (e.g., AG News or BBC News).
Inspected categories, word counts, and content diversity.
Cleaned and normalized text (removed HTML, punctuation, and stopwords).
Sample insights:
- Articles from “Technology” and “Business” often overlapped semantically (shared topics like startups, AI, innovation).
- Keyword-based search struggled with contextual matches (e.g., “AI chip” vs “semiconductor company”).

Step 2: Solution Design

Used Sentence-BERT (all-MiniLM-L6-v2) to create 384-dimensional sentence embeddings for each article.
Stored embeddings in a FAISS index for efficient nearest-neighbor search.
For a user query:
- Encode the query into an embedding.
- Compute cosine similarity with all article vectors.
- Return the top-k most semantically similar results.
Compared performance with simple TF-IDF cosine similarity to show the power of embeddings.

Step 3: Model Assessment

Defined evaluation metric: Precision@k — proportion of retrieved articles that share the correct category with the query.
Compared:
TF-IDF similarity: 0.58 precision@5
Sentence-BERT similarity: 0.82 precision@5
Qualitative analysis showed the embedding model retrieved conceptually relevant articles even with different phrasing (e.g., query “spacecraft launch” matched “NASA delays rocket mission”).

Step 4: Results / How It’s Used

Built the model for self development. No further deployments associated.

Other Projects