Skip to content

Semantic Similarity Estimator

Vector embeddings are the mathematical structures that allow LLMs to represent meaning as numbers. Words, phrases, and passages are positioned in high-dimensional space so that similar concepts cluster together.

This tool started as a way to make that concept tangible, a conceptual widget making vector embeddings easier to grasp by playing with one than by reading about them. Paste in some keywords or two passages of text, and Google Gemini’s embedding model returns a similarity score showing how closely related they are semantically. Use Keywords mode to compare up to 10 terms at once in a color-coded matrix, or Documents mode for a head-to-head score between two longer texts.

πŸ“Š Text Similarity Analyzer

Powered by Google Gemini! Compare the semantic similarity between keywords, phrases, paragraphs, or entire documents using state-of-the-art AI embeddings.

πŸ”‘ Google Gemini API Configuration

Get a FREE API key at Google AI Studio. Free tier: 1,500 requests per day!

πŸ“ Document 1

0 characters

πŸ“ Document 2

0 characters

πŸ“Š Understanding Similarity Scores

Cosine similarity ranges from -1 to 1:

  • 0.9–1.0: Nearly identical meaning
  • 0.7–0.9: Very similar concepts
  • 0.4–0.7: Moderately related
  • Below 0.4: Different or unrelated
What is a vector embedding?

A vector embedding is a list of numbers that represents the meaning of a piece of text. Language models learn to place similar concepts close together in that numerical space, which is what makes similarity measurement possible.

What does the similarity score actually mean?

Scores range from -1 to 1. Values close to 1 mean the two texts are semantically close; values near 0 mean they share little meaning; negative values indicate opposing concepts. In practice most comparisons fall between 0.3 and 0.95.

What's the difference between keyword mode and document mode?

Keyword mode generates individual embeddings per term and plots all pairwise relationships in a matrix. Document mode embeds each passage as a single vector and returns one score, which better reflects whole-text meaning.

Why do two very different-looking texts sometimes score higher than expected?

Embeddings capture semantic field, not just vocabulary. A text about a "physician" and one about a "doctor" will score very high even with zero word overlap.

What are the limitations of cosine similarity as a metric?

It measures the angle between two vectors, not their magnitude, so it captures directional similarity well but can miss differences in specificity or topic depth between texts of very different lengths.