seb fb8f61f868 Fix node-gyp compatibility: move to dependencies for proper version control
- Move node-gyp from devDependencies to dependencies
- Ensures v11.2.0 is used when installed as dependency
- Fixes Visual Studio detection issues in consuming projects
- Resolves shopApi build failures with old node-gyp v8.4.1
2025-06-27 03:37:33 +02:00
2025-04-18 08:22:35 +02:00
2025-04-18 08:22:35 +02:00
2025-04-18 08:22:35 +02:00

Similarity Search

A Node.js module that performs word order independent similarity search on strings.

This module is built as a native addon that uses C code for fast similarity computations. It uses a sophisticated similarity metric that combines fuzzy matching, prefix matching, and word-level comparisons to find matches regardless of word order.

Installation

npm install similarity-search

Dependencies

  • Node.js (with node-gyp for building native addons)
  • nan (^2.22.2)
  • node-addon-api (^6.0.0)

Usage

const SimilaritySearch = require('similarity-search');

// Create a new search index with default capacity (500)
const index = new SimilaritySearch();

// Add strings to the index
index.addString('bio bizz');
index.addString('lightmix bizz btio substrate');
index.addString('bizz bio mix light');

// Add multiple strings at once
index.addStrings([
  'plant growth bio formula',
  'garden soil substrate'
]);

// Search the index with a query and similarity cutoff
const results = index.search('bio bizz', 0.2);

// Display results
results.forEach(match => {
  console.log(`${match.similarity.toFixed(2)}: ${match.string}`);
});

API

new SimilaritySearch([capacity])

Creates a new search index.

  • capacity (optional): Initial capacity for the index. Default: 500.
  • Returns: A new SimilaritySearch instance.

addString(str)

Adds a string to the index.

  • str: The string to add.
  • Returns: Boolean indicating success (true if successful, false otherwise).

addStrings(strings)

Adds multiple strings to the index.

  • strings: Array of strings to add.
  • Returns: Boolean indicating if all adds were successful (true if all successful, false if any failed).

search(query, [cutoff])

Searches the index for strings similar to the query.

  • query: The search query.
  • cutoff (optional): Similarity threshold between 0.0 and 1.0. Default: 0.2.
  • Returns: Array of matching results, sorted by similarity (descending). Each result is an object with:
    • string: The matching string
    • similarity: The similarity score (0.0 to 1.0)

size()

Gets the number of strings in the index.

  • Returns: Number of strings in the index.

Helper Functions

SimilaritySearch.createTestIndex([size])

Creates a test index with random data.

  • size (optional): Number of strings to generate. Default: 500.
  • Returns: A new SimilaritySearch instance with random data.
  • Note: The first 5 strings are fixed test cases, followed by randomly generated strings.

SimilaritySearch.benchmark(index, queries, [cutoff])

Benchmarks the search performance.

  • index: The index to benchmark.
  • queries: Array of search queries.
  • cutoff (optional): Similarity threshold. Default: 0.2.
  • Returns: Array of benchmark results, each containing:
    • query: The search query
    • matches: Number of matches found
    • timeMs: Search time in milliseconds
    • topResults: Top 5 matching results

How It Works

The similarity search uses a sophisticated multi-stage matching algorithm:

  1. Word-level Matching: The algorithm first splits both the query and target strings into words.

  2. Word Similarity Calculation: For each word pair, similarity is calculated using:

    • Levenshtein distance for fuzzy matching
    • Special handling for short words (3 chars or less require exact match)
    • Prefix matching for significantly different length words
    • Length-based similarity adjustments
  3. Overall Similarity Score: The final similarity score is a weighted combination of:

    • Word match score (70% weight): Percentage of query words that have a good match
    • Average word similarity (30% weight): Average similarity of the best matching word pairs

This approach provides robust matching that:

  • Handles typos and slight variations in words
  • Requires exact matches for short words to avoid false positives
  • Recognizes prefix matches (e.g., "bio" matches "biology")
  • Considers both word presence and character-level similarity
Description
No description provided
Readme 63 KiB
Languages
C 51%
JavaScript 24.4%
C++ 21.6%
Python 3%