Similarity Search

A Node.js module that performs word order independent similarity search on strings.

This module is built as a native addon that uses C code for fast similarity computations. It uses Jaccard similarity between word sets to find matches regardless of word order.

Installation

npm install

Usage

const SimilaritySearch = require('./index');

// Create a new search index with default capacity (500)
const index = new SimilaritySearch();

// Add strings to the index
index.addString('bio bizz');
index.addString('lightmix bizz btio substrate');
index.addString('bizz bio mix light');

// Add multiple strings at once
index.addStrings([
  'plant growth bio formula',
  'garden soil substrate'
]);

// Search the index with a query and similarity cutoff
const results = index.search('bio bizz', 0.2);

// Display results
results.forEach(match => {
  console.log(`${match.similarity.toFixed(2)}: ${match.string}`);
});

API

new SimilaritySearch([capacity])

Creates a new search index.

  • capacity (optional): Initial capacity for the index. Default: 500.

addString(str)

Adds a string to the index.

  • str: The string to add.
  • Returns: Boolean indicating success.

addStrings(strings)

Adds multiple strings to the index.

  • strings: Array of strings to add.
  • Returns: Boolean indicating if all adds were successful.

search(query, [cutoff])

Searches the index for strings similar to the query.

  • query: The search query.
  • cutoff (optional): Similarity threshold between 0.0 and 1.0. Default: 0.2.
  • Returns: Array of matching results, sorted by similarity (descending).

size()

Gets the number of strings in the index.

  • Returns: Number of strings in the index.

Helper Functions

SimilaritySearch.createTestIndex([size])

Creates a test index with random data.

  • size (optional): Number of strings to generate. Default: 500.
  • Returns: A new SimilaritySearch instance with random data.

SimilaritySearch.benchmark(index, queries, [cutoff])

Benchmarks the search performance.

  • index: The index to benchmark.
  • queries: Array of search queries.
  • cutoff (optional): Similarity threshold. Default: 0.2.
  • Returns: Benchmark results.

How It Works

The similarity search uses Jaccard similarity between word sets:

similarity = (number of matching words) / (total unique words)

This means word order doesn't matter - "bio bizz" will match with "bizz bio" with 100% similarity.

Building

To rebuild the native addon:

npm install

Testing

Run the test script:

npm test
Description
No description provided
Readme 63 KiB
Languages
C 51%
JavaScript 24.4%
C++ 21.6%
Python 3%