Go to file

seb fb8f61f868 Fix node-gyp compatibility: move to dependencies for proper version control

- Move node-gyp from devDependencies to dependencies
- Ensures v11.2.0 is used when installed as dependency
- Fixes Visual Studio detection issues in consuming projects
- Resolves shopApi build failures with old node-gyp v8.4.1

2025-06-27 03:37:33 +02:00

.vscode

genesis

2025-04-18 08:22:35 +02:00

.gitignore

Update .gitignore to ignore npm lock file and tidy entry

2025-06-23 07:09:30 +02:00

binding.gyp

genesis

2025-04-18 08:22:35 +02:00

index.js

genesis

2025-04-18 08:22:35 +02:00

package.json

Fix node-gyp compatibility: move to dependencies for proper version control

2025-06-27 03:37:33 +02:00

README.md

Update README.md to enhance documentation on similarity metrics and usage. Add dependencies section and clarify return values for methods. Modify test.js to reflect updated search scenarios and improve clarity in search results.

2025-04-18 11:03:07 +02:00

similarity_search_addon.cc

Enhance search index handling by returning an empty array for no results instead of throwing an error. Improve memory management in free_words function by checking for NULL before freeing. Update search_index to properly return NULL when no results are found.

2025-06-27 03:15:45 +02:00

similarity_search.c

2025-06-27 03:15:45 +02:00

similarity_search.h

Increase default capacity in SearchIndexWrapper and enhance similarity calculation in calculate_similarity function to boost similarity score when all query words are found. Update MAX_WORDS and MAX_STRING_LEN definitions for improved handling.

2025-04-18 09:16:26 +02:00

test.js

2025-04-18 11:03:07 +02:00

README.md

Similarity Search

A Node.js module that performs word order independent similarity search on strings.

This module is built as a native addon that uses C code for fast similarity computations. It uses a sophisticated similarity metric that combines fuzzy matching, prefix matching, and word-level comparisons to find matches regardless of word order.

Installation

npm install similarity-search

Dependencies

Node.js (with node-gyp for building native addons)
nan (^2.22.2)
node-addon-api (^6.0.0)

Usage

const SimilaritySearch = require('similarity-search');

// Create a new search index with default capacity (500)
const index = new SimilaritySearch();

// Add strings to the index
index.addString('bio bizz');
index.addString('lightmix bizz btio substrate');
index.addString('bizz bio mix light');

// Add multiple strings at once
index.addStrings([
  'plant growth bio formula',
  'garden soil substrate'
]);

// Search the index with a query and similarity cutoff
const results = index.search('bio bizz', 0.2);

// Display results
results.forEach(match => {
  console.log(`${match.similarity.toFixed(2)}: ${match.string}`);
});

API

`new SimilaritySearch([capacity])`

Creates a new search index.

capacity (optional): Initial capacity for the index. Default: 500.
Returns: A new SimilaritySearch instance.

`addString(str)`

Adds a string to the index.

str: The string to add.
Returns: Boolean indicating success (true if successful, false otherwise).

`addStrings(strings)`

Adds multiple strings to the index.

strings: Array of strings to add.
Returns: Boolean indicating if all adds were successful (true if all successful, false if any failed).

`search(query, [cutoff])`

Searches the index for strings similar to the query.

query: The search query.
cutoff (optional): Similarity threshold between 0.0 and 1.0. Default: 0.2.
Returns: Array of matching results, sorted by similarity (descending). Each result is an object with:
- string: The matching string
- similarity: The similarity score (0.0 to 1.0)

`size()`

Gets the number of strings in the index.

Returns: Number of strings in the index.

Helper Functions

`SimilaritySearch.createTestIndex([size])`

Creates a test index with random data.

size (optional): Number of strings to generate. Default: 500.
Returns: A new SimilaritySearch instance with random data.
Note: The first 5 strings are fixed test cases, followed by randomly generated strings.

`SimilaritySearch.benchmark(index, queries, [cutoff])`

Benchmarks the search performance.

index: The index to benchmark.
queries: Array of search queries.
cutoff (optional): Similarity threshold. Default: 0.2.
Returns: Array of benchmark results, each containing:
- query: The search query
- matches: Number of matches found
- timeMs: Search time in milliseconds
- topResults: Top 5 matching results

How It Works

The similarity search uses a sophisticated multi-stage matching algorithm:

Word-level Matching: The algorithm first splits both the query and target strings into words.
Word Similarity Calculation: For each word pair, similarity is calculated using:
- Levenshtein distance for fuzzy matching
- Special handling for short words (3 chars or less require exact match)
- Prefix matching for significantly different length words
- Length-based similarity adjustments
Overall Similarity Score: The final similarity score is a weighted combination of:
- Word match score (70% weight): Percentage of query words that have a good match
- Average word similarity (30% weight): Average similarity of the best matching word pairs

This approach provides robust matching that:

Handles typos and slight variations in words
Requires exact matches for short words to avoid false positives
Recognizes prefix matches (e.g., "bio" matches "biology")
Considers both word presence and character-level similarity