# Similarity Search A Node.js module that performs word order independent similarity search on strings. This module is built as a native addon that uses C code for fast similarity computations. It uses a sophisticated similarity metric that combines fuzzy matching, prefix matching, and word-level comparisons to find matches regardless of word order. ## Installation ```bash npm install similarity-search ``` ## Dependencies - Node.js (with node-gyp for building native addons) - nan (^2.22.2) - node-addon-api (^6.0.0) ## Usage ```javascript const SimilaritySearch = require('similarity-search'); // Create a new search index with default capacity (500) const index = new SimilaritySearch(); // Add strings to the index index.addString('bio bizz'); index.addString('lightmix bizz btio substrate'); index.addString('bizz bio mix light'); // Add multiple strings at once index.addStrings([ 'plant growth bio formula', 'garden soil substrate' ]); // Search the index with a query and similarity cutoff const results = index.search('bio bizz', 0.2); // Display results results.forEach(match => { console.log(`${match.similarity.toFixed(2)}: ${match.string}`); }); ``` ## API ### `new SimilaritySearch([capacity])` Creates a new search index. - `capacity` (optional): Initial capacity for the index. Default: 500. - Returns: A new SimilaritySearch instance. ### `addString(str)` Adds a string to the index. - `str`: The string to add. - Returns: Boolean indicating success (true if successful, false otherwise). ### `addStrings(strings)` Adds multiple strings to the index. - `strings`: Array of strings to add. - Returns: Boolean indicating if all adds were successful (true if all successful, false if any failed). ### `search(query, [cutoff])` Searches the index for strings similar to the query. - `query`: The search query. - `cutoff` (optional): Similarity threshold between 0.0 and 1.0. Default: 0.2. - Returns: Array of matching results, sorted by similarity (descending). Each result is an object with: - `string`: The matching string - `similarity`: The similarity score (0.0 to 1.0) ### `size()` Gets the number of strings in the index. - Returns: Number of strings in the index. ## Helper Functions ### `SimilaritySearch.createTestIndex([size])` Creates a test index with random data. - `size` (optional): Number of strings to generate. Default: 500. - Returns: A new SimilaritySearch instance with random data. - Note: The first 5 strings are fixed test cases, followed by randomly generated strings. ### `SimilaritySearch.benchmark(index, queries, [cutoff])` Benchmarks the search performance. - `index`: The index to benchmark. - `queries`: Array of search queries. - `cutoff` (optional): Similarity threshold. Default: 0.2. - Returns: Array of benchmark results, each containing: - `query`: The search query - `matches`: Number of matches found - `timeMs`: Search time in milliseconds - `topResults`: Top 5 matching results ## How It Works The similarity search uses a sophisticated multi-stage matching algorithm: 1. **Word-level Matching**: The algorithm first splits both the query and target strings into words. 2. **Word Similarity Calculation**: For each word pair, similarity is calculated using: - Levenshtein distance for fuzzy matching - Special handling for short words (3 chars or less require exact match) - Prefix matching for significantly different length words - Length-based similarity adjustments 3. **Overall Similarity Score**: The final similarity score is a weighted combination of: - Word match score (70% weight): Percentage of query words that have a good match - Average word similarity (30% weight): Average similarity of the best matching word pairs This approach provides robust matching that: - Handles typos and slight variations in words - Requires exact matches for short words to avoid false positives - Recognizes prefix matches (e.g., "bio" matches "biology") - Considers both word presence and character-level similarity