diff --git a/README.md b/README.md index 10309d2..e063086 100644 --- a/README.md +++ b/README.md @@ -2,18 +2,24 @@ A Node.js module that performs word order independent similarity search on strings. -This module is built as a native addon that uses C code for fast similarity computations. It uses Jaccard similarity between word sets to find matches regardless of word order. +This module is built as a native addon that uses C code for fast similarity computations. It uses a sophisticated similarity metric that combines fuzzy matching, prefix matching, and word-level comparisons to find matches regardless of word order. ## Installation ```bash -npm install +npm install similarity-search ``` +## Dependencies + +- Node.js (with node-gyp for building native addons) +- nan (^2.22.2) +- node-addon-api (^6.0.0) + ## Usage ```javascript -const SimilaritySearch = require('./index'); +const SimilaritySearch = require('similarity-search'); // Create a new search index with default capacity (500) const index = new SimilaritySearch(); @@ -45,20 +51,21 @@ results.forEach(match => { Creates a new search index. - `capacity` (optional): Initial capacity for the index. Default: 500. +- Returns: A new SimilaritySearch instance. ### `addString(str)` Adds a string to the index. - `str`: The string to add. -- Returns: Boolean indicating success. +- Returns: Boolean indicating success (true if successful, false otherwise). ### `addStrings(strings)` Adds multiple strings to the index. - `strings`: Array of strings to add. -- Returns: Boolean indicating if all adds were successful. +- Returns: Boolean indicating if all adds were successful (true if all successful, false if any failed). ### `search(query, [cutoff])` @@ -66,7 +73,9 @@ Searches the index for strings similar to the query. - `query`: The search query. - `cutoff` (optional): Similarity threshold between 0.0 and 1.0. Default: 0.2. -- Returns: Array of matching results, sorted by similarity (descending). +- Returns: Array of matching results, sorted by similarity (descending). Each result is an object with: + - `string`: The matching string + - `similarity`: The similarity score (0.0 to 1.0) ### `size()` @@ -82,6 +91,7 @@ Creates a test index with random data. - `size` (optional): Number of strings to generate. Default: 500. - Returns: A new SimilaritySearch instance with random data. +- Note: The first 5 strings are fixed test cases, followed by randomly generated strings. ### `SimilaritySearch.benchmark(index, queries, [cutoff])` @@ -90,30 +100,31 @@ Benchmarks the search performance. - `index`: The index to benchmark. - `queries`: Array of search queries. - `cutoff` (optional): Similarity threshold. Default: 0.2. -- Returns: Benchmark results. +- Returns: Array of benchmark results, each containing: + - `query`: The search query + - `matches`: Number of matches found + - `timeMs`: Search time in milliseconds + - `topResults`: Top 5 matching results ## How It Works -The similarity search uses Jaccard similarity between word sets: +The similarity search uses a sophisticated multi-stage matching algorithm: -``` -similarity = (number of matching words) / (total unique words) -``` +1. **Word-level Matching**: The algorithm first splits both the query and target strings into words. -This means word order doesn't matter - "bio bizz" will match with "bizz bio" with 100% similarity. +2. **Word Similarity Calculation**: For each word pair, similarity is calculated using: + - Levenshtein distance for fuzzy matching + - Special handling for short words (3 chars or less require exact match) + - Prefix matching for significantly different length words + - Length-based similarity adjustments -## Building +3. **Overall Similarity Score**: The final similarity score is a weighted combination of: + - Word match score (70% weight): Percentage of query words that have a good match + - Average word similarity (30% weight): Average similarity of the best matching word pairs -To rebuild the native addon: +This approach provides robust matching that: +- Handles typos and slight variations in words +- Requires exact matches for short words to avoid false positives +- Recognizes prefix matches (e.g., "bio" matches "biology") +- Considers both word presence and character-level similarity -```bash -npm install -``` - -## Testing - -Run the test script: - -```bash -npm test -``` \ No newline at end of file diff --git a/test.js b/test.js index d652b0d..f0d7ccc 100644 --- a/test.js +++ b/test.js @@ -44,7 +44,7 @@ customIndex.addString('bizz bio mix light'); customIndex.addStrings([ 'plant growth bio formula', 'garden soil substrate', - 'plagron light mix', + 'plagron lightmix', 'Anesia Seeds Imperium X Auto 10', 'anesi' ]); @@ -57,8 +57,8 @@ const results = customIndex.search('amnesia haze', 0.1); results.forEach(match => { console.log(` ${match.similarity.toFixed(2)}: ${match.string}`); }); -console.log('\nSearching with higher similarity threshold (0.1) for "lightmix":'); -const results2 = customIndex.search('lightmix', 0.1); +console.log('\nSearching with higher similarity threshold (0.1) for "mix light":'); +const results2 = customIndex.search('mix light', 0.1); results2.forEach(match => { console.log(` ${match.similarity.toFixed(2)}: ${match.string}`); }); \ No newline at end of file