Update README.md to enhance documentation on similarity metrics and usage. Add dependencies section and clarify return values for methods. Modify test.js to reflect updated search scenarios and improve clarity in search results.

2025-04-18 11:03:07 +02:00
parent 92a7bad2b6
commit 0dd17b794f
2 changed files with 39 additions and 28 deletions
--- a/README.md
+++ b/README.md
@@ -2,18 +2,24 @@

 A Node.js module that performs word order independent similarity search on strings.

-This module is built as a native addon that uses C code for fast similarity computations. It uses Jaccard similarity between word sets to find matches regardless of word order.
+This module is built as a native addon that uses C code for fast similarity computations. It uses a sophisticated similarity metric that combines fuzzy matching, prefix matching, and word-level comparisons to find matches regardless of word order.

 ## Installation

 ```bash
-npm install
+npm install similarity-search
 ```

+## Dependencies
+
+- Node.js (with node-gyp for building native addons)
+- nan (^2.22.2)
+- node-addon-api (^6.0.0)
+
 ## Usage

 ```javascript
-const SimilaritySearch = require('./index');
+const SimilaritySearch = require('similarity-search');

 // Create a new search index with default capacity (500)
 const index = new SimilaritySearch();
@@ -45,20 +51,21 @@ results.forEach(match => {
 Creates a new search index.

 - `capacity` (optional): Initial capacity for the index. Default: 500.
+- Returns: A new SimilaritySearch instance.

 ### `addString(str)`

 Adds a string to the index.

 - `str`: The string to add.
- Returns: Boolean indicating success.
+- Returns: Boolean indicating success (true if successful, false otherwise).

 ### `addStrings(strings)`

 Adds multiple strings to the index.

 - `strings`: Array of strings to add.
- Returns: Boolean indicating if all adds were successful.
+- Returns: Boolean indicating if all adds were successful (true if all successful, false if any failed).

 ### `search(query, [cutoff])`

@@ -66,7 +73,9 @@ Searches the index for strings similar to the query.

 - `query`: The search query.
 - `cutoff` (optional): Similarity threshold between 0.0 and 1.0. Default: 0.2.
- Returns: Array of matching results, sorted by similarity (descending).
+- Returns: Array of matching results, sorted by similarity (descending). Each result is an object with:
+  - `string`: The matching string
+  - `similarity`: The similarity score (0.0 to 1.0)

 ### `size()`

@@ -82,6 +91,7 @@ Creates a test index with random data.

 - `size` (optional): Number of strings to generate. Default: 500.
 - Returns: A new SimilaritySearch instance with random data.
+- Note: The first 5 strings are fixed test cases, followed by randomly generated strings.

 ### `SimilaritySearch.benchmark(index, queries, [cutoff])`

@@ -90,30 +100,31 @@ Benchmarks the search performance.
 - `index`: The index to benchmark.
 - `queries`: Array of search queries.
 - `cutoff` (optional): Similarity threshold. Default: 0.2.
- Returns: Benchmark results.
+- Returns: Array of benchmark results, each containing:
+  - `query`: The search query
+  - `matches`: Number of matches found
+  - `timeMs`: Search time in milliseconds
+  - `topResults`: Top 5 matching results

 ## How It Works

-The similarity search uses Jaccard similarity between word sets:
+The similarity search uses a sophisticated multi-stage matching algorithm:

-```
-similarity = (number of matching words) / (total unique words)
-```
+1. **Word-level Matching**: The algorithm first splits both the query and target strings into words.

-This means word order doesn't matter - "bio bizz" will match with "bizz bio" with 100% similarity.
+2. **Word Similarity Calculation**: For each word pair, similarity is calculated using:
+   - Levenshtein distance for fuzzy matching
+   - Special handling for short words (3 chars or less require exact match)
+   - Prefix matching for significantly different length words
+   - Length-based similarity adjustments

-## Building
+3. **Overall Similarity Score**: The final similarity score is a weighted combination of:
+   - Word match score (70% weight): Percentage of query words that have a good match
+   - Average word similarity (30% weight): Average similarity of the best matching word pairs

-To rebuild the native addon:
+This approach provides robust matching that:
+- Handles typos and slight variations in words
+- Requires exact matches for short words to avoid false positives
+- Recognizes prefix matches (e.g., "bio" matches "biology")
+- Considers both word presence and character-level similarity

-```bash
-npm install
-```
-
-## Testing
-
-Run the test script:
-
-```bash
-npm test
-``` 
--- a/test.js
+++ b/test.js
@@ -44,7 +44,7 @@ customIndex.addString('bizz bio mix light');
 customIndex.addStrings([
  'plant growth bio formula',
  'garden soil substrate',
-  'plagron light mix',
+  'plagron lightmix',
  'Anesia Seeds Imperium X Auto 10',
  'anesi'
 ]);
@@ -57,8 +57,8 @@ const results = customIndex.search('amnesia haze', 0.1);
 results.forEach(match => {
  console.log(`  ${match.similarity.toFixed(2)}: ${match.string}`);
 }); 
-console.log('\nSearching with higher similarity threshold (0.1) for "lightmix":');
-const results2 = customIndex.search('lightmix', 0.1);
+console.log('\nSearching with higher similarity threshold (0.1) for "mix light":');
+const results2 = customIndex.search('mix light', 0.1);
 results2.forEach(match => {
  console.log(`  ${match.similarity.toFixed(2)}: ${match.string}`);
 });