Fix node-gyp compatibility: move to dependencies for proper version control

- Move node-gyp from devDependencies to dependencies - Ensures v11.2.0 is used when installed as dependency - Fixes Visual Studio detection issues in consuming projects - Resolves shopApi build failures with old node-gyp v8.4.1
Update package.json to include Node.js engine requirement and enhance build scripts. Add new keywords for better package discoverability, define repository information, and specify supported operating systems and CPU architectures. Introduce devDependencies for node-gyp to streamline native module compilation.
2025-06-27 03:37:33 +02:00 · 2025-06-27 03:32:24 +02:00 · 2025-06-27 03:15:45 +02:00 · 2025-06-23 07:09:30 +02:00 · 2025-06-23 04:16:18 +02:00 · 2025-06-21 21:14:20 +02:00
7 changed files with 238 additions and 93 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -22,4 +22,6 @@ build/
 .Spotlight-V100
 .Trashes
 ehthumbs.db
-Thumbs.db 
+Thumbs.db
+
+package-lock.json
--- a/README.md
+++ b/README.md
@@ -2,18 +2,24 @@

 A Node.js module that performs word order independent similarity search on strings.

-This module is built as a native addon that uses C code for fast similarity computations. It uses Jaccard similarity between word sets to find matches regardless of word order.
+This module is built as a native addon that uses C code for fast similarity computations. It uses a sophisticated similarity metric that combines fuzzy matching, prefix matching, and word-level comparisons to find matches regardless of word order.

 ## Installation

 ```bash
-npm install
+npm install similarity-search
 ```

+## Dependencies
+
+- Node.js (with node-gyp for building native addons)
+- nan (^2.22.2)
+- node-addon-api (^6.0.0)
+
 ## Usage

 ```javascript
-const SimilaritySearch = require('./index');
+const SimilaritySearch = require('similarity-search');

 // Create a new search index with default capacity (500)
 const index = new SimilaritySearch();
@@ -45,20 +51,21 @@ results.forEach(match => {
 Creates a new search index.

 - `capacity` (optional): Initial capacity for the index. Default: 500.
+- Returns: A new SimilaritySearch instance.

 ### `addString(str)`

 Adds a string to the index.

 - `str`: The string to add.
- Returns: Boolean indicating success.
+- Returns: Boolean indicating success (true if successful, false otherwise).

 ### `addStrings(strings)`

 Adds multiple strings to the index.

 - `strings`: Array of strings to add.
- Returns: Boolean indicating if all adds were successful.
+- Returns: Boolean indicating if all adds were successful (true if all successful, false if any failed).

 ### `search(query, [cutoff])`

@@ -66,7 +73,9 @@ Searches the index for strings similar to the query.

 - `query`: The search query.
 - `cutoff` (optional): Similarity threshold between 0.0 and 1.0. Default: 0.2.
- Returns: Array of matching results, sorted by similarity (descending).
+- Returns: Array of matching results, sorted by similarity (descending). Each result is an object with:
+  - `string`: The matching string
+  - `similarity`: The similarity score (0.0 to 1.0)

 ### `size()`

@@ -82,6 +91,7 @@ Creates a test index with random data.

 - `size` (optional): Number of strings to generate. Default: 500.
 - Returns: A new SimilaritySearch instance with random data.
+- Note: The first 5 strings are fixed test cases, followed by randomly generated strings.

 ### `SimilaritySearch.benchmark(index, queries, [cutoff])`

@@ -90,30 +100,31 @@ Benchmarks the search performance.
 - `index`: The index to benchmark.
 - `queries`: Array of search queries.
 - `cutoff` (optional): Similarity threshold. Default: 0.2.
- Returns: Benchmark results.
+- Returns: Array of benchmark results, each containing:
+  - `query`: The search query
+  - `matches`: Number of matches found
+  - `timeMs`: Search time in milliseconds
+  - `topResults`: Top 5 matching results

 ## How It Works

-The similarity search uses Jaccard similarity between word sets:
+The similarity search uses a sophisticated multi-stage matching algorithm:

-```
-similarity = (number of matching words) / (total unique words)
-```
+1. **Word-level Matching**: The algorithm first splits both the query and target strings into words.

-This means word order doesn't matter - "bio bizz" will match with "bizz bio" with 100% similarity.
+2. **Word Similarity Calculation**: For each word pair, similarity is calculated using:
+   - Levenshtein distance for fuzzy matching
+   - Special handling for short words (3 chars or less require exact match)
+   - Prefix matching for significantly different length words
+   - Length-based similarity adjustments

-## Building
+3. **Overall Similarity Score**: The final similarity score is a weighted combination of:
+   - Word match score (70% weight): Percentage of query words that have a good match
+   - Average word similarity (30% weight): Average similarity of the best matching word pairs

-To rebuild the native addon:
+This approach provides robust matching that:
+- Handles typos and slight variations in words
+- Requires exact matches for short words to avoid false positives
+- Recognizes prefix matches (e.g., "bio" matches "biology")
+- Considers both word presence and character-level similarity

-```bash
-npm install
-```
-
-## Testing
-
-Run the test script:
-
-```bash
-npm test
-``` 
--- a/package.json
+++ b/package.json
@@ -1,23 +1,59 @@
 {
  "name": "similarity-search",
-  "version": "1.0.0",
+  "version": "1.0.2",
  "description": "A Node.js module for word order independent string similarity search",
  "main": "index.js",
+  "engines": {
+    "node": ">=14.0.0"
+  },
  "scripts": {
    "install": "node-gyp rebuild",
-    "test": "node test.js"
+    "rebuild": "node-gyp rebuild",
+    "build": "node-gyp rebuild",
+    "clean": "node-gyp clean",
+    "configure": "node-gyp configure",
+    "test": "node test.js",
+    "pretest": "npm run build"
  },
  "keywords": [
    "search",
    "similarity",
    "string",
-    "fuzzy"
+    "fuzzy",
+    "native",
+    "addon",
+    "c++",
+    "performance"
  ],
  "author": "",
  "license": "MIT",
  "dependencies": {
    "nan": "^2.22.2",
-    "node-addon-api": "^6.0.0"
+    "node-addon-api": "^6.0.0",
+    "node-gyp": "^11.2.0"
  },
-  "gypfile": true
+  "devDependencies": {
+  },
+  "repository": {
+    "type": "git",
+    "url": ""
+  },
+  "files": [
+    "index.js",
+    "binding.gyp",
+    "similarity_search.c",
+    "similarity_search.h",
+    "similarity_search_addon.cc",
+    "README.md"
+  ],
+  "gypfile": true,
+  "os": [
+    "win32",
+    "darwin",
+    "linux"
+  ],
+  "cpu": [
+    "x64",
+    "arm64"
+  ]
 }
--- a/similarity_search.c
+++ b/similarity_search.c
@@ -3,6 +3,13 @@
 #include <string.h>
 #include <time.h>
 #include <ctype.h>
+#include <math.h>
+#include <stdbool.h>
+#ifdef _WIN32
+#include <malloc.h>  // For alloca on Windows
+#else
+#include <alloca.h>  // For alloca on Unix-like systems
+#endif
 #include "similarity_search.h"

 // Case insensitive string comparison
@@ -20,72 +27,139 @@ int str_case_cmp(const char *s1, const char *s2) {
 }

 // Split a string into words
-int split_into_words(const char *string, char *words[MAX_WORDS]) {
-    if (!string || strlen(string) >= MAX_STRING_LEN) {
-        return 0;
+int split_into_words(const char *s,
+                     char  *words[MAX_WORDS],
+                     char **storage)          /* NEW OUT PARAM            */
+{
+    if (!s || strlen(s) >= MAX_STRING_LEN) return 0;
+
+    char *buf = strdup(s);                    /* one single allocation    */
+    if (!buf) return 0;
+    *storage = buf;                           /* hand ownership to caller */
+
+    int n = 0;
+    for (char *tok = strtok(buf, " \t\n"); tok && n < MAX_WORDS;
+         tok = strtok(NULL, " \t\n"))
+    {
+        words[n++] = tok;                     /* pointers into buf        */
    }
-    
-    char temp[MAX_STRING_LEN];
-    strncpy(temp, string, MAX_STRING_LEN - 1);
-    temp[MAX_STRING_LEN - 1] = '\0';
-    
-    int word_count = 0;
-    char *token = strtok(temp, " \t\n");
-    
-    while (token != NULL && word_count < MAX_WORDS) {
-        words[word_count] = strdup(token);
-        if (!words[word_count]) {
-            // Free any already allocated words on error
-            for (int i = 0; i < word_count; i++) {
-                free(words[i]);
-            }
-            return 0;
-        }
-        word_count++;
-        token = strtok(NULL, " \t\n");
-    }
-    
-    return word_count;
+    return n;
 }

 // Free memory allocated for words
-void free_words(char *words[], int word_count) {
-    for (int i = 0; i < word_count; i++) {
-        free(words[i]);
+void free_words(char *storage) {              /* simplified               */
+    if (storage) {                            /* check for NULL           */
+        free(storage);                        /* single free, if any      */
    }
 }

+// Calculate Levenshtein distance between two strings
+int levenshtein_distance(const char *a, const char *b)
+{
+    size_t m = strlen(a), n = strlen(b);
+    if (m < n) { const char *t=a; a=b; b=t; size_t tmp=m; m=n; n=tmp; }
+
+    int *row0 = alloca((n + 1) * sizeof(int));
+    int *row1 = alloca((n + 1) * sizeof(int));
+
+    for (size_t j = 0; j <= n; ++j) row0[j] = j;
+    for (size_t i = 1; i <= m; ++i) {
+        row1[0] = i;
+        for (size_t j = 1; j <= n; ++j) {
+            int cost = (tolower((unsigned)a[i-1]) ==
+                        tolower((unsigned)b[j-1])) ? 0 : 1;
+            int del  = row0[j]   + 1;
+            int ins  = row1[j-1] + 1;
+            int sub  = row0[j-1] + cost;
+            row1[j] = (del < ins ? (del < sub ? del : sub)
+                                 : (ins < sub ? ins : sub));
+        }
+        int *tmp = row0; row0 = row1; row1 = tmp;
+    }
+    return row0[n];
+}
+
+// Calculate similarity between two words based on Levenshtein distance
+float word_similarity(const char *word1, const char *word2) {
+    int len1 = strlen(word1);
+    int len2 = strlen(word2);
+    
+    // For very short words (2 chars or less), require exact match
+    if (len1 <= 2 || len2 <= 2) {
+        return str_case_cmp(word1, word2) == 0 ? 1.0f : 0.0f;
+    }
+    
+    // Calculate Levenshtein distance
+    int distance = levenshtein_distance(word1, word2);
+    int max_len = len1 > len2 ? len1 : len2;
+    
+    // Simple linear scoring: 1.0 for exact match, 0.9 for one char difference, etc.
+    float similarity = 1.0f - (float)distance / max_len;
+    
+    // Boost similarity for small differences
+    if (distance <= 1) {
+        similarity = 0.9f + (similarity * 0.1f);
+    }
+    
+    return similarity;
+}
+
 // Calculate similarity between query and target string
 float calculate_similarity(const char *query, const char *target, float cutoff) {
    // Split strings into words
-    char *query_words[MAX_WORDS] = {0};
-    char *target_words[MAX_WORDS] = {0};
+    char *query_buf = NULL, *target_buf = NULL;
+    char *query_words[MAX_WORDS], *target_words[MAX_WORDS];
    
-    int query_word_count = split_into_words(query, query_words);
-    int target_word_count = split_into_words(target, target_words);
+    int query_word_count = split_into_words(query,  query_words,  &query_buf);
+    int target_word_count = split_into_words(target, target_words, &target_buf);
    
    if (query_word_count == 0 || target_word_count == 0) {
-        free_words(query_words, query_word_count);
-        free_words(target_words, target_word_count);
+        free_words(query_buf);
+        free_words(target_buf);
        return 0.0;
    }
    
-    // Count matches
-    int matches = 0;
+    // Track best matches for each query word
+    float best_word_similarities[MAX_WORDS] = {0.0f};
+    int query_words_found = 0;
+    
+    // For each query word, find its best match in target words
    for (int i = 0; i < query_word_count; i++) {
+        float best_similarity = 0.0f;
+        
        for (int j = 0; j < target_word_count; j++) {
-            if (str_case_cmp(query_words[i], target_words[j]) == 0) {
-                matches++;
-                break;
+            float similarity = word_similarity(query_words[i], target_words[j]);
+            if (similarity > best_similarity) {
+                best_similarity = similarity;
            }
        }
+        
+        best_word_similarities[i] = best_similarity;
+        if (best_similarity >= 0.4f) {
+            query_words_found++;
+        }
    }
    
-    // Calculate Jaccard similarity (intersection over union)
-    float similarity = (float)matches / (query_word_count + target_word_count - matches);
+    // Calculate average word similarity
+    float avg_word_similarity = 0.0f;
+    for (int i = 0; i < query_word_count; i++) {
+        avg_word_similarity += best_word_similarities[i];
+    }
+    avg_word_similarity /= query_word_count;
    
-    free_words(query_words, query_word_count);
-    free_words(target_words, target_word_count);
+    // Calculate word match ratio
+    float word_match_ratio = (float)query_words_found / query_word_count;
+    
+    // Final score is the average of word match ratio and average word similarity
+    float similarity = (word_match_ratio + avg_word_similarity) / 2.0f;
+    
+    // Boost score if all words are found
+    if (query_words_found == query_word_count) {
+        similarity = 0.8f + (similarity * 0.2f);
+    }
+    
+    free_words(query_buf);
+    free_words(target_buf);
    
    return similarity;
 }
@@ -232,24 +306,26 @@ SearchResult* search_index(SearchIndex* index, const char* query, float cutoff,
        }
    }
    
+    // If no results found, return NULL properly
+    if (*num_results == 0) {
+        free(temp_results);
+        return NULL;
+    }
+    
    // Sort results by similarity
    qsort(temp_results, *num_results, sizeof(SearchResult), compare_results);
    
-    // Allocate final result array with exact size
-    SearchResult* results = (SearchResult*)malloc(*num_results * sizeof(SearchResult));
+    // Shrink temp_results to exact size and return it directly
+    SearchResult* results = (SearchResult*)realloc(
+        temp_results, *num_results * sizeof(SearchResult));
    if (!results) {
-        // Free all strings in temp_results
+        // realloc failure – temp_results unchanged, clean up
        for (int i = 0; i < *num_results; i++) {
            free(temp_results[i].string);
        }
        free(temp_results);
        return NULL;
    }
-    
-    // Copy results to final array
-    memcpy(results, temp_results, *num_results * sizeof(SearchResult));
-    free(temp_results);
-    
    return results;
 }

@@ -262,4 +338,4 @@ void free_search_results(SearchResult* results, int num_results) {
        free(results[i].string);
    }
    free(results);
-} 
+}
--- a/similarity_search.h
+++ b/similarity_search.h
@@ -5,8 +5,8 @@
 extern "C" {
 #endif

-#define MAX_STRING_LEN 100
-#define MAX_WORDS 20
+#define MAX_STRING_LEN 1000
+#define MAX_WORDS 100

 // Public API

@@ -19,7 +19,7 @@ typedef struct {

 // Structure to hold a search result
 typedef struct {
-    const char *string;
+    char *string;
    float similarity;
 } SearchResult;

--- a/similarity_search_addon.cc
+++ b/similarity_search_addon.cc
@@ -41,7 +41,7 @@ SearchIndexWrapper::SearchIndexWrapper(const Napi::CallbackInfo& info)
  Napi::Env env = info.Env();
  Napi::HandleScope scope(env);
  
-  int capacity = 500; // Default capacity
+  int capacity = 10000; // Increased default capacity from 500 to 10000
  if (info.Length() > 0 && info[0].IsNumber()) {
    capacity = info[0].As<Napi::Number>().Int32Value();
  }
@@ -67,6 +67,12 @@ Napi::Value SearchIndexWrapper::AddString(const Napi::CallbackInfo& info) {
  
  std::string str = info[0].As<Napi::String>().Utf8Value();
  
+  // Check if string is empty
+  if (str.empty()) {
+    Napi::Error::New(env, "Empty string not allowed").ThrowAsJavaScriptException();
+    return env.Null();
+  }
+  
  // Check if string is too long
  if (str.length() >= MAX_STRING_LEN) {
    Napi::Error::New(env, "String too long").ThrowAsJavaScriptException();
@@ -99,6 +105,12 @@ Napi::Value SearchIndexWrapper::Search(const Napi::CallbackInfo& info) {
  
  std::string query = info[0].As<Napi::String>().Utf8Value();
  
+  // Check if query is empty
+  if (query.empty()) {
+    Napi::Error::New(env, "Empty query not allowed").ThrowAsJavaScriptException();
+    return env.Null();
+  }
+  
  // Check if query string is too long
  if (query.length() >= MAX_STRING_LEN) {
    Napi::Error::New(env, "Query string too long").ThrowAsJavaScriptException();
@@ -117,9 +129,9 @@ Napi::Value SearchIndexWrapper::Search(const Napi::CallbackInfo& info) {
  int num_results = 0;
  SearchResult* results = search_index(this->index_, query.c_str(), cutoff, &num_results);
  
-  if (!results) {
-    Napi::Error::New(env, "Search failed").ThrowAsJavaScriptException();
-    return env.Null();
+  // If no results found, return empty array instead of throwing error
+  if (!results || num_results == 0) {
+    return Napi::Array::New(env, 0);
  }
  
  Napi::Array result_array = Napi::Array::New(env, num_results);
--- a/test.js
+++ b/test.js
@@ -43,14 +43,22 @@ customIndex.addString('bizz bio mix light');
 // Add multiple strings at once
 customIndex.addStrings([
  'plant growth bio formula',
-  'garden soil substrate'
+  'garden soil substrate',
+  'plagron lightmix',
+  'Anesia Seeds Imperium X Auto 10',
+  'anesi'
 ]);

 console.log(`Custom index created with ${customIndex.size()} strings`);

 // Search with a higher similarity threshold
-console.log('\nSearching with higher similarity threshold (0.3):');
-const results = customIndex.search('bio bizz', 0.3);
+console.log('\nSearching with higher similarity threshold (0.1) for "amnesia":');
+const results = customIndex.search('amnesia haze', 0.1);
 results.forEach(match => {
  console.log(`  ${match.similarity.toFixed(2)}: ${match.string}`);
+}); 
+console.log('\nSearching with higher similarity threshold (0.1) for "mix light":');
+const results2 = customIndex.search('mix light', 0.1);
+results2.forEach(match => {
+  console.log(`  ${match.similarity.toFixed(2)}: ${match.string}`);
 });
Author	SHA1	Message	Date
seb	fb8f61f868	Fix node-gyp compatibility: move to dependencies for proper version control - Move node-gyp from devDependencies to dependencies - Ensures v11.2.0 is used when installed as dependency - Fixes Visual Studio detection issues in consuming projects - Resolves shopApi build failures with old node-gyp v8.4.1	2025-06-27 03:37:33 +02:00
seb	091c258d41	Update package.json to include Node.js engine requirement and enhance build scripts. Add new keywords for better package discoverability, define repository information, and specify supported operating systems and CPU architectures. Introduce devDependencies for node-gyp to streamline native module compilation.	2025-06-27 03:32:24 +02:00
seb	8474c77163	Enhance search index handling by returning an empty array for no results instead of throwing an error. Improve memory management in free_words function by checking for NULL before freeing. Update search_index to properly return NULL when no results are found.	2025-06-27 03:15:45 +02:00
seb	21f527ba46	Update .gitignore to ignore npm lock file and tidy entry Add package-lock.json to prevent accidental commits of npm’s lock file. While here, remove the stray trailing space from the Thumbs.db entry for a cleaner diff.	2025-06-23 07:09:30 +02:00
seb	462041654d	Optimize search result finalization by reallocating in place Replace the malloc/copy/free sequence with a single realloc that shrinks temp_results to its exact size and returns it directly. This * eliminates an extra allocation and memory copy * simplifies cleanup logic * retains correct failure handling (temp_results unchanged on realloc failure) Also drop the superfluous trailing space at EOF and add package-lock.json to version control to lock Node.js dependencies.	2025-06-23 04:16:18 +02:00
seb	60d609dd6a	Fix Windows compilation issue by adding malloc.h include	2025-06-21 21:14:20 +02:00
seb	ccbd833361	package.json aktualisiert	2025-04-22 04:39:00 +00:00
seb	24895fc1bc	similarity_search.c aktualisiert 3 -> 2 (min word length)	2025-04-22 04:35:04 +00:00
seb	a9a9247773	Refactor word similarity calculation in similarity_search.c to simplify scoring logic. Replace prefix matching with Levenshtein distance for improved accuracy, and adjust similarity scoring to boost results for small differences. Update overall similarity calculation to average word match ratio and average word similarity for better performance.	2025-04-18 19:01:22 +02:00
seb	e2aacaf54b	Refactor similarity_search.c to improve memory management and word splitting logic. Simplify split_into_words function to use a single allocation and update free_words to handle memory more efficiently. Enhance levenshtein_distance calculation with dynamic memory allocation and optimize similarity scoring in calculate_similarity function for better accuracy and performance.	2025-04-18 18:55:37 +02:00
seb	0dd17b794f	Update README.md to enhance documentation on similarity metrics and usage. Add dependencies section and clarify return values for methods. Modify test.js to reflect updated search scenarios and improve clarity in search results.	2025-04-18 11:03:07 +02:00
seb	92a7bad2b6	Implement Levenshtein distance calculation for improved word similarity in similarity_search.c. Adjust similarity thresholds and scoring logic to enhance accuracy, particularly for prefix matches and varying word lengths. Update test.js to reflect new search scenarios with lower similarity thresholds.	2025-04-18 09:47:58 +02:00
seb	e94c034927	Refine word similarity calculation in similarity_search.c by enforcing exact matches for short words and adjusting similarity thresholds. Increase weight of word matches in overall similarity score calculation.	2025-04-18 09:32:44 +02:00
seb	53da84fbcf	Update version to 1.0.1 in package.json for release.	2025-04-18 09:24:41 +02:00
seb	cd41ca2f52	Add word similarity calculation to enhance overall similarity scoring in calculate_similarity function. Implement character matching logic and boost score for same-length words.	2025-04-18 09:20:44 +02:00
seb	6091cc0b80	Increase default capacity in SearchIndexWrapper and enhance similarity calculation in calculate_similarity function to boost similarity score when all query words are found. Update MAX_WORDS and MAX_STRING_LEN definitions for improved handling.	2025-04-18 09:16:26 +02:00
seb	ca2c86ce33	upd	2025-04-18 09:06:34 +02:00