Compare commits
17 Commits
da5e7476a0
...
master
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
fb8f61f868 | ||
|
|
091c258d41 | ||
|
|
8474c77163 | ||
|
|
21f527ba46 | ||
|
|
462041654d | ||
|
|
60d609dd6a | ||
| ccbd833361 | |||
| 24895fc1bc | |||
|
|
a9a9247773 | ||
|
|
e2aacaf54b | ||
|
|
0dd17b794f | ||
|
|
92a7bad2b6 | ||
|
|
e94c034927 | ||
|
|
53da84fbcf | ||
|
|
cd41ca2f52 | ||
|
|
6091cc0b80 | ||
|
|
ca2c86ce33 |
4
.gitignore
vendored
4
.gitignore
vendored
@@ -22,4 +22,6 @@ build/
|
||||
.Spotlight-V100
|
||||
.Trashes
|
||||
ehthumbs.db
|
||||
Thumbs.db
|
||||
Thumbs.db
|
||||
|
||||
package-lock.json
|
||||
61
README.md
61
README.md
@@ -2,18 +2,24 @@
|
||||
|
||||
A Node.js module that performs word order independent similarity search on strings.
|
||||
|
||||
This module is built as a native addon that uses C code for fast similarity computations. It uses Jaccard similarity between word sets to find matches regardless of word order.
|
||||
This module is built as a native addon that uses C code for fast similarity computations. It uses a sophisticated similarity metric that combines fuzzy matching, prefix matching, and word-level comparisons to find matches regardless of word order.
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
npm install
|
||||
npm install similarity-search
|
||||
```
|
||||
|
||||
## Dependencies
|
||||
|
||||
- Node.js (with node-gyp for building native addons)
|
||||
- nan (^2.22.2)
|
||||
- node-addon-api (^6.0.0)
|
||||
|
||||
## Usage
|
||||
|
||||
```javascript
|
||||
const SimilaritySearch = require('./index');
|
||||
const SimilaritySearch = require('similarity-search');
|
||||
|
||||
// Create a new search index with default capacity (500)
|
||||
const index = new SimilaritySearch();
|
||||
@@ -45,20 +51,21 @@ results.forEach(match => {
|
||||
Creates a new search index.
|
||||
|
||||
- `capacity` (optional): Initial capacity for the index. Default: 500.
|
||||
- Returns: A new SimilaritySearch instance.
|
||||
|
||||
### `addString(str)`
|
||||
|
||||
Adds a string to the index.
|
||||
|
||||
- `str`: The string to add.
|
||||
- Returns: Boolean indicating success.
|
||||
- Returns: Boolean indicating success (true if successful, false otherwise).
|
||||
|
||||
### `addStrings(strings)`
|
||||
|
||||
Adds multiple strings to the index.
|
||||
|
||||
- `strings`: Array of strings to add.
|
||||
- Returns: Boolean indicating if all adds were successful.
|
||||
- Returns: Boolean indicating if all adds were successful (true if all successful, false if any failed).
|
||||
|
||||
### `search(query, [cutoff])`
|
||||
|
||||
@@ -66,7 +73,9 @@ Searches the index for strings similar to the query.
|
||||
|
||||
- `query`: The search query.
|
||||
- `cutoff` (optional): Similarity threshold between 0.0 and 1.0. Default: 0.2.
|
||||
- Returns: Array of matching results, sorted by similarity (descending).
|
||||
- Returns: Array of matching results, sorted by similarity (descending). Each result is an object with:
|
||||
- `string`: The matching string
|
||||
- `similarity`: The similarity score (0.0 to 1.0)
|
||||
|
||||
### `size()`
|
||||
|
||||
@@ -82,6 +91,7 @@ Creates a test index with random data.
|
||||
|
||||
- `size` (optional): Number of strings to generate. Default: 500.
|
||||
- Returns: A new SimilaritySearch instance with random data.
|
||||
- Note: The first 5 strings are fixed test cases, followed by randomly generated strings.
|
||||
|
||||
### `SimilaritySearch.benchmark(index, queries, [cutoff])`
|
||||
|
||||
@@ -90,30 +100,31 @@ Benchmarks the search performance.
|
||||
- `index`: The index to benchmark.
|
||||
- `queries`: Array of search queries.
|
||||
- `cutoff` (optional): Similarity threshold. Default: 0.2.
|
||||
- Returns: Benchmark results.
|
||||
- Returns: Array of benchmark results, each containing:
|
||||
- `query`: The search query
|
||||
- `matches`: Number of matches found
|
||||
- `timeMs`: Search time in milliseconds
|
||||
- `topResults`: Top 5 matching results
|
||||
|
||||
## How It Works
|
||||
|
||||
The similarity search uses Jaccard similarity between word sets:
|
||||
The similarity search uses a sophisticated multi-stage matching algorithm:
|
||||
|
||||
```
|
||||
similarity = (number of matching words) / (total unique words)
|
||||
```
|
||||
1. **Word-level Matching**: The algorithm first splits both the query and target strings into words.
|
||||
|
||||
This means word order doesn't matter - "bio bizz" will match with "bizz bio" with 100% similarity.
|
||||
2. **Word Similarity Calculation**: For each word pair, similarity is calculated using:
|
||||
- Levenshtein distance for fuzzy matching
|
||||
- Special handling for short words (3 chars or less require exact match)
|
||||
- Prefix matching for significantly different length words
|
||||
- Length-based similarity adjustments
|
||||
|
||||
## Building
|
||||
3. **Overall Similarity Score**: The final similarity score is a weighted combination of:
|
||||
- Word match score (70% weight): Percentage of query words that have a good match
|
||||
- Average word similarity (30% weight): Average similarity of the best matching word pairs
|
||||
|
||||
To rebuild the native addon:
|
||||
This approach provides robust matching that:
|
||||
- Handles typos and slight variations in words
|
||||
- Requires exact matches for short words to avoid false positives
|
||||
- Recognizes prefix matches (e.g., "bio" matches "biology")
|
||||
- Considers both word presence and character-level similarity
|
||||
|
||||
```bash
|
||||
npm install
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
Run the test script:
|
||||
|
||||
```bash
|
||||
npm test
|
||||
```
|
||||
46
package.json
46
package.json
@@ -1,23 +1,59 @@
|
||||
{
|
||||
"name": "similarity-search",
|
||||
"version": "1.0.0",
|
||||
"version": "1.0.2",
|
||||
"description": "A Node.js module for word order independent string similarity search",
|
||||
"main": "index.js",
|
||||
"engines": {
|
||||
"node": ">=14.0.0"
|
||||
},
|
||||
"scripts": {
|
||||
"install": "node-gyp rebuild",
|
||||
"test": "node test.js"
|
||||
"rebuild": "node-gyp rebuild",
|
||||
"build": "node-gyp rebuild",
|
||||
"clean": "node-gyp clean",
|
||||
"configure": "node-gyp configure",
|
||||
"test": "node test.js",
|
||||
"pretest": "npm run build"
|
||||
},
|
||||
"keywords": [
|
||||
"search",
|
||||
"similarity",
|
||||
"string",
|
||||
"fuzzy"
|
||||
"fuzzy",
|
||||
"native",
|
||||
"addon",
|
||||
"c++",
|
||||
"performance"
|
||||
],
|
||||
"author": "",
|
||||
"license": "MIT",
|
||||
"dependencies": {
|
||||
"nan": "^2.22.2",
|
||||
"node-addon-api": "^6.0.0"
|
||||
"node-addon-api": "^6.0.0",
|
||||
"node-gyp": "^11.2.0"
|
||||
},
|
||||
"gypfile": true
|
||||
"devDependencies": {
|
||||
},
|
||||
"repository": {
|
||||
"type": "git",
|
||||
"url": ""
|
||||
},
|
||||
"files": [
|
||||
"index.js",
|
||||
"binding.gyp",
|
||||
"similarity_search.c",
|
||||
"similarity_search.h",
|
||||
"similarity_search_addon.cc",
|
||||
"README.md"
|
||||
],
|
||||
"gypfile": true,
|
||||
"os": [
|
||||
"win32",
|
||||
"darwin",
|
||||
"linux"
|
||||
],
|
||||
"cpu": [
|
||||
"x64",
|
||||
"arm64"
|
||||
]
|
||||
}
|
||||
|
||||
@@ -3,6 +3,13 @@
|
||||
#include <string.h>
|
||||
#include <time.h>
|
||||
#include <ctype.h>
|
||||
#include <math.h>
|
||||
#include <stdbool.h>
|
||||
#ifdef _WIN32
|
||||
#include <malloc.h> // For alloca on Windows
|
||||
#else
|
||||
#include <alloca.h> // For alloca on Unix-like systems
|
||||
#endif
|
||||
#include "similarity_search.h"
|
||||
|
||||
// Case insensitive string comparison
|
||||
@@ -20,72 +27,139 @@ int str_case_cmp(const char *s1, const char *s2) {
|
||||
}
|
||||
|
||||
// Split a string into words
|
||||
int split_into_words(const char *string, char *words[MAX_WORDS]) {
|
||||
if (!string || strlen(string) >= MAX_STRING_LEN) {
|
||||
return 0;
|
||||
int split_into_words(const char *s,
|
||||
char *words[MAX_WORDS],
|
||||
char **storage) /* NEW OUT PARAM */
|
||||
{
|
||||
if (!s || strlen(s) >= MAX_STRING_LEN) return 0;
|
||||
|
||||
char *buf = strdup(s); /* one single allocation */
|
||||
if (!buf) return 0;
|
||||
*storage = buf; /* hand ownership to caller */
|
||||
|
||||
int n = 0;
|
||||
for (char *tok = strtok(buf, " \t\n"); tok && n < MAX_WORDS;
|
||||
tok = strtok(NULL, " \t\n"))
|
||||
{
|
||||
words[n++] = tok; /* pointers into buf */
|
||||
}
|
||||
|
||||
char temp[MAX_STRING_LEN];
|
||||
strncpy(temp, string, MAX_STRING_LEN - 1);
|
||||
temp[MAX_STRING_LEN - 1] = '\0';
|
||||
|
||||
int word_count = 0;
|
||||
char *token = strtok(temp, " \t\n");
|
||||
|
||||
while (token != NULL && word_count < MAX_WORDS) {
|
||||
words[word_count] = strdup(token);
|
||||
if (!words[word_count]) {
|
||||
// Free any already allocated words on error
|
||||
for (int i = 0; i < word_count; i++) {
|
||||
free(words[i]);
|
||||
}
|
||||
return 0;
|
||||
}
|
||||
word_count++;
|
||||
token = strtok(NULL, " \t\n");
|
||||
}
|
||||
|
||||
return word_count;
|
||||
return n;
|
||||
}
|
||||
|
||||
// Free memory allocated for words
|
||||
void free_words(char *words[], int word_count) {
|
||||
for (int i = 0; i < word_count; i++) {
|
||||
free(words[i]);
|
||||
void free_words(char *storage) { /* simplified */
|
||||
if (storage) { /* check for NULL */
|
||||
free(storage); /* single free, if any */
|
||||
}
|
||||
}
|
||||
|
||||
// Calculate Levenshtein distance between two strings
|
||||
int levenshtein_distance(const char *a, const char *b)
|
||||
{
|
||||
size_t m = strlen(a), n = strlen(b);
|
||||
if (m < n) { const char *t=a; a=b; b=t; size_t tmp=m; m=n; n=tmp; }
|
||||
|
||||
int *row0 = alloca((n + 1) * sizeof(int));
|
||||
int *row1 = alloca((n + 1) * sizeof(int));
|
||||
|
||||
for (size_t j = 0; j <= n; ++j) row0[j] = j;
|
||||
for (size_t i = 1; i <= m; ++i) {
|
||||
row1[0] = i;
|
||||
for (size_t j = 1; j <= n; ++j) {
|
||||
int cost = (tolower((unsigned)a[i-1]) ==
|
||||
tolower((unsigned)b[j-1])) ? 0 : 1;
|
||||
int del = row0[j] + 1;
|
||||
int ins = row1[j-1] + 1;
|
||||
int sub = row0[j-1] + cost;
|
||||
row1[j] = (del < ins ? (del < sub ? del : sub)
|
||||
: (ins < sub ? ins : sub));
|
||||
}
|
||||
int *tmp = row0; row0 = row1; row1 = tmp;
|
||||
}
|
||||
return row0[n];
|
||||
}
|
||||
|
||||
// Calculate similarity between two words based on Levenshtein distance
|
||||
float word_similarity(const char *word1, const char *word2) {
|
||||
int len1 = strlen(word1);
|
||||
int len2 = strlen(word2);
|
||||
|
||||
// For very short words (2 chars or less), require exact match
|
||||
if (len1 <= 2 || len2 <= 2) {
|
||||
return str_case_cmp(word1, word2) == 0 ? 1.0f : 0.0f;
|
||||
}
|
||||
|
||||
// Calculate Levenshtein distance
|
||||
int distance = levenshtein_distance(word1, word2);
|
||||
int max_len = len1 > len2 ? len1 : len2;
|
||||
|
||||
// Simple linear scoring: 1.0 for exact match, 0.9 for one char difference, etc.
|
||||
float similarity = 1.0f - (float)distance / max_len;
|
||||
|
||||
// Boost similarity for small differences
|
||||
if (distance <= 1) {
|
||||
similarity = 0.9f + (similarity * 0.1f);
|
||||
}
|
||||
|
||||
return similarity;
|
||||
}
|
||||
|
||||
// Calculate similarity between query and target string
|
||||
float calculate_similarity(const char *query, const char *target, float cutoff) {
|
||||
// Split strings into words
|
||||
char *query_words[MAX_WORDS] = {0};
|
||||
char *target_words[MAX_WORDS] = {0};
|
||||
char *query_buf = NULL, *target_buf = NULL;
|
||||
char *query_words[MAX_WORDS], *target_words[MAX_WORDS];
|
||||
|
||||
int query_word_count = split_into_words(query, query_words);
|
||||
int target_word_count = split_into_words(target, target_words);
|
||||
int query_word_count = split_into_words(query, query_words, &query_buf);
|
||||
int target_word_count = split_into_words(target, target_words, &target_buf);
|
||||
|
||||
if (query_word_count == 0 || target_word_count == 0) {
|
||||
free_words(query_words, query_word_count);
|
||||
free_words(target_words, target_word_count);
|
||||
free_words(query_buf);
|
||||
free_words(target_buf);
|
||||
return 0.0;
|
||||
}
|
||||
|
||||
// Count matches
|
||||
int matches = 0;
|
||||
// Track best matches for each query word
|
||||
float best_word_similarities[MAX_WORDS] = {0.0f};
|
||||
int query_words_found = 0;
|
||||
|
||||
// For each query word, find its best match in target words
|
||||
for (int i = 0; i < query_word_count; i++) {
|
||||
float best_similarity = 0.0f;
|
||||
|
||||
for (int j = 0; j < target_word_count; j++) {
|
||||
if (str_case_cmp(query_words[i], target_words[j]) == 0) {
|
||||
matches++;
|
||||
break;
|
||||
float similarity = word_similarity(query_words[i], target_words[j]);
|
||||
if (similarity > best_similarity) {
|
||||
best_similarity = similarity;
|
||||
}
|
||||
}
|
||||
|
||||
best_word_similarities[i] = best_similarity;
|
||||
if (best_similarity >= 0.4f) {
|
||||
query_words_found++;
|
||||
}
|
||||
}
|
||||
|
||||
// Calculate Jaccard similarity (intersection over union)
|
||||
float similarity = (float)matches / (query_word_count + target_word_count - matches);
|
||||
// Calculate average word similarity
|
||||
float avg_word_similarity = 0.0f;
|
||||
for (int i = 0; i < query_word_count; i++) {
|
||||
avg_word_similarity += best_word_similarities[i];
|
||||
}
|
||||
avg_word_similarity /= query_word_count;
|
||||
|
||||
free_words(query_words, query_word_count);
|
||||
free_words(target_words, target_word_count);
|
||||
// Calculate word match ratio
|
||||
float word_match_ratio = (float)query_words_found / query_word_count;
|
||||
|
||||
// Final score is the average of word match ratio and average word similarity
|
||||
float similarity = (word_match_ratio + avg_word_similarity) / 2.0f;
|
||||
|
||||
// Boost score if all words are found
|
||||
if (query_words_found == query_word_count) {
|
||||
similarity = 0.8f + (similarity * 0.2f);
|
||||
}
|
||||
|
||||
free_words(query_buf);
|
||||
free_words(target_buf);
|
||||
|
||||
return similarity;
|
||||
}
|
||||
@@ -232,24 +306,26 @@ SearchResult* search_index(SearchIndex* index, const char* query, float cutoff,
|
||||
}
|
||||
}
|
||||
|
||||
// If no results found, return NULL properly
|
||||
if (*num_results == 0) {
|
||||
free(temp_results);
|
||||
return NULL;
|
||||
}
|
||||
|
||||
// Sort results by similarity
|
||||
qsort(temp_results, *num_results, sizeof(SearchResult), compare_results);
|
||||
|
||||
// Allocate final result array with exact size
|
||||
SearchResult* results = (SearchResult*)malloc(*num_results * sizeof(SearchResult));
|
||||
// Shrink temp_results to exact size and return it directly
|
||||
SearchResult* results = (SearchResult*)realloc(
|
||||
temp_results, *num_results * sizeof(SearchResult));
|
||||
if (!results) {
|
||||
// Free all strings in temp_results
|
||||
// realloc failure – temp_results unchanged, clean up
|
||||
for (int i = 0; i < *num_results; i++) {
|
||||
free(temp_results[i].string);
|
||||
}
|
||||
free(temp_results);
|
||||
return NULL;
|
||||
}
|
||||
|
||||
// Copy results to final array
|
||||
memcpy(results, temp_results, *num_results * sizeof(SearchResult));
|
||||
free(temp_results);
|
||||
|
||||
return results;
|
||||
}
|
||||
|
||||
@@ -262,4 +338,4 @@ void free_search_results(SearchResult* results, int num_results) {
|
||||
free(results[i].string);
|
||||
}
|
||||
free(results);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -5,8 +5,8 @@
|
||||
extern "C" {
|
||||
#endif
|
||||
|
||||
#define MAX_STRING_LEN 100
|
||||
#define MAX_WORDS 20
|
||||
#define MAX_STRING_LEN 1000
|
||||
#define MAX_WORDS 100
|
||||
|
||||
// Public API
|
||||
|
||||
@@ -19,7 +19,7 @@ typedef struct {
|
||||
|
||||
// Structure to hold a search result
|
||||
typedef struct {
|
||||
const char *string;
|
||||
char *string;
|
||||
float similarity;
|
||||
} SearchResult;
|
||||
|
||||
|
||||
@@ -41,7 +41,7 @@ SearchIndexWrapper::SearchIndexWrapper(const Napi::CallbackInfo& info)
|
||||
Napi::Env env = info.Env();
|
||||
Napi::HandleScope scope(env);
|
||||
|
||||
int capacity = 500; // Default capacity
|
||||
int capacity = 10000; // Increased default capacity from 500 to 10000
|
||||
if (info.Length() > 0 && info[0].IsNumber()) {
|
||||
capacity = info[0].As<Napi::Number>().Int32Value();
|
||||
}
|
||||
@@ -67,6 +67,12 @@ Napi::Value SearchIndexWrapper::AddString(const Napi::CallbackInfo& info) {
|
||||
|
||||
std::string str = info[0].As<Napi::String>().Utf8Value();
|
||||
|
||||
// Check if string is empty
|
||||
if (str.empty()) {
|
||||
Napi::Error::New(env, "Empty string not allowed").ThrowAsJavaScriptException();
|
||||
return env.Null();
|
||||
}
|
||||
|
||||
// Check if string is too long
|
||||
if (str.length() >= MAX_STRING_LEN) {
|
||||
Napi::Error::New(env, "String too long").ThrowAsJavaScriptException();
|
||||
@@ -99,6 +105,12 @@ Napi::Value SearchIndexWrapper::Search(const Napi::CallbackInfo& info) {
|
||||
|
||||
std::string query = info[0].As<Napi::String>().Utf8Value();
|
||||
|
||||
// Check if query is empty
|
||||
if (query.empty()) {
|
||||
Napi::Error::New(env, "Empty query not allowed").ThrowAsJavaScriptException();
|
||||
return env.Null();
|
||||
}
|
||||
|
||||
// Check if query string is too long
|
||||
if (query.length() >= MAX_STRING_LEN) {
|
||||
Napi::Error::New(env, "Query string too long").ThrowAsJavaScriptException();
|
||||
@@ -117,9 +129,9 @@ Napi::Value SearchIndexWrapper::Search(const Napi::CallbackInfo& info) {
|
||||
int num_results = 0;
|
||||
SearchResult* results = search_index(this->index_, query.c_str(), cutoff, &num_results);
|
||||
|
||||
if (!results) {
|
||||
Napi::Error::New(env, "Search failed").ThrowAsJavaScriptException();
|
||||
return env.Null();
|
||||
// If no results found, return empty array instead of throwing error
|
||||
if (!results || num_results == 0) {
|
||||
return Napi::Array::New(env, 0);
|
||||
}
|
||||
|
||||
Napi::Array result_array = Napi::Array::New(env, num_results);
|
||||
|
||||
14
test.js
14
test.js
@@ -43,14 +43,22 @@ customIndex.addString('bizz bio mix light');
|
||||
// Add multiple strings at once
|
||||
customIndex.addStrings([
|
||||
'plant growth bio formula',
|
||||
'garden soil substrate'
|
||||
'garden soil substrate',
|
||||
'plagron lightmix',
|
||||
'Anesia Seeds Imperium X Auto 10',
|
||||
'anesi'
|
||||
]);
|
||||
|
||||
console.log(`Custom index created with ${customIndex.size()} strings`);
|
||||
|
||||
// Search with a higher similarity threshold
|
||||
console.log('\nSearching with higher similarity threshold (0.3):');
|
||||
const results = customIndex.search('bio bizz', 0.3);
|
||||
console.log('\nSearching with higher similarity threshold (0.1) for "amnesia":');
|
||||
const results = customIndex.search('amnesia haze', 0.1);
|
||||
results.forEach(match => {
|
||||
console.log(` ${match.similarity.toFixed(2)}: ${match.string}`);
|
||||
});
|
||||
console.log('\nSearching with higher similarity threshold (0.1) for "mix light":');
|
||||
const results2 = customIndex.search('mix light', 0.1);
|
||||
results2.forEach(match => {
|
||||
console.log(` ${match.similarity.toFixed(2)}: ${match.string}`);
|
||||
});
|
||||
Reference in New Issue
Block a user