131 lines
4.0 KiB
Markdown
131 lines
4.0 KiB
Markdown
# Similarity Search
|
|
|
|
A Node.js module that performs word order independent similarity search on strings.
|
|
|
|
This module is built as a native addon that uses C code for fast similarity computations. It uses a sophisticated similarity metric that combines fuzzy matching, prefix matching, and word-level comparisons to find matches regardless of word order.
|
|
|
|
## Installation
|
|
|
|
```bash
|
|
npm install similarity-search
|
|
```
|
|
|
|
## Dependencies
|
|
|
|
- Node.js (with node-gyp for building native addons)
|
|
- nan (^2.22.2)
|
|
- node-addon-api (^6.0.0)
|
|
|
|
## Usage
|
|
|
|
```javascript
|
|
const SimilaritySearch = require('similarity-search');
|
|
|
|
// Create a new search index with default capacity (500)
|
|
const index = new SimilaritySearch();
|
|
|
|
// Add strings to the index
|
|
index.addString('bio bizz');
|
|
index.addString('lightmix bizz btio substrate');
|
|
index.addString('bizz bio mix light');
|
|
|
|
// Add multiple strings at once
|
|
index.addStrings([
|
|
'plant growth bio formula',
|
|
'garden soil substrate'
|
|
]);
|
|
|
|
// Search the index with a query and similarity cutoff
|
|
const results = index.search('bio bizz', 0.2);
|
|
|
|
// Display results
|
|
results.forEach(match => {
|
|
console.log(`${match.similarity.toFixed(2)}: ${match.string}`);
|
|
});
|
|
```
|
|
|
|
## API
|
|
|
|
### `new SimilaritySearch([capacity])`
|
|
|
|
Creates a new search index.
|
|
|
|
- `capacity` (optional): Initial capacity for the index. Default: 500.
|
|
- Returns: A new SimilaritySearch instance.
|
|
|
|
### `addString(str)`
|
|
|
|
Adds a string to the index.
|
|
|
|
- `str`: The string to add.
|
|
- Returns: Boolean indicating success (true if successful, false otherwise).
|
|
|
|
### `addStrings(strings)`
|
|
|
|
Adds multiple strings to the index.
|
|
|
|
- `strings`: Array of strings to add.
|
|
- Returns: Boolean indicating if all adds were successful (true if all successful, false if any failed).
|
|
|
|
### `search(query, [cutoff])`
|
|
|
|
Searches the index for strings similar to the query.
|
|
|
|
- `query`: The search query.
|
|
- `cutoff` (optional): Similarity threshold between 0.0 and 1.0. Default: 0.2.
|
|
- Returns: Array of matching results, sorted by similarity (descending). Each result is an object with:
|
|
- `string`: The matching string
|
|
- `similarity`: The similarity score (0.0 to 1.0)
|
|
|
|
### `size()`
|
|
|
|
Gets the number of strings in the index.
|
|
|
|
- Returns: Number of strings in the index.
|
|
|
|
## Helper Functions
|
|
|
|
### `SimilaritySearch.createTestIndex([size])`
|
|
|
|
Creates a test index with random data.
|
|
|
|
- `size` (optional): Number of strings to generate. Default: 500.
|
|
- Returns: A new SimilaritySearch instance with random data.
|
|
- Note: The first 5 strings are fixed test cases, followed by randomly generated strings.
|
|
|
|
### `SimilaritySearch.benchmark(index, queries, [cutoff])`
|
|
|
|
Benchmarks the search performance.
|
|
|
|
- `index`: The index to benchmark.
|
|
- `queries`: Array of search queries.
|
|
- `cutoff` (optional): Similarity threshold. Default: 0.2.
|
|
- Returns: Array of benchmark results, each containing:
|
|
- `query`: The search query
|
|
- `matches`: Number of matches found
|
|
- `timeMs`: Search time in milliseconds
|
|
- `topResults`: Top 5 matching results
|
|
|
|
## How It Works
|
|
|
|
The similarity search uses a sophisticated multi-stage matching algorithm:
|
|
|
|
1. **Word-level Matching**: The algorithm first splits both the query and target strings into words.
|
|
|
|
2. **Word Similarity Calculation**: For each word pair, similarity is calculated using:
|
|
- Levenshtein distance for fuzzy matching
|
|
- Special handling for short words (3 chars or less require exact match)
|
|
- Prefix matching for significantly different length words
|
|
- Length-based similarity adjustments
|
|
|
|
3. **Overall Similarity Score**: The final similarity score is a weighted combination of:
|
|
- Word match score (70% weight): Percentage of query words that have a good match
|
|
- Average word similarity (30% weight): Average similarity of the best matching word pairs
|
|
|
|
This approach provides robust matching that:
|
|
- Handles typos and slight variations in words
|
|
- Requires exact matches for short words to avoid false positives
|
|
- Recognizes prefix matches (e.g., "bio" matches "biology")
|
|
- Considers both word presence and character-level similarity
|
|
|