Faster Fuzzy Matching at Scale: Why I Built go-fuzz

Date Posted: 2025-06-11Author: gruvsaliens.ethRead Time: 7 min

Tags:

fuzzy-matching

deduplication

clustering

Introduction & Motivation

go-fuzz is a modern, high-performance Go library for fuzzy string matching, deduplication, and clustering. It's designed for typo-tolerant grouping at scale, making it possible to efficiently process millions of records with accuracy and speed.

I built go-fuzz because existing solutions, especially in Python, were too slow, memory-hungry, or simply not designed for large-scale, production grade deduplication and clustering.

Problems with Existing Solutions

Python's fuzzywuzzy and rapidfuzz libraries are great for small datasets, but they struggle with performance and memory usage when you need to deduplicate or cluster millions of records. Even with multiprocessing, Python's GIL and memory overhead made it hard to scale. Meanwhile, the most popular Go package for fuzzy matching was deprecated and unmaintained, lacking support for modern features and flexible input/output.

from fuzzywuzzy import fuzz

# This works, but is slow for millions of records
score = fuzz.ratio('alice@example.com', 'al1ce@example.com')

Python is easy, but not always fast enough for big data.

Building go-fuzz: A Go-Powered Engine

To solve these issues, I built go-fuzz: a fast, modular Go library for fuzzy string matching, deduplication, and grouping. It's designed for typo-tolerant clustering (e.g., deduplicating millions of emails by fuzzy prefix) with a focus on scalability, extensibility, and ease of use. The core is written in Go for maximum performance and concurrency.

import (
    "github.com/gruvsaliens/go-fuzz/fuzzy"
)

func main() {
    opts, err := fuzzy.NewOptionsBuilder().
        WithInput("emails.ndjson").
        WithGroupConfig(fuzzy.GroupConfig{
            MaxDistance: 2,
            MinConfidence: 80,
            FuzzyAlgorithm: fuzzy.FuzzyLevenshtein,
        }).
        WithBatchSize(10000).
        WithMaxWorkers(8).
        Build()
    if err != nil { panic(err) }
    fuzzy.WraithFuzzyWuzzyEngine(opts)
}

go-fuzz can process millions of records in parallel, with minimal memory overhead.

Key Features of go-fuzz

Fuzzy Algorithms: Levenshtein, Jaro-Winkler, Damerau-Levenshtein
Confidence & Prefix: Dynamic thresholds and prefix handling
Search Engine: BK-tree based fast approximate search
Performance: Batch processing, multi-core support
I/O: Flexible input/output: files, slices, Postgres, MySQL
Metadata: Rich group metadata, logging
Configuration: Configurable via API or CLI
Extensibility: Designed for extensibility, large-scale deduplication
Monitoring: Advanced logging, progress reporting, error handling

Flexible Inputs & Outputs

go-fuzz is designed to be flexible and easy to integrate into any data pipeline. You can provide input as a file (NDJSON, JSON, TXT), a Go slice, or directly from a database (Postgres, MySQL) using built-in connectors. Output can be written to files, writers, or streamed to callbacks for further processing.

File input: Use a local file (NDJSON, JSON, or TXT) as your data source.

opts := fuzzy.NewOptionsBuilder().WithInput("emails.ndjson")

Read millions of records directly from a file.

Slice input: Pass a Go slice of strings for in-memory processing.

opts := fuzzy.NewOptionsBuilder().WithInput([]string{"alice@example.com", "bob@example.com"})

Process data already loaded in memory.

Postgres input: Use a Postgres database as your input source with a built-in connector.

opts := fuzzy.NewOptionsBuilder().WithInput(fuzzy.PostgresInputConfig{/* ... */})

Stream data from your Postgres database in batches.

MySQL input: Use a MySQL database as your input source with a built-in connector.

opts := fuzzy.NewOptionsBuilder().WithInput(fuzzy.MySQLInputConfig{/* ... */})

Stream data from your MySQL database in batches.

Algorithmic Details & Multi-Pass Flow

go-fuzz uses a multi-pass approach to maximize both speed and accuracy. Each pass is designed to reduce the search space and improve clustering quality:

Pass 1: Prefix Grouping — Quickly groups records by exact or near-exact prefixes to reduce the search space for fuzzy matching. This is a fast O(n) operation that enables efficient downstream processing.

Pass 2: Intra-Group Fuzzy Matching — Within each prefix group, fuzzy matching algorithms (Levenshtein, Jaro-Winkler, Damerau-Levenshtein) are applied to cluster similar records. This pass uses the selected algorithm and confidence thresholds to form tight clusters, reducing false positives.

Pass 3: Cross-Group Merging — Optionally, go-fuzz can perform a final pass to merge clusters across groups using a more relaxed threshold or a different algorithm, catching edge cases and maximizing deduplication. This is especially useful for large, noisy datasets.

Levenshtein Distance: Measures the minimum number of single-character edits needed to change one string into another. Time complexity: O(m*n), where m and n are the lengths of the two strings. Great for general typos and misspellings.

score := fuzzy.Levenshtein("alice@example.com", "al1ce@example.com") // 2

Levenshtein is ideal for general typos and misspellings.

Jaro-Winkler: Focuses on the number and order of matching characters, giving more weight to common prefixes. Time complexity: O(n). Useful for short strings and names.

score := fuzzy.JaroWinkler("alice", "al1ce") // 0.93

Jaro-Winkler is fast and great for short strings and names.

Damerau-Levenshtein: Like Levenshtein, but also counts transpositions (swapping two adjacent characters) as a single edit. Time complexity: O(m*n). Especially useful for common human typos.

score := fuzzy.DamerauLevenshtein("aclie", "alice") // 1

Damerau-Levenshtein is best for catching transposition errors.

Choose the algorithm that best fits your data and error patterns. Multiple passes allow for both precision and recall in deduplication.

BK-Trees: Fast Approximate Search

A brute-force approach to fuzzy matching is O(n^2) and quickly becomes infeasible for large datasets. go-fuzz uses a BK-tree (Burkhard-Keller tree) to index strings by edit distance. This allows us to efficiently query for all strings within a given distance threshold, reducing the number of comparisons by orders of magnitude. BK-tree search is typically O(log n) for balanced trees, making it ideal for large-scale deduplication.

tree := fuzzy.NewBKTree(fuzzy.Levenshtein)
tree.Add("alice@example.com")
matches := tree.Search("al1ce@example.com", 2)

BK-trees make it possible to find near-duplicates in logarithmic time, not linear.

Clustering, Chunking, and Parallelism

To further scale fuzzy matching, go-fuzz divides the input into chunks and processes them in parallel using Go routines. Each chunk is clustered independently, and then results are merged. This chunking approach allows us to fully utilize multi-core CPUs and keep memory usage predictable, even for massive datasets. Clustering and chunking reduce the effective problem size per worker, improving cache locality and throughput.

emails := loadEmails()
chunks := fuzzy.Chunk(emails, 10000)
results := make(chan []fuzzy.EmailGroup)
for _, chunk := range chunks {
    go func(c []string) {
        groups := fuzzy.GroupEmailsByPrefix(c, cfg, true)
        results <- groups
    }(chunk)
}
// Merge results from all goroutines...

Chunking + Go routines = scalable, fast, and memory-efficient processing.

Memory Management & Efficiency

go-fuzz is built for efficiency. By chunking data, using streaming I/O, and leveraging Go's low-level memory management, it keeps memory usage predictable and low—even for huge datasets. Intermediate results are processed and discarded as soon as possible, and parallelism ensures that no single core or memory region becomes a bottleneck. Batch processing and streaming allow for deduplication of millions of records without running out of memory.

// Example: Processing in batches to limit memory
batchSize := 10000
for batch := range fuzzy.BatchReader("emails.ndjson", batchSize) {
    groups := fuzzy.GroupEmailsByPrefix(batch, cfg, true)
    // process groups...
}

Batching and streaming keep memory usage low and predictable.

Python vs Go: Real-World Performance

In benchmarks, go-fuzz outperforms Python libraries by 10-50x on large datasets. Go's concurrency and low memory usage make it possible to cluster millions of emails in minutes, not hours. Plus, the API is designed to be as easy to use as Python's, but with the power of Go under the hood.

// Example: Grouping emails from Postgres
cfg := fuzzy.PostgresInputConfig{
    Host:     "localhost",
    Port:     5432,
    User:     "myuser",
    Password: "mypassword",
    DBName:   "mydb",
    Table:    "emails",
    Column:   "email",
    BatchSize: 100000,
}
opts, err := fuzzy.NewOptionsBuilder().
    WithInput(cfg).
    WithGroupConfig(fuzzy.GroupConfig{ /* ... */ }).
    WithBatchSize(100000).
    WithMaxWorkers(8).
    Build()
fuzzy.WraithFuzzyWuzzyEngine(opts)

go-fuzz: scalable, fast, and production-ready.

Why Not Use Existing Go Packages?

Before building go-fuzz, I explored existing Go fuzzy matching libraries. Unfortunately, the most popular Go package for fuzzy matching was deprecated and no longer maintained, lacking support for modern features, performance optimizations, and flexible input/output options. This motivated me to create go-fuzz as a modern, production-ready, and highly customizable alternative—supporting more algorithms, richer configuration, and seamless integration with files, slices, and databases.

go-fuzz is a next-generation replacement for deprecated Go fuzzy matching libraries.

Conclusion & Key Takeaways

Conclusion:

go-fuzz is a modern, production-ready solution for large-scale fuzzy matching, deduplication, and clustering. By combining Go's concurrency primitives (goroutines, channels), efficient data structures (BK-trees), and chunked clustering, it dramatically improves speed, scalability, and memory efficiency compared to both Python and legacy Go libraries. The right architecture and algorithms can turn a multi-hour job into a multi-minute one, making typo-tolerant deduplication and clustering possible at true production scale.

Key Takeaways:

Ecosystem: The Go ecosystem needed a robust, production-grade fuzzy matcher. go-fuzz is a complete redesign—modern, reliable, and built for scale.

Customizable & Flexible: Real-world data is unpredictable. go-fuzz lets you pick algorithms, set thresholds, and connect to files, slices, or databases like Postgres/MySQL. Flexibility is essential for integrating fuzzy matching into diverse pipelines.

Performance: Brute-force fuzzy matching is O(n²) and doesn't scale. By using BK-trees and clustering, go-fuzz achieves O(n log n) or better. The right data structures make large-scale deduplication practical.

Concurrency: Go's goroutines and chunking enable true parallelism. I've seen multi-million record jobs finish in minutes, not hours. Concurrency is a must for high-throughput data processing.

Memory Efficiency: go-fuzz processes data in batches and streams results, keeping memory usage low and predictable. Efficient memory management is critical for stability at scale.

Production-Ready: go-fuzz includes detailed logging, progress tracking, and robust error handling. Production tools need transparency and reliability, not just speed.

Please Note: This package will remain private until it meets my standards for release.

Return Home