Skip to content

Latest commit

 

History

History
194 lines (135 loc) · 7.94 KB

README.md

File metadata and controls

194 lines (135 loc) · 7.94 KB

Welcome to simhilarity

Simhilarity is a gem for matching up text strings that are similar but not identical. Here is how it works:

  1. Normalize strings. Downcase, remove non-alpha, etc:

    normalize("Hello,  WORLD!") => "hello world"
  2. Calculate ngrams from strings. Specifically, it creates bigrams (2 character ngrams) and also creates an ngram for each sequence of digits in the string:

                         # bigrams                        # digits
    ngrams("hi 123") => ["hi", "i ", " 1", "12", "23"] + ["123"]
  3. Calculate frequency of ngrams in the corpus.

  4. Select pairs of strings that might be matches. These are called candidates, and there are a few different ways they are chosen (see options). Simhilarity will try to pick the best method based on the size of your data set.

  5. Score candidates by measuring ngram overlap (with frequency weighting), using the dice coefficient.

  6. For each input string, return the match with the highest score.

Here is output from a sample run:

score   needle                                      haystack
1.000   Night Heron 19                              Night Heron 19
1.000   103 Oceanwood                               103 Oceanwood
0.987   Sea Crest 1504                              1504 Sea Crest
0.986   Twin Oaks 189                               189 Twin Oaks
0.981   Sea Crest 1205                              1205 Sea Crest
0.980   Sea Crest 2411                              2411 Sea Crest
0.972   Sea Crest 3405                              3405 Sea Crest
0.968   Barrington Arms 504                         504 Barrington Arms
0.964   Windsor Place 503                           503 Windsor Place
0.951   1802 Bluff Villas - Hilton Head Island      1802 Bluff Villas
0.943   3221 Villamare - Hilton Head Island         3221 Villamare
0.941   134 Shorewood - Hilton Head Island          134 Shorewood
0.900   1 Quail Street                              1 Quail
0.894   2 Quail Street                              2 Quail
0.823   Windsor II 2315                             2315 Windsor Place II
0.736   Beachside Tennis 12                         12 Beachside
0.732   16 Piping Plover - Hilton Head Island       16 Piping Plover
0.460   7 Quail                                     7 QUAIL/126 Dune Lane
0.379   11 Battery                                  11 Gunnery

Note that the final match has the lowest score, and is incorrect!

Usage

simhilarity executable

The gem includes an executable called simhilarity. For example:

$ simhilarity needles.txt haystack.txt
score,needle,haystack
0.900,1 Quail Street,1 Quail
1.000,103 Oceanwood,103 Oceanwood
...

It will print out the best matches between needle and haystack in CSV format. Use simhilarity --verbose to look at pretty progress bars while it's running. Use --candidates to customize the candidates selection method, which will dramatically affect performance for large data sets.

Simhilarity::Matcher

To use simhilarity from code:

require "rubygems"
require "simhilarity"

haystack = ["Black Sabbath", "Led Zeppelin", "The Doors",
            "The Beatles", "Neil Young"]
needles = ["blak sabbath", "doors", "neal yung"]

matcher = Simhilarity::Matcher.new
matcher.haystack = haystack
matcher.matches(needles).each do |needle, haystack, score|
  printf "%.2f %s / %s\n", score, needle, haystack
end

# 0.90 blak sabbath / Black Sabbath
# 0.77 doors / The Doors
# 0.67 neal yung / Neil Young

By default, simhilarity assumes that needles and haystack are arrays of strings. To use something else, set reader to a proc that converts your opaque objects into strings. See options.

Benchmarks

When looking at simhilarity's speed, there are two important aspects to consider:

  • picking candidates - how long does it take to pick decent candidates out of all the potential string pairs?
  • matching - once candidates are identified, how long does it take to score them?

Picking Candidates

There are three different methods for picking candidates - see options for a detailed explanation. Here are some numbers from my i5 3ghz, for a test dataset consisting of 500 needles and 10,000 haystacks.

method      time   candidates returned
simhash 5   4s     3,500
simhash 6   7s     5,000
simhash 7   9s     10,000   (this is the default)
simhash 8   12s    25,000
simhash 9   13s    60,000

ngrams 5    45s    1,000,000
ngrams 4    45s    1,500,000
ngrams 3    40s    2,000,000

all         4s   5,000,000

Matching

Once candidates are identified, the string pairs are scored and winners are picked out. Scoring is O(n). On my i5 3ghz:

candidates   time
50,000       1s
1,000,000    35s
5,000,000    140s

Options

There are a few ways to configure simhilarity:

  • candidates - controls how candidates are picked from the complete set of all string pairs. We want to avoid looking at all string pairs, because that's quite expensive for large datasets. On the other hand, if we examine too few we might miss some of the best matches. A conundrum. There are three different settings:

    :simhash - generate a weighted simhash for each string, then iterate the needles and look for "nearby" haystack simhashes using a bktree. Simhashes are compared using the hamming distance. If the hamming distance between the simhashes <= #simhash_max_hamming, the pair becomes a candidate. The default max hamming distance is 7.

    :ngrams - for each pair of strings, count the number of ngrams they have in common. If the overlap is >= #ngram_overlaps, the pair becomes a candidate. The default minimum number of overlaps is 3.

    :all - all pairs are examined. This is completely braindead and very slow for large datasets.

    Simhash works great, but there's no reason not to use :ngrams or even :all for small data sets. In fact, that's what simhilarity does by default - if you use a small dataset (needle * haystack < 200,000) it defaults to :all, otherwise it uses :simhash. Some examples:

    # defaults to :all or :simhash based on data set size
    matcher = Simhilarity::Matcher.new
    
    # use :simhash, custom max_hamming
    matcher = Simhilarity::Matcher.new
    matcher.candidates = :simhash
    matcher.simhash_max_hamming =  8
    
    # use :ngrams, custom overlaps
    matcher = Simhilarity::Matcher.new
    matcher.candidates = :ngrams
    matcher.ngram_overlaps = 4

    or:

    $ simhilarity --candidates simhash   needles.txt haystack.txt
    $ simhilarity --candidates simhash=8 needles.txt haystack.txt
    $ simhilarity --candidates ngrams    needles.txt haystack.txt
    $ simhilarity --candidates ngrams=4  needles.txt haystack.txt
    
  • reader - proc for converting your opaque objects into strings. Set this to use something other than strings for source data. For example, if you want to match author names between ActiveRecord book objects:

    matcher.reader = lambda { |i| i.author }
    matcher.haystack = haystack
    matcher.matches(needles)
  • normalizer - proc for normalizing incoming strings. The default normalizer downcases, removes non-alphas, and strips whitespace.

  • ngrammer - proc for converting normalized strings into ngrams. The default ngrammer pulls out bigrams and runs of digits, which is perfect for matching names and addresses.

  • scorer - proc for scoring candidates. By default, measures ngram overlap (with frequency weighting), using the dice coefficient.

  • verbose - if true, show progress while simhilarity is working. Great for the impatient. Use --verbose from the command line.

Changelog

Master (unreleased)

  • Works with Ruby 2.0 - thanks @abscondment!
  • Travis
  • Accessor for options instead of a gigantic hash