Lepszy algorytm rankingu podobieństwa dla ciągów o zmiennej długości

152

Szukam algorytmu podobieństwa ciągów, który daje lepsze wyniki na ciągach o zmiennej długości niż te, które są zwykle sugerowane (odległość levenshteina, soundex itp.).

Na przykład,

Biorąc pod uwagę ciąg A: "Robert",

Następnie ciąg B: „Amy Robertson”

byłby lepszy niż

Ciąg C: „Richard”

Ponadto najlepiej, aby algorytm ten był agnostyczny językowo (działa również w językach innych niż angielski).

— marzagao
źródło

podobnie w .net: stackoverflow.com/questions/83777/…

— Ciro Santilli 郝海东冠状病六四事件法轮功

Zobacz także: Współczynnik kostki

— avid_useR

155

Simon White z Catalysoft napisał artykuł o bardzo sprytnym algorytmie porównującym sąsiadujące pary znaków, który sprawdza się naprawdę dobrze w moich celach:

http://www.catalysoft.com/articles/StrikeAMatch.html

Simon ma wersję algorytmu w języku Java, a poniżej napisałem jego wersję PL / Ruby (zaczerpniętą ze zwykłej wersji ruby wykonanej w komentarzu na forum pokrewnego przez Marka Wong-VanHarena), aby móc go używać w zapytaniach dotyczących PostgreSQL:

CREATE FUNCTION string_similarity(str1 varchar, str2 varchar)
RETURNS float8 AS '

str1.downcase! 
pairs1 = (0..str1.length-2).collect {|i| str1[i,2]}.reject {
  |pair| pair.include? " "}
str2.downcase! 
pairs2 = (0..str2.length-2).collect {|i| str2[i,2]}.reject {
  |pair| pair.include? " "}
union = pairs1.size + pairs2.size 
intersection = 0 
pairs1.each do |p1| 
  0.upto(pairs2.size-1) do |i| 
    if p1 == pairs2[i] 
      intersection += 1 
      pairs2.slice!(i) 
      break 
    end 
  end 
end 
(2.0 * intersection) / union

' LANGUAGE 'plruby';

Działa jak marzenie!

— marzagao
źródło

32

Znalazłeś odpowiedź i napisałeś to w 4 minuty? Imponujący!

— Matt J

28

Przygotowałem odpowiedź po kilku badaniach i wdrożeniach. Umieściłem to tutaj z korzyścią dla każdego, kto szuka w SO praktycznej odpowiedzi za pomocą alternatywnego algorytmu, ponieważ większość odpowiedzi w powiązanych pytaniach wydaje się kręcić wokół levenshteina lub soundexu.

— marzagao

18

Właśnie to, czego szukałem. Wyjdziesz za mnie?

— BlackTea

6

@JasonSundram ma rację - w rzeczywistości jest to dobrze znany współczynnik kości w bigramach na poziomie postaci, jak pisze autor w „dodatku” (na dole strony).

— Fred Foo

4

Zwraca to „wynik” 1 (dopasowanie 100%) podczas porównywania ciągów znaków, które mają jedną izolowaną literę jako różnicę, jak w tym przykładzie: string_similarity("vitamin B", "vitamin C") #=> 1czy istnieje łatwy sposób zapobiegania tego typu zachowaniom?

— MrYoshiji

77

odpowiedź marzagao jest świetna. Przekonwertowałem go na C #, więc pomyślałem, że opublikuję go tutaj:

Pastebin Link

/// <summary>
/// This class implements string comparison algorithm
/// based on character pair similarity
/// Source: http://www.catalysoft.com/articles/StrikeAMatch.html
/// </summary>
public class SimilarityTool
{
    /// <summary>
    /// Compares the two strings based on letter pair matches
    /// </summary>
    /// <param name="str1"></param>
    /// <param name="str2"></param>
    /// <returns>The percentage match from 0.0 to 1.0 where 1.0 is 100%</returns>
    public double CompareStrings(string str1, string str2)
    {
        List<string> pairs1 = WordLetterPairs(str1.ToUpper());
        List<string> pairs2 = WordLetterPairs(str2.ToUpper());

        int intersection = 0;
        int union = pairs1.Count + pairs2.Count;

        for (int i = 0; i < pairs1.Count; i++)
        {
            for (int j = 0; j < pairs2.Count; j++)
            {
                if (pairs1[i] == pairs2[j])
                {
                    intersection++;
                    pairs2.RemoveAt(j);//Must remove the match to prevent "GGGG" from appearing to match "GG" with 100% success

                    break;
                }
            }
        }

        return (2.0 * intersection) / union;
    }

    /// <summary>
    /// Gets all letter pairs for each
    /// individual word in the string
    /// </summary>
    /// <param name="str"></param>
    /// <returns></returns>
    private List<string> WordLetterPairs(string str)
    {
        List<string> AllPairs = new List<string>();

        // Tokenize the string and put the tokens/words into an array
        string[] Words = Regex.Split(str, @"\s");

        // For each word
        for (int w = 0; w < Words.Length; w++)
        {
            if (!string.IsNullOrEmpty(Words[w]))
            {
                // Find the pairs of characters
                String[] PairsInWord = LetterPairs(Words[w]);

                for (int p = 0; p < PairsInWord.Length; p++)
                {
                    AllPairs.Add(PairsInWord[p]);
                }
            }
        }

        return AllPairs;
    }

    /// <summary>
    /// Generates an array containing every 
    /// two consecutive letters in the input string
    /// </summary>
    /// <param name="str"></param>
    /// <returns></returns>
    private string[] LetterPairs(string str)
    {
        int numPairs = str.Length - 1;

        string[] pairs = new string[numPairs];

        for (int i = 0; i < numPairs; i++)
        {
            pairs[i] = str.Substring(i, 2);
        }

        return pairs;
    }
}

— Michael La Voie
źródło

2

+100 Gdybym mógł, właśnie uratowałeś mi kumpla z ciężkiej pracy! Twoje zdrowie.

— vvohra87

1

Bardzo dobrze! Jedyna sugestia, jaką mam, to uczynić z tego rozszerzenie.

— Levitikon,

+1! Świetnie, że działa, z niewielkimi modyfikacjami również dla Javy. I wydaje się, że zwraca lepsze odpowiedzi niż Levenshtein.

— Xyene,

1

Dodałem wersję konwertującą to na metodę rozszerzenia poniżej. Dzięki za oryginalną wersję i niesamowite tłumaczenie.

— Frank Rundatz,

@Michael La Voie Dziękuję, to bardzo miłe! Chociaż mały problem z (2.0 * intersection) / union- otrzymuję Double.NaN podczas porównywania dwóch pustych ciągów.

— Vojtěch Dohnal

41

Oto kolejna wersja odpowiedzi marzagao , ta napisana w Pythonie:

def get_bigrams(string):
    """
    Take a string and return a list of bigrams.
    """
    s = string.lower()
    return [s[i:i+2] for i in list(range(len(s) - 1))]

def string_similarity(str1, str2):
    """
    Perform bigram comparison between two strings
    and return a percentage match in decimal form.
    """
    pairs1 = get_bigrams(str1)
    pairs2 = get_bigrams(str2)
    union  = len(pairs1) + len(pairs2)
    hit_count = 0
    for x in pairs1:
        for y in pairs2:
            if x == y:
                hit_count += 1
                break
    return (2.0 * hit_count) / union

if __name__ == "__main__":
    """
    Run a test using the example taken from:
    http://www.catalysoft.com/articles/StrikeAMatch.html
    """
    w1 = 'Healed'
    words = ['Heard', 'Healthy', 'Help', 'Herded', 'Sealed', 'Sold']

    for w2 in words:
        print('Healed --- ' + w2)
        print(string_similarity(w1, w2))
        print()

— John Rutledge
źródło

2

Istnieje mały błąd w string_similarity, gdy w słowie występują zduplikowane ngramy, co skutkuje wynikiem> 1 dla identycznych ciągów. Dodanie „przerwy” po „hit_count + = 1” rozwiązuje ten problem.

— jbaiter

1

@jbaiter: Dobry chwyt. Zmieniłem to, aby odzwierciedlić Twoje zmiany.

— John Rutledge,

3

W artykule Simona White'a mówi: „Zauważ, że za każdym razem, gdy zostanie znalezione dopasowanie, ta para znaków jest usuwana z drugiej listy tablic, aby uniemożliwić nam wielokrotne dopasowywanie do tej samej pary znaków. przeciwko 'GG'.) „Zmieniłbym to stwierdzenie, aby powiedzieć, że dałoby to dopasowanie lepsze niż idealne. Bez uwzględnienia tego wydaje się również, że ten algorytm nie jest przechodni (podobieństwo (x, y) = / = podobieństwo (y, x)). Dodanie pairs2.remove (y) po linii hit_count + = 1 rozwiązuje problem.

— NinjaMeTimbers,

17

Oto moja implementacja w PHP sugerowanego algorytmu StrikeAMatch autorstwa Simona White'a. zalety (jak jest napisane w linku) to:

Prawdziwe odzwierciedlenie podobieństwa leksykalnego - łańcuchy z niewielkimi różnicami należy uznać za podobne. W szczególności znaczne nakładanie się podciągów powinno wskazywać na wysoki poziom podobieństwa między ciągami.
Odporność na zmiany kolejności słów - dwa ciągi zawierające te same słowa, ale w innej kolejności, należy uznać za podobne. Z drugiej strony, jeśli jeden ciąg jest tylko przypadkowym anagramem znaków zawartych w drugim, to (zwykle) należy go rozpoznać jako niepodobne.
Niezależność językowa - algorytm powinien działać nie tylko w języku angielskim, ale w wielu różnych językach.

<?php
/**
 * LetterPairSimilarity algorithm implementation in PHP
 * @author Igal Alkon
 * @link http://www.catalysoft.com/articles/StrikeAMatch.html
 */
class LetterPairSimilarity
{
    /**
     * @param $str
     * @return mixed
     */
    private function wordLetterPairs($str)
    {
        $allPairs = array();

        // Tokenize the string and put the tokens/words into an array

        $words = explode(' ', $str);

        // For each word
        for ($w = 0; $w < count($words); $w++)
        {
            // Find the pairs of characters
            $pairsInWord = $this->letterPairs($words[$w]);

            for ($p = 0; $p < count($pairsInWord); $p++)
            {
                $allPairs[] = $pairsInWord[$p];
            }
        }

        return $allPairs;
    }

    /**
     * @param $str
     * @return array
     */
    private function letterPairs($str)
    {
        $numPairs = mb_strlen($str)-1;
        $pairs = array();

        for ($i = 0; $i < $numPairs; $i++)
        {
            $pairs[$i] = mb_substr($str,$i,2);
        }

        return $pairs;
    }

    /**
     * @param $str1
     * @param $str2
     * @return float
     */
    public function compareStrings($str1, $str2)
    {
        $pairs1 = $this->wordLetterPairs(strtoupper($str1));
        $pairs2 = $this->wordLetterPairs(strtoupper($str2));

        $intersection = 0;

        $union = count($pairs1) + count($pairs2);

        for ($i=0; $i < count($pairs1); $i++)
        {
            $pair1 = $pairs1[$i];

            $pairs2 = array_values($pairs2);
            for($j = 0; $j < count($pairs2); $j++)
            {
                $pair2 = $pairs2[$j];
                if ($pair1 === $pair2)
                {
                    $intersection++;
                    unset($pairs2[$j]);
                    break;
                }
            }
        }

        return (2.0*$intersection)/$union;
    }
}

— Igal Alkon
źródło

17

Krótsza wersja odpowiedzi Johna Rutledge'a :

def get_bigrams(string):
    '''
    Takes a string and returns a list of bigrams
    '''
    s = string.lower()
    return {s[i:i+2] for i in xrange(len(s) - 1)}

def string_similarity(str1, str2):
    '''
    Perform bigram comparison between two strings
    and return a percentage match in decimal form
    '''
    pairs1 = get_bigrams(str1)
    pairs2 = get_bigrams(str2)
    return (2.0 * len(pairs1 & pairs2)) / (len(pairs1) + len(pairs2))

— kwant
źródło

Nawet intersectionzmienna jest marnotrawstwem linii.

— Chibueze Opata

14

Ta dyskusja była bardzo pomocna, dzięki. Przekonwertowałem algorytm na VBA do użytku z Excelem i napisałem kilka wersji funkcji arkusza, jedna do prostego porównania pary ciągów, a druga do porównania jednego ciągu z zakresem / tablicą ciągów. Wersja strSimLookup zwraca ostatnie najlepsze dopasowanie jako ciąg znaków, indeks tablicy lub metrykę podobieństwa.

Ta implementacja daje te same wyniki, co przykład Amazona na stronie internetowej Simona White'a z kilkoma drobnymi wyjątkami dotyczącymi meczów o niskiej punktacji; Nie jestem pewien, gdzie wkrada się różnica, może to być funkcja Split VBA, ale nie badałem, ponieważ działa dobrze do moich celów.

'Implements functions to rate how similar two strings are on
'a scale of 0.0 (completely dissimilar) to 1.0 (exactly similar)
'Source:   http://www.catalysoft.com/articles/StrikeAMatch.html
'Author: Bob Chatham, bob.chatham at gmail.com
'9/12/2010

Option Explicit

Public Function stringSimilarity(str1 As String, str2 As String) As Variant
'Simple version of the algorithm that computes the similiarity metric
'between two strings.
'NOTE: This verision is not efficient to use if you're comparing one string
'with a range of other values as it will needlessly calculate the pairs for the
'first string over an over again; use the array-optimized version for this case.

    Dim sPairs1 As Collection
    Dim sPairs2 As Collection

    Set sPairs1 = New Collection
    Set sPairs2 = New Collection

    WordLetterPairs str1, sPairs1
    WordLetterPairs str2, sPairs2

    stringSimilarity = SimilarityMetric(sPairs1, sPairs2)

    Set sPairs1 = Nothing
    Set sPairs2 = Nothing

End Function

Public Function strSimA(str1 As Variant, rRng As Range) As Variant
'Return an array of string similarity indexes for str1 vs every string in input range rRng
    Dim sPairs1 As Collection
    Dim sPairs2 As Collection
    Dim arrOut As Variant
    Dim l As Long, j As Long

    Set sPairs1 = New Collection

    WordLetterPairs CStr(str1), sPairs1

    l = rRng.Count
    ReDim arrOut(1 To l)
    For j = 1 To l
        Set sPairs2 = New Collection
        WordLetterPairs CStr(rRng(j)), sPairs2
        arrOut(j) = SimilarityMetric(sPairs1, sPairs2)
        Set sPairs2 = Nothing
    Next j

    strSimA = Application.Transpose(arrOut)

End Function

Public Function strSimLookup(str1 As Variant, rRng As Range, Optional returnType) As Variant
'Return either the best match or the index of the best match
'depending on returnTYype parameter) between str1 and strings in rRng)
' returnType = 0 or omitted: returns the best matching string
' returnType = 1           : returns the index of the best matching string
' returnType = 2           : returns the similarity metric

    Dim sPairs1 As Collection
    Dim sPairs2 As Collection
    Dim metric, bestMetric As Double
    Dim i, iBest As Long
    Const RETURN_STRING As Integer = 0
    Const RETURN_INDEX As Integer = 1
    Const RETURN_METRIC As Integer = 2

    If IsMissing(returnType) Then returnType = RETURN_STRING

    Set sPairs1 = New Collection

    WordLetterPairs CStr(str1), sPairs1

    bestMetric = -1
    iBest = -1

    For i = 1 To rRng.Count
        Set sPairs2 = New Collection
        WordLetterPairs CStr(rRng(i)), sPairs2
        metric = SimilarityMetric(sPairs1, sPairs2)
        If metric > bestMetric Then
            bestMetric = metric
            iBest = i
        End If
        Set sPairs2 = Nothing
    Next i

    If iBest = -1 Then
        strSimLookup = CVErr(xlErrValue)
        Exit Function
    End If

    Select Case returnType
    Case RETURN_STRING
        strSimLookup = CStr(rRng(iBest))
    Case RETURN_INDEX
        strSimLookup = iBest
    Case Else
        strSimLookup = bestMetric
    End Select

End Function

Public Function strSim(str1 As String, str2 As String) As Variant
    Dim ilen, iLen1, ilen2 As Integer

    iLen1 = Len(str1)
    ilen2 = Len(str2)

    If iLen1 >= ilen2 Then ilen = ilen2 Else ilen = iLen1

    strSim = stringSimilarity(Left(str1, ilen), Left(str2, ilen))

End Function

Sub WordLetterPairs(str As String, pairColl As Collection)
'Tokenize str into words, then add all letter pairs to pairColl

    Dim Words() As String
    Dim word, nPairs, pair As Integer

    Words = Split(str)

    If UBound(Words) < 0 Then
        Set pairColl = Nothing
        Exit Sub
    End If

    For word = 0 To UBound(Words)
        nPairs = Len(Words(word)) - 1
        If nPairs > 0 Then
            For pair = 1 To nPairs
                pairColl.Add Mid(Words(word), pair, 2)
            Next pair
        End If
    Next word

End Sub

Private Function SimilarityMetric(sPairs1 As Collection, sPairs2 As Collection) As Variant
'Helper function to calculate similarity metric given two collections of letter pairs.
'This function is designed to allow the pair collections to be set up separately as needed.
'NOTE: sPairs2 collection will be altered as pairs are removed; copy the collection
'if this is not the desired behavior.
'Also assumes that collections will be deallocated somewhere else

    Dim Intersect As Double
    Dim Union As Double
    Dim i, j As Long

    If sPairs1.Count = 0 Or sPairs2.Count = 0 Then
        SimilarityMetric = CVErr(xlErrNA)
        Exit Function
    End If

    Union = sPairs1.Count + sPairs2.Count
    Intersect = 0

    For i = 1 To sPairs1.Count
        For j = 1 To sPairs2.Count
            If StrComp(sPairs1(i), sPairs2(j)) = 0 Then
                Intersect = Intersect + 1
                sPairs2.Remove j
                Exit For
            End If
        Next j
    Next i

    SimilarityMetric = (2 * Intersect) / Union

End Function

— bchatham
źródło

@bchatham Wygląda to na niezwykle przydatne, ale jestem nowy w VBA i nieco zakwestionowany przez kod. Czy jest możliwe przesłanie pliku Excel, który wykorzystuje Twój wkład? Do moich celów mam nadzieję, że użyję go do dopasowania podobnych imion z jednej kolumny w programie Excel z około 1000 wpisów (fragment tutaj: dropbox.com/s/ofdliln9zxgi882/first-names-excerpt.xlsx ). Następnie użyję dopasowań jako synonimów w wyszukiwaniu osób. (zobacz także softwarerecs.stackexchange.com/questions/38227/… )

— bjornte

12

Przepraszam, odpowiedź nie została wymyślona przez autora. Jest to dobrze znany algorytm, który po raz pierwszy pojawił się w firmie Digital Equipment Corporation i jest często nazywany gontem.

http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-TN-1997-015.pdf

10

Przetłumaczyłem algorytm Simona White'a na PL / pgSQL. To mój wkład.

<!-- language: lang-sql -->

create or replace function spt1.letterpairs(in p_str varchar) 
returns varchar  as 
$$
declare

    v_numpairs integer := length(p_str)-1;
    v_pairs varchar[];

begin

    for i in 1 .. v_numpairs loop
        v_pairs[i] := substr(p_str, i, 2);
    end loop;

    return v_pairs;

end;
$$ language 'plpgsql';

--===================================================================

create or replace function spt1.wordletterpairs(in p_str varchar) 
returns varchar as
$$
declare
    v_allpairs varchar[];
    v_words varchar[];
    v_pairsinword varchar[];
begin
    v_words := regexp_split_to_array(p_str, '[[:space:]]');

    for i in 1 .. array_length(v_words, 1) loop
        v_pairsinword := spt1.letterpairs(v_words[i]);

        if v_pairsinword is not null then
            for j in 1 .. array_length(v_pairsinword, 1) loop
                v_allpairs := v_allpairs || v_pairsinword[j];
            end loop;
        end if;

    end loop;


    return v_allpairs;
end;
$$ language 'plpgsql';

--===================================================================

create or replace function spt1.arrayintersect(ANYARRAY, ANYARRAY)
returns anyarray as 
$$
    select array(select unnest($1) intersect select unnest($2))
$$ language 'sql';

--===================================================================

create or replace function spt1.comparestrings(in p_str1 varchar, in p_str2 varchar)
returns float as
$$
declare
    v_pairs1 varchar[];
    v_pairs2 varchar[];
    v_intersection integer;
    v_union integer;
begin
    v_pairs1 := wordletterpairs(upper(p_str1));
    v_pairs2 := wordletterpairs(upper(p_str2));
    v_union := array_length(v_pairs1, 1) + array_length(v_pairs2, 1); 

    v_intersection := array_length(arrayintersect(v_pairs1, v_pairs2), 1);

    return (2.0 * v_intersection / v_union);
end;
$$ language 'plpgsql';

— fabiolimace
źródło

Działa na moim PostgreSQL, który nie obsługuje plików plruby! Dziękuję Ci!

— hostnik

Dziękuję Ci! Jak byś to zrobił w Oracle SQL?

— olovholm

Ten port jest nieprawidłowy. Dokładny ciąg nie zwraca 1.

— Brandon Wigfield

9

Wskaźniki podobieństwa ciągów zawierają przegląd wielu różnych wskaźników używanych do porównywania ciągów ( przegląd zawiera również Wikipedia ). Wiele z tych metryk jest zaimplementowanych w bibliotece simmetrics .

Jeszcze innym przykładem metryki nieuwzględnionej w tym przeglądzie jest na przykład odległość kompresji (próba przybliżenia złożoności Kołmogorowa ), której można użyć do tekstów nieco dłuższych niż ten, który przedstawiłeś.

Możesz również rozważyć przyjrzenie się znacznie szerszemu tematowi przetwarzania języka naturalnego . Te pakiety R mogą szybko rozpocząć pracę (lub przynajmniej dać kilka pomysłów).

I ostatnia edycja - poszukaj innych pytań na ten temat w SO, jest sporo powiązanych.

— Anonimowy
źródło

9

Szybsza wersja algorytmu PHP:

/**
 *
 * @param $str
 * @return mixed
 */
private static function wordLetterPairs ($str)
{
    $allPairs = array();

    // Tokenize the string and put the tokens/words into an array

    $words = explode(' ', $str);

    // For each word
    for ($w = 0; $w < count($words); $w ++) {
        // Find the pairs of characters
        $pairsInWord = self::letterPairs($words[$w]);

        for ($p = 0; $p < count($pairsInWord); $p ++) {
            $allPairs[$pairsInWord[$p]] = $pairsInWord[$p];
        }
    }

    return array_values($allPairs);
}

/**
 *
 * @param $str
 * @return array
 */
private static function letterPairs ($str)
{
    $numPairs = mb_strlen($str) - 1;
    $pairs = array();

    for ($i = 0; $i < $numPairs; $i ++) {
        $pairs[$i] = mb_substr($str, $i, 2);
    }

    return $pairs;
}

/**
 *
 * @param $str1
 * @param $str2
 * @return float
 */
public static function compareStrings ($str1, $str2)
{
    $pairs1 = self::wordLetterPairs(mb_strtolower($str1));
    $pairs2 = self::wordLetterPairs(mb_strtolower($str2));


    $union = count($pairs1) + count($pairs2);

    $intersection = count(array_intersect($pairs1, $pairs2));

    return (2.0 * $intersection) / $union;
}

Dla danych, które miałem (około 2300 porównań), miałem czas działania 0,58 sekundy z roztworem Igal Alkon w porównaniu z 0,35 sekundy z moim.

— Andy
źródło

9

Wersja w pięknej Scali:

  def pairDistance(s1: String, s2: String): Double = {

    def strToPairs(s: String, acc: List[String]): List[String] = {
      if (s.size < 2) acc
      else strToPairs(s.drop(1),
        if (s.take(2).contains(" ")) acc else acc ::: List(s.take(2)))
    }

    val lst1 = strToPairs(s1.toUpperCase, List())
    val lst2 = strToPairs(s2.toUpperCase, List())

    (2.0 * lst2.intersect(lst1).size) / (lst1.size + lst2.size)

  }

— Ignobilis
źródło

6

Oto wersja R:

get_bigrams <- function(str)
{
  lstr = tolower(str)
  bigramlst = list()
  for(i in 1:(nchar(str)-1))
  {
    bigramlst[[i]] = substr(str, i, i+1)
  }
  return(bigramlst)
}

str_similarity <- function(str1, str2)
{
   pairs1 = get_bigrams(str1)
   pairs2 = get_bigrams(str2)
   unionlen  = length(pairs1) + length(pairs2)
   hit_count = 0
   for(x in 1:length(pairs1)){
        for(y in 1:length(pairs2)){
            if (pairs1[[x]] == pairs2[[y]])
                hit_count = hit_count + 1
        }
   }
   return ((2.0 * hit_count) / unionlen)
}

— Rainer
źródło

Ten algorytm jest lepszy, ale dość powolny w przypadku dużych danych. To znaczy, jeśli trzeba porównać 10000 słów z 15000 innymi słowami, jest to zbyt wolne. Czy możemy zwiększyć jego wydajność pod względem szybkości?

— indra_patil,

6

Publikowanie odpowiedzi marzagao w C99, zainspirowane tymi algorytmami

double dice_match(const char *string1, const char *string2) {

    //check fast cases
    if (((string1 != NULL) && (string1[0] == '\0')) || 
        ((string2 != NULL) && (string2[0] == '\0'))) {
        return 0;
    }
    if (string1 == string2) {
        return 1;
    }

    size_t strlen1 = strlen(string1);
    size_t strlen2 = strlen(string2);
    if (strlen1 < 2 || strlen2 < 2) {
        return 0;
    }

    size_t length1 = strlen1 - 1;
    size_t length2 = strlen2 - 1;

    double matches = 0;
    int i = 0, j = 0;

    //get bigrams and compare
    while (i < length1 && j < length2) {
        char a[3] = {string1[i], string1[i + 1], '\0'};
        char b[3] = {string2[j], string2[j + 1], '\0'};
        int cmp = strcmpi(a, b);
        if (cmp == 0) {
            matches += 2;
        }
        i++;
        j++;
    }

    return matches / (length1 + length2);
}

Niektóre testy oparte na oryginalnym artykule :

#include <stdio.h>

void article_test1() {
    char *string1 = "FRANCE";
    char *string2 = "FRENCH";
    printf("====%s====\n", __func__);
    printf("%2.f%% == 40%%\n", dice_match(string1, string2) * 100);
}


void article_test2() {
    printf("====%s====\n", __func__);
    char *string = "Healed";
    char *ss[] = {"Heard", "Healthy", "Help",
                  "Herded", "Sealed", "Sold"};
    int correct[] = {44, 55, 25, 40, 80, 0};
    for (int i = 0; i < 6; ++i) {
        printf("%2.f%% == %d%%\n", dice_match(string, ss[i]) * 100, correct[i]);
    }
}

void multicase_test() {
    char *string1 = "FRaNcE";
    char *string2 = "fREnCh";
    printf("====%s====\n", __func__);
    printf("%2.f%% == 40%%\n", dice_match(string1, string2) * 100);

}

void gg_test() {
    char *string1 = "GG";
    char *string2 = "GGGGG";
    printf("====%s====\n", __func__);
    printf("%2.f%% != 100%%\n", dice_match(string1, string2) * 100);
}


int main() {
    article_test1();
    article_test2();
    multicase_test();
    gg_test();

    return 0;
}

— skrypty
źródło

5

Opierając się na niesamowitej wersji C # Michaela La Voie, zgodnie z prośbą o uczynienie z niej metody rozszerzającej, oto co wymyśliłem. Główną zaletą zrobienia tego w ten sposób jest to, że możesz sortować listę ogólną według dopasowania procentowego. Załóżmy na przykład, że w obiekcie znajduje się pole tekstowe o nazwie „Miasto”. Użytkownik wyszukuje „Chester”, a Ty chcesz zwrócić wyniki w malejącej kolejności. Na przykład chcesz, aby dosłowne dopasowania Chester były wyświetlane przed Rochester. Aby to zrobić, dodaj dwie nowe właściwości do swojego obiektu:

    public string SearchText { get; set; }
    public double PercentMatch
    {
        get
        {
            return City.ToUpper().PercentMatchTo(this.SearchText.ToUpper());
        }
    }

Następnie dla każdego obiektu ustaw SearchText na to, czego szukał użytkownik. Następnie możesz je łatwo posortować za pomocą czegoś takiego:

    zipcodes = zipcodes.OrderByDescending(x => x.PercentMatch);

Oto niewielka modyfikacja, aby uczynić go metodą rozszerzenia:

    /// <summary>
    /// This class implements string comparison algorithm
    /// based on character pair similarity
    /// Source: http://www.catalysoft.com/articles/StrikeAMatch.html
    /// </summary>
    public static double PercentMatchTo(this string str1, string str2)
    {
        List<string> pairs1 = WordLetterPairs(str1.ToUpper());
        List<string> pairs2 = WordLetterPairs(str2.ToUpper());

        int intersection = 0;
        int union = pairs1.Count + pairs2.Count;

        for (int i = 0; i < pairs1.Count; i++)
        {
            for (int j = 0; j < pairs2.Count; j++)
            {
                if (pairs1[i] == pairs2[j])
                {
                    intersection++;
                    pairs2.RemoveAt(j);//Must remove the match to prevent "GGGG" from appearing to match "GG" with 100% success

                    break;
                }
            }
        }

        return (2.0 * intersection) / union;
    }

    /// <summary>
    /// Gets all letter pairs for each
    /// individual word in the string
    /// </summary>
    /// <param name="str"></param>
    /// <returns></returns>
    private static List<string> WordLetterPairs(string str)
    {
        List<string> AllPairs = new List<string>();

        // Tokenize the string and put the tokens/words into an array
        string[] Words = Regex.Split(str, @"\s");

        // For each word
        for (int w = 0; w < Words.Length; w++)
        {
            if (!string.IsNullOrEmpty(Words[w]))
            {
                // Find the pairs of characters
                String[] PairsInWord = LetterPairs(Words[w]);

                for (int p = 0; p < PairsInWord.Length; p++)
                {
                    AllPairs.Add(PairsInWord[p]);
                }
            }
        }

        return AllPairs;
    }

    /// <summary>
    /// Generates an array containing every 
    /// two consecutive letters in the input string
    /// </summary>
    /// <param name="str"></param>
    /// <returns></returns>
    private static  string[] LetterPairs(string str)
    {
        int numPairs = str.Length - 1;

        string[] pairs = new string[numPairs];

        for (int i = 0; i < numPairs; i++)
        {
            pairs[i] = str.Substring(i, 2);
        }

        return pairs;
    }

— Frank Rundatz
źródło

Myślę, że lepiej byłoby, gdybyś użył bool isCaseSensitive z domyślną wartością false - nawet jeśli to prawda, implementacja jest znacznie czystsza

— Jordan

5

Moja implementacja JavaScript pobiera ciąg lub tablicę ciągów i opcjonalną podłogę (domyślna podłoga to 0,5). Jeśli przekażesz mu łańcuch, zwróci wartość true lub false w zależności od tego, czy wynik podobieństwa ciągu jest większy lub równy podłodze. Jeśli przekażesz mu tablicę ciągów, zwróci tablicę tych ciągów, których wynik podobieństwa jest większy lub równy podłodze, posortowanych według wyniku.

Przykłady:

'Healed'.fuzzy('Sealed');      // returns true
'Healed'.fuzzy('Help');        // returns false
'Healed'.fuzzy('Help', 0.25);  // returns true

'Healed'.fuzzy(['Sold', 'Herded', 'Heard', 'Help', 'Sealed', 'Healthy']);
// returns ["Sealed", "Healthy"]

'Healed'.fuzzy(['Sold', 'Herded', 'Heard', 'Help', 'Sealed', 'Healthy'], 0);
// returns ["Sealed", "Healthy", "Heard", "Herded", "Help", "Sold"]

Oto ona:

(function(){
  var default_floor = 0.5;

  function pairs(str){
    var pairs = []
      , length = str.length - 1
      , pair;
    str = str.toLowerCase();
    for(var i = 0; i < length; i++){
      pair = str.substr(i, 2);
      if(!/\s/.test(pair)){
        pairs.push(pair);
      }
    }
    return pairs;
  }

  function similarity(pairs1, pairs2){
    var union = pairs1.length + pairs2.length
      , hits = 0;

    for(var i = 0; i < pairs1.length; i++){
      for(var j = 0; j < pairs2.length; j++){
        if(pairs1[i] == pairs2[j]){
          pairs2.splice(j--, 1);
          hits++;
          break;
        }
      }
    }
    return 2*hits/union || 0;
  }

  String.prototype.fuzzy = function(strings, floor){
    var str1 = this
      , pairs1 = pairs(this);

    floor = typeof floor == 'number' ? floor : default_floor;

    if(typeof(strings) == 'string'){
      return str1.length > 1 && strings.length > 1 && similarity(pairs1, pairs(strings)) >= floor || str1.toLowerCase() == strings.toLowerCase();
    }else if(strings instanceof Array){
      var scores = {};

      strings.map(function(str2){
        scores[str2] = str1.length > 1 ? similarity(pairs1, pairs(str2)) : 1*(str1.toLowerCase() == str2.toLowerCase());
      });

      return strings.filter(function(str){
        return scores[str] >= floor;
      }).sort(function(a, b){
        return scores[b] - scores[a];
      });
    }
  };
})();

— gruppler
źródło

1

Błąd / literówka! for(var j = 0; j < pairs1.length; j++){powinno byćfor(var j = 0; j < pairs2.length; j++){

— Searle

3

Algorytm współczynnika Dice (odpowiedź Simona White'a / marzagao) jest zaimplementowany w Ruby w metodzie pair_distance_similar w amatch gem

https://github.com/flori/amatch

Ten klejnot zawiera również implementacje szeregu algorytmów przybliżonego dopasowywania i porównywania ciągów: odległość edycji Levenshteina, odległość edycji sprzedawcy, odległość Hamminga, najdłuższa wspólna długość podciągu, najdłuższa wspólna długość podciągu, metryka odległości pary, metryka Jaro-Winklera .

— s01ipsist
źródło

2

Wersja Haskell - nie krępuj się sugerować zmian, ponieważ nie zrobiłem zbyt wiele Haskell.

import Data.Char
import Data.List

-- Convert a string into words, then get the pairs of words from that phrase
wordLetterPairs :: String -> [String]
wordLetterPairs s1 = concat $ map pairs $ words s1

-- Converts a String into a list of letter pairs.
pairs :: String -> [String]
pairs [] = []
pairs (x:[]) = []
pairs (x:ys) = [x, head ys]:(pairs ys)

-- Calculates the match rating for two strings
matchRating :: String -> String -> Double
matchRating s1 s2 = (numberOfMatches * 2) / totalLength
  where pairsS1 = wordLetterPairs $ map toLower s1
        pairsS2 = wordLetterPairs $ map toLower s2
        numberOfMatches = fromIntegral $ length $ pairsS1 `intersect` pairsS2
        totalLength = fromIntegral $ length pairsS1 + length pairsS2

— matthewpalmer
źródło

2

Clojure:

(require '[clojure.set :refer [intersection]])

(defn bigrams [s]
  (->> (split s #"\s+")
       (mapcat #(partition 2 1 %))
       (set)))

(defn string-similarity [a b]
  (let [a-pairs (bigrams a)
        b-pairs (bigrams b)
        total-count (+ (count a-pairs) (count b-pairs))
        match-count (count (intersection a-pairs b-pairs))
        similarity (/ (* 2 match-count) total-count)]
    similarity))

— Shaun Lebron
źródło

1

A co z odległością Levenshteina podzieloną przez długość pierwszej struny (lub alternatywnie podzieloną przez moją min / max / średnią długość obu strun)? Jak na razie to działa.

— tehvan
źródło

Jednak cytując inny post na ten temat, to, co zwraca, jest często „nieregularne”. „Echo” klasyfikuje jako bardzo podobne do „psa”.

— Xyene,

@Nox: Część tej odpowiedzi „podzielona przez długość pierwszego ciągu” jest znacząca. Działa to również lepiej niż chwalony algorytm Dice'a dla literówek i błędów transpozycji, a nawet typowych koniugacji (rozważ na przykład porównanie „pływać” i „pływać”).

— Logan Pickup

1

Hej ludzie, wypróbowałem to w javascript, ale jestem w tym nowy, ktoś zna szybsze sposoby na zrobienie tego?

function get_bigrams(string) {
    // Takes a string and returns a list of bigrams
    var s = string.toLowerCase();
    var v = new Array(s.length-1);
    for (i = 0; i< v.length; i++){
        v[i] =s.slice(i,i+2);
    }
    return v;
}

function string_similarity(str1, str2){
    /*
    Perform bigram comparison between two strings
    and return a percentage match in decimal form
    */
    var pairs1 = get_bigrams(str1);
    var pairs2 = get_bigrams(str2);
    var union = pairs1.length + pairs2.length;
    var hit_count = 0;
    for (x in pairs1){
        for (y in pairs2){
            if (pairs1[x] == pairs2[y]){
                hit_count++;
            }
        }
    }
    return ((2.0 * hit_count) / union);
}


var w1 = 'Healed';
var word =['Heard','Healthy','Help','Herded','Sealed','Sold']
for (w2 in word){
    console.log('Healed --- ' + word[w2])
    console.log(string_similarity(w1,word[w2]));
}

— user779420
źródło

Ta implementacja jest nieprawidłowa. Funkcja bigram przerywa wprowadzanie długości 0. Metoda string_similarity nie działa poprawnie w drugiej pętli, co może prowadzić do wielokrotnego zliczania par, co prowadzi do zwracanej wartości przekraczającej 100%. Zapomniałeś również o zadeklarowaniu xi y, i nie powinieneś przechodzić przez pętle za pomocą for..in..pętli ( for(..;..;..)zamiast tego użyj ).

— Rob W.

1

Oto kolejna wersja podobieństwa oparta na indeksie Sørensena – Dice (odpowiedź marzagao), ta napisana w C ++ 11:

/*
 * Similarity based in Sørensen–Dice index.
 *
 * Returns the Similarity between _str1 and _str2.
 */
double similarity_sorensen_dice(const std::string& _str1, const std::string& _str2) {
    // Base case: if some string is empty.
    if (_str1.empty() || _str2.empty()) {
        return 1.0;
    }

    auto str1 = upper_string(_str1);
    auto str2 = upper_string(_str2);

    // Base case: if the strings are equals.
    if (str1 == str2) {
        return 0.0;
    }

    // Base case: if some string does not have bigrams.
    if (str1.size() < 2 || str2.size() < 2) {
        return 1.0;
    }

    // Extract bigrams from str1
    auto num_pairs1 = str1.size() - 1;
    std::unordered_set<std::string> str1_bigrams;
    str1_bigrams.reserve(num_pairs1);
    for (unsigned i = 0; i < num_pairs1; ++i) {
        str1_bigrams.insert(str1.substr(i, 2));
    }

    // Extract bigrams from str2
    auto num_pairs2 = str2.size() - 1;
    std::unordered_set<std::string> str2_bigrams;
    str2_bigrams.reserve(num_pairs2);
    for (unsigned int i = 0; i < num_pairs2; ++i) {
        str2_bigrams.insert(str2.substr(i, 2));
    }

    // Find the intersection between the two sets.
    int intersection = 0;
    if (str1_bigrams.size() < str2_bigrams.size()) {
        const auto it_e = str2_bigrams.end();
        for (const auto& bigram : str1_bigrams) {
            intersection += str2_bigrams.find(bigram) != it_e;
        }
    } else {
        const auto it_e = str1_bigrams.end();
        for (const auto& bigram : str2_bigrams) {
            intersection += str1_bigrams.find(bigram) != it_e;
        }
    }

    // Returns similarity coefficient.
    return (2.0 * intersection) / (num_pairs1 + num_pairs2);
}

— chema989
źródło

1

Szukałem czystej rubinowej implementacji algorytmu, na który wskazuje odpowiedź @ marzagao. Niestety link wskazany przez @marzagao jest uszkodzony. W odpowiedzi @ s01ipsist wskazał zestaw ruby gem, w którym implementacja nie jest w czystym rubinie. Poszukałem więc trochę i znalazłem gem fuzzy_match, który ma czystą implementację ruby (choć ten klejnot używa amatch) tutaj . Mam nadzieję, że to pomoże komuś takiemu jak ja.

— Engr. Hasanuzzaman Sumon
źródło