Znajdowanie interesujących anagramów

31

Powiedzieć, że i są dwa ciągi o tej samej długości. Anagramming dwóch ciągów bijective mapowania tak, że dla każdego . $a_1a_2\ldots a_n$ $b_1b_2\ldots b_n$ $p:[1\ldots n]\to[1\ldots n]$ $a_i = b_{p(i)}$ $i$

Może być więcej niż jedno anagramowanie dla tej samej pary ciągów. Na przykład, jeśli `abcab` mamy i , między innymi. $a=$ $b=$ cabab $p_1[1,2,3,4,5]\to[4,5,1,2,3]$ $p_2[1,2,3,4,5] \to [2,5,1,4,3]$

Powiemy, że waga anagramowania jest liczbą cięć, które należy wykonać w pierwszym ciągu, aby uzyskać kawałki, które można zmienić, aby uzyskać drugi ciąg. Formalnie jest to liczba wartości dla których . Oznacza to, że jest to liczba punktów, w których jest nie zwiększają dokładnie 1. Dla przykładu, i , ponieważ cięcia raz, do fragmentów i i cięcia czterech razy, na pięć części. $w(p)$ $p$ $i\in[1\ldots n-1]$ $p(i)+1\ne p(i+1)$ $p$ $w(p_1) = 1$ $w(p_2) = 4$ $p_1$ 1234512345 $p_2$ 12345

Załóżmy, że istnieje anagramowanie dla dwóch ciągów i . W takim razie co najmniej jedno anagramowanie musi mieć najmniejszą wagę. Powiedzmy, że ten jest najlżejszy . (Może być wiele najlżejszych anagramów; nie obchodzi mnie to, ponieważ interesują mnie tylko ciężary.) $a$ $b$

Pytanie

Chcę algorytmu, który, biorąc pod uwagę dwa ciągi, dla których istnieje anagramowanie, wydajnie daje dokładną wagę najlżejszego anagramowania dwóch ciągów. W porządku, jeśli algorytm daje najlżejsze anagramowanie, ale nie musi.

Generowanie wszystkich anagramowań i ich ważenie jest dość prostą sprawą, ale może być ich wiele, więc wolałbym metodę, która bezpośrednio wyszukuje anagramy.

Motywacja

Powód, dla którego ten problem jest interesujący, jest następujący. Bardzo łatwo jest zmusić komputer do przeszukiwania słownika i znajdowania anagramów, par słów zawierających dokładnie te same litery. Ale wiele wyprodukowanych anagramów jest nieciekawych. Na przykład najdłuższe przykłady, które można znaleźć w drugim międzynarodowym słowniku Webstera, to:

cholecystoduodenostomy
duodenocholecystostomy

Problem powinien być jasne: są nieciekawe, ponieważ przyznać bardzo lekki anagramming które po prostu zamienia cholecysto, duedenoi stomysekcje, na wadze 2. Z drugiej strony, to znacznie krótszy przykładem jest znacznie bardziej zaskakujące i interesujące:

linia brzegowa
przekroju

Tutaj najlżejsze anagramowanie ma wagę 8.

Mam program, który używa tej metody do znajdowania interesujących anagramów, a mianowicie tych, dla których wszystkie anagramy mają dużą wagę. Ale robi to, generując i ważąc wszystkie możliwe anagramy, co jest powolne.

— Mark Dominus
źródło

Właśnie z ciekawości, jak znaleźć pary anagramów? Czy poszukujesz

użyciem siły brute we wszystkich słowach tej samej długości?

O (n^{2})

$\mathcal O(n^2)$

— Pedro

4

Nie, oczywiście nie. Konwertujesz każde słowo na postać kanoniczną, która ma te same litery w kolejności alfabetycznej. (Na przykład kanoniczna forma cholecystoduodenostomyto ccddeehlmnooooossttuyy.) Dwa słowa są anagramami, jeśli tylko wtedy, gdy mają taką samą kanoniczną formę. Przechowujesz słowa w tablicy skrótów, wpisując ich kanoniczne formy, a gdy znajdziesz kolizję, masz anagram.

— Mark Dominus

Mam teraz na swoim blogu wiele mniej lub bardziej powiązanych informacji na ten temat: (α) (β) (γ) (δ)

— Mark Dominus,

21

Ten problem jest znany jako „problem minimalnego wspólnego ciągu łańcuchowego”. (Dokładniej, odpowiedź w problemie minimalnego wspólnego ciągu łańcuchowego jest równa odpowiedzi w twoim problemie plus 1). Niestety, jest NP-trudna, nawet z ograniczeniem, że każda litera występuje co najwyżej dwa razy w każdym z ciągów wejściowych, o czym świadczą Goldstein, Kilman i Zheng [GKZ05]. Oznacza to, że nie istnieje algorytm wielomianowy, chyba że P = NP. (Oczywiście, jeśli każda litera występuje najwyżej raz, problem jest trywialny, ponieważ istnieje tylko jedno anagramowanie).

Z drugiej strony, ci sami autorzy [GKZ05] podają algorytm aproksymacji czasu wielomianowego 1,1037 z tym samym ograniczeniem. („Algorytm aproksymacji 1,1037 ” oznacza algorytm, który może nie wysyłać poprawnej odpowiedzi A, ale gwarantuje uzyskanie wartości B takiej, że A ≤ B ≤ 1,1037 A. ) Dają również algorytm 4-aproksymacji w czasie liniowym pod słabsze ograniczenie, że każda litera występuje najwyżej trzy razy w każdym z ciągów wejściowych.

[GKZ05] Avraham Goldstein, Petr Kolman i Jie Zheng. Problem z minimalnym wspólnym podziałem łańcucha: Twardość i przybliżenia. Electronic Journal of Combinatorics , 12, artykuł R50, 2005. http://www.combinatorics.org/ojs/index.php/eljc/article/view/v12i1r50

— Tsuyoshi Ito
źródło

Opublikowałem kontynuację .

— Mark Dominus

9

Jest to kontynuacja powyższej odpowiedzi Tsuyoshi Ito , podsumowująca najistotniejszą część cytowanego przez niego artykułu GKZ05 .

Artykuł potwierdza redukcję problemu maksymalnego zestawu niezależnego ( MIS ). Sporządzić wykres , którego wierzchołki są pary w taki sposób, i . Połącz wierzchołki i (gdzie ) za pomocą krawędzi, gdy nie jest możliwe, aby anagramowanie odwzorowało wszystkie $G$ $(i, j)$ $a_i = b_j$ $a_{i+1} = b_{j+1}$ $(i, j)$ $(k, \ell)$ $i≤k$ i i oraz . Jest to łatwe do wykrycia; takie mapowanie jest niemożliwe dokładnie wtedy, gdy spełniony jest jeden z następujących warunków: $i\mapsto j$ $i+1\mapsto j+1$ $k\mapsto\ell$ $k+1\mapsto\ell+1$

i $i=k$ $j\ne\ell$
i $i+1=k$ $j+1\ne\ell$
$i+1<k$ and $\{j, j+1\}$ is disjoint from $\{\ell, \ell+1\}$

Say the resulting graph $G$ has a maximal independent set of size $s$ . Then the minimum anagramming weight is exactly $n-s-1$ , where $n$ is the length of the strings $a$ and $b$ . (The converse holds also: a low-weight anagramming translates directly into a large MIS for $G$ . For details, see pp. 4–5 of the paper.)

yttrioustouristyouriououriri $s=2$ y|t|t|ri|ou|st|ou|ri|s|t|y

Z drugiej strony, rozważ derateri treader. Tym razem wykres ma trzy wierzchołki:

DErater + treaDEr
dERater + treadER
deratER + treadER

2 and 3 are incompatible, and 1 and 3 are incompatible, but 1 and 2 are compatible. So the unique MIS has size $s=2$ and contains vertices 1 and 2. The corresponding anagramming of weight 7-2-1=4 is der|a|t|e|r ↔ t|r|e|a|der.

— Mark Dominus
źródło

2

Thank you for the follow-up post, but this is not a proof of NP-completeness of your problem. To prove the NP-completeness of your problem, you have to reduce some known NP-complete problem to your problem, and that is Theorem 2.2 of [GKZ05]. What you presented here (Lemma 1.1 of [GKZ05]) is a reduction in the opposite direction.

— Tsuyoshi Ito

This is a nice reformulation. A trivial change that's a minor simplification conceptually (at least for me): instead of drawing edges between pairs that are incompatible and asking for the maximum independent set, we can draw edges between pairs that are compatible, and ask for the maximum clique. (I find it easier to think about "what's the maximum number of pairs that we can keep together".)

— ShreevatsaR

2

It doesn't cover the exact algorithm you had in mind (which Tsuyoshi Ito's answer does), but trying to get at the underlying problem of finding "interesting" anagrams...

My first thought was to use some variation on edit-distance, where the atomic changes are weighted according to their "interestingness" rather than the usual "difficulty" or "confusability" weightings. Of course, it seems unlikely that you can efficiently encode the really interesting transformations in this way, since they're likely to be non-local and hence run into the NP-complete issues of MIS, etc.

So, second thought would be to construct a letter-to-letter alignment between the words (à la machine translation alignments), and then to score the alignments themselves for "interestingness" (e.g., counting the alignments that take adjacent letters to non-adjacent letters, or how many alignments each alignment crosses, etc; and then combine them all via loglinear model or such).

Third idea is to completely abandon looking at the structure of the anagramming itself, and instead look at the semantics of the words. Often what makes an anagram "interesting" is the incongruity between the meanings of the words involved. So try something like computing their distance in WordNet, or similar.

— wren romano
źródło

0

The problem can be phrased in terms of permutation groups.

Now a permutation group contains all the "anagram moves", both primitive (swapping two letters) and composite of sequences of primitive moves. It seems that you are interested in only a subset of the possible permutations. I will attempt to define these.

First, recall the notation for permutations, namely, the so-called cycle notation:

$()$ means no permutation.
$(1)$ means 1 is swapped with 1, which is also no permutation.
$(12)$ means 1 and 2 are swapped.
$(123)$ means 1 replaces 2 which replaces 3 which replaces 1 (a rotation).
and so one

These simple 'cycles' are composed to describe more complex permutations.

It seems that the moves you are interested in are (for a word of length $n$ ):

swaps of pairs of single characters: these are the swaps such as $(12)$
swaps of pairs of 2 consecutive characters: these are permutations of the form $(a\ b)(a+1\ b+1)$ , where $a>0$ and $b<a+1$ and $b+1\le n$
...
swaps of pairs of n consecutive characters: these are permutations of the form $(a\ b)(a+1\ b+1)\cdots(a+i-1\ b+i-1)$ where $a>0$ , $a+i-1\le b$ , and $b+i-1\le n$ .

These moves form the basis for your algorithm. What you are interested in is finding the smallest sequence of these moves to move from one word to the next.

I don't know any algorithm for computing this, apart from the brute force search, but at least now there is a clearer (I hope) description of what the primitive moves are. (And maybe some group theorist among us can point to an appropriate algorithm.)

— Dave Clarke
źródło

1

Thanks. Perhaps I'm being pessimistic, but it seems to me that this approach is going to be difficult. I don't think a group-theoretic approach will yield fruit unless we first find out what permutation group is of interest, and that varies depending on the input strings. I think efficient representation of finite groups is an extremely deep and rich problem. But I'd like to be mistaken.

— Mark Dominus

1

“What you are interested in is finding the smallest sequence of these moves to move from one word to the next.” I do not think that this is correct. For example, if n=4, the swap (1 2) has weight 2, but the swap (2 3) has weight 3. Your way of counting does not distinguish these two.

— Tsuyoshi Ito

I answered late at night. I didn't understand the weight measure correctly. In fact, I don't understand it now. I though you wanted to allow moves of blocks of letters, which is why I went to all the trouble of defining these primitives. My answer might provide inspiration, so I'll leave it, even though it is wrong.

— Dave Clarke

0

For cholecystoduodenostomy/duodenocholecystostome, I notice that if you assigned a number to each character describing how much it was moved as a delta, you would have some thing like 7 7's, then 8 -7s, then 6 0's. That is not right because some chars may have been repeated (the second c only moved forward 2, not back 7) etc but still very "run length encodable" because you see the same deltas in a row.

Compare to coastline/sectional, where you see something like (+2)(+5)(+5)(-3)(-1)(+3)....much less "run length encodable".

Perhaps the randomness of the deltas could give you a "score" as to how interesting the anagram is?

— Dan Gelder
źródło