* [PATCH] Fix deletion of last character in levenshtein distance @ 2008-11-18 18:53 Samuel Tardieu 2008-11-18 20:23 ` Matthieu Moy 2008-11-19 0:53 ` Johannes Schindelin 0 siblings, 2 replies; 12+ messages in thread From: Samuel Tardieu @ 2008-11-18 18:53 UTC (permalink / raw) To: git Without this change, "git tags" will not suggest "git tag" (it will only suggest "git status"), and "git statusx" will not suggest anything. Signed-off-by: Samuel Tardieu <sam@rfc1149.net> --- levenshtein.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/levenshtein.c b/levenshtein.c index db52f2c..98fea72 100644 --- a/levenshtein.c +++ b/levenshtein.c @@ -25,7 +25,7 @@ int levenshtein(const char *string1, const char *string2, row2[j + 1] > row0[j - 1] + w) row2[j + 1] = row0[j - 1] + w; /* deletion */ - if (j + 1 < len2 && row2[j + 1] > row1[j + 1] + d) + if (row2[j + 1] > row1[j + 1] + d) row2[j + 1] = row1[j + 1] + d; /* insertion */ if (row2[j + 1] > row2[j] + a) ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH] Fix deletion of last character in levenshtein distance 2008-11-18 18:53 [PATCH] Fix deletion of last character in levenshtein distance Samuel Tardieu @ 2008-11-18 20:23 ` Matthieu Moy 2008-11-19 0:53 ` Johannes Schindelin 1 sibling, 0 replies; 12+ messages in thread From: Matthieu Moy @ 2008-11-18 20:23 UTC (permalink / raw) To: Samuel Tardieu; +Cc: git Samuel Tardieu <sam@rfc1149.net> writes: > Without this change, "git tags" will not suggest "git tag" > (it will only suggest "git status"), and "git statusx" will > not suggest anything. Tested and approved, but I didn't check the code. Thanks, -- Matthieu ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] Fix deletion of last character in levenshtein distance 2008-11-18 18:53 [PATCH] Fix deletion of last character in levenshtein distance Samuel Tardieu 2008-11-18 20:23 ` Matthieu Moy @ 2008-11-19 0:53 ` Johannes Schindelin 2008-11-19 8:42 ` Samuel Tardieu 1 sibling, 1 reply; 12+ messages in thread From: Johannes Schindelin @ 2008-11-19 0:53 UTC (permalink / raw) To: Samuel Tardieu; +Cc: git Hi, On Tue, 18 Nov 2008, Samuel Tardieu wrote: > diff --git a/levenshtein.c b/levenshtein.c > index db52f2c..98fea72 100644 > --- a/levenshtein.c > +++ b/levenshtein.c > @@ -25,7 +25,7 @@ int levenshtein(const char *string1, const char *string2, > row2[j + 1] > row0[j - 1] + w) > row2[j + 1] = row0[j - 1] + w; > /* deletion */ > - if (j + 1 < len2 && row2[j + 1] > row1[j + 1] + d) > + if (row2[j + 1] > row1[j + 1] + d) I do not understand: does row2 have more entries than len2? In any case, you will _have_ to guard against accessing elements outside the reserved memory. You'll have to be more convincing to make me agree that this is a good change, and I am pretty certain that other people are less familiar with that particular part of Git's source code than me. Ciao, Dscho ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] Fix deletion of last character in levenshtein distance 2008-11-19 0:53 ` Johannes Schindelin @ 2008-11-19 8:42 ` Samuel Tardieu 2008-11-19 9:57 ` Johannes Schindelin 0 siblings, 1 reply; 12+ messages in thread From: Samuel Tardieu @ 2008-11-19 8:42 UTC (permalink / raw) To: Johannes Schindelin; +Cc: git * Johannes Schindelin <Johannes.Schindelin@gmx.de> [2008-11-19 01:53:45 +0100] | Hi, | | On Tue, 18 Nov 2008, Samuel Tardieu wrote: | | > diff --git a/levenshtein.c b/levenshtein.c | > index db52f2c..98fea72 100644 | > --- a/levenshtein.c | > +++ b/levenshtein.c | > @@ -25,7 +25,7 @@ int levenshtein(const char *string1, const char *string2, | > row2[j + 1] > row0[j - 1] + w) | > row2[j + 1] = row0[j - 1] + w; | > /* deletion */ | > - if (j + 1 < len2 && row2[j + 1] > row1[j + 1] + d) | > + if (row2[j + 1] > row1[j + 1] + d) | | I do not understand: does row2 have more entries than len2? Yes it does: int *row2 = xmalloc(sizeof(int) * (len2 + 1)); | In any case, you will _have_ to guard against accessing elements | outside the reserved memory. Why would that be needed? j belongs to [0, len2[, so j+1 is always in [0, len2+1[ which is ok for both row2 and row1. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] Fix deletion of last character in levenshtein distance 2008-11-19 8:42 ` Samuel Tardieu @ 2008-11-19 9:57 ` Johannes Schindelin 2008-11-19 13:13 ` Junio C Hamano 0 siblings, 1 reply; 12+ messages in thread From: Johannes Schindelin @ 2008-11-19 9:57 UTC (permalink / raw) To: Samuel Tardieu; +Cc: git Hi, On Wed, 19 Nov 2008, Samuel Tardieu wrote: > * Johannes Schindelin <Johannes.Schindelin@gmx.de> [2008-11-19 01:53:45 +0100] > > | Hi, > | > | On Tue, 18 Nov 2008, Samuel Tardieu wrote: > | > | > diff --git a/levenshtein.c b/levenshtein.c > | > index db52f2c..98fea72 100644 > | > --- a/levenshtein.c > | > +++ b/levenshtein.c > | > @@ -25,7 +25,7 @@ int levenshtein(const char *string1, const char *string2, > | > row2[j + 1] > row0[j - 1] + w) > | > row2[j + 1] = row0[j - 1] + w; > | > /* deletion */ > | > - if (j + 1 < len2 && row2[j + 1] > row1[j + 1] + d) > | > + if (row2[j + 1] > row1[j + 1] + d) > | > | I do not understand: does row2 have more entries than len2? > > Yes it does: int *row2 = xmalloc(sizeof(int) * (len2 + 1)); > > | In any case, you will _have_ to guard against accessing elements > | outside the reserved memory. > > Why would that be needed? j belongs to [0, len2[, so j+1 is always > in [0, len2+1[ which is ok for both row2 and row1. Okay, I understand now, _after_ having looked at the original levenshtein.c. IOW you could have made my task of reviewing your patch much easier. Anyway, here is my Acked-by: Johannes Schindelin <johannes.schindelin@gmx.de> Thanks for the bugfix, Dscho ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] Fix deletion of last character in levenshtein distance 2008-11-19 9:57 ` Johannes Schindelin @ 2008-11-19 13:13 ` Junio C Hamano 2008-11-20 12:00 ` [PATCH] Document levenshtein.c Johannes Schindelin 0 siblings, 1 reply; 12+ messages in thread From: Junio C Hamano @ 2008-11-19 13:13 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Samuel Tardieu, git Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > Okay, I understand now, _after_ having looked at the original > levenshtein.c. > > IOW you could have made my task of reviewing your patch much easier. > > Anyway, here is my > > Acked-by: Johannes Schindelin <johannes.schindelin@gmx.de> > > Thanks for the bugfix, In other words, even the original author's head exploded without looking at extra context lines around the patch. It is a sure sign that the original implementation was too scantily described, and that the fix was not explained well in the proposed commit log message (i.e. in what corner cases the original was bad in what way, and how the patch fixes it). I shouldn't have to decipher the original and the fixed version with pencil and paper when re-reviewing Dscho's Ack. ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH] Document levenshtein.c 2008-11-19 13:13 ` Junio C Hamano @ 2008-11-20 12:00 ` Johannes Schindelin 2008-11-20 12:00 ` Samuel Tardieu 0 siblings, 1 reply; 12+ messages in thread From: Johannes Schindelin @ 2008-11-20 12:00 UTC (permalink / raw) To: Junio C Hamano; +Cc: Samuel Tardieu, git Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> --- On Wed, 19 Nov 2008, Junio C Hamano wrote: > It is a sure sign that the original implementation was too > scantily described, and that the fix was not explained well in the > proposed commit log message (i.e. in what corner cases the original > was bad in what way, and how the patch fixes it). How about this? levenshtein.c | 31 +++++++++++++++++++++++++++++++ 1 files changed, 31 insertions(+), 0 deletions(-) diff --git a/levenshtein.c b/levenshtein.c index db52f2c..298907a 100644 --- a/levenshtein.c +++ b/levenshtein.c @@ -1,6 +1,39 @@ #include "cache.h" #include "levenshtein.h" +/* + * This function implements the Damerau-Levenshtein algorithm to + * calculate a distance between strings. + * + * The idea is to build a distance matrix for the substrings of both + * strings. To avoid a large space complexity, only the last three rows + * are kept in memory (if swaps had the same or higher cost as one deletion + * plus one insertion, only two rows would be needed). + * + * At any stage, "i + 1" denotes the length of the current substring of + * string1 that the distance is calculated for (likewise "j + 1" for + * string2). + * + * row2 holds the current row, row1 the previous row (i.e. for the substring + * of string1 of length "i"), and row0 the row before that. + * + * In other words, at the start of the big loop, row1[j + 1] contains the + * Damerau-Levenshtein distance between the substring of string1 of length + * "i" and the substring of string2 of length "j + 1". + * + * All the big loop does is determine the partial minimum-cost paths. + * + * It does so by calculating the costs of the path ending in characters + * i (in string1) and j (in string2), respectively, given that the last + * operation is a substition, a swap, a deletion, or an insertion. + * + * This implementation allows the costs to be weighted: + * + * - w (as in "sWap") + * - s (as in "Substition") + * - a (for insertion, AKA "Add") + * - d (as in "Deletion") + */ int levenshtein(const char *string1, const char *string2, int w, int s, int a, int d) { -- 1.6.0.2.763.g72663 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH] Document levenshtein.c 2008-11-20 12:00 ` [PATCH] Document levenshtein.c Johannes Schindelin @ 2008-11-20 12:00 ` Samuel Tardieu 2008-11-20 13:27 ` [PATCH v2] " Johannes Schindelin 0 siblings, 1 reply; 12+ messages in thread From: Samuel Tardieu @ 2008-11-20 12:00 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Junio C Hamano, git * Johannes Schindelin <Johannes.Schindelin@gmx.de> [2008-11-20 13:00:35 +0100] | How about this? I think it still lacks a note about what "deletion" and "insertion" means (is that a character deleted from string1 to obtain string2 or the reverse?). In most implementation, you use the same cost for insertion and deletion so the function is symetrical, but this implementation is more powerful. ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v2] Document levenshtein.c 2008-11-20 12:00 ` Samuel Tardieu @ 2008-11-20 13:27 ` Johannes Schindelin 2008-11-20 17:21 ` Jon Loeliger 0 siblings, 1 reply; 12+ messages in thread From: Johannes Schindelin @ 2008-11-20 13:27 UTC (permalink / raw) To: Samuel Tardieu; +Cc: Junio C Hamano, git Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> --- On Thu, 20 Nov 2008, Samuel Tardieu wrote: > * Johannes Schindelin <Johannes.Schindelin@gmx.de> [2008-11-20 > 13:00:35 +0100] > > | How about this? > > I think it still lacks a note about what "deletion" and > "insertion" means (is that a character deleted from string1 to obtain > string2 or the reverse?). In most implementation, you use the same > cost for insertion and deletion so the function is symetrical, but > this implementation is more powerful. Second paragraph and last sentence were added. levenshtein.c | 37 +++++++++++++++++++++++++++++++++++++ 1 files changed, 37 insertions(+), 0 deletions(-) diff --git a/levenshtein.c b/levenshtein.c index db52f2c..ebef34b 100644 --- a/levenshtein.c +++ b/levenshtein.c @@ -1,6 +1,43 @@ #include "cache.h" #include "levenshtein.h" +/* + * This function implements the Damerau-Levenshtein algorithm to + * calculate a distance between strings. + * + * Basically, it says how many letters need to be swapped, substituted, + * deleted from, or added to string1, at least, to get string2. + * + * The idea is to build a distance matrix for the substrings of both + * strings. To avoid a large space complexity, only the last three rows + * are kept in memory (if swaps had the same or higher cost as one deletion + * plus one insertion, only two rows would be needed). + * + * At any stage, "i + 1" denotes the length of the current substring of + * string1 that the distance is calculated for. + * + * row2 holds the current row, row1 the previous row (i.e. for the substring + * of string1 of length "i"), and row0 the row before that. + * + * In other words, at the start of the big loop, row2[j + 1] contains the + * Damerau-Levenshtein distance between the substring of string1 of length + * "i" and the substring of string2 of length "j + 1". + * + * All the big loop does is determine the partial minimum-cost paths. + * + * It does so by calculating the costs of the path ending in characters + * i (in string1) and j (in string2), respectively, given that the last + * operation is a substition, a swap, a deletion, or an insertion. + * + * This implementation allows the costs to be weighted: + * + * - w (as in "sWap") + * - s (as in "Substition") + * - a (for insertion, AKA "Add") + * - d (as in "Deletion") + * + * Note that this algorithm calculates a distance _iff_ d == a. + */ int levenshtein(const char *string1, const char *string2, int w, int s, int a, int d) { -- 1.6.0.2.763.g72663 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH v2] Document levenshtein.c 2008-11-20 13:27 ` [PATCH v2] " Johannes Schindelin @ 2008-11-20 17:21 ` Jon Loeliger 2008-11-20 17:48 ` Sverre Rabbelier 2008-11-20 18:31 ` Johannes Schindelin 0 siblings, 2 replies; 12+ messages in thread From: Jon Loeliger @ 2008-11-20 17:21 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Samuel Tardieu, Junio C Hamano, Git List On Thu, 2008-11-20 at 14:27 +0100, Johannes Schindelin wrote: > Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> > + * This implementation allows the costs to be weighted: > + * > + * - w (as in "sWap") > + * - s (as in "Substition") > + * - a (for insertion, AKA "Add") > + * - d (as in "Deletion") > + * Were these supposed to be examples or definitions? The first looks like a definition by example. I'm not sure what "Substition" is besides a misspelling. Is it the definition "Substitution"? Or was it an example "Substitition" poorly spelled? The final two look like straight definitions. Thanks, jdl ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2] Document levenshtein.c 2008-11-20 17:21 ` Jon Loeliger @ 2008-11-20 17:48 ` Sverre Rabbelier 2008-11-20 18:31 ` Johannes Schindelin 1 sibling, 0 replies; 12+ messages in thread From: Sverre Rabbelier @ 2008-11-20 17:48 UTC (permalink / raw) To: Jon Loeliger Cc: Johannes Schindelin, Samuel Tardieu, Junio C Hamano, Git List On Thu, Nov 20, 2008 at 18:21, Jon Loeliger <jdl@freescale.com> wrote: > Were these supposed to be examples or definitions? > The first looks like a definition by example. > I'm not sure what "Substition" is besides a misspelling. > Is it the definition "Substitution"? Or was it an > example "Substitition" poorly spelled? > The final two look like straight definitions. Err, I'm pretty sure it's documenting the parameters? -- Cheers, Sverre Rabbelier ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2] Document levenshtein.c 2008-11-20 17:21 ` Jon Loeliger 2008-11-20 17:48 ` Sverre Rabbelier @ 2008-11-20 18:31 ` Johannes Schindelin 1 sibling, 0 replies; 12+ messages in thread From: Johannes Schindelin @ 2008-11-20 18:31 UTC (permalink / raw) To: Jon Loeliger; +Cc: Samuel Tardieu, Junio C Hamano, Git List Hi, On Thu, 20 Nov 2008, Jon Loeliger wrote: > On Thu, 2008-11-20 at 14:27 +0100, Johannes Schindelin wrote: > > Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> > > > + * This implementation allows the costs to be weighted: > > + * > > + * - w (as in "sWap") > > + * - s (as in "Substition") > > + * - a (for insertion, AKA "Add") > > + * - d (as in "Deletion") > > + * > > I'm not sure what "Substition" is besides a misspelling. It is a msipeling. Thanks, Dscho ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2008-11-20 18:24 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-11-18 18:53 [PATCH] Fix deletion of last character in levenshtein distance Samuel Tardieu 2008-11-18 20:23 ` Matthieu Moy 2008-11-19 0:53 ` Johannes Schindelin 2008-11-19 8:42 ` Samuel Tardieu 2008-11-19 9:57 ` Johannes Schindelin 2008-11-19 13:13 ` Junio C Hamano 2008-11-20 12:00 ` [PATCH] Document levenshtein.c Johannes Schindelin 2008-11-20 12:00 ` Samuel Tardieu 2008-11-20 13:27 ` [PATCH v2] " Johannes Schindelin 2008-11-20 17:21 ` Jon Loeliger 2008-11-20 17:48 ` Sverre Rabbelier 2008-11-20 18:31 ` Johannes Schindelin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).