[PATCH] help.autocorrect: do not run a command if the command given is junk

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] help.autocorrect: do not run a command if the command given is junk
@ 2009-12-14 13:03 Johannes Sixt
  2009-12-14 13:29 ` Alex Riesen
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Johannes Sixt @ 2009-12-14 13:03 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Johannes Schindelin, Git Mailing List, Alex Riesen

From: Johannes Sixt <j6t@kdbg.org>

If a given command is not found, then help.c tries to guess which one the
user could have meant. If help.autocorrect is 0 or unset, then a list of
suggestions is given as long as the dissimilarity between the given command
and the candidates is not excessively high. But if help.autocorrect was
non-zero (i.e., a delay after which the command is run automatically), the
latter restriction on dissimilarity was not obeyed.

In my case, this happened:

 $ git ..daab02
 WARNING: You called a Git command named '..daab02', which does not exist.
 Continuing under the assumption that you meant 'read-tree'
 in 4.0 seconds automatically...

The similarity limit that this patch introduces is already used a few lines
later where the list of suggested commands is printed.

Signed-off-by: Johannes Sixt <j6t@kdbg.org>
---
 help.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/help.c b/help.c
index e8db31f..db888cf 100644
--- a/help.c
+++ b/help.c
@@ -331,7 +331,7 @@ const char *help_unknown_cmd(const char *cmd)
 	n = 1;
 	while (n < main_cmds.cnt && best_similarity == main_cmds.names[n]->len)
 		++n;
-	if (autocorrect && n == 1) {
+	if (autocorrect && n == 1 && best_similarity < 6) {
 		const char *assumed = main_cmds.names[0]->name;
 		main_cmds.names[0] = NULL;
 		clean_cmdnames(&main_cmds);
-- 
1.6.6.rc1.46.g1635

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] help.autocorrect: do not run a command if the command  given is junk
  2009-12-14 13:03 [PATCH] help.autocorrect: do not run a command if the command given is junk Johannes Sixt
@ 2009-12-14 13:29 ` Alex Riesen
  2009-12-14 14:17   ` Johannes Schindelin
  2009-12-14 14:16 ` Johannes Schindelin
  2009-12-14 20:08 ` Junio C Hamano
  2 siblings, 1 reply; 11+ messages in thread
From: Alex Riesen @ 2009-12-14 13:29 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: Junio C Hamano, Johannes Schindelin, Git Mailing List

On Mon, Dec 14, 2009 at 14:03, Johannes Sixt <j.sixt@viscovery.net> wrote:
> From: Johannes Sixt <j6t@kdbg.org>
>
> If a given command is not found, then help.c tries to guess which one the
> user could have meant. If help.autocorrect is 0 or unset, then a list of
> suggestions is given as long as the dissimilarity between the given command
> and the candidates is not excessively high. But if help.autocorrect was
> non-zero (i.e., a delay after which the command is run automatically), the
> latter restriction on dissimilarity was not obeyed.
>
> In my case, this happened:
>
>  $ git ..daab02
>  WARNING: You called a Git command named '..daab02', which does not exist.
>  Continuing under the assumption that you meant 'read-tree'
>  in 4.0 seconds automatically...
>
> The similarity limit that this patch introduces is already used a few lines
> later where the list of suggested commands is printed.

Yes, sure. We probably just missed that (I don't use autocorrect myself
apart from the testing. I assume Johannes doesn't, too)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] help.autocorrect: do not run a command if the command given is junk
  2009-12-14 13:03 [PATCH] help.autocorrect: do not run a command if the command given is junk Johannes Sixt
  2009-12-14 13:29 ` Alex Riesen
@ 2009-12-14 14:16 ` Johannes Schindelin
  2009-12-14 20:08 ` Junio C Hamano
  2 siblings, 0 replies; 11+ messages in thread
From: Johannes Schindelin @ 2009-12-14 14:16 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: Junio C Hamano, Git Mailing List, Alex Riesen

Hi,

On Mon, 14 Dec 2009, Johannes Sixt wrote:

> From: Johannes Sixt <j6t@kdbg.org>
> 
> If a given command is not found, then help.c tries to guess which one the
> user could have meant. If help.autocorrect is 0 or unset, then a list of
> suggestions is given as long as the dissimilarity between the given command
> and the candidates is not excessively high. But if help.autocorrect was
> non-zero (i.e., a delay after which the command is run automatically), the
> latter restriction on dissimilarity was not obeyed.
> 
> In my case, this happened:
> 
>  $ git ..daab02
>  WARNING: You called a Git command named '..daab02', which does not exist.
>  Continuing under the assumption that you meant 'read-tree'
>  in 4.0 seconds automatically...
> 
> The similarity limit that this patch introduces is already used a few lines
> later where the list of suggested commands is printed.
> 
> Signed-off-by: Johannes Sixt <j6t@kdbg.org>
> ---

Obvious ACK from me.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] help.autocorrect: do not run a command if the command  given is junk
  2009-12-14 13:29 ` Alex Riesen
@ 2009-12-14 14:17   ` Johannes Schindelin
  0 siblings, 0 replies; 11+ messages in thread
From: Johannes Schindelin @ 2009-12-14 14:17 UTC (permalink / raw)
  To: Alex Riesen; +Cc: Johannes Sixt, Junio C Hamano, Git Mailing List

Hi,

On Mon, 14 Dec 2009, Alex Riesen wrote:

> I don't use autocorrect myself apart from the testing. I assume Johannes 
> doesn't, too

As a matter of fact, I do.  But my common mistakes are not in forgetting 
the subcommand, rather in mispelign them.  So I never hit the problem.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] help.autocorrect: do not run a command if the command given is junk
  2009-12-14 13:03 [PATCH] help.autocorrect: do not run a command if the command given is junk Johannes Sixt
  2009-12-14 13:29 ` Alex Riesen
  2009-12-14 14:16 ` Johannes Schindelin
@ 2009-12-14 20:08 ` Junio C Hamano
  2009-12-14 21:09   ` Johannes Schindelin
  2009-12-14 21:55   ` Johannes Sixt
  2 siblings, 2 replies; 11+ messages in thread
From: Junio C Hamano @ 2009-12-14 20:08 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: Johannes Schindelin, Git Mailing List, Alex Riesen

Johannes Sixt <j.sixt@viscovery.net> writes:

> From: Johannes Sixt <j6t@kdbg.org>
>
> If a given command is not found, then help.c tries to guess which one the
> user could have meant. If help.autocorrect is 0 or unset, then a list of
> suggestions is given as long as the dissimilarity between the given command
> and the candidates is not excessively high. But if help.autocorrect was
> non-zero (i.e., a delay after which the command is run automatically), the
> latter restriction on dissimilarity was not obeyed.
>
> In my case, this happened:
>
>  $ git ..daab02
>  WARNING: You called a Git command named '..daab02', which does not exist.
>  Continuing under the assumption that you meant 'read-tree'
>  in 4.0 seconds automatically...
>
> The similarity limit that this patch introduces is already used a few lines
> later where the list of suggested commands is printed.
>
> Signed-off-by: Johannes Sixt <j6t@kdbg.org>
> ---
>  help.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)

Thanks.  Will apply to 'maint'.

But I am curious about and would prefer to see the story behind '6'
someday.

This is not entirely a new problem that you are introducing, as the same
comparison-with-six appears in a later part of the code, but this patch
duplicates the magic number whose two instances need to match, so in that
sense it makes it more necessary than before to document the choice of
magic number somewhere in a comment in the code.

Did 8af84da (git wrapper: DWIM mistyped commands, 2008-08-31) made up just
a random number out of thin-air?  What bad things would happen if we used
'600' (or '1') instead of '6'?  What kind of correlation does the cut-off
value we use here should have with some intrinsic numbers your git
installation has (e.g. perhaps "must be at least 80% of average length of
available commands" or something like that)?

In the meantime, I think squashing the following in would help us keep the
two magic numbers in sync.

diff --git a/help.c b/help.c
index db888cf..fbf80d9 100644
--- a/help.c
+++ b/help.c
@@ -297,6 +297,9 @@ static void add_cmd_list(struct cmdnames *cmds, struct cmdnames *old)
 	old->names = NULL;
 }
 
+/* how did we decide this is a good cutoff??? */
+#define SIMILAR_ENOUGH(x) ((x) < 6)
+
 const char *help_unknown_cmd(const char *cmd)
 {
 	int i, n, best_similarity = 0;
@@ -331,7 +334,7 @@ const char *help_unknown_cmd(const char *cmd)
 	n = 1;
 	while (n < main_cmds.cnt && best_similarity == main_cmds.names[n]->len)
 		++n;
-	if (autocorrect && n == 1 && best_similarity < 6) {
+	if (autocorrect && n == 1 && SIMILAR_ENOUGH(best_similarity)) {
 		const char *assumed = main_cmds.names[0]->name;
 		main_cmds.names[0] = NULL;
 		clean_cmdnames(&main_cmds);
@@ -349,7 +352,7 @@ const char *help_unknown_cmd(const char *cmd)
 
 	fprintf(stderr, "git: '%s' is not a git-command. See 'git --help'.\n", cmd);
 
-	if (best_similarity < 6) {
+	if (SIMILAR_ENOUGH(best_similarity)) {
 		fprintf(stderr, "\nDid you mean %s?\n",
 			n < 2 ? "this": "one of these");
 

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] help.autocorrect: do not run a command if the command given is junk
  2009-12-14 20:08 ` Junio C Hamano
@ 2009-12-14 21:09   ` Johannes Schindelin
  2009-12-14 21:47     ` Junio C Hamano
  2009-12-14 21:55   ` Johannes Sixt
  1 sibling, 1 reply; 11+ messages in thread
From: Johannes Schindelin @ 2009-12-14 21:09 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Johannes Sixt, Git Mailing List, Alex Riesen

Hi,

On Mon, 14 Dec 2009, Junio C Hamano wrote:

> I am curious about and would prefer to see the story behind '6' someday.

The '6' as a cut-off to the levenshtein distances we list when 
autocorrecting was derived in a totally scientific manner:

1) first I implemented Levenshtein-Damerau with a configurable weight of 
   neighbor flips ("switches"), substitutions, additions and deletions,

2) next I patched the code to sort the availablecommands by their distance 
   to the mispelt command,

3) as this lists way too much, I implemented a cut-off that was 
   configurable by an environment variable (without any safety checks, as 
   I did not plan to release that code anyway),

4) now comes the totally, unbelievably cunningly scientific part: I did a 
   self-experiment!  I deliberately mispelt commands in a totally random 
   manner!

5) then I changed the code to actually output the distances so I could 
   determine a cut-off that makes sense with my type of tyops,

6) after about 15 tries of deliberate mistakes (mostly doing what I 
   usually do, something like "git pull" and "git log" or something like 
   that, but watching TV, chatting on the phone _and_ cleaning the dishes 
   at the same time), I found that 5 was too low and 7 too large.

The number '6' happily coincided with the number of steps I needed to come 
up with the number.  You see?  The _perfect_ way to determine a completely 
arbitrary number.

Actually, you probably see that I just made up that number and tested a 
few times, and it seemed to work reasonably well.

FWIW almost the same procedure led to the weights 0, 2, 1 and 4 that you 
see in help.c.  The weights are basically factors with which mistakes are 
punished: if you just confuse two adjacent letters, such as "psuh" instead 
of "push" (which can be quite common if you use two hands, one on the left 
side, and one on the right side of the keyboard, with an en-US layout so 
many of us use, myself included) it costs 0.

If you write a different character than what you intended, the cost is 2.  
The idea behind it is that you're more likely to miss a key than to hit 
the wrong key.  With the laptop I am typing this email on, it is 
particularly likely that I miss a key, because there are certain 
key combinations where only the first key triggers an input event, but the 
second only triggers an input event when it is _released_ after the first 
one.  So when I type "er" real fast and happen to release the "e" key 
after the "r" key, no "r" appears on my screen.

Okay, so the weight for adding a character must be smaller than 
substituting a character, but why is the cost for deletion so high?  
Well, I really rarely type unnecessary characters (except when writing to 
the Git mailing list, arguably) so those costs must be substantially 
higher than for typing the wrong character.

My original plan was to log all my tyops into a log file and analyze those 
errors later, but then my initial 0, 2, 1, 4 and 6 constants worked well 
enough for me that I did not bother.

Satisfied?

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] help.autocorrect: do not run a command if the command given is junk
  2009-12-14 21:09   ` Johannes Schindelin
@ 2009-12-14 21:47     ` Junio C Hamano
  2009-12-14 22:04       ` Johannes Schindelin
  0 siblings, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2009-12-14 21:47 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Johannes Sixt, Git Mailing List, Alex Riesen

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> Satisfied?

Very much.

    FWIW almost the same procedure led to the weights 0, 2, 1 and 4 that you 
    see in help.c.  The weights are basically factors with which mistakes are 
    punished: if you just confuse two adjacent letters, such as "psuh" instead 
    of "push" (which can be quite common if you use two hands, one on the left 
    side, and one on the right side of the keyboard, with an en-US layout so 
    many of us use, myself included) it costs 0.

    If you write a different character than what you intended, the cost is 2.  
    The idea behind it is that you're more likely to miss a key than to hit 
    the wrong key.  With the laptop I am typing this email on, it is 
    particularly likely that I miss a key, because there are certain 
    key combinations where only the first key triggers an input event, but the 
    second only triggers an input event when it is _released_ after the first 
    one.  So when I type "er" real fast and happen to release the "e" key 
    after the "r" key, no "r" appears on my screen.

    Okay, so the weight for adding a character must be smaller than 
    substituting a character, but why is the cost for deletion so high?  
    Well, I really rarely type unnecessary characters (except when writing to 
    the Git mailing list, arguably) so those costs must be substantially 
    higher than for typing the wrong character.

These are actually very good justifications in the sense that people who
might want to tweak the heuristics can see the reason behind the current
choice and agree or disagree with it.

I somehow suspect that a good mathematician can come up with a rationale
for 6 after the fact that sounds convincing, along the lines of "the
average length of commands being N, and levenshtein penalties being
<0,2,1,4>, you can insert X mistaken keystroke and/or omit Y mistaken
keystroke per every correct keystroke without exceeding this value 6, and
the percentage X and/or Y represents is not too low to be practical but low
enough to reject false positives".

In any case, I'll further squash in the following.  Thanks for an amusing
explanation ;-).

diff --git a/help.c b/help.c
index fbf80d9..de1e2ea 100644
--- a/help.c
+++ b/help.c
@@ -297,7 +297,7 @@ static void add_cmd_list(struct cmdnames *cmds, struct cmdnames *old)
 	old->names = NULL;
 }
 
-/* how did we decide this is a good cutoff??? */
+/* An empirically derived magic number */
 #define SIMILAR_ENOUGH(x) ((x) < 6)
 
 const char *help_unknown_cmd(const char *cmd)

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] help.autocorrect: do not run a command if the command given is junk
  2009-12-14 20:08 ` Junio C Hamano
  2009-12-14 21:09   ` Johannes Schindelin
@ 2009-12-14 21:55   ` Johannes Sixt
  2009-12-14 23:39     ` Junio C Hamano
  1 sibling, 1 reply; 11+ messages in thread
From: Johannes Sixt @ 2009-12-14 21:55 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Johannes Schindelin, Git Mailing List, Alex Riesen

On Montag, 14. Dezember 2009, Junio C Hamano wrote:
> In the meantime, I think squashing the following in would help us keep the
> two magic numbers in sync.

I do not think that keeping the numbers in sync is necessary. For example, the 
similarity requirement for commands that run automatically could be stricter 
than for the list of suggestions. Then it would be possible that a unique 
best candidate is not good enough to be run automatically; there would only 
be a list of suggestions.

-- Hannes

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] help.autocorrect: do not run a command if the command given is junk
  2009-12-14 21:47     ` Junio C Hamano
@ 2009-12-14 22:04       ` Johannes Schindelin
  0 siblings, 0 replies; 11+ messages in thread
From: Johannes Schindelin @ 2009-12-14 22:04 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Johannes Sixt, Git Mailing List, Alex Riesen

Hi,

On Mon, 14 Dec 2009, Junio C Hamano wrote:

> I somehow suspect that a good mathematician can come up with a rationale 
> for 6 after the fact that sounds convincing, along the lines of "the 
> average length of commands being N, and levenshtein penalties being 
> <0,2,1,4>, you can insert X mistaken keystroke and/or omit Y mistaken 
> keystroke per every correct keystroke without exceeding this value 6, 
> and the percentage X and/or Y represents is not too low to be practical 
> but low enough to reject false positives".

Being a mathematician, I was tempted to invent such a reasoning in 
hindsight.

But I decided to be truthful instead,
Dscho

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] help.autocorrect: do not run a command if the command given is junk
  2009-12-14 21:55   ` Johannes Sixt
@ 2009-12-14 23:39     ` Junio C Hamano
  2009-12-15  7:57       ` Johannes Sixt
  0 siblings, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2009-12-14 23:39 UTC (permalink / raw)
  To: Johannes Sixt
  Cc: Junio C Hamano, Johannes Schindelin, Git Mailing List,
	Alex Riesen

Johannes Sixt <j.sixt@viscovery.net> writes:

> On Montag, 14. Dezember 2009, Junio C Hamano wrote:
>> In the meantime, I think squashing the following in would help us keep the
>> two magic numbers in sync.
>
> I do not think that keeping the numbers in sync is necessary. For example, the 
> similarity requirement for commands that run automatically could be stricter 
> than for the list of suggestions. Then it would be possible that a unique 
> best candidate is not good enough to be run automatically; there would only 
> be a list of suggestions.

Well thought out.  Would you want to reroll a patch with two symbolic
constants then?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] help.autocorrect: do not run a command if the command given is junk
  2009-12-14 23:39     ` Junio C Hamano
@ 2009-12-15  7:57       ` Johannes Sixt
  0 siblings, 0 replies; 11+ messages in thread
From: Johannes Sixt @ 2009-12-15  7:57 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Johannes Schindelin, Git Mailing List, Alex Riesen

Junio C Hamano schrieb:
> Johannes Sixt <j.sixt@viscovery.net> writes:
> 
>> On Montag, 14. Dezember 2009, Junio C Hamano wrote:
>>> In the meantime, I think squashing the following in would help us keep the
>>> two magic numbers in sync.
>> I do not think that keeping the numbers in sync is necessary. For example, the 
>> similarity requirement for commands that run automatically could be stricter 
>> than for the list of suggestions. Then it would be possible that a unique 
>> best candidate is not good enough to be run automatically; there would only 
>> be a list of suggestions.
> 
> Well thought out.  Would you want to reroll a patch with two symbolic
> constants then?

I briefly looked into it, but, no, I don't want to reroll the patch. Not
only would the change be larger than I first thought, but I would also
have to find a mis-typed command where a stricter limit makes a difference
*and* where it makes sense that the guessed command is not run
automatically. Moreover, I would also have to *find* a suitable new
similarity limit. Not something I want to do now.

Please take my original patch and squash in your suggested changes. Here
it is for your convenience with an updated commit message (only the
last paragraph changed).

-- Hannes

--- 8< ---
From: Johannes Sixt <j6t@kdbg.org>
Subject: [PATCH] help.autocorrect: do not run a command if the command given is junk

If a given command is not found, then help.c tries to guess which one the
user could have meant. If help.autocorrect is 0 or unset, then a list of
suggestions is given as long as the dissimilarity between the given command
and the candidates is not excessively high. But if help.autocorrect was
non-zero (i.e., a delay after which the command is run automatically), the
latter restriction on dissimilarity was not obeyed.

In my case, this happened:

 $ git ..daab02
 WARNING: You called a Git command named '..daab02', which does not exist.
 Continuing under the assumption that you meant 'read-tree'
 in 4.0 seconds automatically...

The patch reuses the similarity limit that is also applied when the list of
suggested commands is printed.

Signed-off-by: Johannes Sixt <j6t@kdbg.org>
---
 help.c |    7 +++++--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/help.c b/help.c
index e8db31f..9da97d7 100644
--- a/help.c
+++ b/help.c
@@ -297,6 +297,9 @@ static void add_cmd_list(struct cmdnames *cmds, struct cmdnames *old)
 	old->names = NULL;
 }

+/* An empirically derived magic number */
+#define SIMILAR_ENOUGH(x) ((x) < 6)
+
 const char *help_unknown_cmd(const char *cmd)
 {
 	int i, n, best_similarity = 0;
@@ -331,7 +334,7 @@ const char *help_unknown_cmd(const char *cmd)
 	n = 1;
 	while (n < main_cmds.cnt && best_similarity == main_cmds.names[n]->len)
 		++n;
-	if (autocorrect && n == 1) {
+	if (autocorrect && n == 1 && SIMILAR_ENOUGH(best_similarity)) {
 		const char *assumed = main_cmds.names[0]->name;
 		main_cmds.names[0] = NULL;
 		clean_cmdnames(&main_cmds);
@@ -349,7 +352,7 @@ const char *help_unknown_cmd(const char *cmd)

 	fprintf(stderr, "git: '%s' is not a git-command. See 'git --help'.\n", cmd);

-	if (best_similarity < 6) {
+	if (SIMILAR_ENOUGH(best_similarity)) {
 		fprintf(stderr, "\nDid you mean %s?\n",
 			n < 2 ? "this": "one of these");

-- 
1.6.6.rc1.46.g1635

^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2009-12-15  7:57 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-12-14 13:03 [PATCH] help.autocorrect: do not run a command if the command given is junk Johannes Sixt
2009-12-14 13:29 ` Alex Riesen
2009-12-14 14:17   ` Johannes Schindelin
2009-12-14 14:16 ` Johannes Schindelin
2009-12-14 20:08 ` Junio C Hamano
2009-12-14 21:09   ` Johannes Schindelin
2009-12-14 21:47     ` Junio C Hamano
2009-12-14 22:04       ` Johannes Schindelin
2009-12-14 21:55   ` Johannes Sixt
2009-12-14 23:39     ` Junio C Hamano
2009-12-15  7:57       ` Johannes Sixt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).