[PATH/RFC] parse-options: report invalid UTF-8 switches

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATH/RFC] parse-options: report invalid UTF-8 switches
@ 2013-02-11 13:34 Erik Faye-Lund
  2013-02-11 13:43 ` Matthieu Moy
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Erik Faye-Lund @ 2013-02-11 13:34 UTC (permalink / raw)
  To: git; +Cc: Erik Faye-Lund

Even though parse-options doesn't support UTF-8 switches (which
makes sense; non-ascii switches would be difficult to enter on
some keyboard layouts), it can be useful to report incorrectly
entered UTF-8 switches to make the output somewhat less ugly
for those of us with keyboard layouts with UTF-8 characters on
it.

Make the reporting code grok UTF-8 in the option sequence, and
write a variable-width output sequence.

Signed-off-by: Erik Faye-Lund <kusmabite@gmail.com>
---
As being both clumsy and Norwegian, I some times to enter the
Norwegian bizarro-letters ('æ', 'ø' and 'å') instead of the
correct ones when entering command-line options.

However, since git only looks at one byte at the time for
short-options, it ends up reporting a partial UTF-8 sequence
in such cases, leading to corruption of the output.

The "real fix" would probably be to add proper multi-byte
support to the short-option parser, but this serves little
purpose in Git; we don't internationalize the command-line
switches.

So perhaps this is a suitable band-aid instead?

 parse-options.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/parse-options.c b/parse-options.c
index 67e98a6..20dc742 100644
--- a/parse-options.c
+++ b/parse-options.c
@@ -3,6 +3,7 @@
 #include "cache.h"
 #include "commit.h"
 #include "color.h"
+#include "utf8.h"

 static int parse_options_usage(struct parse_opt_ctx_t *ctx,
 			       const char * const *usagestr,
@@ -462,7 +463,9 @@ int parse_options(int argc, const char **argv, const char *prefix,
 		if (ctx.argv[0][1] == '-') {
 			error("unknown option `%s'", ctx.argv[0] + 2);
 		} else {
-			error("unknown switch `%c'", *ctx.opt);
+			const char *next = ctx.opt;
+			utf8_width(&next, NULL);
+			error("unknown switch `%.*s'", (int)(next - ctx.opt), ctx.opt);
 		}
 		usage_with_options(usagestr, options);
 	}
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATH/RFC] parse-options: report invalid UTF-8 switches
  2013-02-11 13:34 [PATH/RFC] parse-options: report invalid UTF-8 switches Erik Faye-Lund
@ 2013-02-11 13:43 ` Matthieu Moy
  2013-02-11 13:57   ` Erik Faye-Lund
  2013-02-11 16:28 ` Torsten Bögershausen
  2013-02-11 17:07 ` Junio C Hamano
  2 siblings, 1 reply; 13+ messages in thread
From: Matthieu Moy @ 2013-02-11 13:43 UTC (permalink / raw)
  To: Erik Faye-Lund; +Cc: git

Erik Faye-Lund <kusmabite@gmail.com> writes:

> --- a/parse-options.c
> +++ b/parse-options.c
> @@ -3,6 +3,7 @@
>  #include "cache.h"
>  #include "commit.h"
>  #include "color.h"
> +#include "utf8.h"
>  
>  static int parse_options_usage(struct parse_opt_ctx_t *ctx,
>  			       const char * const *usagestr,
> @@ -462,7 +463,9 @@ int parse_options(int argc, const char **argv, const char *prefix,
>  		if (ctx.argv[0][1] == '-') {
>  			error("unknown option `%s'", ctx.argv[0] + 2);
>  		} else {
> -			error("unknown switch `%c'", *ctx.opt);
> +			const char *next = ctx.opt;
> +			utf8_width(&next, NULL);
> +			error("unknown switch `%.*s'", (int)(next - ctx.opt), ctx.opt);
>  		}
>  		usage_with_options(usagestr, options);
>  	}

You should be careful with the case where the user has a non-UTF8
environment, and entered a non-ascii sequence. I can see two cases:

1) The non-ascii sequence is valid UTF-8, then I guess your patch would
   show two characters instead of one. Not really correct, but not really
   serious either.

2) The non-ascii sequence is NOT valid UTF-8, then if I read correctly
   (I didn't test) utf8_width would set next to NULL, and then you are
   in big trouble.

-- 
Matthieu Moy
http://www-verimag.imag.fr/~moy/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATH/RFC] parse-options: report invalid UTF-8 switches
  2013-02-11 13:43 ` Matthieu Moy
@ 2013-02-11 13:57   ` Erik Faye-Lund
  2013-02-11 14:05     ` Matthieu Moy
  0 siblings, 1 reply; 13+ messages in thread
From: Erik Faye-Lund @ 2013-02-11 13:57 UTC (permalink / raw)
  To: Matthieu Moy; +Cc: git

On Mon, Feb 11, 2013 at 2:43 PM, Matthieu Moy
<Matthieu.Moy@grenoble-inp.fr> wrote:
> Erik Faye-Lund <kusmabite@gmail.com> writes:
>
>> --- a/parse-options.c
>> +++ b/parse-options.c
>> @@ -3,6 +3,7 @@
>>  #include "cache.h"
>>  #include "commit.h"
>>  #include "color.h"
>> +#include "utf8.h"
>>
>>  static int parse_options_usage(struct parse_opt_ctx_t *ctx,
>>                              const char * const *usagestr,
>> @@ -462,7 +463,9 @@ int parse_options(int argc, const char **argv, const char *prefix,
>>               if (ctx.argv[0][1] == '-') {
>>                       error("unknown option `%s'", ctx.argv[0] + 2);
>>               } else {
>> -                     error("unknown switch `%c'", *ctx.opt);
>> +                     const char *next = ctx.opt;
>> +                     utf8_width(&next, NULL);
>> +                     error("unknown switch `%.*s'", (int)(next - ctx.opt), ctx.opt);
>>               }
>>               usage_with_options(usagestr, options);
>>       }
>
> You should be careful with the case where the user has a non-UTF8
> environment, and entered a non-ascii sequence. I can see two cases:
>
> 1) The non-ascii sequence is valid UTF-8, then I guess your patch would
>    show two characters instead of one. Not really correct, but not really
>    serious either.

Hm. So we would end up trading some form of corruption for some other.
Not the biggest problem in the world, but perhaps there's a way of
fixing it?

I'm not entirely sure how to correctly know what encoding stdin is
supposed to be. On Windows, that's easy; it's UTF-16, we re-encode it
to UTF-8 on startup in Git for Windows. But on other platforms, I have
no clue.

But isn't UTF-8 constructed to be very unlikely to clash with existing
encodings? If so, I could add a case for non-ascii and non-UTF-8, that
simply writes the byte as a hex-tuple?

> 2) The non-ascii sequence is NOT valid UTF-8, then if I read correctly
>    (I didn't test) utf8_width would set next to NULL, and then you are
>    in big trouble.

Outch. Yeah, you are right; this is not good at all :)

But I guess the solution above should fix this as well, no?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATH/RFC] parse-options: report invalid UTF-8 switches
  2013-02-11 13:57   ` Erik Faye-Lund
@ 2013-02-11 14:05     ` Matthieu Moy
  2013-02-11 14:27       ` Erik Faye-Lund
  0 siblings, 1 reply; 13+ messages in thread
From: Matthieu Moy @ 2013-02-11 14:05 UTC (permalink / raw)
  To: kusmabite; +Cc: git

Erik Faye-Lund <kusmabite@gmail.com> writes:

> But isn't UTF-8 constructed to be very unlikely to clash with existing
> encodings? If so, I could add a case for non-ascii and non-UTF-8, that
> simply writes the byte as a hex-tuple?

If it's non-ascii and non-UTF-8, I think you'd want to display the byte
as it is, because this is how it was entered. IOW, I'd say we should
keep the current behavior in this case.

>> 2) The non-ascii sequence is NOT valid UTF-8, then if I read correctly
>>    (I didn't test) utf8_width would set next to NULL, and then you are
>>    in big trouble.
>
> Outch. Yeah, you are right; this is not good at all :)
>
> But I guess the solution above should fix this as well, no?

It should, yes.

Of course, there's still the case where the user entered "git -Ã©" as a
Ã followed by a © in a latin-1 environment, but as you said, it's
unlikely enough ;-).

-- 
Matthieu Moy
http://www-verimag.imag.fr/~moy/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATH/RFC] parse-options: report invalid UTF-8 switches
  2013-02-11 14:05     ` Matthieu Moy
@ 2013-02-11 14:27       ` Erik Faye-Lund
  0 siblings, 0 replies; 13+ messages in thread
From: Erik Faye-Lund @ 2013-02-11 14:27 UTC (permalink / raw)
  To: Matthieu Moy; +Cc: git

On Mon, Feb 11, 2013 at 3:05 PM, Matthieu Moy
<Matthieu.Moy@grenoble-inp.fr> wrote:
> Erik Faye-Lund <kusmabite@gmail.com> writes:
>
>> But isn't UTF-8 constructed to be very unlikely to clash with existing
>> encodings? If so, I could add a case for non-ascii and non-UTF-8, that
>> simply writes the byte as a hex-tuple?
>
> If it's non-ascii and non-UTF-8, I think you'd want to display the byte
> as it is, because this is how it was entered. IOW, I'd say we should
> keep the current behavior in this case.
>

Yes, you are of course right. We should detect UTF-8, and only in
those cases do anything special. Because the likely alternatives are
other 8-byte encodings (which the terminal already should grok, since
the user was able to input it), or other multi-byte sequences (which
already is broken, and is tricky to handle). So at least we'd only
break in very unlikely cases.

But, I wonder, could mbrlen be used to detect the length instead? It
consults LC_CTYPE to find out what encoding to use, which seems like
it might give the correct answer in all non-corrupted cases... I'm far
from an expert on UNIX-internationalization, though. And this approach
is likely to break on Windows, but I suspect that we can perform some
well-placed hack for it, as we already know that we're doing UTF-8
there.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATH/RFC] parse-options: report invalid UTF-8 switches
  2013-02-11 13:34 [PATH/RFC] parse-options: report invalid UTF-8 switches Erik Faye-Lund
  2013-02-11 13:43 ` Matthieu Moy
@ 2013-02-11 16:28 ` Torsten Bögershausen
  2013-02-11 16:36   ` Erik Faye-Lund
  2013-02-11 17:07 ` Junio C Hamano
  2 siblings, 1 reply; 13+ messages in thread
From: Torsten Bögershausen @ 2013-02-11 16:28 UTC (permalink / raw)
  To: Erik Faye-Lund; +Cc: git

On 11.02.13 14:34, Erik Faye-Lund wrote:
> Even though parse-options doesn't support UTF-8 switches (which
> makes sense; non-ascii switches would be difficult to enter on
> some keyboard layouts), it can be useful to report incorrectly
> entered UTF-8 switches to make the output somewhat less ugly
> for those of us with keyboard layouts with UTF-8 characters on
> it.
> 
> Make the reporting code grok UTF-8 in the option sequence, and
> write a variable-width output sequence.
> 
> Signed-off-by: Erik Faye-Lund <kusmabite@gmail.com>
> ---
> As being both clumsy and Norwegian, I some times to enter the
> Norwegian bizarro-letters ('æ', 'ø' and 'å') instead of the
> correct ones when entering command-line options.
> 
> However, since git only looks at one byte at the time for
> short-options, it ends up reporting a partial UTF-8 sequence
> in such cases, leading to corruption of the output.
> 
> The "real fix" would probably be to add proper multi-byte
> support to the short-option parser, but this serves little
> purpose in Git; we don't internationalize the command-line
> switches.
> 
> So perhaps this is a suitable band-aid instead?
> 
>  parse-options.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/parse-options.c b/parse-options.c
> index 67e98a6..20dc742 100644
> --- a/parse-options.c
> +++ b/parse-options.c
> @@ -3,6 +3,7 @@
>  #include "cache.h"
>  #include "commit.h"
>  #include "color.h"
> +#include "utf8.h"
>  
>  static int parse_options_usage(struct parse_opt_ctx_t *ctx,
>  			       const char * const *usagestr,
> @@ -462,7 +463,9 @@ int parse_options(int argc, const char **argv, const char *prefix,
>  		if (ctx.argv[0][1] == '-') {
>  			error("unknown option `%s'", ctx.argv[0] + 2);
>  		} else {
> -			error("unknown switch `%c'", *ctx.opt);
> +			const char *next = ctx.opt;
> +			utf8_width(&next, NULL);
> +			error("unknown switch `%.*s'", (int)(next - ctx.opt), ctx.opt);
>  		}
>  		usage_with_options(usagestr, options);
>  	}
> 
Would the following do the trick?

diff --git a/parse-options.c b/parse-options.c
index c1c66bd..f800552 100644
--- a/parse-options.c
+++ b/parse-options.c
@@ -471,7 +471,7 @@ int parse_options(int argc, const char **argv, const char *prefix,
                if (ctx.argv[0][1] == '-') {
                        error("unknown option `%s'", ctx.argv[0] + 2);
                } else {
-                       error("unknown switch `%c'", *ctx.opt);
+                       error("unknown switch `%s'", ctx.opt);
                }

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATH/RFC] parse-options: report invalid UTF-8 switches
  2013-02-11 16:28 ` Torsten Bögershausen
@ 2013-02-11 16:36   ` Erik Faye-Lund
  2013-02-11 17:04     ` Torsten Bögershausen
  0 siblings, 1 reply; 13+ messages in thread
From: Erik Faye-Lund @ 2013-02-11 16:36 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: git

On Mon, Feb 11, 2013 at 5:28 PM, Torsten Bögershausen <tboegi@web.de> wrote:
> On 11.02.13 14:34, Erik Faye-Lund wrote:
>> Even though parse-options doesn't support UTF-8 switches (which
>> makes sense; non-ascii switches would be difficult to enter on
>> some keyboard layouts), it can be useful to report incorrectly
>> entered UTF-8 switches to make the output somewhat less ugly
>> for those of us with keyboard layouts with UTF-8 characters on
>> it.
>>
>> Make the reporting code grok UTF-8 in the option sequence, and
>> write a variable-width output sequence.
>>
>> Signed-off-by: Erik Faye-Lund <kusmabite@gmail.com>
>> ---
>> As being both clumsy and Norwegian, I some times to enter the
>> Norwegian bizarro-letters ('æ', 'ø' and 'å') instead of the
>> correct ones when entering command-line options.
>>
>> However, since git only looks at one byte at the time for
>> short-options, it ends up reporting a partial UTF-8 sequence
>> in such cases, leading to corruption of the output.
>>
>> The "real fix" would probably be to add proper multi-byte
>> support to the short-option parser, but this serves little
>> purpose in Git; we don't internationalize the command-line
>> switches.
>>
>> So perhaps this is a suitable band-aid instead?
>>
>>  parse-options.c | 5 ++++-
>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/parse-options.c b/parse-options.c
>> index 67e98a6..20dc742 100644
>> --- a/parse-options.c
>> +++ b/parse-options.c
>> @@ -3,6 +3,7 @@
>>  #include "cache.h"
>>  #include "commit.h"
>>  #include "color.h"
>> +#include "utf8.h"
>>
>>  static int parse_options_usage(struct parse_opt_ctx_t *ctx,
>>                              const char * const *usagestr,
>> @@ -462,7 +463,9 @@ int parse_options(int argc, const char **argv, const char *prefix,
>>               if (ctx.argv[0][1] == '-') {
>>                       error("unknown option `%s'", ctx.argv[0] + 2);
>>               } else {
>> -                     error("unknown switch `%c'", *ctx.opt);
>> +                     const char *next = ctx.opt;
>> +                     utf8_width(&next, NULL);
>> +                     error("unknown switch `%.*s'", (int)(next - ctx.opt), ctx.opt);
>>               }
>>               usage_with_options(usagestr, options);
>>       }
>>
> Would the following do the trick?
>
> diff --git a/parse-options.c b/parse-options.c
> index c1c66bd..f800552 100644
> --- a/parse-options.c
> +++ b/parse-options.c
> @@ -471,7 +471,7 @@ int parse_options(int argc, const char **argv, const char *prefix,
>                 if (ctx.argv[0][1] == '-') {
>                         error("unknown option `%s'", ctx.argv[0] + 2);
>                 } else {
> -                       error("unknown switch `%c'", *ctx.opt);
> +                       error("unknown switch `%s'", ctx.opt);
>                 }
>
>

Nope; that would print the rest of the option-string, in cases of "git
<command> -abcd".

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATH/RFC] parse-options: report invalid UTF-8 switches
  2013-02-11 16:36   ` Erik Faye-Lund
@ 2013-02-11 17:04     ` Torsten Bögershausen
  0 siblings, 0 replies; 13+ messages in thread
From: Torsten Bögershausen @ 2013-02-11 17:04 UTC (permalink / raw)
  To: kusmabite; +Cc: Torsten Bögershausen, git

On 11.02.13 17:36, Erik Faye-Lund wrote:
> On Mon, Feb 11, 2013 at 5:28 PM, Torsten Bögershausen <tboegi@web.de> wrote:
>> On 11.02.13 14:34, Erik Faye-Lund wrote:
>>> Even though parse-options doesn't support UTF-8 switches (which
>>> makes sense; non-ascii switches would be difficult to enter on
>>> some keyboard layouts), it can be useful to report incorrectly
>>> entered UTF-8 switches to make the output somewhat less ugly
>>> for those of us with keyboard layouts with UTF-8 characters on
>>> it.
>>>
>>> Make the reporting code grok UTF-8 in the option sequence, and
>>> write a variable-width output sequence.
>>>
>>> Signed-off-by: Erik Faye-Lund <kusmabite@gmail.com>
>>> ---
>>> As being both clumsy and Norwegian, I some times to enter the
>>> Norwegian bizarro-letters ('æ', 'ø' and 'å') instead of the
>>> correct ones when entering command-line options.
>>>
>>> However, since git only looks at one byte at the time for
>>> short-options, it ends up reporting a partial UTF-8 sequence
>>> in such cases, leading to corruption of the output.
>>>
>>> The "real fix" would probably be to add proper multi-byte
>>> support to the short-option parser, but this serves little
>>> purpose in Git; we don't internationalize the command-line
>>> switches.
>>>
>>> So perhaps this is a suitable band-aid instead?
>>>
>>>  parse-options.c | 5 ++++-
>>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/parse-options.c b/parse-options.c
>>> index 67e98a6..20dc742 100644
>>> --- a/parse-options.c
>>> +++ b/parse-options.c
>>> @@ -3,6 +3,7 @@
>>>  #include "cache.h"
>>>  #include "commit.h"
>>>  #include "color.h"
>>> +#include "utf8.h"
>>>
>>>  static int parse_options_usage(struct parse_opt_ctx_t *ctx,
>>>                              const char * const *usagestr,
>>> @@ -462,7 +463,9 @@ int parse_options(int argc, const char **argv, const char *prefix,
>>>               if (ctx.argv[0][1] == '-') {
>>>                       error("unknown option `%s'", ctx.argv[0] + 2);
>>>               } else {
>>> -                     error("unknown switch `%c'", *ctx.opt);
>>> +                     const char *next = ctx.opt;
>>> +                     utf8_width(&next, NULL);
>>> +                     error("unknown switch `%.*s'", (int)(next - ctx.opt), ctx.opt);
>>>               }
>>>               usage_with_options(usagestr, options);
>>>       }
>>>
>> Would the following do the trick?
>>
>> diff --git a/parse-options.c b/parse-options.c
>> index c1c66bd..f800552 100644
>> --- a/parse-options.c
>> +++ b/parse-options.c
>> @@ -471,7 +471,7 @@ int parse_options(int argc, const char **argv, const char *prefix,
>>                 if (ctx.argv[0][1] == '-') {
>>                         error("unknown option `%s'", ctx.argv[0] + 2);
>>                 } else {
>> -                       error("unknown switch `%c'", *ctx.opt);
>> +                       error("unknown switch `%s'", ctx.opt);
>>                 }
>>
>>
> Nope; that would print the rest of the option-string, in cases of "git
> <command> -abcd".
Ok, may be pick_one_utf8_char() is a better choice than simply assuming ASCII.

We can make a guess, if it is utf-8, we use it. If not, assume ASCII.
Just thinking loud (the "if" could be written shorter using the "?" operator)

                } else {
                       const char *start = ctx.opt;
                       unsigned c = pick_one_utf8_char(&start, NULL);
                       if (!c)
                         c =  *ctx.opt;
                       error("unknown switch `%c'", c);
                }

 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATH/RFC] parse-options: report invalid UTF-8 switches
  2013-02-11 13:34 [PATH/RFC] parse-options: report invalid UTF-8 switches Erik Faye-Lund
  2013-02-11 13:43 ` Matthieu Moy
  2013-02-11 16:28 ` Torsten Bögershausen
@ 2013-02-11 17:07 ` Junio C Hamano
  2013-02-11 17:15   ` Erik Faye-Lund
  2013-02-11 17:19   ` Jeff King
  2 siblings, 2 replies; 13+ messages in thread
From: Junio C Hamano @ 2013-02-11 17:07 UTC (permalink / raw)
  To: Erik Faye-Lund; +Cc: git

Erik Faye-Lund <kusmabite@gmail.com> writes:

> However, since git only looks at one byte at the time for
> short-options, it ends up reporting a partial UTF-8 sequence
> in such cases, leading to corruption of the output.

Isn't it a workable, easier and more robust alternative to punt and
use the entire ctx.argv[0] as unrecognized?

>
> The "real fix" would probably be to add proper multi-byte
> support to the short-option parser, but this serves little
> purpose in Git; we don't internationalize the command-line
> switches.
>
> So perhaps this is a suitable band-aid instead?
>
>  parse-options.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/parse-options.c b/parse-options.c
> index 67e98a6..20dc742 100644
> --- a/parse-options.c
> +++ b/parse-options.c
> @@ -3,6 +3,7 @@
>  #include "cache.h"
>  #include "commit.h"
>  #include "color.h"
> +#include "utf8.h"
>  
>  static int parse_options_usage(struct parse_opt_ctx_t *ctx,
>  			       const char * const *usagestr,
> @@ -462,7 +463,9 @@ int parse_options(int argc, const char **argv, const char *prefix,
>  		if (ctx.argv[0][1] == '-') {
>  			error("unknown option `%s'", ctx.argv[0] + 2);
>  		} else {
> -			error("unknown switch `%c'", *ctx.opt);
> +			const char *next = ctx.opt;
> +			utf8_width(&next, NULL);
> +			error("unknown switch `%.*s'", (int)(next - ctx.opt), ctx.opt);
>  		}
>  		usage_with_options(usagestr, options);
>  	}

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATH/RFC] parse-options: report invalid UTF-8 switches
  2013-02-11 17:07 ` Junio C Hamano
@ 2013-02-11 17:15   ` Erik Faye-Lund
  2013-02-11 17:19   ` Jeff King
  1 sibling, 0 replies; 13+ messages in thread
From: Erik Faye-Lund @ 2013-02-11 17:15 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Mon, Feb 11, 2013 at 6:07 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Erik Faye-Lund <kusmabite@gmail.com> writes:
>
>> However, since git only looks at one byte at the time for
>> short-options, it ends up reporting a partial UTF-8 sequence
>> in such cases, leading to corruption of the output.
>
> Isn't it a workable, easier and more robust alternative to punt and
> use the entire ctx.argv[0] as unrecognized?
>

Perhaps. It doesn't match the output of the usual GNU tools like we
currently do, but even the GNU tools only report a single byte.

However, I'm unsure if that totals to an improvement in the common
case. We stop telling the user exactly what option was problematic,
making it slightly more annoying to read through the options.

So, we'd end up making the common-case worse, by making a special case
(that only sometimes affects some users) more robust. Isn't that
making the user interface worse?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATH/RFC] parse-options: report invalid UTF-8 switches
  2013-02-11 17:07 ` Junio C Hamano
  2013-02-11 17:15   ` Erik Faye-Lund
@ 2013-02-11 17:19   ` Jeff King
  2013-02-11 17:21     ` Erik Faye-Lund
  2013-02-11 17:54     ` Junio C Hamano
  1 sibling, 2 replies; 13+ messages in thread
From: Jeff King @ 2013-02-11 17:19 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Erik Faye-Lund, git

On Mon, Feb 11, 2013 at 09:07:53AM -0800, Junio C Hamano wrote:

> Erik Faye-Lund <kusmabite@gmail.com> writes:
> 
> > However, since git only looks at one byte at the time for
> > short-options, it ends up reporting a partial UTF-8 sequence
> > in such cases, leading to corruption of the output.
> 
> Isn't it a workable, easier and more robust alternative to punt and
> use the entire ctx.argv[0] as unrecognized?

Yes, but it regresses the usability:

  [before]
  $ git foobar -qrxs
  unknown switch: x

  [after]
  $ git foobar -qrxs
  unknown switch: -qrxs

One is much more informative than the other, and you are punishing the
common ascii case for the extremely uncommon case of utf-8. Maybe:

  if (isascii(*ctx.opt))
          error("unknown option `%c'", *ctx.opt);
  else
          error("unknown multi-byte short option in string: `%s'", ctx.argv[0]);

which only kicks in in the uncommon case (and extends the error message
to make it more clear why we are showing the whole string).

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATH/RFC] parse-options: report invalid UTF-8 switches
  2013-02-11 17:19   ` Jeff King
@ 2013-02-11 17:21     ` Erik Faye-Lund
  2013-02-11 17:54     ` Junio C Hamano
  1 sibling, 0 replies; 13+ messages in thread
From: Erik Faye-Lund @ 2013-02-11 17:21 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git

On Mon, Feb 11, 2013 at 6:19 PM, Jeff King <peff@peff.net> wrote:
> On Mon, Feb 11, 2013 at 09:07:53AM -0800, Junio C Hamano wrote:
>
>> Erik Faye-Lund <kusmabite@gmail.com> writes:
>>
>> > However, since git only looks at one byte at the time for
>> > short-options, it ends up reporting a partial UTF-8 sequence
>> > in such cases, leading to corruption of the output.
>>
>> Isn't it a workable, easier and more robust alternative to punt and
>> use the entire ctx.argv[0] as unrecognized?
>
> Yes, but it regresses the usability:
>
>   [before]
>   $ git foobar -qrxs
>   unknown switch: x
>
>   [after]
>   $ git foobar -qrxs
>   unknown switch: -qrxs
>
> One is much more informative than the other, and you are punishing the
> common ascii case for the extremely uncommon case of utf-8. Maybe:
>
>   if (isascii(*ctx.opt))
>           error("unknown option `%c'", *ctx.opt);
>   else
>           error("unknown multi-byte short option in string: `%s'", ctx.argv[0]);
>
> which only kicks in in the uncommon case (and extends the error message
> to make it more clear why we are showing the whole string).

Yes. This is IMO a much better approach, and it doesn't involve trying
to figure out what encoding the string is. Thanks!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATH/RFC] parse-options: report invalid UTF-8 switches
  2013-02-11 17:19   ` Jeff King
  2013-02-11 17:21     ` Erik Faye-Lund
@ 2013-02-11 17:54     ` Junio C Hamano
  1 sibling, 0 replies; 13+ messages in thread
From: Junio C Hamano @ 2013-02-11 17:54 UTC (permalink / raw)
  To: Jeff King; +Cc: Erik Faye-Lund, git

Jeff King <peff@peff.net> writes:

> On Mon, Feb 11, 2013 at 09:07:53AM -0800, Junio C Hamano wrote:
>
>> Erik Faye-Lund <kusmabite@gmail.com> writes:
>> 
>> > However, since git only looks at one byte at the time for
>> > short-options, it ends up reporting a partial UTF-8 sequence
>> > in such cases, leading to corruption of the output.
>> 
>> Isn't it a workable, easier and more robust alternative to punt and
>> use the entire ctx.argv[0] as unrecognized?
>
> Yes, but it regresses the usability:
>
>   [before]
>   $ git foobar -qrxs
>   unknown switch: x
>
>   [after]
>   $ git foobar -qrxs
>   unknown switch: -qrxs
>
> One is much more informative than the other, and you are punishing the
> common ascii case for the extremely uncommon case of utf-8. Maybe:
>
>   if (isascii(*ctx.opt))
>           error("unknown option `%c'", *ctx.opt);
>   else
>           error("unknown multi-byte short option in string: `%s'", ctx.argv[0]);
>
> which only kicks in in the uncommon case (and extends the error message
> to make it more clear why we are showing the whole string).

Yup, that is what I meant.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2013-02-11 17:54 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-02-11 13:34 [PATH/RFC] parse-options: report invalid UTF-8 switches Erik Faye-Lund
2013-02-11 13:43 ` Matthieu Moy
2013-02-11 13:57   ` Erik Faye-Lund
2013-02-11 14:05     ` Matthieu Moy
2013-02-11 14:27       ` Erik Faye-Lund
2013-02-11 16:28 ` Torsten Bögershausen
2013-02-11 16:36   ` Erik Faye-Lund
2013-02-11 17:04     ` Torsten Bögershausen
2013-02-11 17:07 ` Junio C Hamano
2013-02-11 17:15   ` Erik Faye-Lund
2013-02-11 17:19   ` Jeff King
2013-02-11 17:21     ` Erik Faye-Lund
2013-02-11 17:54     ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).