git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Feature request: better error messages when UTF-8 bites
@ 2022-07-27 20:21 CH
  2022-07-28  5:42 ` Johannes Sixt
  0 siblings, 1 reply; 4+ messages in thread
From: CH @ 2022-07-27 20:21 UTC (permalink / raw)
  To: git

Hi;

Just found an annoyance in `git log` (and likely elsewhere) that may 
warrant a change:

Somehow when copying and pasting a commit from a website to the command 
line, a UTF-8 Byte Order Mark (BOM) 
[https://en.wikipedia.org/wiki/Byte_order_mark] was appended to one of 
the commit ids.  BOMs are invisible, as are many other UTF-8 code 
points.  The upshot was that Git didn't like it, and complained 
bitterly:

> $ strace -etrace=execve -s 200 git diff 
> 038179704f0066aa815d5429221cf381ff4ef289  
> 47346a462d8ba40b9a8b073e351c362522c46aa6
> 
> execve("/usr/bin/git", ["git", "diff", 
> "038179704f0066aa815d5429221cf381ff4ef289\357\273\277", 
> "47346a462d8ba40b9a8b073e351c362522c46aa6"], 0x7fffec3c4bb0 /* 80 vars 
> */) = 0
> 
> fatal: ambiguous argument '038179704f0066aa815d5429221cf381ff4ef289': 
> unknown revision or path not in the working tree.
> Use '--' to separate paths from revisions, like this:
> 'git <command> [<revision>...] -- [<file>...]'
> +++ exited with 128 +++

Feature request:
================

When printing the "fatal: ambiguous argument '......': ....", perhaps 
escape (url or otherwise) the ambiguous argument when printing it in the 
error message, or maybe add a sentence about non-ASCII characters being 
found.

This is sort of a difficult corner-case, in that it is perfectly legal 
to have UTF-8 characters in a branch or tag name (see 
git-check-ref-format for the allowed characters), so someone could 
indeed create a branch named 
"038179704f0066aa815d5429221cf381ff4ef289\357\273\277" if they were a 
tortured soul bent on overthrowing polite society.  Rejecting input 
because it has bytes with values above \177 is therefore not a solution.

Similarly, scanning the input for invisible UTF-8 characters (or even 
invalid UTF-8 sequences) is leaning too far the other way: git should 
not be validating character encodings.  It should stay encoding-neutral, 
as the alternative leads to madness, driving developers into becoming 
tortured souls bent on rigidly enforcing polite society.  We have enough 
of those already.

It's unclear as to whether violent overthrow or rigid enforcement is the 
lesser of two evils, but let's not perform the experiment to find out.  
:-)

Cheers!

-- 
CH (ch-and-git.vger.kernel.org@ch.pkts.ca)

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-07-28 18:01 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-07-27 20:21 Feature request: better error messages when UTF-8 bites CH
2022-07-28  5:42 ` Johannes Sixt
2022-07-28  9:40   ` Thomas Guyot
2022-07-28 18:01     ` Torsten Bögershausen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).