UTF-8 trouble with GIT 1.5.0.3

All of lore.kernel.org
 help / color / mirror / Atom feed

* UTF-8 trouble with GIT 1.5.0.3
@ 2007-03-17 10:27 Willy Tarreau
  0 siblings, 0 replies; only message in thread
From: Willy Tarreau @ 2007-03-17 10:27 UTC (permalink / raw)
  To: jukio; +Cc: git

Hi Junio,

after upgrading from GIT 1.4 to 1.5, I noticed that I had got
errors when applying patches from mailboxes and that I had to 
manually add "--no-utf8" to git-am. The error was always :

  "fatal: cannot convert from latin1 to utf-8".

This morning, I decided to take a closer look at it, first
trying to fix it in the configuration. Even this fails :

[i18n]
        commitencoding = latin1

$ git-am -3 username.patch 
fatal: cannot convert from latin1 to latin1

So I took a look at the code and noticed that in builtin-mailinfo,
convert_to_utf8() is always called because metainfo_charset is
always set. Also, this function always calls reencode_string(),
no matter what input and outputs are passed. And since it fails
for an unknown reason, I get the error message.

Now I have several options :

 - always pass --no-utf8 with git-am to leave metainfo_charset
   NULL and avoid the call to convert_to_utf8()
   => very annoying and I have to rm -rf .dotest every time I forget
      about it

 - add a check to convert_to_utf8() so that when input and output
   charsets are equal, it does not malloc/convert/copy/free the
   result. This would seem necessary anyway.

 - check for empty charset name instead of just a NULL pointer,
   so that it is still possible by configuration to stay compatible
   with old default behaviour which is : not to convert by default.

 - switch back to 1.4 : this will be my immediate option while
   waiting for any advice on the right long-term solution.

I would be glad to get a recommendation for one of the solutions
above. I can workout a patch, but I don't want to waste my time
doing crappy patches that will not be retained.

Please note that I don't know a damn thing about the way iconv
works, which may be one of the reason I really hate it. So it
will take far too much time to me to dig below reencode_string().

Also, I am really worried for buffer overflows here. Looking at
the code, it seems like headers from input mails are passed to
decode_header(), which itself calls decode_header_bq() which in
turns calls convert_to_utf8() with input and output buffers of
256 bytes. Same for mailinfo() which calls check_header_line()
with <line> which has been filled up to sizeof(line) (1000 bytes).

Unless I'm mistaken, encoding from latin1 to utf-8 can grow the
string by a 5-to-1 ratio. So this means that the blind strcpy()
in convert_to_utf8() could overflow on user-crafted mail headers.
Maybe I overlooked something :-/

Regards,
Willy

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2007-03-17 10:38 UTC | newest]

Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-03-17 10:27 UTF-8 trouble with GIT 1.5.0.3 Willy Tarreau

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.