[PATCH] utf8: add utf8

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] utf8: add utf8_strwidth()
@ 2009-01-30  9:41 Geoffrey Thomas
  2009-01-30  9:41 ` [PATCH] builtin-blame.c: Use utf8_strwidth for author's names Geoffrey Thomas
  2009-01-31  7:17 ` [PATCH] utf8: add utf8_strwidth() Jeff King
  0 siblings, 2 replies; 13+ messages in thread
From: Geoffrey Thomas @ 2009-01-30  9:41 UTC (permalink / raw)
  To: git; +Cc: Geoffrey Thomas

From: Geoffrey Thomas <geofft@mit.edu>
I'm about to use this pattern more than once, so make it a common function.

Signed-off-by: Geoffrey Thomas <geofft@mit.edu>
---
 utf8.c |   12 ++++++++++++
 utf8.h |    1 +
 2 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/utf8.c b/utf8.c
index dc37353..a2d888d 100644
--- a/utf8.c
+++ b/utf8.c
@@ -246,6 +246,18 @@ int utf8_width(const char **start, size_t *remainder_p)
 	return git_wcwidth(ch);
 }
 
+/*
+ * Returns the total number of columns required by a null-terminated
+ * string.
+ */
+size_t utf8_strwidth(const char *string)
+{
+	size_t width = 0;
+	while (string && *string)
+		width += utf8_width(&string, NULL);
+	return width;
+}
+
 int is_utf8(const char *text)
 {
 	while (*text) {
diff --git a/utf8.h b/utf8.h
index 98cce1b..1ae3450 100644
--- a/utf8.h
+++ b/utf8.h
@@ -5,6 +5,7 @@ typedef unsigned int ucs_char_t;  /* assuming 32bit int */
 
 ucs_char_t pick_one_utf8_char(const char **start, size_t *remainder_p);
 int utf8_width(const char **start, size_t *remainder_p);
+size_t utf8_strwidth(const char *string);
 int is_utf8(const char *text);
 int is_encoding_utf8(const char *name);
 
-- 
1.5.6.5

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH] builtin-blame.c: Use utf8_strwidth for author's names
  2009-01-30  9:41 [PATCH] utf8: add utf8_strwidth() Geoffrey Thomas
@ 2009-01-30  9:41 ` Geoffrey Thomas
  2009-01-30 17:12   ` Johannes Schindelin
  2009-01-31  7:17 ` [PATCH] utf8: add utf8_strwidth() Jeff King
  1 sibling, 1 reply; 13+ messages in thread
From: Geoffrey Thomas @ 2009-01-30  9:41 UTC (permalink / raw)
  To: git; +Cc: Geoffrey Thomas

From: Geoffrey Thomas <geofft@mit.edu>
git blame misaligns output if a author's name has a differing display width and
strlen; for instance, an accented Latin letter that takes two bytes to encode
will cause the rest of the line to be shifted to the left by one. To fix this,
use utf8_strwidth instead of strlen (and compute the padding ourselves, since
printf doesn't know about UTF-8).

Signed-off-by: Geoffrey Thomas <geofft@mit.edu>
---
 builtin-blame.c |    7 ++++---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/builtin-blame.c b/builtin-blame.c
index aae14ef..2941fc0 100644
--- a/builtin-blame.c
+++ b/builtin-blame.c
@@ -19,6 +19,7 @@
 #include "string-list.h"
 #include "mailmap.h"
 #include "parse-options.h"
+#include "utf8.h"
 
 static char blame_usage[] = "git blame [options] [rev-opts] [rev] [--] file";
 
@@ -1619,9 +1620,9 @@ static void emit_other(struct scoreboard *sb, struct blame_entry *ent, int opt)
 				       ent->s_lno + 1 + cnt);
 
 			if (!(opt & OUTPUT_NO_AUTHOR))
-				printf(" (%-*.*s %10s",
-				       longest_author, longest_author,
+				printf(" (%s%*s %10s",
 				       ci.author,
+				       longest_author - utf8_strwidth(ci.author), "",
 				       format_time(ci.author_time,
 						   ci.author_tz,
 						   show_raw_time));
@@ -1755,7 +1756,7 @@ static void find_alignment(struct scoreboard *sb, int *option)
 		if (!(suspect->commit->object.flags & METAINFO_SHOWN)) {
 			suspect->commit->object.flags |= METAINFO_SHOWN;
 			get_commit_info(suspect->commit, &ci, 1);
-			num = strlen(ci.author);
+			num = utf8_strwidth(ci.author);
 			if (longest_author < num)
 				longest_author = num;
 		}
-- 
1.5.6.5

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] builtin-blame.c: Use utf8_strwidth for author's names
  2009-01-30  9:41 ` [PATCH] builtin-blame.c: Use utf8_strwidth for author's names Geoffrey Thomas
@ 2009-01-30 17:12   ` Johannes Schindelin
  2009-01-30 22:22     ` Geoffrey Thomas
  0 siblings, 1 reply; 13+ messages in thread
From: Johannes Schindelin @ 2009-01-30 17:12 UTC (permalink / raw)
  To: Geoffrey Thomas; +Cc: git

Hi,

On Fri, 30 Jan 2009, Geoffrey Thomas wrote:

> From: Geoffrey Thomas <geofft@mit.edu>
> git blame misaligns output if a author's name has a differing display width and
> strlen; for instance, an accented Latin letter that takes two bytes to encode
> will cause the rest of the line to be shifted to the left by one. To fix this,
> use utf8_strwidth instead of strlen (and compute the padding ourselves, since
> printf doesn't know about UTF-8).

Good point (even if your commit message has lines much longer than 72 
chars, ASCII ones at that).

But how certain are you at that point that the authors are in UTF-8 
format?  IOW what encoding conversions were possibly performed up to that 
point?

You might want to make it easier on possible reviewers by putting that 
discussion into the commit message.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] builtin-blame.c: Use utf8_strwidth for author's names
  2009-01-30 17:12   ` Johannes Schindelin
@ 2009-01-30 22:22     ` Geoffrey Thomas
  2009-01-31  7:24       ` Jeff King
  2009-02-01 22:34       ` Johannes Schindelin
  0 siblings, 2 replies; 13+ messages in thread
From: Geoffrey Thomas @ 2009-01-30 22:22 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git

> Good point (even if your commit message has lines much longer than 72
> chars, ASCII ones at that).

Oops; I'll fix that.

> But how certain are you at that point that the authors are in UTF-8
> format?  IOW what encoding conversions were possibly performed up to that
> point?

I don't believe there are any encoding conversions performed up to that 
point. IIRC git doesn't require any encoding but encourages UTF-8; if it's 
something obscure, I have no way of knowing how wide in screen columns the 
author field is because I likely don't have a library for it in git at 
all. I do have a utf8.c, though.

Currently, however, printf("%*.*s", width, width, author) is simply wrong, 
because printf only cares about bytes, not screen columns. Do you think I 
should fall back on the old behavior if i18n.commitencoding is set, or if 
at least one of the author names isn't parseable as UTF-8, or something? 
Or should I be doing this with iconv and assuming all commits are 
encoded in the current encoding specified via $LANG or $LC_whatever?

-- 
Geoffrey Thomas
geofft@mit.edu

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] utf8: add utf8_strwidth()
  2009-01-30  9:41 [PATCH] utf8: add utf8_strwidth() Geoffrey Thomas
  2009-01-30  9:41 ` [PATCH] builtin-blame.c: Use utf8_strwidth for author's names Geoffrey Thomas
@ 2009-01-31  7:17 ` Jeff King
  2009-01-31  8:51   ` Geoffrey Thomas
  1 sibling, 1 reply; 13+ messages in thread
From: Jeff King @ 2009-01-31  7:17 UTC (permalink / raw)
  To: Geoffrey Thomas; +Cc: git

On Fri, Jan 30, 2009 at 04:41:28AM -0500, Geoffrey Thomas wrote:

> I'm about to use this pattern more than once, so make it a common function.

I know next to nothing about our encoding functions, but this seems
suspiciously similar to utf8_width in utf8.c. There is also a
git_wcwidth, but I don't know how they relate.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] builtin-blame.c: Use utf8_strwidth for author's names
  2009-01-30 22:22     ` Geoffrey Thomas
@ 2009-01-31  7:24       ` Jeff King
  2009-02-01 22:34       ` Johannes Schindelin
  1 sibling, 0 replies; 13+ messages in thread
From: Jeff King @ 2009-01-31  7:24 UTC (permalink / raw)
  To: Geoffrey Thomas; +Cc: Johannes Schindelin, git

On Fri, Jan 30, 2009 at 05:22:07PM -0500, Geoffrey Thomas wrote:

> I don't believe there are any encoding conversions performed up to that  
> point. IIRC git doesn't require any encoding but encourages UTF-8; if it's 
> something obscure, I have no way of knowing how wide in screen columns the 
> author field is because I likely don't have a library for it in git at  
> all. I do have a utf8.c, though.

Don't we pull the author from the commit message after it has been
converted using reencode_commit_message (see get_commit_info)? That
should be respecting the log output encoding.

It looks like we just throw away the information on what we encoded _to_
(i.e., the second parameter of reencode_commit_message).  Probably we
need to remember that and use a generic "what is the width of this
string in this encoding" function.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] utf8: add utf8_strwidth()
  2009-01-31  7:17 ` [PATCH] utf8: add utf8_strwidth() Jeff King
@ 2009-01-31  8:51   ` Geoffrey Thomas
  2009-01-31  8:56     ` Jeff King
  0 siblings, 1 reply; 13+ messages in thread
From: Geoffrey Thomas @ 2009-01-31  8:51 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Sat, 31 Jan 2009, Jeff King wrote:
> I know next to nothing about our encoding functions, but this seems
> suspiciously similar to utf8_width in utf8.c. There is also a
> git_wcwidth, but I don't know how they relate.

git_wcwidth determines the screen columns of a single ucs_char_t. 
utf8_width returns the git_wcwidth of the first character in a string. 
utf8_strwidth (the function added by this patch) is a simple loop around 
utf8_width, because writing the loop every time would be silly.

On that note, there are probably more cases in the code that ought to use 
something like utf8_strwidth. I only noticed this one case because I'm 
working on a project with someone with an accented letter in his last 
name.

-- 
Geoffrey Thomas
geofft@mit.edu

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] utf8: add utf8_strwidth()
  2009-01-31  8:51   ` Geoffrey Thomas
@ 2009-01-31  8:56     ` Jeff King
  0 siblings, 0 replies; 13+ messages in thread
From: Jeff King @ 2009-01-31  8:56 UTC (permalink / raw)
  To: Geoffrey Thomas; +Cc: git

On Sat, Jan 31, 2009 at 03:51:48AM -0500, Geoffrey Thomas wrote:

> On Sat, 31 Jan 2009, Jeff King wrote:
>> I know next to nothing about our encoding functions, but this seems
>> suspiciously similar to utf8_width in utf8.c. There is also a
>> git_wcwidth, but I don't know how they relate.
>
> git_wcwidth determines the screen columns of a single ucs_char_t.  
> utf8_width returns the git_wcwidth of the first character in a string.  
> utf8_strwidth (the function added by this patch) is a simple loop around  
> utf8_width, because writing the loop every time would be silly.

Urgh. Sorry. If I had taken 3 seconds to actually _look_ at your patch,
I would have seen that (instead I thought "don't we already have
something that does this" and went straight to the existing code).

But no, it looks like we don't already have this, so your patch is fine.
Sorry for the noise.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] builtin-blame.c: Use utf8_strwidth for author's names
  2009-01-30 22:22     ` Geoffrey Thomas
  2009-01-31  7:24       ` Jeff King
@ 2009-02-01 22:34       ` Johannes Schindelin
  2009-02-02  6:48         ` Junio C Hamano
  1 sibling, 1 reply; 13+ messages in thread
From: Johannes Schindelin @ 2009-02-01 22:34 UTC (permalink / raw)
  To: Geoffrey Thomas; +Cc: git

Hi,

On Fri, 30 Jan 2009, Geoffrey Thomas wrote:

> Currently, however, printf("%*.*s", width, width, author) is simply 
> wrong, because printf only cares about bytes, not screen columns. Do you 
> think I should fall back on the old behavior if i18n.commitencoding is 
> set, or if at least one of the author names isn't parseable as UTF-8, or 
> something? Or should I be doing this with iconv and assuming all commits 
> are encoded in the current encoding specified via $LANG or $LC_whatever?

I do not know what encoding the author is at that point, but if you cannot 
be sure that it is UTF-8, using utf8_strwidth() is just as wrong as the 
current code, IMHO.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] builtin-blame.c: Use utf8_strwidth for author's names
  2009-02-01 22:34       ` Johannes Schindelin
@ 2009-02-02  6:48         ` Junio C Hamano
  2009-02-02 12:40           ` Johannes Schindelin
  2009-02-02 12:41           ` Jeff King
  0 siblings, 2 replies; 13+ messages in thread
From: Junio C Hamano @ 2009-02-02  6:48 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Geoffrey Thomas, git

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> On Fri, 30 Jan 2009, Geoffrey Thomas wrote:
>
>> Currently, however, printf("%*.*s", width, width, author) is simply 
>> wrong, because printf only cares about bytes, not screen columns. Do you 
>> think I should fall back on the old behavior if i18n.commitencoding is 
>> set, or if at least one of the author names isn't parseable as UTF-8, or 
>> something? Or should I be doing this with iconv and assuming all commits 
>> are encoded in the current encoding specified via $LANG or $LC_whatever?
>
> I do not know what encoding the author is at that point, but if you cannot 
> be sure that it is UTF-8, using utf8_strwidth() is just as wrong as the 
> current code, IMHO.

That is true, but then we are not losing anything.

This codepath is not about the payload (the contents of the files) but the
author name part of the commit log message, and UTF-8 would probably be
the only sensible encoding to standardize on.

If your project uses UTF-8 for everybody, great, we will align them better
than we did before.  If not, sorry, you will get a different misaligned
names.

That assumes utf8_width() does not barf when fed an invalid byte sequence,
but I did not think it is that fragile (I didn't actually audit the
codepath, though).

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] builtin-blame.c: Use utf8_strwidth for author's names
  2009-02-02  6:48         ` Junio C Hamano
@ 2009-02-02 12:40           ` Johannes Schindelin
  2009-02-03  4:30             ` Junio C Hamano
  2009-02-02 12:41           ` Jeff King
  1 sibling, 1 reply; 13+ messages in thread
From: Johannes Schindelin @ 2009-02-02 12:40 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Geoffrey Thomas, git

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2540 bytes --]

Hi,

On Sun, 1 Feb 2009, Junio C Hamano wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> 
> > On Fri, 30 Jan 2009, Geoffrey Thomas wrote:
> >
> >> Currently, however, printf("%*.*s", width, width, author) is simply 
> >> wrong, because printf only cares about bytes, not screen columns. Do 
> >> you think I should fall back on the old behavior if 
> >> i18n.commitencoding is set, or if at least one of the author names 
> >> isn't parseable as UTF-8, or something? Or should I be doing this 
> >> with iconv and assuming all commits are encoded in the current 
> >> encoding specified via $LANG or $LC_whatever?
> >
> > I do not know what encoding the author is at that point, but if you 
> > cannot be sure that it is UTF-8, using utf8_strwidth() is just as 
> > wrong as the current code, IMHO.
> 
> That is true, but then we are not losing anything.
> 
> This codepath is not about the payload (the contents of the files) but 
> the author name part of the commit log message, and UTF-8 would probably 
> be the only sensible encoding to standardize on.

Almost agree, except for shops where you have an enforced encoding that 
you cannot easily change.

And last time I checked, many more encodings used 1 character/byte (or for 
that matter, 1 column / byte) than not; utf8_width would be "more wrong" 
than strlen() here, because strlen() would "happen to work" here.

> If your project uses UTF-8 for everybody, great, we will align them 
> better than we did before.  If not, sorry, you will get a different 
> misaligned names.

There _has_ to be a way to check if the current author string is encoded 
in UTF-8.  All I am asking is that the original poster would put just a 
_little_ more effort into the issue and make the thing dependent on the 
knowledge -- as opposed to the assumption -- that the author is encoded in 
UTF-8.

> That assumes utf8_width() does not barf when fed an invalid byte 
> sequence, but I did not think it is that fragile (I didn't actually 
> audit the codepath, though).

That is the code that barfs in wcwidth:

        if (ch < 32 || (ch >= 0x7f && ch < 0xa0))
                return -1;

That is not a big problem, but Geoff's code does not handle that case 
correctly.  For example, in Code-page 437, a name like "£ïñûç" would 
result in a negative width.

But hey, it is definitely not my itch, I will never suffer from the 
fallout of this patch, as I am safely within US-ASCII.  I just thought I 
saw a potential problem and a possible way out.  That's it.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] builtin-blame.c: Use utf8_strwidth for author's names
  2009-02-02  6:48         ` Junio C Hamano
  2009-02-02 12:40           ` Johannes Schindelin
@ 2009-02-02 12:41           ` Jeff King
  1 sibling, 0 replies; 13+ messages in thread
From: Jeff King @ 2009-02-02 12:41 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Johannes Schindelin, Geoffrey Thomas, git

On Sun, Feb 01, 2009 at 10:48:51PM -0800, Junio C Hamano wrote:

> > I do not know what encoding the author is at that point, but if you cannot 
> > be sure that it is UTF-8, using utf8_strwidth() is just as wrong as the 
> > current code, IMHO.
> 
> That is true, but then we are not losing anything.
> 
> This codepath is not about the payload (the contents of the files) but the
> author name part of the commit log message, and UTF-8 would probably be
> the only sensible encoding to standardize on.
> 
> If your project uses UTF-8 for everybody, great, we will align them better
> than we did before.  If not, sorry, you will get a different misaligned
> names.
> 
> That assumes utf8_width() does not barf when fed an invalid byte sequence,
> but I did not think it is that fragile (I didn't actually audit the
> codepath, though).

We should be able to know the encoding (we call reencode_commit_message,
but we don't bother to save the result). It should be trivial to do:

int strwidth(const char *s, const char *encoding)
{
  if (!strcmp(encoding, "utf-8"))
    return utf8_strwidth(s);
  /* ideally, else if (some_other_encoding_family) */
  else
    return strlen(s);
}

Then utf-8 is fixed, and other encodings keep identical behavior (and
don't even waste cycles on utf-8 decoding). And it should be obvious to
anyone who wants to add a width detector for their pet encoding where it
should go.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] builtin-blame.c: Use utf8_strwidth for author's names
  2009-02-02 12:40           ` Johannes Schindelin
@ 2009-02-03  4:30             ` Junio C Hamano
  0 siblings, 0 replies; 13+ messages in thread
From: Junio C Hamano @ 2009-02-03  4:30 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Geoffrey Thomas, git

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> And last time I checked, many more encodings used 1 character/byte (or for 
> that matter, 1 column / byte) than not; utf8_width would be "more wrong" 
> than strlen() here, because strlen() would "happen to work" here.

Ahh, you are absolutely right here, and use of utf8_width without checking
is actively breaking things.

> There _has_ to be a way to check if the current author string is encoded 
> in UTF-8.  All I am asking is that the original poster would put just a 
> _little_ more effort into the issue and make the thing dependent on the 
> knowledge -- as opposed to the assumption -- that the author is encoded in 
> UTF-8.

Yeah, that makes sense.

> That is the code that barfs in wcwidth:
>
>         if (ch < 32 || (ch >= 0x7f && ch < 0xa0))
>                 return -1;
>
> That is not a big problem, but Geoff's code does not handle that case 
> correctly.

Thanks for checking --- I suspected something like that would be there
somewhere.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2009-02-03  4:32 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-01-30  9:41 [PATCH] utf8: add utf8_strwidth() Geoffrey Thomas
2009-01-30  9:41 ` [PATCH] builtin-blame.c: Use utf8_strwidth for author's names Geoffrey Thomas
2009-01-30 17:12   ` Johannes Schindelin
2009-01-30 22:22     ` Geoffrey Thomas
2009-01-31  7:24       ` Jeff King
2009-02-01 22:34       ` Johannes Schindelin
2009-02-02  6:48         ` Junio C Hamano
2009-02-02 12:40           ` Johannes Schindelin
2009-02-03  4:30             ` Junio C Hamano
2009-02-02 12:41           ` Jeff King
2009-01-31  7:17 ` [PATCH] utf8: add utf8_strwidth() Jeff King
2009-01-31  8:51   ` Geoffrey Thomas
2009-01-31  8:56     ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).