Re: i18n.logToUTF8: convert commit log message to UTF-8

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: i18n.logToUTF8: convert commit log message to UTF-8
@ 2006-12-24 14:18 Johannes Schindelin
  2006-12-24 15:44 ` [PATCH] commit encoding: store it in commit header rather than mucking with NUL Johannes Schindelin
  0 siblings, 1 reply; 6+ messages in thread
From: Johannes Schindelin @ 2006-12-24 14:18 UTC (permalink / raw)
  To: git, junkio

Hi,

I don't like the convention to add the encoding after a trailing NUL. I'd 
rather have it as an "encoding blabl" header like the author and committer 
headers. I.e.

-- snip --
tree 519fba5ec35f25cbac7f46574f214fb5eb95d2c8
parent 41c7c86ccbcb846cacf48c5e283983fa797cf37b
author Johannes Schindelin <Johannes.Schindelin@gmx.de> 1166969650 +0100
committer Johannes Schindelin <Johannes.Schindelin@gmx.de> 1166969650 +0100
encoding latin1

    git-commit-tree: add encoding header if commitEncoding != utf-8
-- snap --

Our log routines all grok these commit objects.

And I would prefer commit-tree to fail if we explicitely ask for 
conversion to UTF-8, but NO_ICONV is set, or reencoding fails for other 
reasons.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH] commit encoding: store it in commit header rather than mucking with NUL
  2006-12-24 14:18 i18n.logToUTF8: convert commit log message to UTF-8 Johannes Schindelin
@ 2006-12-24 15:44 ` Johannes Schindelin
  2006-12-25  0:13   ` Junio C Hamano
  0 siblings, 1 reply; 6+ messages in thread
From: Johannes Schindelin @ 2006-12-24 15:44 UTC (permalink / raw)
  To: git, junkio


It also fixes a segmentation fault when iconv does not know the
encoding (defaulting to no conversion).

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---

	On Sun, 24 Dec 2006, Johannes Schindelin wrote:

	> I don't like the convention to add the encoding after a trailing 
	> NUL. I'd rather have it as an "encoding blabl" header like the 
	> author and committer headers.

	This is a patch implementing that, on top of pu.

 builtin-commit-tree.c |   20 ++++++-------
 commit.c              |   70 ++++++++++++++++++++----------------------------
 2 files changed, 38 insertions(+), 52 deletions(-)

diff --git a/builtin-commit-tree.c b/builtin-commit-tree.c
index 9aff980..33c29f7 100644
--- a/builtin-commit-tree.c
+++ b/builtin-commit-tree.c
@@ -92,6 +92,7 @@ int cmd_commit_tree(int argc, const char **argv, const char *prefix)
 	char comment[1000];
 	char *buffer;
 	unsigned int size;
+	int encoding_is_utf8;
 
 	setup_ident();
 	git_config(git_default_config);
@@ -117,6 +118,8 @@ int cmd_commit_tree(int argc, const char **argv, const char *prefix)
 			parents++;
 	}
 
+	encoding_is_utf8 = !strcmp(git_commit_encoding, "utf-8");
+
 	init_buffer(&buffer, &size);
 	add_buffer(&buffer, &size, "tree %s\n", sha1_to_hex(tree_sha1));
 
@@ -130,7 +133,11 @@ int cmd_commit_tree(int argc, const char **argv, const char *prefix)
 
 	/* Person/date information */
 	add_buffer(&buffer, &size, "author %s\n", git_author_info(1));
-	add_buffer(&buffer, &size, "committer %s\n\n", git_committer_info(1));
+	add_buffer(&buffer, &size, "committer %s\n", git_committer_info(1));
+	if (!encoding_is_utf8)
+		add_buffer(&buffer, &size,
+				"encoding %s\n", git_commit_encoding);
+	add_buffer(&buffer, &size, "\n");
 
 	/* And add the comment */
 	while (fgets(comment, sizeof(comment), stdin) != NULL)
@@ -138,16 +145,7 @@ int cmd_commit_tree(int argc, const char **argv, const char *prefix)
 
 	/* And check the encoding */
 	buffer[size] = '\0';
-	if (strcmp(git_commit_encoding, "utf-8")) {
-		/*
-		 * Add trailing section: this is safe because
-		 * add_buffer always allocates one byte more.
-		 */
-		size++;
-		add_buffer(&buffer, &size,
-			   "encoding %s\n", git_commit_encoding);
-	}
-	else if (!is_utf8(buffer))
+	if (encoding_is_utf8 && !is_utf8(buffer))
 		fprintf(stderr, commit_utf8_warn);
 
 	if (!write_sha1_file(buffer, size, commit_type, commit_sha1)) {
diff --git a/commit.c b/commit.c
index 58c4ecb..3132323 100644
--- a/commit.c
+++ b/commit.c
@@ -598,56 +598,43 @@ static int add_merge_info(enum cmit_fmt fmt, char *buf, const struct commit *com
 	return offset;
 }
 
-const char *commit_trailer(const struct commit *commit)
+static char *get_header(const struct commit *commit, const char *key)
 {
-	unsigned long size, len;
-	char type[10];
+	int key_len = strlen(key);
+	const char *line = commit->buffer;
 
-	if (sha1_object_info(commit->object.sha1, type, &size))
-		return NULL;
-	len = strlen(commit->buffer);
-	if (len == size)
-		return NULL;
-	return commit->buffer + len + 1;
-}
-
-char *find_commit_trailer(const struct commit *commit, const char *key)
-{
-	const char *trailer = commit_trailer(commit);
-	int keylen = strlen(key);
-	if (!trailer)
-		return NULL;
-	while (*trailer) {
-		const char *eol = strchr(trailer, '\n');
-		if (!eol)
-			eol = trailer + strlen(trailer);
-		if (!strncmp(trailer, key, keylen) && trailer[keylen] == ' ') {
-			int valsz = eol - trailer - keylen;
-			char *val = xmalloc(valsz + 1);
-			memcpy(val, trailer + keylen + 1, valsz);
-			val[valsz] = '\0';
-			if (val[valsz-1] == '\n')
-				val[valsz-1] = '\0';
-			return val;
+	for (;;) {
+		const char *eol = strchr(line, '\n'), *next;
+
+		if (line == eol)
+			return NULL;
+		if (!eol) {
+			eol = line + strlen(line);
+			next = NULL;
+		} else
+			next = eol + 1;
+		if (!strncmp(line, key, key_len) && line[key_len] == ' ') {
+			int len = eol - line - key_len;
+			char *ret = xmalloc(len);
+			memcpy(ret, line + key_len + 1, len - 1);
+			ret[len - 1] = '\0';
+			return ret;
 		}
-		if (!eol)
-			break;
-		trailer = eol + 1;
+		line = next;
 	}
-	return NULL;
 }
 
 static char *logmsg_to_utf8(const struct commit *commit)
 {
-	char *encoding = find_commit_trailer(commit, "encoding");
+	char *encoding = get_header(commit, "encoding");
 	char *out;
 
 	if (!encoding)
-		return commit->buffer;
+		return NULL;
 	out = reencode_string(commit->buffer, "utf-8", encoding);
 	free(encoding);
 	if (!out)
-		return commit->buffer;
+		return NULL;
 	return out;
 }
 
@@ -662,11 +649,12 @@ unsigned long pretty_print_commit(enum cmit_fmt fmt, const struct commit *commit
 	int parents_shown = 0;
 	const char *msg = commit->buffer;
 	int plain_non_ascii = 0;
-	int msg_reencoded = 0;
+	char *reencoded = NULL;
 
 	if (strcmp(git_commit_encoding, "utf-8") && i18n_log_to_utf8) {
-		msg = logmsg_to_utf8(commit);
-		msg_reencoded = 1;
+		reencoded = logmsg_to_utf8(commit);
+		if (reencoded)
+			msg = reencoded;
 	}
 
 	if (fmt == CMIT_FMT_ONELINE || fmt == CMIT_FMT_EMAIL)
@@ -816,8 +804,8 @@ unsigned long pretty_print_commit(enum cmit_fmt fmt, const struct commit *commit
 		buf[offset++] = '\n';
 	buf[offset] = '\0';
 
-	if (msg_reencoded)
-		free((char*)msg);
+	if (reencoded)
+		free(reencoded);
 	return offset;
 }
 
-- 
1.4.4.3.g0001f-dirty

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] commit encoding: store it in commit header rather than mucking with NUL
  2006-12-24 15:44 ` [PATCH] commit encoding: store it in commit header rather than mucking with NUL Johannes Schindelin
@ 2006-12-25  0:13   ` Junio C Hamano
  2006-12-25  0:41     ` Johannes Schindelin
  0 siblings, 1 reply; 6+ messages in thread
From: Junio C Hamano @ 2006-12-25  0:13 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, junkio

Thanks for paying a very close attention to what is in 'pu'.
The series was a work in progress not yet ready for publication,
which I merged by mistake because I was way too sleepy.

There are three reasons (and a half -- it was not even compile
tested) that I did not mean to merge it yet.

 - I did not want to break existing git implementations, but
   hadn't audited the commit header parsers to see if they do
   not get upset when they see unrecognized header fields in
   commit objects.  The 'trailer' was just a quick hack to have
   something working for me to test the output conversion code.
   I've looked at the code since then and I think it is Ok to
   add new ones after author/name fields like you did.

   If we _were_ to do this, I think it is preferable to make it
   a header, not a trailer like I did.

 - I was not sure if the "assume the whole commit->buffer is in
   the local encoding and recode it into UTF-8" is correct.  The
   message body ought to be in i18n.commitencoding from the
   repository the commit comes from, which is now recorded in
   the trailer (or the header as in your patch), but what I was
   unsure about was the encoding in which author and committer
   names are recorded.  I think they are set either by the
   environment variables or user.name to be consistent with
   i18n.commitencoding in the repository the commit is made, so
   it is Ok after all.

 - Existing Porcelains such as gitk know i18n.commitencoding is
   a hint to them by the core, and expect the core to give them
   output in the local encoding.  With the change, the core
   feeds UTF-8 to the caller, unless the Porcelain gets the log
   with plumbing "cat-file".  This means they either have to
   lose code to do their own recoding (which is probably a good
   thing in the long run), or we would need to have a flag for
   them to tell the core not to do the conversion.  But a new
   flag to ask for older behaviour is always a wrong way of
   transitioning across backward incompatibility.

   I think the output conversion from the log should be more
   explicitly asked for it, than just a mere configuration
   variable that cannot be overriden by gitk and friends.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] commit encoding: store it in commit header rather than mucking with NUL
  2006-12-25  0:13   ` Junio C Hamano
@ 2006-12-25  0:41     ` Johannes Schindelin
  2006-12-25  7:27       ` Junio C Hamano
  0 siblings, 1 reply; 6+ messages in thread
From: Johannes Schindelin @ 2006-12-25  0:41 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Hi,

On Sun, 24 Dec 2006, Junio C Hamano wrote:

> Thanks for paying a very close attention to what is in 'pu'.

You're welcome ;-)

>  - I did not want to break existing git implementations, but
>    hadn't audited the commit header parsers to see if they do
>    not get upset when they see unrecognized header fields in
>    commit objects.

I did not really audit it, but am quite sure that only parse_commit() 
really walks the buffer, and even this function just ignores what comes 
after committer. But if I read the code correctly, then the mandatory 
headers ("tree parent* author committer") have to come first, in that 
order.

>  - I was not sure if the "assume the whole commit->buffer is in
>    the local encoding and recode it into UTF-8" is correct.

For the purpose of showing it, there is no point in using two different 
encodings. I am not aware of any terminal (and do not own such a terminal 
anyway) which can display text with parts encoded differently from the 
rest.

>  - Existing Porcelains such as gitk know i18n.commitencoding is
>    a hint to them by the core, and expect the core to give them
>    output in the local encoding.  With the change, the core
>    feeds UTF-8 to the caller, unless the Porcelain gets the log
>    with plumbing "cat-file".  This means they either have to
>    lose code to do their own recoding (which is probably a good
>    thing in the long run), or we would need to have a flag for
>    them to tell the core not to do the conversion.  But a new
>    flag to ask for older behaviour is always a wrong way of
>    transitioning across backward incompatibility.
> 
>    I think the output conversion from the log should be more
>    explicitly asked for it, than just a mere configuration
>    variable that cannot be overriden by gitk and friends.

Probably a new flag "--encoding" to log, which sets git_commit_encoding. 
And the function you introduced should probably not blindly convert to 
UTF-8, but to the current git_commit_encoding.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] commit encoding: store it in commit header rather than mucking with NUL
  2006-12-25  0:41     ` Johannes Schindelin
@ 2006-12-25  7:27       ` Junio C Hamano
  2006-12-25 22:54         ` Jakub Narebski
  0 siblings, 1 reply; 6+ messages in thread
From: Junio C Hamano @ 2006-12-25  7:27 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

>>  - I was not sure if the "assume the whole commit->buffer is in
>>    the local encoding and recode it into UTF-8" is correct.
>
> For the purpose of showing it, there is no point in using two different 
> encodings. I am not aware of any terminal (and do not own such a terminal 
> anyway) which can display text with parts encoded differently from the 
> rest.

That's not what I meant.  Because the definition of the contents
of the commit has been (and will be --- the unpublished topic
was about suggesting stronger convention across Porcelains) just
the set of fixed headers plus binary blob, the use of different
encodings is entirely up to the users.

I was afraid that there might be something we did (or we did not
do) that encouraged people to have their names (via environment
variables, or perhaps user.name) always in UTF-8 while recording
the log messages in the legacy encoding, and if that kind of use
is already done in the wild, we would end up having to not
reencode the header field but reencode the body.

But I do not think we ever encouraged encoding names in UTF-8 or
anything else (we did encourage use of UTF-8 in the commit log),
so I think we are Ok.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] commit encoding: store it in commit header rather than mucking with NUL
  2006-12-25  7:27       ` Junio C Hamano
@ 2006-12-25 22:54         ` Jakub Narebski
  0 siblings, 0 replies; 6+ messages in thread
From: Jakub Narebski @ 2006-12-25 22:54 UTC (permalink / raw)
  To: git

Junio C Hamano wrote:

> I was afraid that there might be something we did (or we did not
> do) that encouraged people to have their names (via environment
> variables, or perhaps user.name) always in UTF-8 while recording
> the log messages in the legacy encoding, and if that kind of use
> is already done in the wild, we would end up having to not
> reencode the header field but reencode the body.
> 
> But I do not think we ever encouraged encoding names in UTF-8 or
> anything else (we did encourage use of UTF-8 in the commit log),
> so I think we are Ok.
  
By the way, it would be nice to have .mailmap like mechanism to tell git
that those two (three or more) names are the same comitter and should be
shown (at least in git-shortlog) as this. This is because some people
changed their email, some people have sometimes middle initial and
sometimes not, some people have name outside US-ASCII and it is sometimes
broken...
-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2006-12-25 22:52 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-12-24 14:18 i18n.logToUTF8: convert commit log message to UTF-8 Johannes Schindelin
2006-12-24 15:44 ` [PATCH] commit encoding: store it in commit header rather than mucking with NUL Johannes Schindelin
2006-12-25  0:13   ` Junio C Hamano
2006-12-25  0:41     ` Johannes Schindelin
2006-12-25  7:27       ` Junio C Hamano
2006-12-25 22:54         ` Jakub Narebski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).