* Re: i18n.logToUTF8: convert commit log message to UTF-8
@ 2006-12-24 14:18 Johannes Schindelin
2006-12-24 15:44 ` [PATCH] commit encoding: store it in commit header rather than mucking with NUL Johannes Schindelin
0 siblings, 1 reply; 6+ messages in thread
From: Johannes Schindelin @ 2006-12-24 14:18 UTC (permalink / raw)
To: git, junkio
Hi,
I don't like the convention to add the encoding after a trailing NUL. I'd
rather have it as an "encoding blabl" header like the author and committer
headers. I.e.
-- snip --
tree 519fba5ec35f25cbac7f46574f214fb5eb95d2c8
parent 41c7c86ccbcb846cacf48c5e283983fa797cf37b
author Johannes Schindelin <Johannes.Schindelin@gmx.de> 1166969650 +0100
committer Johannes Schindelin <Johannes.Schindelin@gmx.de> 1166969650 +0100
encoding latin1
git-commit-tree: add encoding header if commitEncoding != utf-8
-- snap --
Our log routines all grok these commit objects.
And I would prefer commit-tree to fail if we explicitely ask for
conversion to UTF-8, but NO_ICONV is set, or reencoding fails for other
reasons.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 6+ messages in thread* [PATCH] commit encoding: store it in commit header rather than mucking with NUL 2006-12-24 14:18 i18n.logToUTF8: convert commit log message to UTF-8 Johannes Schindelin @ 2006-12-24 15:44 ` Johannes Schindelin 2006-12-25 0:13 ` Junio C Hamano 0 siblings, 1 reply; 6+ messages in thread From: Johannes Schindelin @ 2006-12-24 15:44 UTC (permalink / raw) To: git, junkio It also fixes a segmentation fault when iconv does not know the encoding (defaulting to no conversion). Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> --- On Sun, 24 Dec 2006, Johannes Schindelin wrote: > I don't like the convention to add the encoding after a trailing > NUL. I'd rather have it as an "encoding blabl" header like the > author and committer headers. This is a patch implementing that, on top of pu. builtin-commit-tree.c | 20 ++++++------- commit.c | 70 ++++++++++++++++++++---------------------------- 2 files changed, 38 insertions(+), 52 deletions(-) diff --git a/builtin-commit-tree.c b/builtin-commit-tree.c index 9aff980..33c29f7 100644 --- a/builtin-commit-tree.c +++ b/builtin-commit-tree.c @@ -92,6 +92,7 @@ int cmd_commit_tree(int argc, const char **argv, const char *prefix) char comment[1000]; char *buffer; unsigned int size; + int encoding_is_utf8; setup_ident(); git_config(git_default_config); @@ -117,6 +118,8 @@ int cmd_commit_tree(int argc, const char **argv, const char *prefix) parents++; } + encoding_is_utf8 = !strcmp(git_commit_encoding, "utf-8"); + init_buffer(&buffer, &size); add_buffer(&buffer, &size, "tree %s\n", sha1_to_hex(tree_sha1)); @@ -130,7 +133,11 @@ int cmd_commit_tree(int argc, const char **argv, const char *prefix) /* Person/date information */ add_buffer(&buffer, &size, "author %s\n", git_author_info(1)); - add_buffer(&buffer, &size, "committer %s\n\n", git_committer_info(1)); + add_buffer(&buffer, &size, "committer %s\n", git_committer_info(1)); + if (!encoding_is_utf8) + add_buffer(&buffer, &size, + "encoding %s\n", git_commit_encoding); + add_buffer(&buffer, &size, "\n"); /* And add the comment */ while (fgets(comment, sizeof(comment), stdin) != NULL) @@ -138,16 +145,7 @@ int cmd_commit_tree(int argc, const char **argv, const char *prefix) /* And check the encoding */ buffer[size] = '\0'; - if (strcmp(git_commit_encoding, "utf-8")) { - /* - * Add trailing section: this is safe because - * add_buffer always allocates one byte more. - */ - size++; - add_buffer(&buffer, &size, - "encoding %s\n", git_commit_encoding); - } - else if (!is_utf8(buffer)) + if (encoding_is_utf8 && !is_utf8(buffer)) fprintf(stderr, commit_utf8_warn); if (!write_sha1_file(buffer, size, commit_type, commit_sha1)) { diff --git a/commit.c b/commit.c index 58c4ecb..3132323 100644 --- a/commit.c +++ b/commit.c @@ -598,56 +598,43 @@ static int add_merge_info(enum cmit_fmt fmt, char *buf, const struct commit *com return offset; } -const char *commit_trailer(const struct commit *commit) +static char *get_header(const struct commit *commit, const char *key) { - unsigned long size, len; - char type[10]; + int key_len = strlen(key); + const char *line = commit->buffer; - if (sha1_object_info(commit->object.sha1, type, &size)) - return NULL; - len = strlen(commit->buffer); - if (len == size) - return NULL; - return commit->buffer + len + 1; -} - -char *find_commit_trailer(const struct commit *commit, const char *key) -{ - const char *trailer = commit_trailer(commit); - int keylen = strlen(key); - if (!trailer) - return NULL; - while (*trailer) { - const char *eol = strchr(trailer, '\n'); - if (!eol) - eol = trailer + strlen(trailer); - if (!strncmp(trailer, key, keylen) && trailer[keylen] == ' ') { - int valsz = eol - trailer - keylen; - char *val = xmalloc(valsz + 1); - memcpy(val, trailer + keylen + 1, valsz); - val[valsz] = '\0'; - if (val[valsz-1] == '\n') - val[valsz-1] = '\0'; - return val; + for (;;) { + const char *eol = strchr(line, '\n'), *next; + + if (line == eol) + return NULL; + if (!eol) { + eol = line + strlen(line); + next = NULL; + } else + next = eol + 1; + if (!strncmp(line, key, key_len) && line[key_len] == ' ') { + int len = eol - line - key_len; + char *ret = xmalloc(len); + memcpy(ret, line + key_len + 1, len - 1); + ret[len - 1] = '\0'; + return ret; } - if (!eol) - break; - trailer = eol + 1; + line = next; } - return NULL; } static char *logmsg_to_utf8(const struct commit *commit) { - char *encoding = find_commit_trailer(commit, "encoding"); + char *encoding = get_header(commit, "encoding"); char *out; if (!encoding) - return commit->buffer; + return NULL; out = reencode_string(commit->buffer, "utf-8", encoding); free(encoding); if (!out) - return commit->buffer; + return NULL; return out; } @@ -662,11 +649,12 @@ unsigned long pretty_print_commit(enum cmit_fmt fmt, const struct commit *commit int parents_shown = 0; const char *msg = commit->buffer; int plain_non_ascii = 0; - int msg_reencoded = 0; + char *reencoded = NULL; if (strcmp(git_commit_encoding, "utf-8") && i18n_log_to_utf8) { - msg = logmsg_to_utf8(commit); - msg_reencoded = 1; + reencoded = logmsg_to_utf8(commit); + if (reencoded) + msg = reencoded; } if (fmt == CMIT_FMT_ONELINE || fmt == CMIT_FMT_EMAIL) @@ -816,8 +804,8 @@ unsigned long pretty_print_commit(enum cmit_fmt fmt, const struct commit *commit buf[offset++] = '\n'; buf[offset] = '\0'; - if (msg_reencoded) - free((char*)msg); + if (reencoded) + free(reencoded); return offset; } -- 1.4.4.3.g0001f-dirty ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] commit encoding: store it in commit header rather than mucking with NUL 2006-12-24 15:44 ` [PATCH] commit encoding: store it in commit header rather than mucking with NUL Johannes Schindelin @ 2006-12-25 0:13 ` Junio C Hamano 2006-12-25 0:41 ` Johannes Schindelin 0 siblings, 1 reply; 6+ messages in thread From: Junio C Hamano @ 2006-12-25 0:13 UTC (permalink / raw) To: Johannes Schindelin; +Cc: git, junkio Thanks for paying a very close attention to what is in 'pu'. The series was a work in progress not yet ready for publication, which I merged by mistake because I was way too sleepy. There are three reasons (and a half -- it was not even compile tested) that I did not mean to merge it yet. - I did not want to break existing git implementations, but hadn't audited the commit header parsers to see if they do not get upset when they see unrecognized header fields in commit objects. The 'trailer' was just a quick hack to have something working for me to test the output conversion code. I've looked at the code since then and I think it is Ok to add new ones after author/name fields like you did. If we _were_ to do this, I think it is preferable to make it a header, not a trailer like I did. - I was not sure if the "assume the whole commit->buffer is in the local encoding and recode it into UTF-8" is correct. The message body ought to be in i18n.commitencoding from the repository the commit comes from, which is now recorded in the trailer (or the header as in your patch), but what I was unsure about was the encoding in which author and committer names are recorded. I think they are set either by the environment variables or user.name to be consistent with i18n.commitencoding in the repository the commit is made, so it is Ok after all. - Existing Porcelains such as gitk know i18n.commitencoding is a hint to them by the core, and expect the core to give them output in the local encoding. With the change, the core feeds UTF-8 to the caller, unless the Porcelain gets the log with plumbing "cat-file". This means they either have to lose code to do their own recoding (which is probably a good thing in the long run), or we would need to have a flag for them to tell the core not to do the conversion. But a new flag to ask for older behaviour is always a wrong way of transitioning across backward incompatibility. I think the output conversion from the log should be more explicitly asked for it, than just a mere configuration variable that cannot be overriden by gitk and friends. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] commit encoding: store it in commit header rather than mucking with NUL 2006-12-25 0:13 ` Junio C Hamano @ 2006-12-25 0:41 ` Johannes Schindelin 2006-12-25 7:27 ` Junio C Hamano 0 siblings, 1 reply; 6+ messages in thread From: Johannes Schindelin @ 2006-12-25 0:41 UTC (permalink / raw) To: Junio C Hamano; +Cc: git Hi, On Sun, 24 Dec 2006, Junio C Hamano wrote: > Thanks for paying a very close attention to what is in 'pu'. You're welcome ;-) > - I did not want to break existing git implementations, but > hadn't audited the commit header parsers to see if they do > not get upset when they see unrecognized header fields in > commit objects. I did not really audit it, but am quite sure that only parse_commit() really walks the buffer, and even this function just ignores what comes after committer. But if I read the code correctly, then the mandatory headers ("tree parent* author committer") have to come first, in that order. > - I was not sure if the "assume the whole commit->buffer is in > the local encoding and recode it into UTF-8" is correct. For the purpose of showing it, there is no point in using two different encodings. I am not aware of any terminal (and do not own such a terminal anyway) which can display text with parts encoded differently from the rest. > - Existing Porcelains such as gitk know i18n.commitencoding is > a hint to them by the core, and expect the core to give them > output in the local encoding. With the change, the core > feeds UTF-8 to the caller, unless the Porcelain gets the log > with plumbing "cat-file". This means they either have to > lose code to do their own recoding (which is probably a good > thing in the long run), or we would need to have a flag for > them to tell the core not to do the conversion. But a new > flag to ask for older behaviour is always a wrong way of > transitioning across backward incompatibility. > > I think the output conversion from the log should be more > explicitly asked for it, than just a mere configuration > variable that cannot be overriden by gitk and friends. Probably a new flag "--encoding" to log, which sets git_commit_encoding. And the function you introduced should probably not blindly convert to UTF-8, but to the current git_commit_encoding. Ciao, Dscho ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] commit encoding: store it in commit header rather than mucking with NUL 2006-12-25 0:41 ` Johannes Schindelin @ 2006-12-25 7:27 ` Junio C Hamano 2006-12-25 22:54 ` Jakub Narebski 0 siblings, 1 reply; 6+ messages in thread From: Junio C Hamano @ 2006-12-25 7:27 UTC (permalink / raw) To: Johannes Schindelin; +Cc: git Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: >> - I was not sure if the "assume the whole commit->buffer is in >> the local encoding and recode it into UTF-8" is correct. > > For the purpose of showing it, there is no point in using two different > encodings. I am not aware of any terminal (and do not own such a terminal > anyway) which can display text with parts encoded differently from the > rest. That's not what I meant. Because the definition of the contents of the commit has been (and will be --- the unpublished topic was about suggesting stronger convention across Porcelains) just the set of fixed headers plus binary blob, the use of different encodings is entirely up to the users. I was afraid that there might be something we did (or we did not do) that encouraged people to have their names (via environment variables, or perhaps user.name) always in UTF-8 while recording the log messages in the legacy encoding, and if that kind of use is already done in the wild, we would end up having to not reencode the header field but reencode the body. But I do not think we ever encouraged encoding names in UTF-8 or anything else (we did encourage use of UTF-8 in the commit log), so I think we are Ok. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] commit encoding: store it in commit header rather than mucking with NUL 2006-12-25 7:27 ` Junio C Hamano @ 2006-12-25 22:54 ` Jakub Narebski 0 siblings, 0 replies; 6+ messages in thread From: Jakub Narebski @ 2006-12-25 22:54 UTC (permalink / raw) To: git Junio C Hamano wrote: > I was afraid that there might be something we did (or we did not > do) that encouraged people to have their names (via environment > variables, or perhaps user.name) always in UTF-8 while recording > the log messages in the legacy encoding, and if that kind of use > is already done in the wild, we would end up having to not > reencode the header field but reencode the body. > > But I do not think we ever encouraged encoding names in UTF-8 or > anything else (we did encourage use of UTF-8 in the commit log), > so I think we are Ok. By the way, it would be nice to have .mailmap like mechanism to tell git that those two (three or more) names are the same comitter and should be shown (at least in git-shortlog) as this. This is because some people changed their email, some people have sometimes middle initial and sometimes not, some people have name outside US-ASCII and it is sometimes broken... -- Jakub Narebski Warsaw, Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2006-12-25 22:52 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-12-24 14:18 i18n.logToUTF8: convert commit log message to UTF-8 Johannes Schindelin 2006-12-24 15:44 ` [PATCH] commit encoding: store it in commit header rather than mucking with NUL Johannes Schindelin 2006-12-25 0:13 ` Junio C Hamano 2006-12-25 0:41 ` Johannes Schindelin 2006-12-25 7:27 ` Junio C Hamano 2006-12-25 22:54 ` Jakub Narebski
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).