[PATCH] Reencode committer info to utf-8 before formatting mail header

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] Reencode committer info to utf-8 before formatting mail header
@ 2007-01-12 13:06 David Kågedal
  2007-01-12 22:11 ` Junio C Hamano
  0 siblings, 1 reply; 18+ messages in thread
From: David Kågedal @ 2007-01-12 13:06 UTC (permalink / raw)
  To: git

The add_user_info function formats the commit as a mail message, and
uses add_rfc2047 to format the From: line.  The add_rfc2047 assumes
that the string is encoded as utf-8.
---
 builtin-mailinfo.c |    2 +-
 commit.c           |   10 +++++++++-
 utf8.c             |    9 +++++++--
 utf8.h             |    2 +-
 4 files changed, 18 insertions(+), 5 deletions(-)

I was hit by this problem when working with an old repository where I
had used latin1, and I tried to use "git rebase".

Another option would have been to use the correct encoding in the
RFC2047 header, but this was a quicker solution.

diff --git a/builtin-mailinfo.c b/builtin-mailinfo.c
index 583da38..3fd8e00 100644
--- a/builtin-mailinfo.c
+++ b/builtin-mailinfo.c
@@ -513,7 +513,7 @@ static void convert_to_utf8(char *line, char *charset)
 {
 	static char latin_one[] = "latin1";
 	char *input_charset = *charset ? charset : latin_one;
-	char *out = reencode_string(line, metainfo_charset, input_charset);
+	char *out = reencode_string(line, metainfo_charset, input_charset, NULL);
 
 	if (!out)
 		die("cannot convert from %s to %s\n",
diff --git a/commit.c b/commit.c
index 496d37a..8477fa7 100644
--- a/commit.c
+++ b/commit.c
@@ -486,6 +486,10 @@ static int add_rfc2047(char *buf, const char *line, int len)
 	if (!needquote)
 		return sprintf(buf, "%.*s", len, line);
 
+        if (git_commit_encoding)
+                line = reencode_string(line, "utf-8",
+                                       git_commit_encoding, &len);
+
 	memcpy(bp, q_utf8, sizeof(q_utf8)-1);
 	bp += sizeof(q_utf8)-1;
 	for (i = 0; i < len; i++) {
@@ -501,6 +505,10 @@ static int add_rfc2047(char *buf, const char *line, int len)
 	}
 	memcpy(bp, "?=", 2);
 	bp += 2;
+
+        if (git_commit_encoding)
+                free((char *)line);
+
 	return bp - buf;
 }
 
@@ -687,7 +695,7 @@ static char *logmsg_reencode(const struct commit *commit)
 		out = strdup(commit->buffer);
 	else
 		out = reencode_string(commit->buffer,
-				      output_encoding, encoding);
+				      output_encoding, encoding, NULL);
 	if (out)
 		out = replace_encoding_header(out, output_encoding);
 
diff --git a/utf8.c b/utf8.c
index 7c80eec..ee9f514 100644
--- a/utf8.c
+++ b/utf8.c
@@ -291,7 +291,7 @@ int is_encoding_utf8(const char *name)
  * with iconv.  If the conversion fails, returns NULL.
  */
 #ifndef NO_ICONV
-char *reencode_string(const char *in, const char *out_encoding, const char *in_encoding)
+char *reencode_string(const char *in, const char *out_encoding, const char *in_encoding, int *len)
 {
 	iconv_t conv;
 	size_t insz, outsz, outalloc;
@@ -302,7 +302,10 @@ char *reencode_string(const char *in, const char *out_encoding, const char *in_e
 	conv = iconv_open(out_encoding, in_encoding);
 	if (conv == (iconv_t) -1)
 		return NULL;
-	insz = strlen(in);
+        if (len)
+                insz = *len;
+        else
+                insz = strlen(in);
 	outsz = insz;
 	outalloc = outsz + 1; /* for terminating NUL */
 	out = xmalloc(outalloc);
@@ -332,6 +335,8 @@ char *reencode_string(const char *in, const char *out_encoding, const char *in_e
 		}
 		else {
 			*outpos = '\0';
+                        if (len)
+                                *len = outpos - out;
 			break;
 		}
 	}
diff --git a/utf8.h b/utf8.h
index a07c5a8..eb64d46 100644
--- a/utf8.h
+++ b/utf8.h
@@ -8,7 +8,7 @@ int is_encoding_utf8(const char *name);
 void print_wrapped_text(const char *text, int indent, int indent2, int len);
 
 #ifndef NO_ICONV
-char *reencode_string(const char *in, const char *out_encoding, const char *in_encoding);
+char *reencode_string(const char *in, const char *out_encoding, const char *in_encoding, int *len);
 #else
 #define reencode_string(a,b,c) NULL
 #endif
-- 
1.4.4.4.ge10a-dirty


-- 
David Kågedal

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] Reencode committer info to utf-8 before formatting mail header
  2007-01-12 13:06 [PATCH] Reencode committer info to utf-8 before formatting mail header David Kågedal
@ 2007-01-12 22:11 ` Junio C Hamano
  2007-01-13  1:31   ` Junio C Hamano
  2007-01-15 16:53   ` David Kågedal
  0 siblings, 2 replies; 18+ messages in thread
From: Junio C Hamano @ 2007-01-12 22:11 UTC (permalink / raw)
  To: David Kågedal; +Cc: git

David Kågedal <davidk@lysator.liu.se> writes:

> The add_user_info function formats the commit as a mail message, and
> uses add_rfc2047 to format the From: line.  The add_rfc2047 assumes
> that the string is encoded as utf-8.

pretty_print_commit() labels the commit log message not just the
author name also as UTF-8 when doing plain_non_ascii.

It might make more sense to just set the log_output_encoding to
be always UTF-8 when generating an e-mail output, in
git-format-patch.

> diff --git a/utf8.h b/utf8.h
> index a07c5a8..eb64d46 100644
> --- a/utf8.h
> +++ b/utf8.h
> @@ -8,7 +8,7 @@ int is_encoding_utf8(const char *name);
>  void print_wrapped_text(const char *text, int indent, int indent2, int len);
>  
>  #ifndef NO_ICONV
> -char *reencode_string(const char *in, const char *out_encoding, const char *in_encoding);
> +char *reencode_string(const char *in, const char *out_encoding, const char *in_encoding, int *len);
>  #else
>  #define reencode_string(a,b,c) NULL
>  #endif

This feels fishy...

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Reencode committer info to utf-8 before formatting mail header
  2007-01-12 22:11 ` Junio C Hamano
@ 2007-01-13  1:31   ` Junio C Hamano
  2007-01-13  1:43     ` Junio C Hamano
                       ` (3 more replies)
  2007-01-15 16:53   ` David Kågedal
  1 sibling, 4 replies; 18+ messages in thread
From: Junio C Hamano @ 2007-01-13  1:31 UTC (permalink / raw)
  To: David Kågedal; +Cc: git

Junio C Hamano <junkio@cox.net> writes:

> It might make more sense to just set the log_output_encoding to
> be always UTF-8 when generating an e-mail output, in
> git-format-patch.

Actually, I do not want to be an UTF-8 imperialist, so how about
doing this?

-- >8 --
Use log output encoding in --pretty=email headers.

Private functions add_rfc2047() and pretty_print_commit() assumed
they are only emitting UTF-8.

Signed-off-by: Junio C Hamano <junkio@cox.net>
---
diff --git a/commit.c b/commit.c
index 496d37a..9b2b842 100644
--- a/commit.c
+++ b/commit.c
@@ -464,20 +464,29 @@ static int get_one_line(const char *msg, unsigned long len)
 	return ret;
 }
 
+/* High bit set, or ISO-2022-INT */
+static int non_ascii(int ch)
+{
+	ch = (ch & 0xff);
+	return ((ch & 0x80) || (ch == 0x1b));
+}
+
 static int is_rfc2047_special(char ch)
 {
-	return ((ch & 0x80) || (ch == '=') || (ch == '?') || (ch == '_'));
+	return (non_ascii(ch) || (ch == '=') || (ch == '?') || (ch == '_'));
 }
 
-static int add_rfc2047(char *buf, const char *line, int len)
+static int add_rfc2047(char *buf, const char *line, int len,
+		       const char *encoding)
 {
 	char *bp = buf;
 	int i, needquote;
-	static const char q_utf8[] = "=?utf-8?q?";
+	char q_encoding[128];
+	const char *q_encoding_fmt = "=?%s?q?";
 
 	for (i = needquote = 0; !needquote && i < len; i++) {
-		unsigned ch = line[i];
-		if (ch & 0x80)
+		int ch = line[i];
+		if (non_ascii(ch))
 			needquote++;
 		if ((i + 1 < len) &&
 		    (ch == '=' && line[i+1] == '?'))
@@ -486,8 +495,11 @@ static int add_rfc2047(char *buf, const char *line, int len)
 	if (!needquote)
 		return sprintf(buf, "%.*s", len, line);
 
-	memcpy(bp, q_utf8, sizeof(q_utf8)-1);
-	bp += sizeof(q_utf8)-1;
+	i = snprintf(q_encoding, sizeof(q_encoding), q_encoding_fmt, encoding);
+	if (sizeof(q_encoding) < i)
+		die("Insanely long encoding name %s", encoding);
+	memcpy(bp, q_encoding, i);
+	bp += i;
 	for (i = 0; i < len; i++) {
 		unsigned ch = line[i] & 0xFF;
 		if (is_rfc2047_special(ch)) {
@@ -505,7 +517,8 @@ static int add_rfc2047(char *buf, const char *line, int len)
 }
 
 static int add_user_info(const char *what, enum cmit_fmt fmt, char *buf,
-			 const char *line, int relative_date)
+			 const char *line, int relative_date,
+			 const char *encoding)
 {
 	char *date;
 	int namelen;
@@ -533,7 +546,8 @@ static int add_user_info(const char *what, enum cmit_fmt fmt, char *buf,
 		filler = "";
 		strcpy(buf, "From: ");
 		ret = strlen(buf);
-		ret += add_rfc2047(buf + ret, line, display_name_length);
+		ret += add_rfc2047(buf + ret, line, display_name_length,
+				   encoding);
 		memcpy(buf + ret, name_tail, namelen - display_name_length);
 		ret += namelen - display_name_length;
 		buf[ret++] = '\n';
@@ -668,21 +682,18 @@ static char *replace_encoding_header(char *buf, char *encoding)
 	return buf;
 }
 
-static char *logmsg_reencode(const struct commit *commit)
+static char *logmsg_reencode(const struct commit *commit,
+			     char *output_encoding)
 {
 	char *encoding;
 	char *out;
-	char *output_encoding = (git_log_output_encoding
-				 ? git_log_output_encoding
-				 : git_commit_encoding);
+	char *utf8 = "utf-8";
 
-	if (!output_encoding)
-		output_encoding = "utf-8";
-	else if (!*output_encoding)
+	if (!*output_encoding)
 		return NULL;
 	encoding = get_header(commit, "encoding");
 	if (!encoding)
-		return NULL;
+		encoding = utf8;
 	if (!strcmp(encoding, output_encoding))
 		out = strdup(commit->buffer);
 	else
@@ -691,7 +702,8 @@ static char *logmsg_reencode(const struct commit *commit)
 	if (out)
 		out = replace_encoding_header(out, output_encoding);
 
-	free(encoding);
+	if (encoding != utf8)
+		free(encoding);
 	if (!out)
 		return NULL;
 	return out;
@@ -711,8 +723,15 @@ unsigned long pretty_print_commit(enum cmit_fmt fmt,
 	int parents_shown = 0;
 	const char *msg = commit->buffer;
 	int plain_non_ascii = 0;
-	char *reencoded = logmsg_reencode(commit);
+	char *reencoded;
+	char *encoding;
 
+	encoding = (git_log_output_encoding
+		    ? git_log_output_encoding
+		    : git_commit_encoding);
+	if (!encoding)
+		encoding = "utf-8";
+	reencoded = logmsg_reencode(commit, encoding);
 	if (reencoded)
 		msg = reencoded;
 
@@ -738,7 +757,7 @@ unsigned long pretty_print_commit(enum cmit_fmt fmt,
 				    i + 1 < len && msg[i+1] == '\n')
 					in_body = 1;
 			}
-			else if (ch & 0x80) {
+			else if (non_ascii(ch)) {
 				plain_non_ascii = 1;
 				break;
 			}
@@ -797,13 +816,15 @@ unsigned long pretty_print_commit(enum cmit_fmt fmt,
 				offset += add_user_info("Author", fmt,
 							buf + offset,
 							line + 7,
-							relative_date);
+							relative_date,
+							encoding);
 			if (!memcmp(line, "committer ", 10) &&
 			    (fmt == CMIT_FMT_FULL || fmt == CMIT_FMT_FULLER))
 				offset += add_user_info("Commit", fmt,
 							buf + offset,
 							line + 10,
-							relative_date);
+							relative_date,
+							encoding);
 			continue;
 		}
 
@@ -826,7 +847,8 @@ unsigned long pretty_print_commit(enum cmit_fmt fmt,
 			int slen = strlen(subject);
 			memcpy(buf + offset, subject, slen);
 			offset += slen;
-			offset += add_rfc2047(buf + offset, line, linelen);
+			offset += add_rfc2047(buf + offset, line, linelen,
+					      encoding);
 		}
 		else {
 			memset(buf + offset, ' ', indent);
@@ -837,11 +859,17 @@ unsigned long pretty_print_commit(enum cmit_fmt fmt,
 		if (fmt == CMIT_FMT_ONELINE)
 			break;
 		if (subject && plain_non_ascii) {
-			static const char header[] =
-				"Content-Type: text/plain; charset=UTF-8\n"
+			int sz;
+			char header[512];
+			const char *header_fmt =
+				"Content-Type: text/plain; charset=%s\n"
 				"Content-Transfer-Encoding: 8bit\n";
-			memcpy(buf + offset, header, sizeof(header)-1);
-			offset += sizeof(header)-1;
+			sz = snprintf(header, sizeof(header), header_fmt,
+				      encoding);
+			if (sizeof(header) < sz)
+				die("Encoding name %s too long", encoding);
+			memcpy(buf + offset, header, sz);
+			offset += sz;
 		}
 		if (after_subject) {
 			int slen = strlen(after_subject);

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] Reencode committer info to utf-8 before formatting mail header
  2007-01-13  1:31   ` Junio C Hamano
@ 2007-01-13  1:43     ` Junio C Hamano
  2007-01-13 11:19       ` Johannes Schindelin
                         ` (2 more replies)
  2007-01-13 11:02     ` Alex Riesen
                       ` (2 subsequent siblings)
  3 siblings, 3 replies; 18+ messages in thread
From: Junio C Hamano @ 2007-01-13  1:43 UTC (permalink / raw)
  To: David Kågedal; +Cc: git

Side note.  The previous patch does not help if your commit were
made in non UTF-8 with not too recent git; the code assumes that
commit messages without the new "encoding" headers are in UTF-8.

We might want to help transitioning people by doing something
like this on top of the previous patch.  Then when dealing with
an ancient commit (sorry, I am not saying commits older than 3
weeks are ancient -- but it will be 6 months from now ;-), you
can override that default by setting an environment variable.

---
diff --git a/commit.c b/commit.c
index 9b2b842..a1b5705 100644
--- a/commit.c
+++ b/commit.c
@@ -692,8 +692,12 @@ static char *logmsg_reencode(const struct commit *commit,
 	if (!*output_encoding)
 		return NULL;
 	encoding = get_header(commit, "encoding");
-	if (!encoding)
-		encoding = utf8;
+	if (!encoding) {
+		if (getenv("GIT_OLD_COMMIT_ENCODING"))
+			encoding = strdup(getenv("GIT_OLD_COMMIT_ENCODING"));
+		else
+			encoding = utf8;
+	}
 	if (!strcmp(encoding, output_encoding))
 		out = strdup(commit->buffer);
 	else

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] Reencode committer info to utf-8 before formatting mail header
  2007-01-13  1:31   ` Junio C Hamano
  2007-01-13  1:43     ` Junio C Hamano
@ 2007-01-13 11:02     ` Alex Riesen
  2007-01-14  0:42       ` Horst H. von Brand
  2007-01-13 22:18     ` Junio C Hamano
  2007-01-15 16:57     ` David Kågedal
  3 siblings, 1 reply; 18+ messages in thread
From: Alex Riesen @ 2007-01-13 11:02 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: David Kågedal, git

Junio C Hamano, Sat, Jan 13, 2007 02:31:35 +0100:
> +/* High bit set, or ISO-2022-INT */
> +static int non_ascii(int ch)
> +{
> +	ch = (ch & 0xff);
> +	return ((ch & 0x80) || (ch == 0x1b));
> +}
> +

"return (ch & 0x0x80) || (ch & 0xff) == 0x1b;" :)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Reencode committer info to utf-8 before formatting mail header
  2007-01-13  1:43     ` Junio C Hamano
@ 2007-01-13 11:19       ` Johannes Schindelin
  2007-01-13 17:57         ` Junio C Hamano
  2007-01-15 16:58         ` David Kågedal
  2007-01-13 12:23       ` Robin Rosenberg
  2007-01-15 16:54       ` David Kågedal
  2 siblings, 2 replies; 18+ messages in thread
From: Johannes Schindelin @ 2007-01-13 11:19 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: David Kågedal, git

Hi,

On Fri, 12 Jan 2007, Junio C Hamano wrote:

> Side note.  The previous patch does not help if your commit were
> made in non UTF-8 with not too recent git; the code assumes that
> commit messages without the new "encoding" headers are in UTF-8.

Why not just use is_utf8() and warn, or error out, if the message is not 
UTF-8? (I tend towards the erroring out, since this _is_ a new feature, 
and gives undesired results with "old" commits.)

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Reencode committer info to utf-8 before formatting mail header
  2007-01-13  1:43     ` Junio C Hamano
  2007-01-13 11:19       ` Johannes Schindelin
@ 2007-01-13 12:23       ` Robin Rosenberg
  2007-01-13 17:54         ` Junio C Hamano
  2007-01-15 16:54       ` David Kågedal
  2 siblings, 1 reply; 18+ messages in thread
From: Robin Rosenberg @ 2007-01-13 12:23 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: David Kågedal, git

lördag 13 januari 2007 02:43 skrev Junio C Hamano:
> Side note.  The previous patch does not help if your commit were
> made in non UTF-8 with not too recent git; the code assumes that
> commit messages without the new "encoding" headers are in UTF-8.

Wasn't there a repository option, "commitencoding"?  I can't see it being
used here. I.e., we should err out if the log message is not UTF-8 and the 
option is not set (giving a message telling the user to set it).  If it is 
set we should consider the repository encoding to be the one and if that too 
is wrong (only possible to detect for some encodings), just assume iso-8859-1 
as anything could in theory be iso-8859-1 encoded.  

-- robin

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Reencode committer info to utf-8 before formatting mail header
  2007-01-13 12:23       ` Robin Rosenberg
@ 2007-01-13 17:54         ` Junio C Hamano
  0 siblings, 0 replies; 18+ messages in thread
From: Junio C Hamano @ 2007-01-13 17:54 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: David Kågedal, git

Robin Rosenberg <robin.rosenberg.lists@dewire.com> writes:

> lördag 13 januari 2007 02:43 skrev Junio C Hamano:
>> Side note.  The previous patch does not help if your commit were
>> made in non UTF-8 with not too recent git; the code assumes that
>> commit messages without the new "encoding" headers are in UTF-8.
>
> Wasn't there a repository option, "commitencoding"?  I can't see it being
> used here.

commitencoding is about what encoding the commit newly created
in this repository right now should claim to have -- in other
words what is fed to commit-tree.

We are talking about examining existing commit that might have
come from another repository or created some time ago when the
repository configuration was set differently.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Reencode committer info to utf-8 before formatting mail header
  2007-01-13 11:19       ` Johannes Schindelin
@ 2007-01-13 17:57         ` Junio C Hamano
  2007-01-15 16:58         ` David Kågedal
  1 sibling, 0 replies; 18+ messages in thread
From: Junio C Hamano @ 2007-01-13 17:57 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: David Kågedal, git

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> Why not just use is_utf8() and warn, or error out, if the message is not 
> UTF-8? (I tend towards the erroring out, since this _is_ a new feature, 
> and gives undesired results with "old" commits.)

That sounds sensible.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Reencode committer info to utf-8 before formatting mail header
  2007-01-13  1:31   ` Junio C Hamano
  2007-01-13  1:43     ` Junio C Hamano
  2007-01-13 11:02     ` Alex Riesen
@ 2007-01-13 22:18     ` Junio C Hamano
  2007-01-15 16:57     ` David Kågedal
  3 siblings, 0 replies; 18+ messages in thread
From: Junio C Hamano @ 2007-01-13 22:18 UTC (permalink / raw)
  To: git; +Cc: David Kågedal, Johannes Schindelin

On this topic, along with the "format-patch" fix (which
automatically makes "rebase without --merge" do the right thing
because it is "format-patch piped to am" in essence), I have
another commit to make "cherry-pick", "rebase --merge" and
"commit -c/-C" do the right thing according to the
commitencoding specified in the repository the new commit is
being created.

The issue is that an existing commit might have come from a
different repository or from the past when this repository had
commitencoding that was different from the current value.
Running "cat-file commit" to extract the old commit log message
and feeding it directly to create the new commit would not work,
because the value of commitencoding in this repository may be
different.

This should not affect old encoding-unaware setup where people
use _only_ a legacy encoding and do not bother to specify any
commitencoding.  In such a case, both input and output are the
same and while we pretend both are UTF-8, we actually do not
trigger conversion.  To support such a configuration is one
reason I did not actually take Johannes's suggestion to error
out on an existing commit that does _not_ have encoding header
but the contents does not look like a valid UTF-8.

The series is currently sitting in 'next'.  If people do not see
problem with it, I think it should go in v1.5.0.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Reencode committer info to utf-8 before formatting mail header
  2007-01-13 11:02     ` Alex Riesen
@ 2007-01-14  0:42       ` Horst H. von Brand
  2007-01-14 19:25         ` Alex Riesen
  0 siblings, 1 reply; 18+ messages in thread
From: Horst H. von Brand @ 2007-01-14  0:42 UTC (permalink / raw)
  To: Alex Riesen; +Cc: Junio C Hamano, David Kågedal, git

Alex Riesen <fork0@t-online.de> wrote:
> Junio C Hamano, Sat, Jan 13, 2007 02:31:35 +0100:
> > +/* High bit set, or ISO-2022-INT */
> > +static int non_ascii(int ch)
> > +{
> > +	ch = (ch & 0xff);
> > +	return ((ch & 0x80) || (ch == 0x1b));
> > +}
> > +
> 
> "return (ch & 0x0x80) || (ch & 0xff) == 0x1b;" :)
                ^^

Is the same, if ch == 0x9b, it will match the first part anyway.

The outer parentesis can (should?) go.
-- 
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                    Fono: +56 32 2654431
Universidad Tecnica Federico Santa Maria             +56 32 2654239
Casilla 110-V, Valparaiso, Chile               Fax:  +56 32 2797513

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Reencode committer info to utf-8 before formatting mail header
  2007-01-14  0:42       ` Horst H. von Brand
@ 2007-01-14 19:25         ` Alex Riesen
  0 siblings, 0 replies; 18+ messages in thread
From: Alex Riesen @ 2007-01-14 19:25 UTC (permalink / raw)
  To: Horst H. von Brand; +Cc: Junio C Hamano, David Kågedal, git

Horst H. von Brand, Sun, Jan 14, 2007 01:42:57 +0100:
> Alex Riesen <fork0@t-online.de> wrote:
> > Junio C Hamano, Sat, Jan 13, 2007 02:31:35 +0100:
> > > +/* High bit set, or ISO-2022-INT */
> > > +static int non_ascii(int ch)
> > > +{
> > > +	ch = (ch & 0xff);
> > > +	return ((ch & 0x80) || (ch == 0x1b));
> > > +}
> > > +
> > 
> > "return (ch & 0x0x80) || (ch & 0xff) == 0x1b;" :)
>                 ^^

Oops :)

> Is the same, if ch == 0x9b, it will match the first part anyway.

So it should. 0x9b isn't ASCII.

> The outer parentesis can (should?) go.

It's "question of style", I'm afraid :)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Reencode committer info to utf-8 before formatting mail header
  2007-01-12 22:11 ` Junio C Hamano
  2007-01-13  1:31   ` Junio C Hamano
@ 2007-01-15 16:53   ` David Kågedal
  1 sibling, 0 replies; 18+ messages in thread
From: David Kågedal @ 2007-01-15 16:53 UTC (permalink / raw)
  To: git

Junio C Hamano <junkio@cox.net> writes:

>> diff --git a/utf8.h b/utf8.h
>> index a07c5a8..eb64d46 100644
>> --- a/utf8.h
>> +++ b/utf8.h
>> @@ -8,7 +8,7 @@ int is_encoding_utf8(const char *name);
>>  void print_wrapped_text(const char *text, int indent, int indent2, int len);
>>  
>>  #ifndef NO_ICONV
>> -char *reencode_string(const char *in, const char *out_encoding, const char *in_encoding);
>> +char *reencode_string(const char *in, const char *out_encoding, const char *in_encoding, int *len);
>>  #else
>>  #define reencode_string(a,b,c) NULL
>>  #endif
>
> This feels fishy...

I admit that I didn't test-compile with NO_ICONV.

-- 
David Kågedal

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Reencode committer info to utf-8 before formatting mail header
  2007-01-13  1:43     ` Junio C Hamano
  2007-01-13 11:19       ` Johannes Schindelin
  2007-01-13 12:23       ` Robin Rosenberg
@ 2007-01-15 16:54       ` David Kågedal
  2 siblings, 0 replies; 18+ messages in thread
From: David Kågedal @ 2007-01-15 16:54 UTC (permalink / raw)
  To: git

Junio C Hamano <junkio@cox.net> writes:

> Side note.  The previous patch does not help if your commit were
> made in non UTF-8 with not too recent git; the code assumes that
> commit messages without the new "encoding" headers are in UTF-8.

This was exactly the problem I was trying to solve.

> We might want to help transitioning people by doing something
> like this on top of the previous patch.  Then when dealing with
> an ancient commit (sorry, I am not saying commits older than 3
> weeks are ancient -- but it will be 6 months from now ;-), you
> can override that default by setting an environment variable.
>
> ---
> diff --git a/commit.c b/commit.c
> index 9b2b842..a1b5705 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -692,8 +692,12 @@ static char *logmsg_reencode(const struct commit *commit,
>  	if (!*output_encoding)
>  		return NULL;
>  	encoding = get_header(commit, "encoding");
> -	if (!encoding)
> -		encoding = utf8;
> +	if (!encoding) {
> +		if (getenv("GIT_OLD_COMMIT_ENCODING"))
> +			encoding = strdup(getenv("GIT_OLD_COMMIT_ENCODING"));
> +		else
> +			encoding = utf8;
> +	}
>  	if (!strcmp(encoding, output_encoding))
>  		out = strdup(commit->buffer);
>  	else
>
>
>

-- 
David Kågedal

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Reencode committer info to utf-8 before formatting mail header
  2007-01-13  1:31   ` Junio C Hamano
                       ` (2 preceding siblings ...)
  2007-01-13 22:18     ` Junio C Hamano
@ 2007-01-15 16:57     ` David Kågedal
  3 siblings, 0 replies; 18+ messages in thread
From: David Kågedal @ 2007-01-15 16:57 UTC (permalink / raw)
  To: git

Junio C Hamano <junkio@cox.net> writes:

> -static int add_rfc2047(char *buf, const char *line, int len)
> +static int add_rfc2047(char *buf, const char *line, int len,
> +		       const char *encoding)
>  {
>  	char *bp = buf;
>  	int i, needquote;
> -	static const char q_utf8[] = "=?utf-8?q?";
> +	char q_encoding[128];
> +	const char *q_encoding_fmt = "=?%s?q?";

This goes against the old principle of being forgiving in what you
accept, and strict in what you send.  The names of the encoding in the
headers should probably be normalized before putting them in an
e-mail.  I.e. we might accept "utf-8", "utf8", "UTF-8", and "UTF8"
(this depends on iconv, I suppose), but the RFC2047 encoding should be
the one blessed by RFC4027.  But I admit that I haven't read the RFC,
and I'm writing this offline so I can't check right now.

-- 
David Kågedal

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Reencode committer info to utf-8 before formatting mail header
  2007-01-13 11:19       ` Johannes Schindelin
  2007-01-13 17:57         ` Junio C Hamano
@ 2007-01-15 16:58         ` David Kågedal
  2007-01-16 11:41           ` Johannes Schindelin
  1 sibling, 1 reply; 18+ messages in thread
From: David Kågedal @ 2007-01-15 16:58 UTC (permalink / raw)
  To: git

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> Hi,
>
> On Fri, 12 Jan 2007, Junio C Hamano wrote:
>
>> Side note.  The previous patch does not help if your commit were
>> made in non UTF-8 with not too recent git; the code assumes that
>> commit messages without the new "encoding" headers are in UTF-8.
>
> Why not just use is_utf8() and warn, or error out, if the message is not 
> UTF-8? (I tend towards the erroring out, since this _is_ a new feature, 
> and gives undesired results with "old" commits.)

What do you mean? I have an old repository with latin1 commits without
any encoding markers.  I want to be able to use format-patch from that
and at least get a From: line with something readable.  You can't just
barf and say "This isn't UTF-8, go away".

-- 
David Kågedal

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Reencode committer info to utf-8 before formatting mail header
  2007-01-15 16:58         ` David Kågedal
@ 2007-01-16 11:41           ` Johannes Schindelin
  2007-01-16 12:43             ` David Kågedal
  0 siblings, 1 reply; 18+ messages in thread
From: Johannes Schindelin @ 2007-01-16 11:41 UTC (permalink / raw)
  To: David Kågedal; +Cc: git

[-- Attachment #1: Type: TEXT/PLAIN, Size: 988 bytes --]

Hi,

On Mon, 15 Jan 2007, David Kågedal wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> 
> > On Fri, 12 Jan 2007, Junio C Hamano wrote:
> >
> >> Side note.  The previous patch does not help if your commit were
> >> made in non UTF-8 with not too recent git; the code assumes that
> >> commit messages without the new "encoding" headers are in UTF-8.
> >
> > Why not just use is_utf8() and warn, or error out, if the message is not 
> > UTF-8? (I tend towards the erroring out, since this _is_ a new feature, 
> > and gives undesired results with "old" commits.)
> 
> What do you mean? I have an old repository with latin1 commits without
> any encoding markers.  I want to be able to use format-patch from that
> and at least get a From: line with something readable.  You can't just
> barf and say "This isn't UTF-8, go away".

So what do you want to do instead? Just pretend that the unrecoded -- 
Latin-1 encoded -- text is UTF-8? That's plain wrong.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Reencode committer info to utf-8 before formatting mail header
  2007-01-16 11:41           ` Johannes Schindelin
@ 2007-01-16 12:43             ` David Kågedal
  0 siblings, 0 replies; 18+ messages in thread
From: David Kågedal @ 2007-01-16 12:43 UTC (permalink / raw)
  To: git

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> Hi,
>
> On Mon, 15 Jan 2007, David Kågedal wrote:
>
>> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
>> 
>> > On Fri, 12 Jan 2007, Junio C Hamano wrote:
>> >
>> >> Side note.  The previous patch does not help if your commit were
>> >> made in non UTF-8 with not too recent git; the code assumes that
>> >> commit messages without the new "encoding" headers are in UTF-8.
>> >
>> > Why not just use is_utf8() and warn, or error out, if the message is not 
>> > UTF-8? (I tend towards the erroring out, since this _is_ a new feature, 
>> > and gives undesired results with "old" commits.)
>> 
>> What do you mean? I have an old repository with latin1 commits without
>> any encoding markers.  I want to be able to use format-patch from that
>> and at least get a From: line with something readable.  You can't just
>> barf and say "This isn't UTF-8, go away".
>
> So what do you want to do instead? Just pretend that the unrecoded -- 
> Latin-1 encoded -- text is UTF-8? That's plain wrong.

That is what git did before I wrote my patch, so it obviously not what
I want.  I want to be able to tell git what encoding it is.

My patch reused the i18n.commitencoding configuration parameter for
that, but Junio is probably right in that that is only meant for new
commits, and an evironment variable makes more sense.

So just barfing on a commit that isn't utf-8 isn't a complete
solution.  But maybe there was some context to your comment above that
I missed.

-- 
David Kågedal

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2007-01-16 12:44 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-01-12 13:06 [PATCH] Reencode committer info to utf-8 before formatting mail header David Kågedal
2007-01-12 22:11 ` Junio C Hamano
2007-01-13  1:31   ` Junio C Hamano
2007-01-13  1:43     ` Junio C Hamano
2007-01-13 11:19       ` Johannes Schindelin
2007-01-13 17:57         ` Junio C Hamano
2007-01-15 16:58         ` David Kågedal
2007-01-16 11:41           ` Johannes Schindelin
2007-01-16 12:43             ` David Kågedal
2007-01-13 12:23       ` Robin Rosenberg
2007-01-13 17:54         ` Junio C Hamano
2007-01-15 16:54       ` David Kågedal
2007-01-13 11:02     ` Alex Riesen
2007-01-14  0:42       ` Horst H. von Brand
2007-01-14 19:25         ` Alex Riesen
2007-01-13 22:18     ` Junio C Hamano
2007-01-15 16:57     ` David Kågedal
2007-01-15 16:53   ` David Kågedal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).