git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/4] add --recode-patch parameter to mailinfo and am
@ 2010-11-28 19:10 ZHANG, Le
  2010-11-28 19:10 ` [PATCH v4 1/4] mailinfo.c: convert_to_utf8(): added a target_charset parameter ZHANG, Le
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: ZHANG, Le @ 2010-11-28 19:10 UTC (permalink / raw)
  To: git; +Cc: ZHANG, Le

I have a translation project which uses UTF-8 as charset.
So the patch must be encoded in UTF-8, not just the commit msg etc.
And we use google group as our mailing list.

Recently, mails saved from gmail are encoded using local charset if all the
characters in the patch are in that specific local charset even if the orignal
mail is in UTF-8. This seems smart but it caused inconvenience for
our project.

Since we have no control on what google will do, so I took another way,
i.e. add this option to git-mailinfo. I hope this could benefit others as
well.

Changelog:

v4 -> v3:
* Added a target_charset parameter to convert_to_utf8() in mailinfo.c.
* Introduced a new config varible: i18n.patchencoding, which will be used solely
  by --recode-patch parameter.

v2 -> v3:
* Removed 'const' type qualifier from handle_patch()'s parameter
* Fixed typos in commit msg

v1 -> v2:
* Clarified how -u/--encoding is handled in git-mailinfo's documentation


ZHANG, Le (4):
  mailinfo.c: convert_to_utf8(): added a target_charset parameter
  i18n.patchencoding: introduce a new config variable
  git mailinfo: added a --recode-patch parameter
  git am: added a --recode-patch parameter

 Documentation/git-am.txt       |    4 ++++
 Documentation/git-mailinfo.txt |    6 +++++-
 builtin/mailinfo.c             |   27 +++++++++++++++++----------
 cache.h                        |    1 +
 config.c                       |    3 +++
 environment.c                  |    1 +
 git-am.sh                      |   13 +++++++++++--
 7 files changed, 42 insertions(+), 13 deletions(-)

-- 
1.7.3.2.344.gb3680.dirty

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v4 1/4] mailinfo.c: convert_to_utf8(): added a target_charset parameter
  2010-11-28 19:10 [PATCH v4 0/4] add --recode-patch parameter to mailinfo and am ZHANG, Le
@ 2010-11-28 19:10 ` ZHANG, Le
  2010-11-29 20:23   ` Junio C Hamano
  2010-11-28 19:10 ` [PATCH v4 2/4] i18n.patchencoding: introduce a new config variable ZHANG, Le
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: ZHANG, Le @ 2010-11-28 19:10 UTC (permalink / raw)
  To: git; +Cc: ZHANG, Le

This is required for my recode-patch patch which needs a seperate patch_charset variable.

Signed-off-by: ZHANG, Le <r0bertz@gentoo.org>
---
 builtin/mailinfo.c |   16 ++++++++--------
 1 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/builtin/mailinfo.c b/builtin/mailinfo.c
index 2320d98..1406d9f 100644
--- a/builtin/mailinfo.c
+++ b/builtin/mailinfo.c
@@ -492,22 +492,22 @@ static const char *guess_charset(const struct strbuf *line, const char *target_c
 	return "ISO8859-1";
 }
 
-static void convert_to_utf8(struct strbuf *line, const char *charset)
+static void convert_to_utf8(struct strbuf *line, const char *charset, const char *target_charset)
 {
 	char *out;
 
 	if (!charset || !*charset) {
-		charset = guess_charset(line, metainfo_charset);
+		charset = guess_charset(line, target_charset);
 		if (!charset)
 			return;
 	}
 
-	if (!strcasecmp(metainfo_charset, charset))
+	if (!strcasecmp(target_charset, charset))
 		return;
-	out = reencode_string(line->buf, metainfo_charset, charset);
+	out = reencode_string(line->buf, target_charset, charset);
 	if (!out)
 		die("cannot convert from %s to %s",
-		    charset, metainfo_charset);
+		    charset, target_charset);
 	strbuf_attach(line, out, strlen(out), strlen(out));
 }
 
@@ -577,7 +577,7 @@ static int decode_header_bq(struct strbuf *it)
 			break;
 		}
 		if (metainfo_charset)
-			convert_to_utf8(dec, charset_q.buf);
+			convert_to_utf8(dec, charset_q.buf, metainfo_charset);
 
 		strbuf_addbuf(&outbuf, dec);
 		strbuf_release(dec);
@@ -602,7 +602,7 @@ static void decode_header(struct strbuf *it)
 	 * This can be binary guck but there is no charset specified.
 	 */
 	if (metainfo_charset)
-		convert_to_utf8(it, "");
+		convert_to_utf8(it, "", metainfo_charset);
 }
 
 static void decode_transfer_encoding(struct strbuf *line)
@@ -796,7 +796,7 @@ static int handle_commit_msg(struct strbuf *line)
 
 	/* normalize the log message to UTF-8. */
 	if (metainfo_charset)
-		convert_to_utf8(line, charset.buf);
+		convert_to_utf8(line, charset.buf, metainfo_charset);
 
 	if (use_scissors && is_scissors_line(line)) {
 		int i;
-- 
1.7.3.2.344.gb3680.dirty

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v4 2/4] i18n.patchencoding: introduce a new config variable
  2010-11-28 19:10 [PATCH v4 0/4] add --recode-patch parameter to mailinfo and am ZHANG, Le
  2010-11-28 19:10 ` [PATCH v4 1/4] mailinfo.c: convert_to_utf8(): added a target_charset parameter ZHANG, Le
@ 2010-11-28 19:10 ` ZHANG, Le
  2010-11-29 20:23   ` Junio C Hamano
  2010-11-28 19:10 ` [PATCH v4 3/4] git mailinfo: added a --recode-patch parameter ZHANG, Le
  2010-11-28 19:10 ` [PATCH v4 4/4] git am: " ZHANG, Le
  3 siblings, 1 reply; 9+ messages in thread
From: ZHANG, Le @ 2010-11-28 19:10 UTC (permalink / raw)
  To: git; +Cc: ZHANG, Le

This varible will be used by git mailinfo's --recode-patch parameter only.

Signed-off-by: ZHANG, Le <r0bertz@gentoo.org>
---
 cache.h       |    1 +
 config.c      |    3 +++
 environment.c |    1 +
 3 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/cache.h b/cache.h
index 33decd9..d04aeff 100644
--- a/cache.h
+++ b/cache.h
@@ -1015,6 +1015,7 @@ extern int user_ident_explicitly_given;
 extern int user_ident_sufficiently_given(void);
 
 extern const char *git_commit_encoding;
+extern const char *git_patch_encoding;
 extern const char *git_log_output_encoding;
 extern const char *git_mailmap_file;
 
diff --git a/config.c b/config.c
index c63d683..14b0f92 100644
--- a/config.c
+++ b/config.c
@@ -674,6 +674,9 @@ static int git_default_i18n_config(const char *var, const char *value)
 	if (!strcmp(var, "i18n.commitencoding"))
 		return git_config_string(&git_commit_encoding, var, value);
 
+	if (!strcmp(var, "i18n.patchencoding"))
+		return git_config_string(&git_patch_encoding, var, value);
+
 	if (!strcmp(var, "i18n.logoutputencoding"))
 		return git_config_string(&git_log_output_encoding, var, value);
 
diff --git a/environment.c b/environment.c
index de5581f..b2870f4 100644
--- a/environment.c
+++ b/environment.c
@@ -23,6 +23,7 @@ int log_all_ref_updates = -1; /* unspecified */
 int warn_ambiguous_refs = 1;
 int repository_format_version;
 const char *git_commit_encoding;
+const char *git_patch_encoding;
 const char *git_log_output_encoding;
 int shared_repository = PERM_UMASK;
 const char *apply_default_whitespace;
-- 
1.7.3.2.344.gb3680.dirty

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v4 3/4] git mailinfo: added a --recode-patch parameter
  2010-11-28 19:10 [PATCH v4 0/4] add --recode-patch parameter to mailinfo and am ZHANG, Le
  2010-11-28 19:10 ` [PATCH v4 1/4] mailinfo.c: convert_to_utf8(): added a target_charset parameter ZHANG, Le
  2010-11-28 19:10 ` [PATCH v4 2/4] i18n.patchencoding: introduce a new config variable ZHANG, Le
@ 2010-11-28 19:10 ` ZHANG, Le
  2010-11-29 20:23   ` Junio C Hamano
  2010-11-28 19:10 ` [PATCH v4 4/4] git am: " ZHANG, Le
  3 siblings, 1 reply; 9+ messages in thread
From: ZHANG, Le @ 2010-11-28 19:10 UTC (permalink / raw)
  To: git; +Cc: ZHANG, Le

When this parameter is specified, patch will be converted to a target encoding before applied.
The target encoding defaults to UTF-8. It could also be specified by i18n.patchencoding.

Signed-off-by: ZHANG, Le <r0bertz@gentoo.org>
---
 Documentation/git-mailinfo.txt |    6 +++++-
 builtin/mailinfo.c             |   11 +++++++++--
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/Documentation/git-mailinfo.txt b/Documentation/git-mailinfo.txt
index 3ea5aad..3e84817 100644
--- a/Documentation/git-mailinfo.txt
+++ b/Documentation/git-mailinfo.txt
@@ -45,7 +45,7 @@ OPTIONS
 	them.  This used to be optional but now it is the default.
 +
 Note that the patch is always used as-is without charset
-conversion, even with this flag.
+conversion, even with this flag; use '--recode-patch' for that.
 
 --encoding=<encoding>::
 	Similar to -u.  But when re-coding, the charset specified here is
@@ -54,6 +54,10 @@ conversion, even with this flag.
 -n::
 	Disable all charset re-coding of the metadata.
 
+--recode-patch::
+	Convert the patch from the e-mail to UTF-8 (or the value of the
+	configuration variable i18n.patchencoding if it is set).
+
 --scissors::
 	Remove everything in body before a scissors line.  A line that
 	mainly consists of scissors (either ">8" or "8<") and perforation
diff --git a/builtin/mailinfo.c b/builtin/mailinfo.c
index 1406d9f..96181e6 100644
--- a/builtin/mailinfo.c
+++ b/builtin/mailinfo.c
@@ -12,6 +12,8 @@ static FILE *cmitmsg, *patchfile, *fin, *fout;
 static int keep_subject;
 static int keep_non_patch_brackets_in_subject;
 static const char *metainfo_charset;
+static const char *patch_charset;
+static int recode_patch;
 static struct strbuf line = STRBUF_INIT;
 static struct strbuf name = STRBUF_INIT;
 static struct strbuf email = STRBUF_INIT;
@@ -828,8 +830,10 @@ static int handle_commit_msg(struct strbuf *line)
 	return 0;
 }
 
-static void handle_patch(const struct strbuf *line)
+static void handle_patch(struct strbuf *line)
 {
+	if (recode_patch)
+		convert_to_utf8(line, charset.buf, patch_charset);
 	fwrite(line->buf, 1, line->len, patchfile);
 	patch_lines++;
 }
@@ -1021,7 +1025,7 @@ static int git_mailinfo_config(const char *var, const char *value, void *unused)
 }
 
 static const char mailinfo_usage[] =
-	"git mailinfo [-k|-b] [-u | --encoding=<encoding> | -n] [--scissors | --no-scissors] msg patch < mail >info";
+	"git mailinfo [-k|-b] [-u | --encoding=<encoding> | -n] [--recode-patch] [--scissors | --no-scissors] msg patch < mail >info";
 
 int cmd_mailinfo(int argc, const char **argv, const char *prefix)
 {
@@ -1034,6 +1038,7 @@ int cmd_mailinfo(int argc, const char **argv, const char *prefix)
 
 	def_charset = (git_commit_encoding ? git_commit_encoding : "UTF-8");
 	metainfo_charset = def_charset;
+	patch_charset = git_patch_encoding ? git_patch_encoding : "UTF-8";
 
 	while (1 < argc && argv[1][0] == '-') {
 		if (!strcmp(argv[1], "-k"))
@@ -1046,6 +1051,8 @@ int cmd_mailinfo(int argc, const char **argv, const char *prefix)
 			metainfo_charset = NULL;
 		else if (!prefixcmp(argv[1], "--encoding="))
 			metainfo_charset = argv[1] + 11;
+		else if (!prefixcmp(argv[1], "--recode-patch"))
+			recode_patch = 1;
 		else if (!strcmp(argv[1], "--scissors"))
 			use_scissors = 1;
 		else if (!strcmp(argv[1], "--no-scissors"))
-- 
1.7.3.2.344.gb3680.dirty

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v4 4/4] git am: added a --recode-patch parameter
  2010-11-28 19:10 [PATCH v4 0/4] add --recode-patch parameter to mailinfo and am ZHANG, Le
                   ` (2 preceding siblings ...)
  2010-11-28 19:10 ` [PATCH v4 3/4] git mailinfo: added a --recode-patch parameter ZHANG, Le
@ 2010-11-28 19:10 ` ZHANG, Le
  3 siblings, 0 replies; 9+ messages in thread
From: ZHANG, Le @ 2010-11-28 19:10 UTC (permalink / raw)
  To: git; +Cc: ZHANG, Le

When this parameter is specified, git am will pass this parameter to git mailinfo.

Signed-off-by: ZHANG, Le <r0bertz@gentoo.org>
---
 Documentation/git-am.txt |    4 ++++
 git-am.sh                |   13 +++++++++++--
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-am.txt b/Documentation/git-am.txt
index 51297d0..24ba5ec 100644
--- a/Documentation/git-am.txt
+++ b/Documentation/git-am.txt
@@ -73,6 +73,10 @@ default.   You can use `--no-utf8` to override this.
 	Pass `-n` flag to 'git mailinfo' (see
 	linkgit:git-mailinfo[1]).
 
+--recode-patch::
+	Pass `--recode-patch` flag to 'git mailinfo' (see
+	linkgit:git-mailinfo[1]).
+
 -3::
 --3way::
 	When the patch does not apply cleanly, fall back on
diff --git a/git-am.sh b/git-am.sh
index df09b42..8010119 100755
--- a/git-am.sh
+++ b/git-am.sh
@@ -14,6 +14,7 @@ b,binary*       (historical option -- no-op)
 q,quiet         be quiet
 s,signoff       add a Signed-off-by line to the commit message
 u,utf8          recode into utf8 (default)
+recode-patch    pass --recode-patch flag to git-mailinfo
 k,keep          pass -k flag to git-mailinfo
 keep-cr         pass --keep-cr flag to git-mailsplit for mbox format
 no-keep-cr      do not pass --keep-cr flag to git-mailsplit independent of am.keepcr
@@ -295,7 +296,7 @@ split_patches () {
 prec=4
 dotest="$GIT_DIR/rebase-apply"
 sign= utf8=t keep= keepcr= skip= interactive= resolved= rebasing= abort=
-resolvemsg= resume= scissors= no_inbody_headers=
+resolvemsg= resume= scissors= no_inbody_headers= recode_patch=
 git_apply_opt=
 committer_date_is_author_date=
 ignore_date=
@@ -321,6 +322,8 @@ do
 		utf8=t ;; # this is now default
 	--no-utf8)
 		utf8= ;;
+	--recode-patch)
+		recode_patch=t ;;
 	-k|--keep)
 		keep=t ;;
 	-c|--scissors)
@@ -464,6 +467,7 @@ else
 	echo "$threeway" >"$dotest/threeway"
 	echo "$sign" >"$dotest/sign"
 	echo "$utf8" >"$dotest/utf8"
+	echo "$recode_patch" >"$dotest/recode_patch"
 	echo "$keep" >"$dotest/keep"
 	echo "$keepcr" >"$dotest/keepcr"
 	echo "$scissors" >"$dotest/scissors"
@@ -505,6 +509,10 @@ then
 else
 	utf8=-n
 fi
+if test "$(cat "$dotest/recode_patch")" = t
+then
+	recodepatch=--recode-patch
+fi
 if test "$(cat "$dotest/keep")" = t
 then
 	keep=-k
@@ -581,7 +589,8 @@ do
 	# by the user, or the user can tell us to do so by --resolved flag.
 	case "$resume" in
 	'')
-		git mailinfo $keep $no_inbody_headers $scissors $utf8 "$dotest/msg" "$dotest/patch" \
+		git mailinfo $keep $no_inbody_headers $scissors $utf8 \
+			$recodepatch "$dotest/msg" "$dotest/patch" \
 			<"$dotest/$msgnum" >"$dotest/info" ||
 			stop_here $this
 
-- 
1.7.3.2.344.gb3680.dirty

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 1/4] mailinfo.c: convert_to_utf8(): added a target_charset parameter
  2010-11-28 19:10 ` [PATCH v4 1/4] mailinfo.c: convert_to_utf8(): added a target_charset parameter ZHANG, Le
@ 2010-11-29 20:23   ` Junio C Hamano
  2011-04-16  6:22     ` ZHANG, Le
  0 siblings, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2010-11-29 20:23 UTC (permalink / raw)
  To: ZHANG, Le; +Cc: git

"ZHANG, Le" <r0bertz@gentoo.org> writes:

> This is required for my recode-patch patch which needs a seperate patch_charset variable.
>
> Signed-off-by: ZHANG, Le <r0bertz@gentoo.org>
> ---

Thanks.

Please describe what the new parameter means.  Is it used to convert the
contents in "line" from "charset" to "target_charset"?  Perhaps it is a
good time to rename the function to "convert_to()", its "charset"
parameter to "from_charset", and name the new parameter "to_charset", in
order to reduce confusion?

It is not just _your_ patch but will help other/later patches, so you may
want to phrase the proposed commit log message a bit differently (and with
a narrower page width, like 68-74 chars per line).  Perhaps like...

    mailinfo.c: Allow convert_to_utf8() to specify both src/dst charset

    The convert_to_utf8() function actually converts to whatever charset
    "metainfo_charset" variable contains, which is not necessarily UTF-8.
    Rename it to convert_to(), and give an extra parameter "to_charset" to
    specify what charset to re-encode to.  Also rename its "charset"
    parameter to "from_charset" to clarify which is which.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 2/4] i18n.patchencoding: introduce a new config variable
  2010-11-28 19:10 ` [PATCH v4 2/4] i18n.patchencoding: introduce a new config variable ZHANG, Le
@ 2010-11-29 20:23   ` Junio C Hamano
  0 siblings, 0 replies; 9+ messages in thread
From: Junio C Hamano @ 2010-11-29 20:23 UTC (permalink / raw)
  To: ZHANG, Le; +Cc: git

"ZHANG, Le" <r0bertz@gentoo.org> writes:

> This varible will be used by git mailinfo's --recode-patch parameter only.

I have a few complaints and observations about this:

 - The patch order is screwed up in the series.  Without knowing what
   the --recode-patch option does, the reader is forced to look-ahead
   before judging this patch.

 - No documentation in the same patch as the feature is added.  I am
   guessing that the new configuration variable (and the new option we
   will see laster) means "the patchfile I got is in this encoding but the
   mail header does not mark it as such, so I am giving what encoding it
   is", but this forces the reader to look-ahead.

 - "It will be used by ... only", says who?  In an environment where
   people send patches in a local encoding but want to keep their
   repository in a different encoding, it may not be totally implausible
   to wish "format-patch" to pay attention to this variable to _produce_
   the output in that encoding, especially given the name of the variable
   that does not say anything about in which direction it is used, no?

 - Assuming that I guessed the meaning of this option and parameter right,
   I am not sure if this should be a configuration variable.  It implies
   that the majority of patches, if not all, are in this single local
   encoding that is different from the encoding used in the repository.
   Is it common?  I dunno.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 3/4] git mailinfo: added a --recode-patch parameter
  2010-11-28 19:10 ` [PATCH v4 3/4] git mailinfo: added a --recode-patch parameter ZHANG, Le
@ 2010-11-29 20:23   ` Junio C Hamano
  0 siblings, 0 replies; 9+ messages in thread
From: Junio C Hamano @ 2010-11-29 20:23 UTC (permalink / raw)
  To: ZHANG, Le; +Cc: git

"ZHANG, Le" <r0bertz@gentoo.org> writes:

> When this parameter is specified, patch will be converted to a target
> encoding before applied.  The target encoding defaults to UTF-8. It
> could also be specified by i18n.patchencoding.
>
> Signed-off-by: ZHANG, Le <r0bertz@gentoo.org>
> ---

Ah, please forget (most of) what I said in my review of 2/4.  This series
is not about incoming e-mails that misidentify their content encoding, but
is about accepting patches from e-mails that are in different encoding and
correctly says which encoding they are in.  We make mailinfo convert the
payload to UTF-8 or "i18n.patchencoding".

The above is a clear indication that "i18n.patchencoding" is grossly
misnamed.  It sounds as if it is about the encoding of patches, and
because we are discussing "mailinfo", I mistook that is about the encoding
used in the incoming patches.

But it is not.

It is about encoding of blobs and paths used in the repository.  With your
series, the mailinfo codepath that deals with incoming patches happens to
become the first user to pay attention to the repository data encoding,
but that is not a good reason to name it "i18n.patchencoding".

In the longer run, it is entirely plausible that we will want to know
about different encodings used for these things:

 - encoding in which blobs and tree paths are stored in the repository.
   This is what you called "i18n.patchencoding".

 - encoding in which the log messages of the commits are stored in the
   repository.  We have "i18n.commitencoding" variable for that.

 - encoding in which incoming patches are given to "mailinfo".  This is
   read from the e-mail message.

 - encoding in which blobs are checked out to the working tree.  In a
   distant future, we may want to re-encode blobs from the repository
   encoding to this encoding while checking out, and the other way while
   checking in.

 - encoding in which readdir(3) returns paths from the working tree.  In a
   distant future, we may want to re-encode paths from the repository
   encoding to this encoding while checking out, and the other way while
   checking in.

I think as a short-term solution to your immediate issue, your series may
be good enough, but if we are to go this route, we really should make it
"the repository encoding".  And the variable that replaces your
"patch_charset" should be declared in cache.h, defined in environment.c,
parsed by git_default_config(), and be accessible to everybody.

An alternative is to use i18n.commitencoding for that purpose, but that
will cause issues with existing repositories---they have been happy
without us recoding the payload, and they will get upset if we suddenly
start doing so.  So I'd say that a new "i18n.repositoryencoding"
configuration variable and repository_encoding variable would be a better
design in the longer term.  They

 (1) default to "binary" (or "bytestring", "literal", "verbatim",
     whatever) to tell us _not_ to do any encoding conversion to the
     contents when accepting patches, generating patches, checking out,
     checking in, or running git-archive; and

 (2) when set, certain operations will pay attention to it; in your
     situation, "mailinfo" will convert incoming e-mails to its value
     (presumably "UTF-8").

Yet even longer term, we probably would need to make the blob encoding
into an attribute to the path (e.g. think about an i18n documentation
project, where different translations may need to be stored in different
encodings).

This will open another big can of worms, though.  Unless the incoming
e-mail is split into a MIME multipart that contains one-patch per file
with each part in different encoding, a single plaintext message needs to
be in a single encoding, so your mailinfo patch would most likely need to
encode to one canonical encoding (i.e. "UTF-8"), and then reencode to per
path encoding when feeding the patch.  Then there is another issue of what
to do with the tree paths that appear in "diff --git" and "rename from ..."
headers which most likely to be a single canonical encoding in the project.

But we need to start from somewhere, so let's make the first step "a
single project wide blob and tree path encoding".

Opinions?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 1/4] mailinfo.c: convert_to_utf8(): added a target_charset parameter
  2010-11-29 20:23   ` Junio C Hamano
@ 2011-04-16  6:22     ` ZHANG, Le
  0 siblings, 0 replies; 9+ messages in thread
From: ZHANG, Le @ 2011-04-16  6:22 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 4696 bytes --]

On 12:23 Mon 29 Nov     , Junio C Hamano wrote:
> "ZHANG, Le" <r0bertz@gentoo.org> writes:
> 
> > This is required for my recode-patch patch which needs a seperate patch_charset variable.
> >
> > Signed-off-by: ZHANG, Le <r0bertz@gentoo.org>
> > ---
> 
> Thanks.
> 
> Please describe what the new parameter means.  Is it used to convert the
> contents in "line" from "charset" to "target_charset"?  Perhaps it is a
> good time to rename the function to "convert_to()", its "charset"
> parameter to "from_charset", and name the new parameter "to_charset", in
> order to reduce confusion?
> 
> It is not just _your_ patch but will help other/later patches, so you may
> want to phrase the proposed commit log message a bit differently (and with
> a narrower page width, like 68-74 chars per line).  Perhaps like...
> 
>     mailinfo.c: Allow convert_to_utf8() to specify both src/dst charset
> 
>     The convert_to_utf8() function actually converts to whatever charset
>     "metainfo_charset" variable contains, which is not necessarily UTF-8.
>     Rename it to convert_to(), and give an extra parameter "to_charset" to
>     specify what charset to re-encode to.  Also rename its "charset"
>     parameter to "from_charset" to clarify which is which.

Thank you for reviewing! And sorry for late response.

For this problem only, I'd like to propose a slightly better approach based on
your suggestion.

Currently, in mailinfo.c, there are 3 calls to convert_to_utf8(). Two of calls
specifies from_charset, the third doesn't. For those two calls which already
specifies from_charset, there is no need to guess from_charset. So I make two
convert functions. One is guess_and_convert_to(), the other is convert_to().
The next version of repository_encoding patch[1] will use convert_to().

[1] http://thread.gmane.org/gmane.comp.version-control.git/162345/focus=162424

Here is the patch:

diff --git a/builtin/mailinfo.c b/builtin/mailinfo.c
index 71e6262..0f42ff1 100644
--- a/builtin/mailinfo.c
+++ b/builtin/mailinfo.c
@@ -472,6 +472,20 @@ static struct strbuf *decode_b_segment(const struct strbuf *b_seg)
 	return out;
 }
 
+static void convert_to(struct strbuf *line, const char *to_charset, const char *from_charset)
+{
+	char *out;
+
+	if (!strcasecmp(to_charset, from_charset))
+		return;
+
+	out = reencode_string(line->buf, to_charset, from_charset);
+	if (!out)
+		die("cannot convert from %s to %s",
+		    from_charset, to_charset);
+	strbuf_attach(line, out, strlen(out), strlen(out));
+}
+
 /*
  * When there is no known charset, guess.
  *
@@ -483,32 +497,14 @@ static struct strbuf *decode_b_segment(const struct strbuf *b_seg)
  * Otherwise, we default to assuming it is Latin1 for historical
  * reasons.
  */
-static const char *guess_charset(const struct strbuf *line, const char *target_charset)
+static void guess_and_convert_to(struct strbuf *line, const char *to_charset)
 {
-	if (is_encoding_utf8(target_charset)) {
+	if (is_encoding_utf8(to_charset)) {
 		if (is_utf8(line->buf))
-			return NULL;
-	}
-	return "ISO8859-1";
-}
-
-static void convert_to_utf8(struct strbuf *line, const char *charset)
-{
-	char *out;
-
-	if (!charset || !*charset) {
-		charset = guess_charset(line, metainfo_charset);
-		if (!charset)
 			return;
 	}
 
-	if (!strcasecmp(metainfo_charset, charset))
-		return;
-	out = reencode_string(line->buf, metainfo_charset, charset);
-	if (!out)
-		die("cannot convert from %s to %s",
-		    charset, metainfo_charset);
-	strbuf_attach(line, out, strlen(out), strlen(out));
+    convert_to(line, to_charset, "ISO8859-1");
 }
 
 static int decode_header_bq(struct strbuf *it)
@@ -577,7 +573,7 @@ static int decode_header_bq(struct strbuf *it)
 			break;
 		}
 		if (metainfo_charset)
-			convert_to_utf8(dec, charset_q.buf);
+			convert_to(dec, metainfo_charset, charset_q.buf);
 
 		strbuf_addbuf(&outbuf, dec);
 		strbuf_release(dec);
@@ -602,7 +598,7 @@ static void decode_header(struct strbuf *it)
 	 * This can be binary guck but there is no charset specified.
 	 */
 	if (metainfo_charset)
-		convert_to_utf8(it, "");
+		guess_and_convert_to(it, metainfo_charset);
 }
 
 static void decode_transfer_encoding(struct strbuf *line)
@@ -796,7 +792,7 @@ static int handle_commit_msg(struct strbuf *line)
 
 	/* normalize the log message to UTF-8. */
 	if (metainfo_charset)
-		convert_to_utf8(line, charset.buf);
+		convert_to(line, metainfo_charset, charset.buf);
 
 	if (use_scissors && is_scissors_line(line)) {
 		int i;


-- 
ZHANG, Le
http://zhangle.co
0260 C902 B8F8 6506 6586 2B90 BC51 C808 1E4E 2973

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2011-04-16  6:23 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-28 19:10 [PATCH v4 0/4] add --recode-patch parameter to mailinfo and am ZHANG, Le
2010-11-28 19:10 ` [PATCH v4 1/4] mailinfo.c: convert_to_utf8(): added a target_charset parameter ZHANG, Le
2010-11-29 20:23   ` Junio C Hamano
2011-04-16  6:22     ` ZHANG, Le
2010-11-28 19:10 ` [PATCH v4 2/4] i18n.patchencoding: introduce a new config variable ZHANG, Le
2010-11-29 20:23   ` Junio C Hamano
2010-11-28 19:10 ` [PATCH v4 3/4] git mailinfo: added a --recode-patch parameter ZHANG, Le
2010-11-29 20:23   ` Junio C Hamano
2010-11-28 19:10 ` [PATCH v4 4/4] git am: " ZHANG, Le

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).