git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jonathan Nieder <jrnieder@gmail.com>
To: Richard MICHAEL <rmichael@leadformance.com>
Cc: git@vger.kernel.org
Subject: Re: git-filter-branch : LANG / LC_ALL = C breaks UTF-8 author names
Date: Tue, 31 Aug 2010 20:08:55 -0500	[thread overview]
Message-ID: <20100901010855.GD22968@burratino> (raw)
In-Reply-To: <4C6E86AA.2020903@leadformance.com>

Hi Richard,

Richard MICHAEL wrote:
>>Richard MICHAEL wrote:

>>> I am filtering our repo with git-filter-branch, but as the sed
>>> script runs with LANG=C LC_ALL=C (7 bit US ASCII), it dies on
>>> commits authored by our team members with accented names.
[...]
> What about special casing the bad sed (or whitelisting good sed)?
> Surely a hack, but would those of us with GNU or BSD would be happy.
> Which was the troublesome sed?

Sorry for the slow response.  The problematic sed is GNU sed from
MacPorts (I think).  Even with LC_ALL=C, .* no longer matches
arbitrary sequences of bytes with such sed: you can check yours with

 $ echo 'étale' | LC_ALL=C sed 's/.*//'

Unfortunately I have not been able to reproduce it on Linux.  Debian
sed 4.2.1-7 and GNU sed v4.2.1-21-gc6d32f0 both produce the expected
result:

 $ echo 'étale' | LC_ALL=C sed 's/.*//'
 $

> Unfortunately, it
> doesn't "die" well either; the 'export' shell var fails but it keeps
> processing commits.

Hmm, that sounds like a bug indeed.  Here is what the start to a fix
might look like, but I stopped early because it there's quite a lot of
sed usage in git that expects to be able to process arbitrary data
with short, newline-terminated lines (regardless of encoding).

diff --git a/git-filter-branch.sh b/git-filter-branch.sh
index 962a93b..34a5fa3 100755
--- a/git-filter-branch.sh
+++ b/git-filter-branch.sh
@@ -68,8 +68,8 @@ eval "$functions"
 # "author" or "committer
 
 set_ident () {
-	lid="$(echo "$1" | tr "[A-Z]" "[a-z]")"
-	uid="$(echo "$1" | tr "[a-z]" "[A-Z]")"
+	lid="$(echo "$1" | tr "[A-Z]" "[a-z]")" &&
+	uid="$(echo "$1" | tr "[a-z]" "[A-Z]")" &&
 	pick_id_script='
 		/^'$lid' /{
 			s/'\''/'\''\\'\'\''/g
@@ -90,9 +90,9 @@ set_ident () {
 
 			q
 		}
-	'
+	' &&
 
-	LANG=C LC_ALL=C sed -ne "$pick_id_script"
+	LANG=C LC_ALL=C sed -ne "$pick_id_script" &&
 	# Ensure non-empty id name.
 	echo "case \"\$GIT_${uid}_NAME\" in \"\") GIT_${uid}_NAME=\"\${GIT_${uid}_EMAIL%%@*}\" && export GIT_${uid}_NAME;; esac"
 }
@@ -322,9 +322,11 @@ while read commit parents; do
 	git cat-file commit "$commit" >../commit ||
 		die "Cannot read commit $commit"
 
-	eval "$(set_ident AUTHOR <../commit)" ||
+	set_author=$(set_ident AUTHOR <../commit) &&
+	eval "$set_author" ||
 		die "setting author failed for commit $commit"
-	eval "$(set_ident COMMITTER <../commit)" ||
+	set_committer=$(set_ident COMMITTER <../commit) &&
+	eval "$set_committer" ||
 		die "setting committer failed for commit $commit"
 	eval "$filter_env" < /dev/null ||
 		die "env filter failed: $filter_env"

      reply	other threads:[~2010-09-01  1:11 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-08-20 13:20 git-filter-branch : LANG / LC_ALL = C breaks UTF-8 author names Richard MICHAEL
2010-08-20 13:32 ` Jonathan Nieder
2010-08-20 13:44   ` Richard MICHAEL
2010-09-01  1:08     ` Jonathan Nieder [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100901010855.GD22968@burratino \
    --to=jrnieder@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=rmichael@leadformance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).