git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* git-filter-branch : LANG / LC_ALL = C breaks UTF-8 author names
@ 2010-08-20 13:20 Richard MICHAEL
  2010-08-20 13:32 ` Jonathan Nieder
  0 siblings, 1 reply; 4+ messages in thread
From: Richard MICHAEL @ 2010-08-20 13:20 UTC (permalink / raw)
  To: git

  Hello all,

I am filtering our repo with git-filter-branch, but as the sed script 
runs with LANG=C LC_ALL=C (7 bit US ASCII), it dies on commits authored 
by our team members with accented names. Why are the locales "C"?  For 
compatibility with older sed?  I've changed to LANG=en_US.UTF-8, will my 
change will cause other git-breakage?


git-filter-branch

95: LANG=C LC_ALL=C sed -ne "$pick_id_script"

95: LANG=en_US.UTF-8 sed -ne "$pick_id_script"


Regards,
Richard

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: git-filter-branch : LANG / LC_ALL = C breaks UTF-8 author names
  2010-08-20 13:20 git-filter-branch : LANG / LC_ALL = C breaks UTF-8 author names Richard MICHAEL
@ 2010-08-20 13:32 ` Jonathan Nieder
  2010-08-20 13:44   ` Richard MICHAEL
  0 siblings, 1 reply; 4+ messages in thread
From: Jonathan Nieder @ 2010-08-20 13:32 UTC (permalink / raw)
  To: Richard MICHAEL; +Cc: git

Richard MICHAEL wrote:

> I am filtering our repo with git-filter-branch, but as the sed
> script runs with LANG=C LC_ALL=C (7 bit US ASCII), it dies on
> commits authored by our team members with accented names.

Yep, someone else recently sent a report about such a sed version,
too.  It is breaking our fragile minds; we ought to find some way to
deal with it, but we haven't yet.

Jonathan

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: git-filter-branch : LANG / LC_ALL = C breaks UTF-8 author names
  2010-08-20 13:32 ` Jonathan Nieder
@ 2010-08-20 13:44   ` Richard MICHAEL
  2010-09-01  1:08     ` Jonathan Nieder
  0 siblings, 1 reply; 4+ messages in thread
From: Richard MICHAEL @ 2010-08-20 13:44 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: git

  On 10-08-20 3:32 PM, Jonathan Nieder wrote:
> Richard MICHAEL wrote:
>
>> I am filtering our repo with git-filter-branch, but as the sed
>> script runs with LANG=C LC_ALL=C (7 bit US ASCII), it dies on
>> commits authored by our team members with accented names.
> Yep, someone else recently sent a report about such a sed version,
> too.  It is breaking our fragile minds; we ought to find some way to
> deal with it, but we haven't yet.
>
> Jonathan

Jonathan, thanks for your reply.

What about special casing the bad sed (or whitelisting good sed)?  
Surely a hack, but would those of us with GNU or BSD would be happy.  
Which was the troublesome sed?

That opposed to figuring out the problem, reading about unicode, and 
re-cloning and re-filtering 5,000 commits. :-)  Unfortunately, it 
doesn't "die" well either; the 'export' shell var fails but it keeps 
processing commits.  (If I hadn't investigated and changed the LANG, 
would I have lost those commits?)

Regards,
Richard

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: git-filter-branch : LANG / LC_ALL = C breaks UTF-8 author names
  2010-08-20 13:44   ` Richard MICHAEL
@ 2010-09-01  1:08     ` Jonathan Nieder
  0 siblings, 0 replies; 4+ messages in thread
From: Jonathan Nieder @ 2010-09-01  1:08 UTC (permalink / raw)
  To: Richard MICHAEL; +Cc: git

Hi Richard,

Richard MICHAEL wrote:
>>Richard MICHAEL wrote:

>>> I am filtering our repo with git-filter-branch, but as the sed
>>> script runs with LANG=C LC_ALL=C (7 bit US ASCII), it dies on
>>> commits authored by our team members with accented names.
[...]
> What about special casing the bad sed (or whitelisting good sed)?
> Surely a hack, but would those of us with GNU or BSD would be happy.
> Which was the troublesome sed?

Sorry for the slow response.  The problematic sed is GNU sed from
MacPorts (I think).  Even with LC_ALL=C, .* no longer matches
arbitrary sequences of bytes with such sed: you can check yours with

 $ echo 'étale' | LC_ALL=C sed 's/.*//'

Unfortunately I have not been able to reproduce it on Linux.  Debian
sed 4.2.1-7 and GNU sed v4.2.1-21-gc6d32f0 both produce the expected
result:

 $ echo 'étale' | LC_ALL=C sed 's/.*//'
 $

> Unfortunately, it
> doesn't "die" well either; the 'export' shell var fails but it keeps
> processing commits.

Hmm, that sounds like a bug indeed.  Here is what the start to a fix
might look like, but I stopped early because it there's quite a lot of
sed usage in git that expects to be able to process arbitrary data
with short, newline-terminated lines (regardless of encoding).

diff --git a/git-filter-branch.sh b/git-filter-branch.sh
index 962a93b..34a5fa3 100755
--- a/git-filter-branch.sh
+++ b/git-filter-branch.sh
@@ -68,8 +68,8 @@ eval "$functions"
 # "author" or "committer
 
 set_ident () {
-	lid="$(echo "$1" | tr "[A-Z]" "[a-z]")"
-	uid="$(echo "$1" | tr "[a-z]" "[A-Z]")"
+	lid="$(echo "$1" | tr "[A-Z]" "[a-z]")" &&
+	uid="$(echo "$1" | tr "[a-z]" "[A-Z]")" &&
 	pick_id_script='
 		/^'$lid' /{
 			s/'\''/'\''\\'\'\''/g
@@ -90,9 +90,9 @@ set_ident () {
 
 			q
 		}
-	'
+	' &&
 
-	LANG=C LC_ALL=C sed -ne "$pick_id_script"
+	LANG=C LC_ALL=C sed -ne "$pick_id_script" &&
 	# Ensure non-empty id name.
 	echo "case \"\$GIT_${uid}_NAME\" in \"\") GIT_${uid}_NAME=\"\${GIT_${uid}_EMAIL%%@*}\" && export GIT_${uid}_NAME;; esac"
 }
@@ -322,9 +322,11 @@ while read commit parents; do
 	git cat-file commit "$commit" >../commit ||
 		die "Cannot read commit $commit"
 
-	eval "$(set_ident AUTHOR <../commit)" ||
+	set_author=$(set_ident AUTHOR <../commit) &&
+	eval "$set_author" ||
 		die "setting author failed for commit $commit"
-	eval "$(set_ident COMMITTER <../commit)" ||
+	set_committer=$(set_ident COMMITTER <../commit) &&
+	eval "$set_committer" ||
 		die "setting committer failed for commit $commit"
 	eval "$filter_env" < /dev/null ||
 		die "env filter failed: $filter_env"

^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2010-09-01  1:11 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-08-20 13:20 git-filter-branch : LANG / LC_ALL = C breaks UTF-8 author names Richard MICHAEL
2010-08-20 13:32 ` Jonathan Nieder
2010-08-20 13:44   ` Richard MICHAEL
2010-09-01  1:08     ` Jonathan Nieder

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).