* git-filter-branch : LANG / LC_ALL = C breaks UTF-8 author names
@ 2010-08-20 13:20 Richard MICHAEL
2010-08-20 13:32 ` Jonathan Nieder
0 siblings, 1 reply; 4+ messages in thread
From: Richard MICHAEL @ 2010-08-20 13:20 UTC (permalink / raw)
To: git
Hello all,
I am filtering our repo with git-filter-branch, but as the sed script
runs with LANG=C LC_ALL=C (7 bit US ASCII), it dies on commits authored
by our team members with accented names. Why are the locales "C"? For
compatibility with older sed? I've changed to LANG=en_US.UTF-8, will my
change will cause other git-breakage?
git-filter-branch
95: LANG=C LC_ALL=C sed -ne "$pick_id_script"
95: LANG=en_US.UTF-8 sed -ne "$pick_id_script"
Regards,
Richard
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: git-filter-branch : LANG / LC_ALL = C breaks UTF-8 author names
2010-08-20 13:20 git-filter-branch : LANG / LC_ALL = C breaks UTF-8 author names Richard MICHAEL
@ 2010-08-20 13:32 ` Jonathan Nieder
2010-08-20 13:44 ` Richard MICHAEL
0 siblings, 1 reply; 4+ messages in thread
From: Jonathan Nieder @ 2010-08-20 13:32 UTC (permalink / raw)
To: Richard MICHAEL; +Cc: git
Richard MICHAEL wrote:
> I am filtering our repo with git-filter-branch, but as the sed
> script runs with LANG=C LC_ALL=C (7 bit US ASCII), it dies on
> commits authored by our team members with accented names.
Yep, someone else recently sent a report about such a sed version,
too. It is breaking our fragile minds; we ought to find some way to
deal with it, but we haven't yet.
Jonathan
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: git-filter-branch : LANG / LC_ALL = C breaks UTF-8 author names
2010-08-20 13:32 ` Jonathan Nieder
@ 2010-08-20 13:44 ` Richard MICHAEL
2010-09-01 1:08 ` Jonathan Nieder
0 siblings, 1 reply; 4+ messages in thread
From: Richard MICHAEL @ 2010-08-20 13:44 UTC (permalink / raw)
To: Jonathan Nieder; +Cc: git
On 10-08-20 3:32 PM, Jonathan Nieder wrote:
> Richard MICHAEL wrote:
>
>> I am filtering our repo with git-filter-branch, but as the sed
>> script runs with LANG=C LC_ALL=C (7 bit US ASCII), it dies on
>> commits authored by our team members with accented names.
> Yep, someone else recently sent a report about such a sed version,
> too. It is breaking our fragile minds; we ought to find some way to
> deal with it, but we haven't yet.
>
> Jonathan
Jonathan, thanks for your reply.
What about special casing the bad sed (or whitelisting good sed)?
Surely a hack, but would those of us with GNU or BSD would be happy.
Which was the troublesome sed?
That opposed to figuring out the problem, reading about unicode, and
re-cloning and re-filtering 5,000 commits. :-) Unfortunately, it
doesn't "die" well either; the 'export' shell var fails but it keeps
processing commits. (If I hadn't investigated and changed the LANG,
would I have lost those commits?)
Regards,
Richard
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: git-filter-branch : LANG / LC_ALL = C breaks UTF-8 author names
2010-08-20 13:44 ` Richard MICHAEL
@ 2010-09-01 1:08 ` Jonathan Nieder
0 siblings, 0 replies; 4+ messages in thread
From: Jonathan Nieder @ 2010-09-01 1:08 UTC (permalink / raw)
To: Richard MICHAEL; +Cc: git
Hi Richard,
Richard MICHAEL wrote:
>>Richard MICHAEL wrote:
>>> I am filtering our repo with git-filter-branch, but as the sed
>>> script runs with LANG=C LC_ALL=C (7 bit US ASCII), it dies on
>>> commits authored by our team members with accented names.
[...]
> What about special casing the bad sed (or whitelisting good sed)?
> Surely a hack, but would those of us with GNU or BSD would be happy.
> Which was the troublesome sed?
Sorry for the slow response. The problematic sed is GNU sed from
MacPorts (I think). Even with LC_ALL=C, .* no longer matches
arbitrary sequences of bytes with such sed: you can check yours with
$ echo 'étale' | LC_ALL=C sed 's/.*//'
Unfortunately I have not been able to reproduce it on Linux. Debian
sed 4.2.1-7 and GNU sed v4.2.1-21-gc6d32f0 both produce the expected
result:
$ echo 'étale' | LC_ALL=C sed 's/.*//'
$
> Unfortunately, it
> doesn't "die" well either; the 'export' shell var fails but it keeps
> processing commits.
Hmm, that sounds like a bug indeed. Here is what the start to a fix
might look like, but I stopped early because it there's quite a lot of
sed usage in git that expects to be able to process arbitrary data
with short, newline-terminated lines (regardless of encoding).
diff --git a/git-filter-branch.sh b/git-filter-branch.sh
index 962a93b..34a5fa3 100755
--- a/git-filter-branch.sh
+++ b/git-filter-branch.sh
@@ -68,8 +68,8 @@ eval "$functions"
# "author" or "committer
set_ident () {
- lid="$(echo "$1" | tr "[A-Z]" "[a-z]")"
- uid="$(echo "$1" | tr "[a-z]" "[A-Z]")"
+ lid="$(echo "$1" | tr "[A-Z]" "[a-z]")" &&
+ uid="$(echo "$1" | tr "[a-z]" "[A-Z]")" &&
pick_id_script='
/^'$lid' /{
s/'\''/'\''\\'\'\''/g
@@ -90,9 +90,9 @@ set_ident () {
q
}
- '
+ ' &&
- LANG=C LC_ALL=C sed -ne "$pick_id_script"
+ LANG=C LC_ALL=C sed -ne "$pick_id_script" &&
# Ensure non-empty id name.
echo "case \"\$GIT_${uid}_NAME\" in \"\") GIT_${uid}_NAME=\"\${GIT_${uid}_EMAIL%%@*}\" && export GIT_${uid}_NAME;; esac"
}
@@ -322,9 +322,11 @@ while read commit parents; do
git cat-file commit "$commit" >../commit ||
die "Cannot read commit $commit"
- eval "$(set_ident AUTHOR <../commit)" ||
+ set_author=$(set_ident AUTHOR <../commit) &&
+ eval "$set_author" ||
die "setting author failed for commit $commit"
- eval "$(set_ident COMMITTER <../commit)" ||
+ set_committer=$(set_ident COMMITTER <../commit) &&
+ eval "$set_committer" ||
die "setting committer failed for commit $commit"
eval "$filter_env" < /dev/null ||
die "env filter failed: $filter_env"
^ permalink raw reply related [flat|nested] 4+ messages in thread
end of thread, other threads:[~2010-09-01 1:11 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-08-20 13:20 git-filter-branch : LANG / LC_ALL = C breaks UTF-8 author names Richard MICHAEL
2010-08-20 13:32 ` Jonathan Nieder
2010-08-20 13:44 ` Richard MICHAEL
2010-09-01 1:08 ` Jonathan Nieder
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).