* [ANNOUNCE] GIT 1.5.5-rc2
@ 2008-03-28 6:30 Junio C Hamano
2008-03-28 18:13 ` Jeff King
0 siblings, 1 reply; 33+ messages in thread
From: Junio C Hamano @ 2008-03-28 6:30 UTC (permalink / raw)
To: git
GIT 1.5.5-rc2 was tagged tonight, and it is available from the usual
places.
http://www.kernel.org/pub/software/scm/git/
git-1.5.5.rc2.tar.{gz,bz2} (tarball)
git-htmldocs-1.5.5.rc2.tar.{gz,bz2} (preformatted docs)
git-manpages-1.5.5.rc2.tar.{gz,bz2} (preformatted docs)
testing/git-*-1.5.5.rc2-1.$arch.rpm (RPM)
The draft release notes as of tonight follows.
GIT v1.5.5 Release Notes
========================
Updates since v1.5.4
--------------------
(subsystems)
* Comes with git-gui 0.9.3.
(portability)
* We shouldn't ask for BSD group ownership semantics by setting g+s bit
on directories on older BSD systems that refuses chmod() by non root
users. BSD semantics is the default there anyway.
* Bunch of portability improvement patches coming from an effort to port
to Solaris has been applied.
(performance)
* On platforms with suboptimal qsort(3) implementation, there
is an option to use more reasonable substitute we ship with
our software.
* New configuration variable "pack.packsizelimit" can be used
in place of command line option --max-pack-size.
* "git fetch" over the native git protocol used to make a
connection to find out the set of current remote refs and
another to actually download the pack data. We now use only
one connection for these tasks.
* "git commit" does not run lstat(2) more than necessary
anymore.
(usability, bells and whistles)
* Bash completion script (in contrib) are aware of more commands and
options.
* You can be warned when core.autocrlf conversion is applied in
such a way that results in an irreversible conversion.
* A catch-all "color.ui" configuration variable can be used to
enable coloring of all color-capable commands, instead of
individual ones such as "color.status" and "color.branch".
* The commands refused to take absolute pathnames where they
require pathnames relative to the work tree or the current
subdirectory. They now can take absolute pathnames in such a
case as long as the pathnames do not refer outside of the
work tree. E.g. "git add $(pwd)/foo" now works.
* Error messages used to be sent to stderr, only to get hidden,
when $PAGER was in use. They now are sent to stdout along
with the command output to be shown in the $PAGER.
* A pattern "foo/" in .gitignore file now matches a directory
"foo". Pattern "foo" also matches as before.
* bash completion's prompt helper function can talk about
operation in-progress (e.g. merge, rebase, etc.).
* Configuration variables "url.<usethis>.insteadof = <otherurl>" can be
used to tell "git-fetch" and "git-push" to use different URL than what
is given from the command line.
* "git add -i" behaves better even before you make an initial commit.
* "git am" refused to run from a subdirectory without a good reason.
* After "git apply --whitespace=fix" fixes whitespace errors in a patch,
a line before the fix can appear as a context or preimage line in a
later patch, causing the patch not to apply. The command now knows to
see through whitespace fixes done to context lines to successfully
apply such a patch series.
* "git branch" (and "git checkout -b") to branch from a local branch can
optionally set "branch.<name>.merge" to mark the new branch to build on
the other local branch, when "branch.autosetupmerge" is set to
"always", or when passing the command line option "--track" (this option
was ignored when branching from local branches). By default, this does
not happen when branching from a local branch.
* "git checkout" to switch to a branch that has "branch.<name>.merge" set
(i.e. marked to build on another branch) reports how much the branch
and the other branch diverged.
* When "git checkout" has to update a lot of paths, it used to be silent
for 4 seconds before it showed any progress report. It is now a bit
more impatient and starts showing progress report early.
* "git commit" learned a new hook "prepare-commit-msg" that can
inspect what is going to be committed and prepare the commit
log message template to be edited.
* "git cvsimport" can now take more than one -M options.
* "git describe" learned to limit the tags to be used for
naming with --match option.
* "git describe --contains" now barfs when the named commit
cannot be described.
* "git describe --exact-match" describes only commits that are tagged.
* "git describe --long" describes a tagged commit as $tag-0-$sha1,
instead of just showing the exact tagname.
* "git describe" warns when using a tag whose name and path contradict
with each other.
* "git diff" learned "--relative" option to limit and output paths
relative to the current directory when working in a subdirectory.
* "git diff" learned "--dirstat" option to show birds-eye-summary of
changes more concisely than "--diffstat".
* "git format-patch" learned --cover-letter option to generate a cover
letter template.
* "git gc" learned --quiet option.
* "git gc" now automatically prunes unreachable objects that are two
weeks old or older.
* "git gc --auto" can be disabled more easily by just setting gc.auto
to zero. It also tolerates more packfiles by default.
* "git grep" now knows "--name-only" is a synonym for the "-l" option.
* "git help <alias>" now reports "'git <alias>' is alias to <what>",
instead of saying "No manual entry for git-<alias>".
* "git help" can use different backends to show manual pages and this can
be configured using "man.viewer" configuration.
* "gitk" does not restore window position from $HOME/.gitk anymore (it
still restores the size).
* "git log --grep=<what>" learned "--fixed-strings" option to look for
<what> without treating it as a regular expression.
* "git gui" learned an auto-spell checking.
* "git push <somewhere> HEAD" and "git push <somewhere> +HEAD" works as
expected; they push the current branch (and only the current branch).
In addition, HEAD can be written as the value of "remote.<there>.push"
configuration variable.
* When the configuration variable "pack.threads" is set to 0, "git
repack" auto detects the number of CPUs and uses that many threads.
* "git send-email" learned to prompt for passwords
interactively.
* "git send-email" learned an easier way to suppress CC
recipients.
* "git stash" learned "pop" command, that applies the latest stash and
removes it from the stash, and "drop" command to discard the named
stash entry.
* "git submodule" learned a new subcommand "summary" to show the
symmetric difference between the HEAD version and the work tree version
of the submodule commits.
* Various "git cvsimport", "git cvsexportcommit", "git svn" and
"git p4" improvements.
(internal)
* Duplicated code between git-help and git-instaweb that
launches user's preferred browser has been refactored.
* It is now easier to write test scripts that records known
breakages.
* "git checkout" is rewritten in C.
* "git remote" is rewritten in C.
* Two conflict hunks that are separated by a very short span of common
lines are now coalesced into one larger hunk, to make the result easier
to read.
* Run-command API's use of file descriptors is documented clearer and
is more consistent now.
* diff output can be sent to FILE * that is different from stdout. This
will help reimplementing more things in C.
Fixes since v1.5.4
------------------
All of the fixes in v1.5.4 maintenance series are included in
this release, unless otherwise noted.
* "git-http-push" did not allow deletion of remote ref with the usual
"push <remote> :<branch>" syntax.
* "git-rebase --abort" did not go back to the right location if
"git-reset" was run during the "git-rebase" session.
* "git imap-send" without setting imap.host did not error out but
segfaulted.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [ANNOUNCE] GIT 1.5.5-rc2
2008-03-28 6:30 [ANNOUNCE] GIT 1.5.5-rc2 Junio C Hamano
@ 2008-03-28 18:13 ` Jeff King
2008-03-28 21:05 ` Junio C Hamano
0 siblings, 1 reply; 33+ messages in thread
From: Jeff King @ 2008-03-28 18:13 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
On Thu, Mar 27, 2008 at 11:30:27PM -0700, Junio C Hamano wrote:
> GIT 1.5.5-rc2 was tagged tonight, and it is available from the usual
> places.
I never got a response to my patches to fix encoding issues in
"send-email --compose". It _is_ a bugfix, but I don't know if it is
1.5.5-worthy. Forgotten (and I should resubmit now), or should I wait
until after the release?
-Peff
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [ANNOUNCE] GIT 1.5.5-rc2
2008-03-28 18:13 ` Jeff King
@ 2008-03-28 21:05 ` Junio C Hamano
2008-03-28 21:23 ` Jeff King
0 siblings, 1 reply; 33+ messages in thread
From: Junio C Hamano @ 2008-03-28 21:05 UTC (permalink / raw)
To: Jeff King; +Cc: git
Jeff King <peff@peff.net> writes:
> On Thu, Mar 27, 2008 at 11:30:27PM -0700, Junio C Hamano wrote:
>
>> GIT 1.5.5-rc2 was tagged tonight, and it is available from the usual
>> places.
>
> I never got a response to my patches to fix encoding issues in
> "send-email --compose". It _is_ a bugfix, but I don't know if it is
> 1.5.5-worthy. Forgotten (and I should resubmit now), or should I wait
> until after the release?
I was getting the impression that it was still in "ah, but this is
better", "you are right, but how about doing this", stage and was hoping
that "ok, based on the discussion here is the final one" will come soon.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [ANNOUNCE] GIT 1.5.5-rc2
2008-03-28 21:05 ` Junio C Hamano
@ 2008-03-28 21:23 ` Jeff King
2008-03-28 21:27 ` Jeff King
0 siblings, 1 reply; 33+ messages in thread
From: Jeff King @ 2008-03-28 21:23 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
On Fri, Mar 28, 2008 at 02:05:09PM -0700, Junio C Hamano wrote:
> > I never got a response to my patches to fix encoding issues in
> > "send-email --compose". It _is_ a bugfix, but I don't know if it is
> > 1.5.5-worthy. Forgotten (and I should resubmit now), or should I wait
> > until after the release?
>
> I was getting the impression that it was still in "ah, but this is
> better", "you are right, but how about doing this", stage and was hoping
> that "ok, based on the discussion here is the final one" will come soon.
Ah. I think the current status is "here are two patches that work, but
will always assume utf-8 encoding" which I think is not unreasonable as
a bugfix. A nice feature would be to allow setting the encoding, but:
- I think that is a feature, and one that nobody has expressed an
interest in. In fact, the little rfc2047 encoding already being done
in send-email blindly assumed utf-8.
- if that feature is going to be done, I think some thought would have
to go into how encodings should be specified so we don't end up with
too many (or too few) places where you have to specify the encoding
(IOW, I think that send-email.compose-encoding is probably too
specific, but reusing an existing encoding variable is not quite
right).
- As a non-user of send-email, a rare user of encodings at all, and an
always user of utf-8, I'm not too interested in such a feature, nor
would I feel comfortable speaking on behalf of users who _would_ use
such a feature.
So I think it is worth taking the patches for 1.5.5 as they are a strict
improvement over the old behavior (the only reason they would not be is
if somebody used a mail pipeline that assumed non-MIME stuff was in some
random charset instead of us-ascii, and by setting utf-8 we are now
confusing that pipeline; it seems unlikely to me, and it violates the
standards).
On top of which I think they are a fine stepping stone to selecting the
encoding (IOW, if I _were_ going to do such a feature, I think I would
still submit those two patches as-is, and add configurability as a third
patch anyway).
My only real concern is that they break something unrelated, as we are
late in the -rc cycle.
-Peff
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [ANNOUNCE] GIT 1.5.5-rc2
2008-03-28 21:23 ` Jeff King
@ 2008-03-28 21:27 ` Jeff King
2008-03-28 21:28 ` [PATCH 1/2] send-email: specify content-type of --compose body Jeff King
2008-03-28 21:29 ` [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters Jeff King
0 siblings, 2 replies; 33+ messages in thread
From: Jeff King @ 2008-03-28 21:27 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
On Fri, Mar 28, 2008 at 05:23:40PM -0400, Jeff King wrote:
> > I was getting the impression that it was still in "ah, but this is
> > better", "you are right, but how about doing this", stage and was hoping
> > that "ok, based on the discussion here is the final one" will come soon.
>
> Ah. I think the current status is "here are two patches that work, but
> will always assume utf-8 encoding" which I think is not unreasonable as
> a bugfix. A nice feature would be to allow setting the encoding, but:
After sending my lengthy response, I realized that you might have
actually been talking about the minor fixups that happened, and missed
the "correct" 2/2 which was sent later in the thread. So following this
are the most up-to-date versions. 1/1 is the same as the original, 2/2
does Teemu's "let user set subject in editor" suggestion, plus the
follow-on syntax fixup.
-Peff
^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH 1/2] send-email: specify content-type of --compose body
2008-03-28 21:27 ` Jeff King
@ 2008-03-28 21:28 ` Jeff King
2008-03-28 21:29 ` [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters Jeff King
1 sibling, 0 replies; 33+ messages in thread
From: Jeff King @ 2008-03-28 21:28 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
If the compose message contains non-ascii characters, then
we assume it is in utf-8 and include the appropriate MIME
headers. If the user has already included a MIME-Version
header, then we assume they know what they are doing and
don't add any headers.
Signed-off-by: Jeff King <peff@peff.net>
---
git-send-email.perl | 24 ++++++++++++++++++++++++
t/t9001-send-email.sh | 44 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 68 insertions(+), 0 deletions(-)
diff --git a/git-send-email.perl b/git-send-email.perl
index 9e568bf..7c4f06c 100755
--- a/git-send-email.perl
+++ b/git-send-email.perl
@@ -520,8 +520,22 @@ EOT
open(C,"<",$compose_filename)
or die "Failed to open $compose_filename : " . $!;
+ my $need_8bit_cte = file_has_nonascii($compose_filename);
+ my $in_body = 0;
while(<C>) {
next if m/^GIT: /;
+ if (!$in_body && /^\n$/) {
+ $in_body = 1;
+ if ($need_8bit_cte) {
+ print C2 "MIME-Version: 1.0\n",
+ "Content-Type: text/plain; ",
+ "charset=utf-8\n",
+ "Content-Transfer-Encoding: 8bit\n";
+ }
+ }
+ if (!$in_body && /^MIME-Version:/i) {
+ $need_8bit_cte = 0;
+ }
print C2 $_;
}
close(C);
@@ -958,3 +972,13 @@ sub validate_patch {
}
return undef;
}
+
+sub file_has_nonascii {
+ my $fn = shift;
+ open(my $fh, '<', $fn)
+ or die "unable to open $fn: $!\n";
+ while (my $line = <$fh>) {
+ return 1 if $line =~ /[^[:ascii:]]/;
+ }
+ return 0;
+}
diff --git a/t/t9001-send-email.sh b/t/t9001-send-email.sh
index c0973b4..e222c49 100755
--- a/t/t9001-send-email.sh
+++ b/t/t9001-send-email.sh
@@ -166,4 +166,48 @@ test_expect_success 'second message is patch' '
grep "Subject:.*Second" msgtxt2
'
+test_expect_success '--compose adds MIME for utf8 body' '
+ clean_fake_sendmail &&
+ (echo "#!/bin/sh" &&
+ echo "echo utf8 body: àéìöú >>\$1"
+ ) >fake-editor-utf8 &&
+ chmod +x fake-editor-utf8 &&
+ echo y | \
+ GIT_EDITOR=$(pwd)/fake-editor-utf8 \
+ GIT_SEND_EMAIL_NOTTY=1 \
+ git send-email \
+ --compose --subject foo \
+ --from="Example <nobody@example.com>" \
+ --to=nobody@example.com \
+ --smtp-server="$(pwd)/fake.sendmail" \
+ $patches &&
+ grep "^utf8 body" msgtxt1 &&
+ grep "^Content-Type: text/plain; charset=utf-8" msgtxt1
+'
+
+test_expect_success '--compose respects user mime type' '
+ clean_fake_sendmail &&
+ (echo "#!/bin/sh" &&
+ echo "(echo MIME-Version: 1.0"
+ echo " echo Content-Type: text/plain\\; charset=iso-8859-1"
+ echo " echo Content-Transfer-Encoding: 8bit"
+ echo " echo Subject: foo"
+ echo " echo "
+ echo " echo utf8 body: àéìöú) >\$1"
+ ) >fake-editor-utf8-mime &&
+ chmod +x fake-editor-utf8-mime &&
+ echo y | \
+ GIT_EDITOR=$(pwd)/fake-editor-utf8-mime \
+ GIT_SEND_EMAIL_NOTTY=1 \
+ git send-email \
+ --compose --subject foo \
+ --from="Example <nobody@example.com>" \
+ --to=nobody@example.com \
+ --smtp-server="$(pwd)/fake.sendmail" \
+ $patches &&
+ grep "^utf8 body" msgtxt1 &&
+ grep "^Content-Type: text/plain; charset=iso-8859-1" msgtxt1 &&
+ ! grep "^Content-Type: text/plain; charset=utf-8" msgtxt1
+'
+
test_done
--
1.5.5.rc1.141.g50ecd.dirty
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
2008-03-28 21:27 ` Jeff King
2008-03-28 21:28 ` [PATCH 1/2] send-email: specify content-type of --compose body Jeff King
@ 2008-03-28 21:29 ` Jeff King
2008-03-29 7:19 ` Robin Rosenberg
2008-05-21 19:39 ` Junio C Hamano
1 sibling, 2 replies; 33+ messages in thread
From: Jeff King @ 2008-03-28 21:29 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
We always use 'utf-8' as the encoding, since we currently
have no way of getting the information from the user.
This also refactors the quoting of recipient names, since
both processes can share the rfc2047 quoting code.
Signed-off-by: Jeff King <peff@peff.net>
---
git-send-email.perl | 19 +++++++++++++++++--
t/t9001-send-email.sh | 15 +++++++++++++++
2 files changed, 32 insertions(+), 2 deletions(-)
diff --git a/git-send-email.perl b/git-send-email.perl
index 7c4f06c..d0f9d4a 100755
--- a/git-send-email.perl
+++ b/git-send-email.perl
@@ -536,6 +536,14 @@ EOT
if (!$in_body && /^MIME-Version:/i) {
$need_8bit_cte = 0;
}
+ if (!$in_body && /^Subject: ?(.*)/i) {
+ my $subject = $1;
+ $_ = "Subject: " .
+ ($subject =~ /[^[:ascii:]]/ ?
+ quote_rfc2047($subject) :
+ $subject) .
+ "\n";
+ }
print C2 $_;
}
close(C);
@@ -626,6 +634,14 @@ sub unquote_rfc2047 {
return wantarray ? ($_, $encoding) : $_;
}
+sub quote_rfc2047 {
+ local $_ = shift;
+ my $encoding = shift || 'utf-8';
+ s/([^-a-zA-Z0-9!*+\/])/sprintf("=%02X", ord($1))/eg;
+ s/(.*)/=\?$encoding\?q\?$1\?=/;
+ return $_;
+}
+
# use the simplest quoting being able to handle the recipient
sub sanitize_address
{
@@ -643,8 +659,7 @@ sub sanitize_address
# rfc2047 is needed if a non-ascii char is included
if ($recipient_name =~ /[^[:ascii:]]/) {
- $recipient_name =~ s/([^-a-zA-Z0-9!*+\/])/sprintf("=%02X", ord($1))/eg;
- $recipient_name =~ s/(.*)/=\?utf-8\?q\?$1\?=/;
+ $recipient_name = quote_rfc2047($recipient_name);
}
# double quotes are needed if specials or CTLs are included
diff --git a/t/t9001-send-email.sh b/t/t9001-send-email.sh
index e222c49..a4bcd28 100755
--- a/t/t9001-send-email.sh
+++ b/t/t9001-send-email.sh
@@ -210,4 +210,19 @@ test_expect_success '--compose respects user mime type' '
! grep "^Content-Type: text/plain; charset=utf-8" msgtxt1
'
+test_expect_success '--compose adds MIME for utf8 subject' '
+ clean_fake_sendmail &&
+ echo y | \
+ GIT_EDITOR=$(pwd)/fake-editor \
+ GIT_SEND_EMAIL_NOTTY=1 \
+ git send-email \
+ --compose --subject utf8-sübjëct \
+ --from="Example <nobody@example.com>" \
+ --to=nobody@example.com \
+ --smtp-server="$(pwd)/fake.sendmail" \
+ $patches &&
+ grep "^fake edit" msgtxt1 &&
+ grep "^Subject: =?utf-8?q?utf8-s=C3=BCbj=C3=ABct?=" msgtxt1
+'
+
test_done
--
1.5.5.rc1.141.g50ecd.dirty
^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
2008-03-28 21:29 ` [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters Jeff King
@ 2008-03-29 7:19 ` Robin Rosenberg
2008-03-29 7:22 ` Jeff King
2008-05-21 19:39 ` Junio C Hamano
1 sibling, 1 reply; 33+ messages in thread
From: Robin Rosenberg @ 2008-03-29 7:19 UTC (permalink / raw)
To: Jeff King; +Cc: Junio C Hamano, git
Den Friday 28 March 2008 22.29.01 skrev Jeff King:
> We always use 'utf-8' as the encoding, since we currently
> have no way of getting the information from the user.
Don't set encoding to UTF-8 unless it actually looks like UTF-8.
-- robin
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
2008-03-29 7:19 ` Robin Rosenberg
@ 2008-03-29 7:22 ` Jeff King
2008-03-29 8:41 ` Robin Rosenberg
2008-03-29 8:44 ` Robin Rosenberg
0 siblings, 2 replies; 33+ messages in thread
From: Jeff King @ 2008-03-29 7:22 UTC (permalink / raw)
To: Robin Rosenberg; +Cc: Junio C Hamano, git
On Sat, Mar 29, 2008 at 08:19:07AM +0100, Robin Rosenberg wrote:
> Den Friday 28 March 2008 22.29.01 skrev Jeff King:
> > We always use 'utf-8' as the encoding, since we currently
> > have no way of getting the information from the user.
>
> Don't set encoding to UTF-8 unless it actually looks like UTF-8.
OK. Do you have an example function that guesses with high probability
whether a string is utf-8? If there are non-ascii characters but we
_don't_ guess utf-8, what should we do?
-Peff
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
2008-03-29 7:22 ` Jeff King
@ 2008-03-29 8:41 ` Robin Rosenberg
2008-03-29 8:49 ` Jeff King
2008-03-30 23:47 ` Junio C Hamano
2008-03-29 8:44 ` Robin Rosenberg
1 sibling, 2 replies; 33+ messages in thread
From: Robin Rosenberg @ 2008-03-29 8:41 UTC (permalink / raw)
To: Jeff King; +Cc: Junio C Hamano, git
Den Saturday 29 March 2008 08.22.03 skrev Jeff King:
> On Sat, Mar 29, 2008 at 08:19:07AM +0100, Robin Rosenberg wrote:
> > Den Friday 28 March 2008 22.29.01 skrev Jeff King:
> > > We always use 'utf-8' as the encoding, since we currently
> > > have no way of getting the information from the user.
> >
> > Don't set encoding to UTF-8 unless it actually looks like UTF-8.
>
> OK. Do you have an example function that guesses with high probability
> whether a string is utf-8? If there are non-ascii characters but we
> _don't_ guess utf-8, what should we do?
Any test for valid UTF-8 will do that with a very high probability. The
perl UTF-8 "api" is a mess. I couldn't find such a routine!?. Calling
decode/encode and see if you get the original string works, but that is too
clumsy, IMHO.
-- robin
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
2008-03-29 8:41 ` Robin Rosenberg
@ 2008-03-29 8:49 ` Jeff King
2008-03-29 9:02 ` Robin Rosenberg
2008-03-30 23:47 ` Junio C Hamano
1 sibling, 1 reply; 33+ messages in thread
From: Jeff King @ 2008-03-29 8:49 UTC (permalink / raw)
To: Robin Rosenberg; +Cc: Junio C Hamano, git
On Sat, Mar 29, 2008 at 09:41:53AM +0100, Robin Rosenberg wrote:
> > OK. Do you have an example function that guesses with high probability
> > whether a string is utf-8? If there are non-ascii characters but we
> > _don't_ guess utf-8, what should we do?
>
> Any test for valid UTF-8 will do that with a very high probability. The
> perl UTF-8 "api" is a mess. I couldn't find such a routine!?. Calling
> decode/encode and see if you get the original string works, but that is too
> clumsy, IMHO.
Does that work? I would think you would have to compare the normalized
versions of each string, since decode(encode($x)) is not, AIUI,
guaranteed to produce $x.
-Peff
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
2008-03-29 8:49 ` Jeff King
@ 2008-03-29 9:02 ` Robin Rosenberg
2008-03-29 9:11 ` Jeff King
0 siblings, 1 reply; 33+ messages in thread
From: Robin Rosenberg @ 2008-03-29 9:02 UTC (permalink / raw)
To: Jeff King; +Cc: Junio C Hamano, git
Den Saturday 29 March 2008 09.49.48 skrev Jeff King:
> On Sat, Mar 29, 2008 at 09:41:53AM +0100, Robin Rosenberg wrote:
> > > OK. Do you have an example function that guesses with high probability
> > > whether a string is utf-8? If there are non-ascii characters but we
> > > _don't_ guess utf-8, what should we do?
> >
> > Any test for valid UTF-8 will do that with a very high probability. The
> > perl UTF-8 "api" is a mess. I couldn't find such a routine!?. Calling
> > decode/encode and see if you get the original string works, but that is
> > too clumsy, IMHO.
>
> Does that work? I would think you would have to compare the normalized
> versions of each string, since decode(encode($x)) is not, AIUI,
> guaranteed to produce $x.
I don't claim to understand it either. Hopefully some perl guru will step
forward and just explain how to do this in perl.
My proof is entirely empirical. What happens is that attempting to decode a
non-UTF-8 string will put a unicode surrogate pair into the (now Unicode)
string and encoding will just encode the surrogate pair into UTF-8 and not
the original. As a result, the encode(decode($x)) eq $x *only* if $x is a
valid UTF-8 octet sequence. Why would you not get the original back if
you start with valid UTF-8?
-- robin
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
2008-03-29 9:02 ` Robin Rosenberg
@ 2008-03-29 9:11 ` Jeff King
2008-03-29 9:39 ` Robin Rosenberg
0 siblings, 1 reply; 33+ messages in thread
From: Jeff King @ 2008-03-29 9:11 UTC (permalink / raw)
To: Robin Rosenberg; +Cc: Junio C Hamano, git
On Sat, Mar 29, 2008 at 10:02:43AM +0100, Robin Rosenberg wrote:
> My proof is entirely empirical. What happens is that attempting to decode a
> non-UTF-8 string will put a unicode surrogate pair into the (now Unicode)
> string and encoding will just encode the surrogate pair into UTF-8 and not
> the original. As a result, the encode(decode($x)) eq $x *only* if $x is a
> valid UTF-8 octet sequence. Why would you not get the original back if
> you start with valid UTF-8?
Because some UTF-8 sequences have multiple representations, and that
information may be lost by whatever intermediate form is the result of
decode($x). In practice, I don't know if this happens or not.
Though it looks like there is an Encode::is_utf8 function (which is also
utf8::is_utf8, but only in perl >= 5.8.1). So we could use that, but it
needs the utf-8 flag turned on for the string. Maybe utf8::valid is
actually what we want.
But there is still a larger question. You have some binary bytes that
will go in a subject header. There are non-ascii bytes. There are
non-utf8 sequences. What do you do?
-Peff
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
2008-03-29 9:11 ` Jeff King
@ 2008-03-29 9:39 ` Robin Rosenberg
2008-03-29 9:43 ` Jeff King
0 siblings, 1 reply; 33+ messages in thread
From: Robin Rosenberg @ 2008-03-29 9:39 UTC (permalink / raw)
To: Jeff King; +Cc: Junio C Hamano, git
Den Saturday 29 March 2008 10.11.45 skrev Jeff King:
> On Sat, Mar 29, 2008 at 10:02:43AM +0100, Robin Rosenberg wrote:
> > My proof is entirely empirical. What happens is that attempting to decode
> > a non-UTF-8 string will put a unicode surrogate pair into the (now
> > Unicode) string and encoding will just encode the surrogate pair into
> > UTF-8 and not the original. As a result, the encode(decode($x)) eq $x
> > *only* if $x is a valid UTF-8 octet sequence. Why would you not get the
> > original back if you start with valid UTF-8?
>
> Because some UTF-8 sequences have multiple representations, and that
Care to give an example?
-- robon
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
2008-03-29 9:39 ` Robin Rosenberg
@ 2008-03-29 9:43 ` Jeff King
2008-03-29 12:54 ` Robin Rosenberg
0 siblings, 1 reply; 33+ messages in thread
From: Jeff King @ 2008-03-29 9:43 UTC (permalink / raw)
To: Robin Rosenberg; +Cc: Junio C Hamano, git
On Sat, Mar 29, 2008 at 10:39:43AM +0100, Robin Rosenberg wrote:
> > Because some UTF-8 sequences have multiple representations, and that
>
> Care to give an example?
There were several given in the "OS X normalize your UTF-8 filenames"
thread a while back. They generally boil down to "a<UMLAUT MODIFIER>"
versus "<A WITH UMLAUT>" both of which are valid UTF-8.
-Peff
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
2008-03-29 9:43 ` Jeff King
@ 2008-03-29 12:54 ` Robin Rosenberg
2008-03-29 21:45 ` Jeff King
0 siblings, 1 reply; 33+ messages in thread
From: Robin Rosenberg @ 2008-03-29 12:54 UTC (permalink / raw)
To: Jeff King; +Cc: Junio C Hamano, git
Den Saturday 29 March 2008 10.43.22 skrev Jeff King:
> On Sat, Mar 29, 2008 at 10:39:43AM +0100, Robin Rosenberg wrote:
> > > Because some UTF-8 sequences have multiple representations, and that
> >
> > Care to give an example?
>
> There were several given in the "OS X normalize your UTF-8 filenames"
> thread a while back. They generally boil down to "a<UMLAUT MODIFIER>"
> versus "<A WITH UMLAUT>" both of which are valid UTF-8.
That is what /OS X/ does with file names. It changes one unicode code point
to a sequence of other "equivalent" code points. I'm pretty sure perl does
not do that.
-- robin
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
2008-03-29 12:54 ` Robin Rosenberg
@ 2008-03-29 21:45 ` Jeff King
2008-03-30 3:40 ` Sam Vilain
0 siblings, 1 reply; 33+ messages in thread
From: Jeff King @ 2008-03-29 21:45 UTC (permalink / raw)
To: Robin Rosenberg; +Cc: Junio C Hamano, git
On Sat, Mar 29, 2008 at 01:54:47PM +0100, Robin Rosenberg wrote:
> > There were several given in the "OS X normalize your UTF-8 filenames"
> > thread a while back. They generally boil down to "a<UMLAUT MODIFIER>"
> > versus "<A WITH UMLAUT>" both of which are valid UTF-8.
>
> That is what /OS X/ does with file names. It changes one unicode code point
> to a sequence of other "equivalent" code points. I'm pretty sure perl does
> not do that.
My point is that we don't _know_ what is happening in between the decode
and encode. Does that intermediate form have the information required to
convert back to the exact same bytes as the original form? I don't think
you've provided any evidence that it does or does not.
But here is some evidence that it does work:
$ cat test.pl
sub is_valid {
my $orig = shift;
my $test = $orig;
utf8::decode($test);
utf8::encode($test);
return $orig eq $test ? "yes" : "no";
}
print "utf-8: ", is_valid("\xc3\xb6"), "\n";
print "latin-1: ", is_valid("\xc3"), "\n";
print "utf-8 w/ combining: ", is_valid("o\xcc\x88"), "\n";
$ perl test.pl
utf-8: yes
latin-1: no
utf-8 w/ combining: yes
But it still feels a little wrong to test by converting. There must be
some way to ask "is this valid utf-8" (there are several candidate
functions, but I don't think either of us quite knows the right way to
invoke them).
-Peff
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
2008-03-29 21:45 ` Jeff King
@ 2008-03-30 3:40 ` Sam Vilain
2008-03-30 4:39 ` Jeff King
0 siblings, 1 reply; 33+ messages in thread
From: Sam Vilain @ 2008-03-30 3:40 UTC (permalink / raw)
To: Jeff King; +Cc: Robin Rosenberg, Junio C Hamano, git
Jeff King wrote:
> My point is that we don't _know_ what is happening in between the decode
> and encode. Does that intermediate form have the information required to
> convert back to the exact same bytes as the original form?
No, it doesn't. If you want that, save a copy of the string (it's a
lazy copy anyway).
The module that will let you see into the strings to see what it
happening is Devel::Peek. Using that, you will see the state of the
UTF8 scalar flag. For example;
maia:~$ perl -Mutf8 -MDevel::Peek -le 'Dump "Güt"'
SV = PV(0x605d08) at 0x62f230
REFCNT = 1
FLAGS = (PADBUSY,PADTMP,POK,READONLY,pPOK,UTF8)
PV = 0x60cd20 "G\303\274t"\0 [UTF8 "G\x{fc}t"]
CUR = 4
LEN = 8
By default, all strings that are read from files will NOT have this flag
set, unless the filehandle that was read from was marked as being utf-8
(in order to preserve C semantics by default);
maia:~$ echo "Güt" | perl -MDevel::Peek -nle 'Dump $_'
SV = PV(0x6052d0) at 0x604220
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x62f0e0 "G\303\274t"\0
CUR = 4
LEN = 80
maia:~$ echo "Güt" | perl -MDevel::Peek -nle 'BEGIN { binmode STDIN,
":utf8" } Dump $_'
SV = PV(0x6052d0) at 0x604220
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x62f100 "G\303\274t"\0 [UTF8 "G\x{fc}t"]
CUR = 4
LEN = 80
> But it still feels a little wrong to test by converting.
utf8::decode works in-place; it is essentially checking that the string
is valid, and if so, marking it as UTF8.
my ($encoding);
if (utf8::decode($string)) {
if (utf8::is_utf($string)) {
$encoding = "UTF-8";
}
else {
$encoding = "US-ASCII";
}
}
else {
$encoding = "ISO8859-1"
}
For US-ASCII, you'll only have to encode if the string contains special
characters (those below \037) or any "=" characters.
You could try using langinfo CODESET instead of hardcoding ISO8859-1
like that, but at least on my system can return bizarre values like
ANSI_X3.4-1968, which may be in some contexts a "correct" description of
the encoding, but is unlikely to be understood by mail clients.
> There must be
> some way to ask "is this valid utf-8" (there are several candidate
> functions, but I don't think either of us quite knows the right way to
> invoke them).
I think you were just reading the note on the utf8::valid function a
little too strongly.
You could use this block;
if ($string =~ m/[\200-\377]/) {
Encode::_utf8_on($string);
if (!utf8::valid($string)) {
Encode::_utf8_off($string);
}
}
Anyway, I guess all this rubbish is why people use CPAN modules, so that
they don't have to continually rediscover every single protocol quirk
and reinvent the wheel.
ie, it would be much, much simpler to use MIME::Entity->build for all of
this, and remove the duplication of code.
Sam.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
2008-03-30 3:40 ` Sam Vilain
@ 2008-03-30 4:39 ` Jeff King
0 siblings, 0 replies; 33+ messages in thread
From: Jeff King @ 2008-03-30 4:39 UTC (permalink / raw)
To: Sam Vilain; +Cc: Robin Rosenberg, Junio C Hamano, git
On Sun, Mar 30, 2008 at 04:40:53PM +1300, Sam Vilain wrote:
> > My point is that we don't _know_ what is happening in between the decode
> > and encode. Does that intermediate form have the information required to
> > convert back to the exact same bytes as the original form?
> No, it doesn't. If you want that, save a copy of the string (it's a
> lazy copy anyway).
We do already save a copy. The question is that Robin is proposing
decode/encode to check for validity. It was not clear to me that such a
process would always return the exact same bytes even for valid utf-8.
But it seems like you are saying below that it is really just the
"decode" part of that which is interesting:
> utf8::decode works in-place; it is essentially checking that the string
> is valid, and if so, marking it as UTF8.
>
> my ($encoding);
> if (utf8::decode($string)) {
> if (utf8::is_utf($string)) {
> $encoding = "UTF-8";
> }
> else {
> $encoding = "US-ASCII";
> }
> }
> else {
> $encoding = "ISO8859-1"
> }
OK, that was the magic invocation we were looking for. Thank you.
> For US-ASCII, you'll only have to encode if the string contains special
> characters (those below \037) or any "=" characters.
Ah, yeah. I think our tests are lacking in that they check for only
[^[:ascii:]].
> Anyway, I guess all this rubbish is why people use CPAN modules, so that
> they don't have to continually rediscover every single protocol quirk
> and reinvent the wheel.
>
> ie, it would be much, much simpler to use MIME::Entity->build for all of
> this, and remove the duplication of code.
Yes, I actually made a similar comment recently. send-email could
probably be shorter, easier to read, and have fewer bugs if it used one
of the many mail-handling CPAN modules. I think it would pretty much
involve scrapping the current send-email and starting fresh, though.
Thanks for your input.
-Peff
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
2008-03-29 8:41 ` Robin Rosenberg
2008-03-29 8:49 ` Jeff King
@ 2008-03-30 23:47 ` Junio C Hamano
1 sibling, 0 replies; 33+ messages in thread
From: Junio C Hamano @ 2008-03-30 23:47 UTC (permalink / raw)
To: Robin Rosenberg; +Cc: Jeff King, git
Robin Rosenberg <robin.rosenberg.lists@dewire.com> writes:
> Den Saturday 29 March 2008 08.22.03 skrev Jeff King:
>> On Sat, Mar 29, 2008 at 08:19:07AM +0100, Robin Rosenberg wrote:
>> > Den Friday 28 March 2008 22.29.01 skrev Jeff King:
>> > > We always use 'utf-8' as the encoding, since we currently
>> > > have no way of getting the information from the user.
>> >
>> > Don't set encoding to UTF-8 unless it actually looks like UTF-8.
>>
>> OK. Do you have an example function that guesses with high probability
>> whether a string is utf-8? If there are non-ascii characters but we
>> _don't_ guess utf-8, what should we do?
>
> Any test for valid UTF-8 will do that with a very high probability. The
> perl UTF-8 "api" is a mess. I couldn't find such a routine!?. Calling
> decode/encode and see if you get the original string works, but that is too
> clumsy, IMHO.
The sequence to decode followed by encode will test if you have a valid
one and if it is canonically encoded, which is testing too much. You only
want to check if it is valid, and do not care about normalization.
I see this in perluniintro.pod:
=item *
How Do I Detect Data That's Not Valid In a Particular Encoding?
Use the C<Encode> package to try converting it.
For example,
use Encode 'decode_utf8';
if (decode_utf8($string_of_bytes_that_I_think_is_utf8)) {
# valid
} else {
# invalid
}
For commit log messages, we traditionally use similar idea to guess by
checking if it looks like an UTF-8 encoded string and otherwise assume
Latin-1 (and I think we still do if the user does not tell us).
If this issue is only about the --compose part of send-email, perhaps you
can interactively ask instead of "otherwise assume Latin-1"?
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
2008-03-29 7:22 ` Jeff King
2008-03-29 8:41 ` Robin Rosenberg
@ 2008-03-29 8:44 ` Robin Rosenberg
2008-03-29 8:53 ` Jeff King
1 sibling, 1 reply; 33+ messages in thread
From: Robin Rosenberg @ 2008-03-29 8:44 UTC (permalink / raw)
To: Jeff King; +Cc: Junio C Hamano, git
Den Saturday 29 March 2008 08.22.03 skrev Jeff King:
> On Sat, Mar 29, 2008 at 08:19:07AM +0100, Robin Rosenberg wrote:
> > Den Friday 28 March 2008 22.29.01 skrev Jeff King:
> > > We always use 'utf-8' as the encoding, since we currently
> > > have no way of getting the information from the user.
> >
> > Don't set encoding to UTF-8 unless it actually looks like UTF-8.
>
> OK. Do you have an example function that guesses with high probability
> whether a string is utf-8? If there are non-ascii characters but we
> _don't_ guess utf-8, what should we do?
I guess the best bet is to assume the locale. Btw, is the encoding header
from the commit (when present) completely lost? (not that it can be trusted
anyway).
-- robin
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
2008-03-29 8:44 ` Robin Rosenberg
@ 2008-03-29 8:53 ` Jeff King
2008-03-29 9:38 ` Robin Rosenberg
0 siblings, 1 reply; 33+ messages in thread
From: Jeff King @ 2008-03-29 8:53 UTC (permalink / raw)
To: Robin Rosenberg; +Cc: Junio C Hamano, git
On Sat, Mar 29, 2008 at 09:44:55AM +0100, Robin Rosenberg wrote:
> > OK. Do you have an example function that guesses with high probability
> > whether a string is utf-8? If there are non-ascii characters but we
> > _don't_ guess utf-8, what should we do?
>
> I guess the best bet is to assume the locale. Btw, is the encoding header
> from the commit (when present) completely lost? (not that it can be trusted
> anyway).
What do you mean by "assume the locale"? Is there a portable way to say
"this is the encoding of the locale the user has chosen?" On my system I
set LANG=en_US, and behind-the-scenes magic chooses utf-8 versus
iso8859-1.
And there is no encoding header for the commit; the point of this patch
is to handle the "cover letter" message created by "send-email
--compose" (we should already be doing the right thing for the patch
emails, since the commit encoding is output by format-patch in a
content-type header before we even get to send-email).
-Peff
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
2008-03-29 8:53 ` Jeff King
@ 2008-03-29 9:38 ` Robin Rosenberg
2008-03-29 9:52 ` Jeff King
0 siblings, 1 reply; 33+ messages in thread
From: Robin Rosenberg @ 2008-03-29 9:38 UTC (permalink / raw)
To: Jeff King; +Cc: Junio C Hamano, git
Den Saturday 29 March 2008 09.53.04 skrev Jeff King:
> On Sat, Mar 29, 2008 at 09:44:55AM +0100, Robin Rosenberg wrote:
> > > OK. Do you have an example function that guesses with high probability
> > > whether a string is utf-8? If there are non-ascii characters but we
> > > _don't_ guess utf-8, what should we do?
> >
> > I guess the best bet is to assume the locale. Btw, is the encoding header
> > from the commit (when present) completely lost? (not that it can be
> > trusted anyway).
>
> What do you mean by "assume the locale"? Is there a portable way to say
> "this is the encoding of the locale the user has chosen?" On my system I
> set LANG=en_US, and behind-the-scenes magic chooses utf-8 versus
> iso8859-1.
The environment variables are only part of the story. There is a langinfo API
for this. See I18N::Langinfo(3pm) that knows about those and something else.
# perl -e 'require I18N::Langinfo; I18N::Langinfo->import(qw(langinfo
CODESET)); $codeset = langinfo(CODESET()); print "My codeset=".
$codeset."\n";'
My codeset=ISO-8859-15
-- robin
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
2008-03-29 9:38 ` Robin Rosenberg
@ 2008-03-29 9:52 ` Jeff King
2008-03-29 12:54 ` Robin Rosenberg
2008-03-30 2:12 ` Sam Vilain
0 siblings, 2 replies; 33+ messages in thread
From: Jeff King @ 2008-03-29 9:52 UTC (permalink / raw)
To: Robin Rosenberg; +Cc: Junio C Hamano, git
On Sat, Mar 29, 2008 at 10:38:48AM +0100, Robin Rosenberg wrote:
> The environment variables are only part of the story. There is a langinfo API
> for this. See I18N::Langinfo(3pm) that knows about those and something else.
>
> # perl -e 'require I18N::Langinfo; I18N::Langinfo->import(qw(langinfo
> CODESET)); $codeset = langinfo(CODESET()); print "My codeset=".
> $codeset."\n";'
> My codeset=ISO-8859-15
Hmm, neat. So perhaps it would make sense to just use this value instead
of utf-8, and not worry about examining the actual text (since any such
examination is at best a guess, anyway)?
Any idea what version of perl started shipping I18N::Langinfo? I
couldn't see anything useful from grepping the Changes files.
-Peff
PS Your 'require' is more simply written as 'use I18N::Langinfo
qw(langinfo CODESET)', or perhaps even simpler:
perl -MI18N::Langinfo=langinfo,CODESET ...
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
2008-03-29 9:52 ` Jeff King
@ 2008-03-29 12:54 ` Robin Rosenberg
2008-03-29 21:18 ` Jeff King
2008-03-30 2:12 ` Sam Vilain
1 sibling, 1 reply; 33+ messages in thread
From: Robin Rosenberg @ 2008-03-29 12:54 UTC (permalink / raw)
To: Jeff King; +Cc: Junio C Hamano, git
Den Saturday 29 March 2008 10.52.38 skrev Jeff King:
> On Sat, Mar 29, 2008 at 10:38:48AM +0100, Robin Rosenberg wrote:
> > The environment variables are only part of the story. There is a langinfo
> > API for this. See I18N::Langinfo(3pm) that knows about those and
> > something else.
> >
> > # perl -e 'require I18N::Langinfo; I18N::Langinfo->import(qw(langinfo
> > CODESET)); $codeset = langinfo(CODESET()); print "My codeset=".
> > $codeset."\n";'
> > My codeset=ISO-8859-15
>
> Hmm, neat. So perhaps it would make sense to just use this value instead
> of utf-8, and not worry about examining the actual text (since any such
> examination is at best a guess, anyway)?
I think you really should try the UTF-8 guess, since a file may well be UTF-8
even if the user locale is something else. Especially for XML files, UTF-8
is common, but there are many more cases. Look into git-gui/po for more
examples. The probability of a UTF-8 test being wrong is just so unimaginable
low.
> PS Your 'require' is more simply written as 'use I18N::Langinfo
> qw(langinfo CODESET)', or perhaps even simpler:
See the man page, from which I stole it. It suggests you wrap it all inside
eval {}, just in case your perl does not have langinfo.
As for the is_utf8() i'm not sure what it does, but I can't make it work.
-- robin
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
2008-03-29 12:54 ` Robin Rosenberg
@ 2008-03-29 21:18 ` Jeff King
2008-03-29 21:43 ` Robin Rosenberg
0 siblings, 1 reply; 33+ messages in thread
From: Jeff King @ 2008-03-29 21:18 UTC (permalink / raw)
To: Robin Rosenberg; +Cc: Junio C Hamano, git
On Sat, Mar 29, 2008 at 01:54:10PM +0100, Robin Rosenberg wrote:
> I think you really should try the UTF-8 guess, since a file may well be UTF-8
> even if the user locale is something else. Especially for XML files, UTF-8
> is common, but there are many more cases. Look into git-gui/po for more
> examples. The probability of a UTF-8 test being wrong is just so unimaginable
> low.
Thinking about this more, I think it is only half the solution. If
something is not valid utf-8, then we know it must be something else.
But if something is valid utf-8, is it necessarily utf-8? I think we are
going to have a much higher probability of guessing wrong there.
For example, consider the bytes { 0xc3, 0xb6 }. In utf-8, they are 'ö'.
But in iso8859-1, they also have meaning (paragraph symbol followed by
Ã). Now that is an unlikely combination to come up. And maybe for
Latin-1, having two non-ascii characters next to each other is unlikely.
But over all commonly used encodings, what is the probability in an
average text of that encoding that it contains valid UTF-8?
For example, I have no idea what patterns can be found in EUCJP.
> > PS Your 'require' is more simply written as 'use I18N::Langinfo
> > qw(langinfo CODESET)', or perhaps even simpler:
>
> See the man page, from which I stole it. It suggests you wrap it all inside
> eval {}, just in case your perl does not have langinfo.
Yes, that does make sense for a script (I just couldn't see it because
the entire toy example would be inside the eval).
> As for the is_utf8() i'm not sure what it does, but I can't make it work.
There is some magic with how Perl marks strings as "binary" versus
"utf-8" that I don't quite understand. And I think is_utf8 is really
about asking "is the utf-8 flag set".
I think this discussion would benefit greatly from somebody who has more
of a clue how perl i18n stuff works. Why don't you work up a patch that
makes sense for you, and then hopefully that will get some attention?
-Peff
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
2008-03-29 21:18 ` Jeff King
@ 2008-03-29 21:43 ` Robin Rosenberg
2008-03-29 22:00 ` Jeff King
0 siblings, 1 reply; 33+ messages in thread
From: Robin Rosenberg @ 2008-03-29 21:43 UTC (permalink / raw)
To: Jeff King; +Cc: Junio C Hamano, git
Den Saturday 29 March 2008 22.18.49 skrev Jeff King:
> On Sat, Mar 29, 2008 at 01:54:10PM +0100, Robin Rosenberg wrote:
> > I think you really should try the UTF-8 guess, since a file may well be
> > UTF-8 even if the user locale is something else. Especially for XML
> > files, UTF-8 is common, but there are many more cases. Look into
> > git-gui/po for more examples. The probability of a UTF-8 test being wrong
> > is just so unimaginable low.
>
> Thinking about this more, I think it is only half the solution. If
> something is not valid utf-8, then we know it must be something else.
> But if something is valid utf-8, is it necessarily utf-8? I think we are
> going to have a much higher probability of guessing wrong there.
>
> For example, consider the bytes { 0xc3, 0xb6 }. In utf-8, they are 'ö'.
> But in iso8859-1, they also have meaning (paragraph symbol followed by
> Ã). Now that is an unlikely combination to come up. And maybe for
> Latin-1, having two non-ascii characters next to each other is unlikely.
First that is even by random an unlikely sequence. For any "real" is string
it simply won't happen, even in this context. Try scanning everything you
can think of and see if you find such a sequence that is not actually UTF-8.
> But over all commonly used encodings, what is the probability in an
> average text of that encoding that it contains valid UTF-8?
> For example, I have no idea what patterns can be found in EUCJP.
See here http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf
Note that a random string is a randomly generated string. Not a random string
from the set of actually existing strings.
> There is some magic with how Perl marks strings as "binary" versus
> "utf-8" that I don't quite understand. And I think is_utf8 is really
> about asking "is the utf-8 flag set".
>
> I think this discussion would benefit greatly from somebody who has more
> of a clue how perl i18n stuff works. Why don't you work up a patch that
> makes sense for you, and then hopefully that will get some attention?
The only real question as I see it is whether perl has a builtin metod that
works better than the decode/encode. Anyone?
-- robin
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
2008-03-29 21:43 ` Robin Rosenberg
@ 2008-03-29 22:00 ` Jeff King
0 siblings, 0 replies; 33+ messages in thread
From: Jeff King @ 2008-03-29 22:00 UTC (permalink / raw)
To: Robin Rosenberg; +Cc: Junio C Hamano, git
On Sat, Mar 29, 2008 at 10:43:40PM +0100, Robin Rosenberg wrote:
> First that is even by random an unlikely sequence. For any "real" is string
> it simply won't happen, even in this context. Try scanning everything you
> can think of and see if you find such a sequence that is not actually UTF-8.
That's the problem I was mentioning: "everything I can think of" is
basically just us-ascii with a few accented characters. I don't know
how, e.g., Japanese texts will fare with such a test.
> > But over all commonly used encodings, what is the probability in an
> > average text of that encoding that it contains valid UTF-8?
> > For example, I have no idea what patterns can be found in EUCJP.
>
> See here http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf
Thanks, that is an interesting read. And he seems to indicate that you
can guess with a reasonable degree of success. But a few points on that
work:
- he has a specific methodology for guessing, which is more elaborate
than what you proposed. So to get his results, you would need to
implement his method. Hopefully if perl does have a "guess if this
looks like utf8" method, it uses a similar scheme.
- he does admit that some encodings have difficult to assess
probabilities, and it will vary from language to language. See page
22:
If a specific language does not use all three letters (a single
letter on the left and the corresponding two letters on the
right), then this combination presents no danger. Further checks
can then be made with a dictionary, although there is the problem
that a dictionary never contains all possible words, and that of
course resource names don't necessarily have to be words.
- he mentions Latin, Cyrillic, and Hebrew encodings. I note the
conspicuous absence of any Asian languages.
> Note that a random string is a randomly generated string. Not a random
> string from the set of actually existing strings.
Sure. But looking at random strings isn't terribly useful; there is a
non-uniform distribution over the set of strings, dependent on the
_actual_ encoding. So there are going to be "good" encodings that will
guess well, and there will be "bad" encodings that might not (and by
"will", I mean "there may be"; that is the very thing I am saying we
don't have good evidence for).
-Peff
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
2008-03-29 9:52 ` Jeff King
2008-03-29 12:54 ` Robin Rosenberg
@ 2008-03-30 2:12 ` Sam Vilain
2008-03-30 4:31 ` Jeff King
1 sibling, 1 reply; 33+ messages in thread
From: Sam Vilain @ 2008-03-30 2:12 UTC (permalink / raw)
To: Jeff King; +Cc: Robin Rosenberg, Junio C Hamano, git
Jeff King wrote:
> Any idea what version of perl started shipping I18N::Langinfo? I
> couldn't see anything useful from grepping the Changes files.
Module::CoreList knows. See the man page for that.
Sam.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
2008-03-30 2:12 ` Sam Vilain
@ 2008-03-30 4:31 ` Jeff King
0 siblings, 0 replies; 33+ messages in thread
From: Jeff King @ 2008-03-30 4:31 UTC (permalink / raw)
To: Sam Vilain; +Cc: Robin Rosenberg, Junio C Hamano, git
On Sun, Mar 30, 2008 at 03:12:46PM +1300, Sam Vilain wrote:
> > Any idea what version of perl started shipping I18N::Langinfo? I
> > couldn't see anything useful from grepping the Changes files.
> Module::CoreList knows. See the man page for that.
Thanks, I didn't know about that (I foolishly assumed that such
information would be, well, along with the core of perl).
The answer is: I18N::Langinfo started shipping with 5.007003. I think we
have pretty much given up on perl < 5.6 (at least from my experience
with 5.005 on Solaris), so it is probably safe to use.
-Peff
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
2008-03-28 21:29 ` [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters Jeff King
2008-03-29 7:19 ` Robin Rosenberg
@ 2008-05-21 19:39 ` Junio C Hamano
2008-05-21 19:47 ` Jeff King
1 sibling, 1 reply; 33+ messages in thread
From: Junio C Hamano @ 2008-05-21 19:39 UTC (permalink / raw)
To: git; +Cc: Jeff King
Last night I was going through old mail-logs and found this and another
one that this is a follow-up to, which I think are still needed. Does
anybody see anything wrong with them?
Jeff King <peff@peff.net> writes:
> We always use 'utf-8' as the encoding, since we currently
> have no way of getting the information from the user.
>
> This also refactors the quoting of recipient names, since
> both processes can share the rfc2047 quoting code.
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
> git-send-email.perl | 19 +++++++++++++++++--
> t/t9001-send-email.sh | 15 +++++++++++++++
> 2 files changed, 32 insertions(+), 2 deletions(-)
>
> diff --git a/git-send-email.perl b/git-send-email.perl
> index 7c4f06c..d0f9d4a 100755
> --- a/git-send-email.perl
> +++ b/git-send-email.perl
> @@ -536,6 +536,14 @@ EOT
> if (!$in_body && /^MIME-Version:/i) {
> $need_8bit_cte = 0;
> }
> + if (!$in_body && /^Subject: ?(.*)/i) {
> + my $subject = $1;
> + $_ = "Subject: " .
> + ($subject =~ /[^[:ascii:]]/ ?
> + quote_rfc2047($subject) :
> + $subject) .
> + "\n";
> + }
> print C2 $_;
> }
> close(C);
> @@ -626,6 +634,14 @@ sub unquote_rfc2047 {
> return wantarray ? ($_, $encoding) : $_;
> }
>
> +sub quote_rfc2047 {
> + local $_ = shift;
> + my $encoding = shift || 'utf-8';
> + s/([^-a-zA-Z0-9!*+\/])/sprintf("=%02X", ord($1))/eg;
> + s/(.*)/=\?$encoding\?q\?$1\?=/;
> + return $_;
> +}
> +
> # use the simplest quoting being able to handle the recipient
> sub sanitize_address
> {
> @@ -643,8 +659,7 @@ sub sanitize_address
>
> # rfc2047 is needed if a non-ascii char is included
> if ($recipient_name =~ /[^[:ascii:]]/) {
> - $recipient_name =~ s/([^-a-zA-Z0-9!*+\/])/sprintf("=%02X", ord($1))/eg;
> - $recipient_name =~ s/(.*)/=\?utf-8\?q\?$1\?=/;
> + $recipient_name = quote_rfc2047($recipient_name);
> }
>
> # double quotes are needed if specials or CTLs are included
> diff --git a/t/t9001-send-email.sh b/t/t9001-send-email.sh
> index e222c49..a4bcd28 100755
> --- a/t/t9001-send-email.sh
> +++ b/t/t9001-send-email.sh
> @@ -210,4 +210,19 @@ test_expect_success '--compose respects user mime type' '
> ! grep "^Content-Type: text/plain; charset=utf-8" msgtxt1
> '
>
> +test_expect_success '--compose adds MIME for utf8 subject' '
> + clean_fake_sendmail &&
> + echo y | \
> + GIT_EDITOR=$(pwd)/fake-editor \
> + GIT_SEND_EMAIL_NOTTY=1 \
> + git send-email \
> + --compose --subject utf8-sübjëct \
> + --from="Example <nobody@example.com>" \
> + --to=nobody@example.com \
> + --smtp-server="$(pwd)/fake.sendmail" \
> + $patches &&
> + grep "^fake edit" msgtxt1 &&
> + grep "^Subject: =?utf-8?q?utf8-s=C3=BCbj=C3=ABct?=" msgtxt1
> +'
> +
> test_done
> --
> 1.5.5.rc1.141.g50ecd.dirty
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
2008-05-21 19:39 ` Junio C Hamano
@ 2008-05-21 19:47 ` Jeff King
0 siblings, 0 replies; 33+ messages in thread
From: Jeff King @ 2008-05-21 19:47 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
On Wed, May 21, 2008 at 12:39:44PM -0700, Junio C Hamano wrote:
> Last night I was going through old mail-logs and found this and another
> one that this is a follow-up to, which I think are still needed. Does
> anybody see anything wrong with them?
>
> Jeff King <peff@peff.net> writes:
>
> > We always use 'utf-8' as the encoding, since we currently
> > have no way of getting the information from the user.
Ah, thanks for bringing this up. I noticed a few weeks ago that it
hadn't been applied and meant to bring it up, but somehow I failed to
do so.
Obviously I'm in support of this one, but I also think Horst's patch
looks correct.
-Peff
^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH 1/2] send-email: specify content-type of --compose body
@ 2008-03-25 23:02 Jeff King
0 siblings, 0 replies; 33+ messages in thread
From: Jeff King @ 2008-03-25 23:02 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git, Teemu Likonen
If the compose message contains non-ascii characters, then
we assume it is in utf-8 and include the appropriate MIME
headers. If the user has already included a MIME-Version
header, then we assume they know what they are doing and
don't add any headers.
Signed-off-by: Jeff King <peff@peff.net>
---
git-send-email.perl | 24 ++++++++++++++++++++++++
t/t9001-send-email.sh | 44 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 68 insertions(+), 0 deletions(-)
diff --git a/git-send-email.perl b/git-send-email.perl
index 9e568bf..7c4f06c 100755
--- a/git-send-email.perl
+++ b/git-send-email.perl
@@ -520,8 +520,22 @@ EOT
open(C,"<",$compose_filename)
or die "Failed to open $compose_filename : " . $!;
+ my $need_8bit_cte = file_has_nonascii($compose_filename);
+ my $in_body = 0;
while(<C>) {
next if m/^GIT: /;
+ if (!$in_body && /^\n$/) {
+ $in_body = 1;
+ if ($need_8bit_cte) {
+ print C2 "MIME-Version: 1.0\n",
+ "Content-Type: text/plain; ",
+ "charset=utf-8\n",
+ "Content-Transfer-Encoding: 8bit\n";
+ }
+ }
+ if (!$in_body && /^MIME-Version:/i) {
+ $need_8bit_cte = 0;
+ }
print C2 $_;
}
close(C);
@@ -958,3 +972,13 @@ sub validate_patch {
}
return undef;
}
+
+sub file_has_nonascii {
+ my $fn = shift;
+ open(my $fh, '<', $fn)
+ or die "unable to open $fn: $!\n";
+ while (my $line = <$fh>) {
+ return 1 if $line =~ /[^[:ascii:]]/;
+ }
+ return 0;
+}
diff --git a/t/t9001-send-email.sh b/t/t9001-send-email.sh
index c0973b4..e222c49 100755
--- a/t/t9001-send-email.sh
+++ b/t/t9001-send-email.sh
@@ -166,4 +166,48 @@ test_expect_success 'second message is patch' '
grep "Subject:.*Second" msgtxt2
'
+test_expect_success '--compose adds MIME for utf8 body' '
+ clean_fake_sendmail &&
+ (echo "#!/bin/sh" &&
+ echo "echo utf8 body: àéìöú >>\$1"
+ ) >fake-editor-utf8 &&
+ chmod +x fake-editor-utf8 &&
+ echo y | \
+ GIT_EDITOR=$(pwd)/fake-editor-utf8 \
+ GIT_SEND_EMAIL_NOTTY=1 \
+ git send-email \
+ --compose --subject foo \
+ --from="Example <nobody@example.com>" \
+ --to=nobody@example.com \
+ --smtp-server="$(pwd)/fake.sendmail" \
+ $patches &&
+ grep "^utf8 body" msgtxt1 &&
+ grep "^Content-Type: text/plain; charset=utf-8" msgtxt1
+'
+
+test_expect_success '--compose respects user mime type' '
+ clean_fake_sendmail &&
+ (echo "#!/bin/sh" &&
+ echo "(echo MIME-Version: 1.0"
+ echo " echo Content-Type: text/plain\\; charset=iso-8859-1"
+ echo " echo Content-Transfer-Encoding: 8bit"
+ echo " echo Subject: foo"
+ echo " echo "
+ echo " echo utf8 body: àéìöú) >\$1"
+ ) >fake-editor-utf8-mime &&
+ chmod +x fake-editor-utf8-mime &&
+ echo y | \
+ GIT_EDITOR=$(pwd)/fake-editor-utf8-mime \
+ GIT_SEND_EMAIL_NOTTY=1 \
+ git send-email \
+ --compose --subject foo \
+ --from="Example <nobody@example.com>" \
+ --to=nobody@example.com \
+ --smtp-server="$(pwd)/fake.sendmail" \
+ $patches &&
+ grep "^utf8 body" msgtxt1 &&
+ grep "^Content-Type: text/plain; charset=iso-8859-1" msgtxt1 &&
+ ! grep "^Content-Type: text/plain; charset=utf-8" msgtxt1
+'
+
test_done
--
1.5.5.rc1.123.ge5f4e6
^ permalink raw reply related [flat|nested] 33+ messages in thread
end of thread, other threads:[~2008-05-21 19:48 UTC | newest]
Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-03-28 6:30 [ANNOUNCE] GIT 1.5.5-rc2 Junio C Hamano
2008-03-28 18:13 ` Jeff King
2008-03-28 21:05 ` Junio C Hamano
2008-03-28 21:23 ` Jeff King
2008-03-28 21:27 ` Jeff King
2008-03-28 21:28 ` [PATCH 1/2] send-email: specify content-type of --compose body Jeff King
2008-03-28 21:29 ` [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters Jeff King
2008-03-29 7:19 ` Robin Rosenberg
2008-03-29 7:22 ` Jeff King
2008-03-29 8:41 ` Robin Rosenberg
2008-03-29 8:49 ` Jeff King
2008-03-29 9:02 ` Robin Rosenberg
2008-03-29 9:11 ` Jeff King
2008-03-29 9:39 ` Robin Rosenberg
2008-03-29 9:43 ` Jeff King
2008-03-29 12:54 ` Robin Rosenberg
2008-03-29 21:45 ` Jeff King
2008-03-30 3:40 ` Sam Vilain
2008-03-30 4:39 ` Jeff King
2008-03-30 23:47 ` Junio C Hamano
2008-03-29 8:44 ` Robin Rosenberg
2008-03-29 8:53 ` Jeff King
2008-03-29 9:38 ` Robin Rosenberg
2008-03-29 9:52 ` Jeff King
2008-03-29 12:54 ` Robin Rosenberg
2008-03-29 21:18 ` Jeff King
2008-03-29 21:43 ` Robin Rosenberg
2008-03-29 22:00 ` Jeff King
2008-03-30 2:12 ` Sam Vilain
2008-03-30 4:31 ` Jeff King
2008-05-21 19:39 ` Junio C Hamano
2008-05-21 19:47 ` Jeff King
-- strict thread matches above, loose matches on Subject: below --
2008-03-25 23:02 [PATCH 1/2] send-email: specify content-type of --compose body Jeff King
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).