From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: Thomas Haller <thom311@gmail.com>, Git List <git@vger.kernel.org>
Subject: [PATCH] log: re-encode commit messages before grepping
Date: Mon, 11 Feb 2013 15:59:58 -0500 [thread overview]
Message-ID: <20130211205958.GA32740@sigill.intra.peff.net> (raw)
In-Reply-To: <7v621ymxfv.fsf@alter.siamese.dyndns.org>
If you run "git log --grep=foo", we will run your regex on
the literal bytes of the commit message. This can provide
confusing results if the commit message is not in the same
encoding as your grep expression (or worse, you have commits
in multiple encodings, in which case your regex would need
to be written to match either encoding). On top of this, we
might also be grepping in the commit's notes, which are
already re-encoded, potentially leading to grepping in a
buffer with mixed encodings concatenated. This is insanity,
but most people never noticed, because their terminal and
their commit encodings all match.
Instead, let's massage the to-be-grepped commit into a
standardized encoding. There is not much point in adding a
flag for "this is the encoding I expect my grep pattern to
match"; the only sane choice is for it to use the log output
encoding. That is presumably what the user's terminal is
using, and it means that the patterns found by the grep will
match the output produced by git.
As a bonus, this fixes a potential segfault in commit_match
when commit->buffer is NULL, as we now build on logmsg_reencode,
which handles reading the commit buffer from disk if
necessary. The segfault can be triggered with:
git commit -m 'text1' --allow-empty
git commit -m 'text2' --allow-empty
git log --graph --no-walk --grep 'text2'
which arguably does not make any sense (--graph inherently
wants a connected history, and by --no-walk the command line
is telling us to show discrete points in history without
connectivity), and we probably should forbid the
combination, but that is a separate issue.
Signed-off-by: Jeff King <peff@peff.net>
---
A few notes:
1. I suppose we could also use $LANG or one of the $LC_* variables to
guess at the encoding of the user's pattern. But I think using the
output encoding makes the most sense, since then the pattern you
searched for will actually be in the output.
2. There are still problems with utf8 normalization. E.g., my tests
represent utf-8 é with \xc3\xa9 (the code point for that glyph),
but it could also be represented by \x65\xcc\x81 (e + combining
acute). But that is not a new problem; it is an inherent issue with
grepping utf8. We might in the future want to offer an option to
normalize utf8 (or possibly the regex library can be taught to
handle this).
3. If the commit does need to be re-encoded, we end up doing so here,
and then potentially again if we actually show the commit. So
there may be some room to optimize that by stashing the re-encoded
version somewhere (or it may not make a big difference, I haven't
measured).
4. I'm still not clear on why "--graph --no-walk" wants to look at
commit_match after we have already cleared the commit buffer. I
agree it's nonsensical, but I wonder if it might be a symptom of an
underlying bug or inefficiency.
revision.c | 27 ++++++++++++++++++-------
t/t4210-log-i18n.sh | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 78 insertions(+), 7 deletions(-)
create mode 100755 t/t4210-log-i18n.sh
diff --git a/revision.c b/revision.c
index d7562ee..ef60205 100644
--- a/revision.c
+++ b/revision.c
@@ -2268,7 +2268,10 @@ static int commit_match(struct commit *commit, struct rev_info *opt)
static int commit_match(struct commit *commit, struct rev_info *opt)
{
int retval;
+ const char *encoding;
+ char *message;
struct strbuf buf = STRBUF_INIT;
+
if (!opt->grep_filter.pattern_list && !opt->grep_filter.header_list)
return 1;
@@ -2279,13 +2282,23 @@ static int commit_match(struct commit *commit, struct rev_info *opt)
strbuf_addch(&buf, '\n');
}
+ /*
+ * We grep in the user's output encoding, under the assumption that it
+ * is the encoding they are most likely to write their grep pattern
+ * for. In addition, it means we will match the "notes" encoding below,
+ * so we will not end up with a buffer that has two different encodings
+ * in it.
+ */
+ encoding = get_log_output_encoding();
+ message = logmsg_reencode(commit, encoding);
+
/* Copy the commit to temporary if we are using "fake" headers */
if (buf.len)
- strbuf_addstr(&buf, commit->buffer);
+ strbuf_addstr(&buf, message);
if (opt->grep_filter.header_list && opt->mailmap) {
if (!buf.len)
- strbuf_addstr(&buf, commit->buffer);
+ strbuf_addstr(&buf, message);
commit_rewrite_person(&buf, "\nauthor ", opt->mailmap);
commit_rewrite_person(&buf, "\ncommitter ", opt->mailmap);
@@ -2294,18 +2307,18 @@ static int commit_match(struct commit *commit, struct rev_info *opt)
/* Append "fake" message parts as needed */
if (opt->show_notes) {
if (!buf.len)
- strbuf_addstr(&buf, commit->buffer);
- format_display_notes(commit->object.sha1, &buf,
- get_log_output_encoding(), 1);
+ strbuf_addstr(&buf, message);
+ format_display_notes(commit->object.sha1, &buf, encoding, 1);
}
- /* Find either in the commit object, or in the temporary */
+ /* Find either in the original commit message, or in the temporary */
if (buf.len)
retval = grep_buffer(&opt->grep_filter, buf.buf, buf.len);
else
retval = grep_buffer(&opt->grep_filter,
- commit->buffer, strlen(commit->buffer));
+ message, strlen(message));
strbuf_release(&buf);
+ logmsg_free(message, commit);
return retval;
}
diff --git a/t/t4210-log-i18n.sh b/t/t4210-log-i18n.sh
new file mode 100755
index 0000000..52a7472
--- /dev/null
+++ b/t/t4210-log-i18n.sh
@@ -0,0 +1,58 @@
+#!/bin/sh
+
+test_description='test log with i18n features'
+. ./test-lib.sh
+
+# two forms of é
+utf8_e=$(printf '\303\251')
+latin1_e=$(printf '\351')
+
+test_expect_success 'create commits in different encodings' '
+ test_tick &&
+ cat >msg <<-EOF &&
+ utf8
+
+ t${utf8_e}st
+ EOF
+ git add msg &&
+ git -c i18n.commitencoding=utf8 commit -F msg &&
+ cat >msg <<-EOF &&
+ latin1
+
+ t${latin1_e}st
+ EOF
+ git add msg &&
+ git -c i18n.commitencoding=ISO-8859-1 commit -F msg
+'
+
+test_expect_success 'log --grep searches in log output encoding (utf8)' '
+ cat >expect <<-\EOF &&
+ latin1
+ utf8
+ EOF
+ git log --encoding=utf8 --format=%s --grep=$utf8_e >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'log --grep searches in log output encoding (latin1)' '
+ cat >expect <<-\EOF &&
+ latin1
+ utf8
+ EOF
+ git log --encoding=ISO-8859-1 --format=%s --grep=$latin1_e >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'log --grep does not find non-reencoded values (utf8)' '
+ >expect &&
+ git log --encoding=utf8 --format=%s --grep=$latin1_e >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'log --grep does not find non-reencoded values (latin1)' '
+ >expect &&
+ git log --encoding=ISO-8859-1 --format=%s --grep=$utf8_e >actual &&
+ test_cmp expect actual
+'
+
+test_done
--
1.8.1.2.11.g1a2f572
next prev parent reply other threads:[~2013-02-11 21:00 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-02-08 23:52 segfault for git log --graph --no-walk --grep a Thomas Haller
2013-02-09 0:05 ` Junio C Hamano
2013-02-09 0:22 ` Junio C Hamano
2013-02-09 0:27 ` Jeff King
2013-02-09 0:39 ` Junio C Hamano
2013-02-09 0:47 ` Junio C Hamano
2013-02-09 1:05 ` Jeff King
2013-02-09 1:08 ` Jeff King
2013-02-11 19:16 ` Jeff King
2013-02-11 20:01 ` Junio C Hamano
2013-02-11 20:36 ` Junio C Hamano
2013-02-11 20:41 ` Jeff King
2013-02-11 20:55 ` Junio C Hamano
2013-02-11 20:59 ` Jeff King [this message]
2013-02-11 21:11 ` [PATCH] log: re-encode commit messages before grepping Junio C Hamano
2013-02-11 21:14 ` Jeff King
2013-02-25 8:37 ` [PATCH ] t4210-log-i18n: spell encoding name "UTF-8" correctly Johannes Sixt
2013-02-25 15:19 ` Jeff King
2013-02-25 19:06 ` Junio C Hamano
2013-02-25 20:31 ` Jeff King
2013-02-26 6:47 ` Johannes Sixt
2013-02-25 21:00 ` Torsten Bögershausen
2013-02-25 18:54 ` Torsten Bögershausen
2013-02-25 20:36 ` Jeff King
2013-02-09 0:29 ` segfault for git log --graph --no-walk --grep a Junio C Hamano
2013-02-09 0:39 ` Jeff King
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130211205958.GA32740@sigill.intra.peff.net \
--to=peff@peff.net \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=thom311@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).