[PATCH] blame: Improve parsing for emails with spaces

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] blame: Improve parsing for emails with spaces
@ 2011-04-21 22:07 Josh Stone
  2011-04-29 13:11 ` Jeff King
  0 siblings, 1 reply; 6+ messages in thread
From: Josh Stone @ 2011-04-21 22:07 UTC (permalink / raw)
  To: git; +Cc: Josh Stone

One of my git repositories has some old commits where the authors
obfuscated their email address as <author at example dot com>.  To
handle this, blame needs to look for the leading '<' when scanning
to split the "name <email>", rather then only a space delimiter.

Signed-off-by: Josh Stone <jistone@redhat.com>
---
 builtin/blame.c     |    2 +-
 t/annotate-tests.sh |   12 +++++++++++-
 t/t8002-blame.sh    |    2 +-
 3 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/builtin/blame.c b/builtin/blame.c
index f6b03f7..41525f1 100644
--- a/builtin/blame.c
+++ b/builtin/blame.c
@@ -1377,7 +1377,7 @@ static void get_ac_line(const char *inbuf, const char *what,
 	timepos = tmp;
 
 	*tmp = 0;
-	while (person < tmp && *tmp != ' ')
+	while (person < tmp && !(*tmp == ' ' && tmp[1] == '<'))
 		tmp--;
 	if (tmp <= person)
 		return;
diff --git a/t/annotate-tests.sh b/t/annotate-tests.sh
index d34208c..abb1885 100644
--- a/t/annotate-tests.sh
+++ b/t/annotate-tests.sh
@@ -1,5 +1,5 @@
 # This file isn't used as a test script directly, instead it is
-# sourced from t8001-annotate.sh and t8001-blame.sh.
+# sourced from t8001-annotate.sh and t8002-blame.sh.
 
 check_count () {
 	head=
@@ -124,3 +124,13 @@ test_expect_success \
 test_expect_success \
     'some edit' \
     'check_count A 1 B 1 B1 1 B2 1 "A U Thor" 1 C 1 D 1'
+
+test_expect_success \
+    'an obfuscated email added' \
+    'sed -e "1i No robots allowed" < file > file.new &&
+     mv file.new file &&
+     GIT_AUTHOR_NAME="E" GIT_AUTHOR_EMAIL="E at test dot git" git commit -a -m "norobots"'
+
+test_expect_success \
+    'obfuscated email parsed' \
+    'check_count A 1 B 1 B1 1 B2 1 "A U Thor" 1 C 1 D 1 E 1'
diff --git a/t/t8002-blame.sh b/t/t8002-blame.sh
index d3a51e1..e2896cf 100755
--- a/t/t8002-blame.sh
+++ b/t/t8002-blame.sh
@@ -8,7 +8,7 @@ PROG='git blame -c'
 
 PROG='git blame -c -e'
 test_expect_success 'Blame --show-email works' '
-    check_count "<A@test.git>" 1 "<B@test.git>" 1 "<B1@test.git>" 1 "<B2@test.git>" 1 "<author@example.com>" 1 "<C@test.git>" 1 "<D@test.git>" 1
+    check_count "<A@test.git>" 1 "<B@test.git>" 1 "<B1@test.git>" 1 "<B2@test.git>" 1 "<author@example.com>" 1 "<C@test.git>" 1 "<D@test.git>" 1 "<E at test dot git>" 1
 '
 
 test_done
-- 
1.7.4.4

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] blame: Improve parsing for emails with spaces
  2011-04-21 22:07 [PATCH] blame: Improve parsing for emails with spaces Josh Stone
@ 2011-04-29 13:11 ` Jeff King
  2011-04-29 17:59   ` Junio C Hamano
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff King @ 2011-04-29 13:11 UTC (permalink / raw)
  To: Josh Stone; +Cc: git

On Thu, Apr 21, 2011 at 03:07:36PM -0700, Josh Stone wrote:

> One of my git repositories has some old commits where the authors
> obfuscated their email address as <author at example dot com>.  To
> handle this, blame needs to look for the leading '<' when scanning
> to split the "name <email>", rather then only a space delimiter.

Hmm. That address seems bogus, and I wonder if we should be rejecting it
at commit time. Still, it is something we may run across in existing
repositories, so handling it more gracefully makes sense.

> diff --git a/builtin/blame.c b/builtin/blame.c
> index f6b03f7..41525f1 100644
> --- a/builtin/blame.c
> +++ b/builtin/blame.c
> @@ -1377,7 +1377,7 @@ static void get_ac_line(const char *inbuf, const char *what,
>  	timepos = tmp;
>  
>  	*tmp = 0;
> -	while (person < tmp && *tmp != ' ')
> +	while (person < tmp && !(*tmp == ' ' && tmp[1] == '<'))
>  		tmp--;
>  	if (tmp <= person)
>  		return;

The fix looks fine. It's a little gross that we are parsing the commit
headers ourselves here, as opposed to using some library function
provided for commits. I think we could eliminate some code by using
the parser in pretty.c, but it might need some refactoring to do so
efficiently.

Looking over the other places we parse author info, I don't think the
same problem exists elsewhere. Most other places parse from left to
right, and just go to the closing ">".

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] blame: Improve parsing for emails with spaces
  2011-04-29 13:11 ` Jeff King
@ 2011-04-29 17:59   ` Junio C Hamano
  2011-04-29 18:09     ` Junio C Hamano
                       ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Junio C Hamano @ 2011-04-29 17:59 UTC (permalink / raw)
  To: Jeff King; +Cc: Josh Stone, git

Jeff King <peff@peff.net> writes:

> On Thu, Apr 21, 2011 at 03:07:36PM -0700, Josh Stone wrote:
>
>> One of my git repositories has some old commits where the authors
>> obfuscated their email address as <author at example dot com>.  To
>> handle this, blame needs to look for the leading '<' when scanning
>> to split the "name <email>", rather then only a space delimiter.

Given that we enclose the e-mail inside "<>" pair and excise "<" from
author names in fmt_ident(), I think it makes sense to look for " <" like
this patch does.

> Hmm. That address seems bogus, and I wonder if we should be rejecting it
> at commit time. Still, it is something we may run across in existing
> repositories, so handling it more gracefully makes sense.

Perhaps but within reason.  

What new types of breakages are we proposing to tolerate, what breakages
are we declaring not worth fixing, and what is the price of not loosening
this?  Without this patch, such a broken commit will result in the author
email shown somewhat broken, but the original is already broken to begin
with, and also the entry for the blamed line will come with its commit
object name anyway, so I do not think it is such a big deal.

It would be an entirely different issue if the command barfed and refused
to blame the file.

> Looking over the other places we parse author info, I don't think the
> same problem exists elsewhere. Most other places parse from left to
> right, and just go to the closing ">".

Because fmt_ident() removes "<" or ">" from given email address, I think
it is sufficient.

It would be nice to have an abstraction around the parsing in a way
similar to fmt_ident() is an abstraction around the formatting, but that
would be a separate topic.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] blame: Improve parsing for emails with spaces
  2011-04-29 17:59   ` Junio C Hamano
@ 2011-04-29 18:09     ` Junio C Hamano
  2011-04-29 18:17     ` Josh Stone
  2011-04-29 19:13     ` Jeff King
  2 siblings, 0 replies; 6+ messages in thread
From: Junio C Hamano @ 2011-04-29 18:09 UTC (permalink / raw)
  To: Jeff King, Josh Stone; +Cc: git

Junio C Hamano <gitster@pobox.com> writes:

> Jeff King <peff@peff.net> writes:
>
>> On Thu, Apr 21, 2011 at 03:07:36PM -0700, Josh Stone wrote:
>>
>>> One of my git repositories has some old commits where the authors
>>> obfuscated their email address as <author at example dot com>.  To
>>> handle this, blame needs to look for the leading '<' when scanning
>>> to split the "name <email>", rather then only a space delimiter.
>
> Given that we enclose the e-mail inside "<>" pair and excise "<" from
> author names in fmt_ident(), I think it makes sense to look for " <" like
> this patch does.

Will queue.  Thanks.

Author: Josh Stone <jistone@redhat.com>
Date:   Thu Apr 21 15:07:36 2011 -0700

    blame: tolerate bogus e-mail addresses a bit better
    
    The names and e-mails are sanitized by fmt_ident() when creating commits,
    so that they do not contain "<" nor ">", and the "committer" and "author"
    lines in the commit object will always be in the form:
    
        ("author" | "committer") name SP "<" email ">" SP timestamp SP zone
    
    When parsing the email part out, the current code looks for SP starting
    from the end of the email part, but the author could obfuscate the address
    as "author at example dot com".
    
    We should instead look for SP followed by "<", to match the logic of the
    side that formats these lines.
    
    Signed-off-by: Josh Stone <jistone@redhat.com>
    Signed-off-by: Junio C Hamano <gitster@pobox.com>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] blame: Improve parsing for emails with spaces
  2011-04-29 17:59   ` Junio C Hamano
  2011-04-29 18:09     ` Junio C Hamano
@ 2011-04-29 18:17     ` Josh Stone
  2011-04-29 19:13     ` Jeff King
  2 siblings, 0 replies; 6+ messages in thread
From: Josh Stone @ 2011-04-29 18:17 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git

On 04/29/2011 10:59 AM, Junio C Hamano wrote:
> It would be an entirely different issue if the command barfed and refused
> to blame the file.

Yeah - the command doesn't fail outright.  Anything that builds on
blame/annotate would get bad data, but for me it's more of a formatting
issue.  In my actual bad commit, the author's "name <broken email" takes
up 61 characters, making the whole blame output tediously wide.

Thanks for taking the patch.

Josh

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] blame: Improve parsing for emails with spaces
  2011-04-29 17:59   ` Junio C Hamano
  2011-04-29 18:09     ` Junio C Hamano
  2011-04-29 18:17     ` Josh Stone
@ 2011-04-29 19:13     ` Jeff King
  2 siblings, 0 replies; 6+ messages in thread
From: Jeff King @ 2011-04-29 19:13 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Josh Stone, git

On Fri, Apr 29, 2011 at 10:59:43AM -0700, Junio C Hamano wrote:

> > Hmm. That address seems bogus, and I wonder if we should be rejecting it
> > at commit time. Still, it is something we may run across in existing
> > repositories, so handling it more gracefully makes sense.
> 
> Perhaps but within reason.  
> 
> What new types of breakages are we proposing to tolerate, what breakages
> are we declaring not worth fixing, and what is the price of not loosening
> this?  Without this patch, such a broken commit will result in the author
> email shown somewhat broken, but the original is already broken to begin
> with, and also the entry for the blamed line will come with its commit
> object name anyway, so I do not think it is such a big deal.

I'm pretty sure such an address would make a non-rfc822-compliant "from"
header when used with format-patch. But given that it is obviously a
bogus address, I don't think there's much we can do anyway, and anyone
looking at will say "Oh, that's wrong". So it's probably not a big deal.

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2011-04-29 19:14 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-04-21 22:07 [PATCH] blame: Improve parsing for emails with spaces Josh Stone
2011-04-29 13:11 ` Jeff King
2011-04-29 17:59   ` Junio C Hamano
2011-04-29 18:09     ` Junio C Hamano
2011-04-29 18:17     ` Josh Stone
2011-04-29 19:13     ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).