Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jeff King <peff@peff.net>
To: A Large Angry SCM <gitzilla@gmail.com>
Cc: Felipe Contreras <felipe.contreras@gmail.com>,
	Michael J Gruber <git@drmicha.warpmail.net>,
	Git Mailing List <git@vger.kernel.org>
Subject: Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)
Date: Sun, 11 Nov 2012 12:15:18 -0500	[thread overview]
Message-ID: <20121111171518.GA20115@sigill.intra.peff.net> (raw)
In-Reply-To: <509FD9BC.7050204@gmail.com>

On Sun, Nov 11, 2012 at 12:00:44PM -0500, A Large Angry SCM wrote:

> >>>a) Leave the name conversion to the export tools, and when they miss
> >>>some weird corner case, like 'Author<email', let the user face the
> >>>consequences, perhaps after an hour of the process.
> [...]
> >>>b) Do the name conversion in fast-import itself, perhaps optionally,
> >>>so if a tool missed some weird corner case, the user does not have to
> >>>face the consequences.
> [...]
> >>c) Do the name conversion, and whatever other cleanup and manipulations
> >>you're interesting in, in a filter between the exporter and git-fast-import.
> >
> >Such a filter would probably be quite complicated, and would decrease
> >performance.
> >
> 
> Really?
> 
> The fast import stream protocol is pretty simple. All the filter
> really needs to do is pass through everything that isn't a 'commit'
> command. And for the 'commit' command, it only needs to do something
> with the 'author' and 'committer' lines; passing through everything
> else.
> 
> I agree that an additional filter _may_ decrease performance somewhat
> if you are already CPU constrained. But I suspect that the effect
> would be negligible compared to the all of the SHA-1 calculations.

It might be measurable, as you are passing every byte of every version
of every file in the repo through an extra pipe. But more importantly, I
don't think it helps.

If there is not a standard filter for fixing up names, we do not need to
care. The user can use "sed" or whatever and pay the performance penalty
(and deal with the possibility of errors from being lazy about parsing
the fast-import stream).

If there is a standard filter, then what is the advantage in doing it as
a pipe? Why not just teach fast-import the same trick (and possibly make
it optional)? That would be simpler, more efficient, and it would make
it easier for remote helpers to turn it on (they use a command-line
switch rather than setting up an extra process).

But what I don't understand is: what would such a standard filter look
like? Fast-import (or a filter) would already receive the exporter's
best attempt at a git-like ident string. We can clean up and normalize
things like whitespace (and we probably should if we do not do so
already). But beyond that, we have no context about the name; only the
exporter has that.

So if we receive:

  Foo Bar<foo.bar@example.com> <none@none>

or:

  Foo Bar<foo.bar@example.com <none@none>

or:

  Foo Bar<foo.bar@example.com

what do we do with it? Is the first part a malformed name/email pair,
and the second part is crap added by a lazy exporter? Or does the
exporter want to keep the angle brackets as part of the name field? Is
there a malformed email in the last one, or no email at all?

The exporter is the only program that actually knows where the data came
from, how it should be broken down, and what is appropriate for pulling
data out of its particular source system. For that reason, the exporter
has to be the place where we come up with a syntactically correct and
unambiguous ident.

I am not opposed to adding a mailmap-like feature to fast-import to map
identities, but it has to start with sane, unambiguous output from the
exporter.

-Peff

next prev parent reply	other threads:[~2012-11-11 17:15 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-11-02 14:43 RFD: fast-import is picky with author names (and maybe it should - but how much so?) Michael J Gruber
2012-11-02 14:47 ` Michael J Gruber
2012-11-08 20:09 ` Jeff King
2012-11-09  9:28   ` Michael J Gruber
2012-11-09 14:34     ` Felipe Contreras
2012-11-10 17:28       ` Michael J Gruber
2012-11-10 18:43         ` Felipe Contreras
2012-11-10 19:25           ` A Large Angry SCM
2012-11-11 12:41             ` Felipe Contreras
2012-11-11 17:00               ` A Large Angry SCM
2012-11-11 17:15                 ` Jeff King [this message]
2012-11-11 17:45                   ` Felipe Contreras
2012-11-11 18:14                     ` Jeff King
2012-11-11 18:48                       ` Felipe Contreras
2012-11-12 21:41                         ` Jeff King
2012-11-12 22:47                           ` Felipe Contreras
2012-11-13 10:15                             ` Michael J Gruber
2012-11-13 18:15                               ` Felipe Contreras
2012-11-11 18:16                   ` A Large Angry SCM
2012-11-11 17:16                 ` Felipe Contreras
2012-11-11 17:39                   ` A Large Angry SCM
2012-11-11 17:49                     ` Felipe Contreras
2012-11-12 17:45                 ` Junio C Hamano
2012-11-12 20:46                   ` Felipe Contreras

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20121111171518.GA20115@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=felipe.contreras@gmail.com \
    --cc=git@drmicha.warpmail.net \
    --cc=git@vger.kernel.org \
    --cc=gitzilla@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).