git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Elijah Newren <newren@gmail.com>
To: Jeff King <peff@peff.net>
Cc: Git Mailing List <git@vger.kernel.org>,
	Lars Schneider <larsxschneider@gmail.com>,
	"brian m. carlson" <sandals@crustytoothpaste.net>,
	Taylor Blau <me@ttaylorr.com>,
	Jonathan Nieder <jrnieder@gmail.com>
Subject: Re: [PATCH 10/10] fast-export: add --always-show-modify-after-rename
Date: Tue, 13 Nov 2018 09:10:36 -0800	[thread overview]
Message-ID: <CABPp-BFbtusiT30_gU7SgmmMg25NCdgNTSEEJJysoT-1MwSnkA@mail.gmail.com> (raw)
In-Reply-To: <20181113144554.GB17454@sigill.intra.peff.net>

On Tue, Nov 13, 2018 at 6:45 AM Jeff King <peff@peff.net> wrote:
> It is an expensive log command, but it's the same expense as running
> fast-export, no? And I think maybe that is the disconnect.

I would expect an expensive log command to generally be the same
expense as running fast-export, yes.  But I would expect two expensive
log commands to be twice the expense of a single fast-export (and you
suggested two log commands: both the --find-object= one and the
--diff-filter one).

> I am looking at this problem as "how do you answer question X in a
> repository". And I think you are looking at as "I am receiving a
> fast-export stream, and I need to answer question X on the fly".
>
> And that would explain why you want to get extra annotations into the
> fast-export stream. Is that right?

I'm not trying to get information on the fly during a rewrite or
anything like that.  This is an optional pre-rewrite step (from a
separate invocation of the tool) where I have multiple questions I
want to answer.  I'd like to answer them all relatively quickly, if
possible, and I think all of them should be answerable with a single
history traversal (plus a cat-file --batch-all-objects call to get
object sizes, since I don't know of another way to get those).  I'd be
fine with switching from fast-export to log or something else if it
met the needs better.

As far as I can tell, you're trying to split each question apart and
do a history traversal for each, and I don't see why that's better.
Simpler, perhaps, but it seems worse for performance.  Am I missing
something?

> > > There I think you'd want to assemble the list with something like "git
> > > log --follow --name-only paths-of-interest" except that --follow sucks
> > > too much to handle more than one path at a time.
> > >
> > > But if you wanted to do it manually, then:
> > >
> > >   git log --diff-filter=R --name-only
> > >
> > > would be enough to let you track it down, wouldn't it?
> >
> > Without a -M you'd only catch 100% renames, right?  Those aren't the
> > only ones I'd want to catch, so I'd need to add -M.  You are right
> > that we could get basic renames this way, but it doesn't cover
> > everything I need.  Let's use this as a starting point, though, and
> > build up to what I need...
>
> No, renames are on by default these days, and that includes inexact
> renames. That said, if you're scripting you probably ought to be doing:
>
>   git rev-list HEAD | git diff-tree --stdin
>
> and there yes, you'd have to enable "-M" yourself (you touched on
> scripting and formatting below; diff-tree can accept the format options
> you'd want).

Ah, I didn't know renames were on by default; I somehow missed that.
Also, the rev-list to diff-tree pipe is nice, but I also need parent
and commit timestamp information.

....
> Yeah, I think "-t" would help your tree deletion problem.

Absolutely, thanks for the hint.  Much appreciated.  :-)

> > At this point, let's remember that we had another full git-log
> > invocation for mapping object sizes to filenames.  We might as well
> > coalesce the two log commands into one, by extending this latest one
> > to:
> >
> >   git log -M --diff-filter=RAMD --no-abbrev --raw
>
> What is there besides RAMD? :)

Well, as you pointed out above, log detects renames by default,
whereas it didn't used to.
So, if someone had written some similar-ish history walking/parsing
tool years ago that didn't depend need renames and was based on log
output, there's a good chance their tool might start failing when
rename detection was turned on by default, because instead of getting
both a 'D' and an 'M' change, they'd get an unexpected 'R'.

For my case, do I have to worry about similar future changes?  Will
copy detection ('C') or break detection ('B') become the default in
the future?  Do I have to worry about typechanges ('T")?  Will new
change types be added?  I mean, the fast-export output could maybe
change too, but it seems much less likely than with log.

> > I could potentially switch to using this and drop patch 10/10.
>
> So I'm still not _entirely_ clear on what you're trying to do with
> 10/10. I think maybe the "disconnect" part I wrote above explains it. If
> that's correct, then I think framing it in terms of the operations that
> you'd be able to perform _without running a separate traverse_ would
> make it more obvious.

Let me try to put it as briefly as I can.  With as few traversals as
possible, I want to:
  * Get all blob sizes
  * Map blob shas to filename(s) they appeared under in the history
  * Find when files and directories were deleted (and whether they
were later reinstated, since that means they aren't actually gone)
  * Find sets of filenames referring to the same logical 'file'. (e.g.
foo->bar in commit A and bar->baz in commit B mean that {foo,bar,baz}
refer to the same 'file' so that a user has an easy report to look at
to find out that if they just want to "keep baz and its history" then
they need foo & bar & baz.  I need to know about things like another
foo or bar being introduced after the rename though, since that breaks
the connection between filenames)
  * Do a few aggregations on the above data as well (e.g. all copies
of postgres.exe add up to 20M -- why were those checked in anyway?,
*.webm files in aggregate are .5G, your long-deleted src/video-server/
directory from that aborted experimental project years ago takes up 2G
of your history, etc.)

Right now, my best solution for this combination of questions is
'cat-file --batch-all-objects' plus fast-export, if I get patch 10/10
in place.  I'm totally open to better solutions, including ones that
don't use fast-export.

  reply	other threads:[~2018-11-13 17:10 UTC|newest]

Thread overview: 90+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-09-23 13:04 Import/Export as a fast way to purge files from Git? Lars Schneider
2018-09-23 14:55 ` Eric Sunshine
2018-09-23 15:58   ` Lars Schneider
2018-09-23 15:53 ` brian m. carlson
2018-09-23 17:04   ` Jeff King
2018-09-24 17:24 ` Elijah Newren
2018-10-31 19:15   ` Lars Schneider
2018-11-01  7:12     ` Elijah Newren
2018-11-11  6:23       ` [PATCH 00/10] fast export and import fixes and features Elijah Newren
2018-11-11  6:23         ` [PATCH 01/10] git-fast-import.txt: fix documentation for --quiet option Elijah Newren
2018-11-11  6:33           ` Jeff King
2018-11-11  6:23         ` [PATCH 02/10] git-fast-export.txt: clarify misleading documentation about rev-list args Elijah Newren
2018-11-11  6:36           ` Jeff King
2018-11-11  7:17             ` Elijah Newren
2018-11-13 23:25               ` Elijah Newren
2018-11-13 23:39                 ` Jonathan Nieder
2018-11-14  0:02                   ` Elijah Newren
2018-11-11  6:23         ` [PATCH 03/10] fast-export: use value from correct enum Elijah Newren
2018-11-11  6:36           ` Jeff King
2018-11-11 20:10             ` Ævar Arnfjörð Bjarmason
2018-11-12  9:12               ` Ævar Arnfjörð Bjarmason
2018-11-12 11:31               ` Jeff King
2018-11-11  6:23         ` [PATCH 04/10] fast-export: avoid dying when filtering by paths and old tags exist Elijah Newren
2018-11-11  6:44           ` Jeff King
2018-11-11  7:38             ` Elijah Newren
2018-11-12 12:32               ` Jeff King
2018-11-12 22:50             ` brian m. carlson
2018-11-13 14:38               ` Jeff King
2018-11-11  6:23         ` [PATCH 05/10] fast-export: move commit rewriting logic into a function for reuse Elijah Newren
2018-11-11  6:47           ` Jeff King
2018-11-11  6:23         ` [PATCH 06/10] fast-export: when using paths, avoid corrupt stream with non-existent mark Elijah Newren
2018-11-11  6:53           ` Jeff King
2018-11-11  8:01             ` Elijah Newren
2018-11-12 12:45               ` Jeff King
2018-11-12 15:36                 ` Elijah Newren
2018-11-11  6:23         ` [PATCH 07/10] fast-export: ensure we export requested refs Elijah Newren
2018-11-11  7:02           ` Jeff King
2018-11-11  8:20             ` Elijah Newren
2018-11-11  6:23         ` [PATCH 08/10] fast-export: add --reference-excluded-parents option Elijah Newren
2018-11-11  7:11           ` Jeff King
2018-11-11  6:23         ` [PATCH 09/10] fast-export: add a --show-original-ids option to show original names Elijah Newren
2018-11-11  7:20           ` Jeff King
2018-11-11  8:32             ` Elijah Newren
2018-11-12 12:53               ` Jeff King
2018-11-12 15:46                 ` Elijah Newren
2018-11-12 16:31                   ` Jeff King
2018-11-11  6:23         ` [PATCH 10/10] fast-export: add --always-show-modify-after-rename Elijah Newren
2018-11-11  7:23           ` Jeff King
2018-11-11  8:42             ` Elijah Newren
2018-11-12 12:58               ` Jeff King
2018-11-12 18:08                 ` Elijah Newren
2018-11-13 14:45                   ` Jeff King
2018-11-13 17:10                     ` Elijah Newren [this message]
2018-11-14  7:14                       ` Jeff King
2018-11-11  7:27         ` [PATCH 00/10] fast export and import fixes and features Jeff King
2018-11-11  8:44           ` Elijah Newren
2018-11-12 13:00             ` Jeff King
2018-11-14  0:25         ` [PATCH v2 00/11] " Elijah Newren
2018-11-14  0:25           ` [PATCH v2 01/11] git-fast-import.txt: fix documentation for --quiet option Elijah Newren
2018-11-14  0:25           ` [PATCH v2 02/11] git-fast-export.txt: clarify misleading documentation about rev-list args Elijah Newren
2018-11-14  0:25           ` [PATCH v2 03/11] fast-export: use value from correct enum Elijah Newren
2018-11-14  0:25           ` [PATCH v2 04/11] fast-export: avoid dying when filtering by paths and old tags exist Elijah Newren
2018-11-14 19:17             ` SZEDER Gábor
2018-11-14 23:13               ` Elijah Newren
2018-11-14  0:25           ` [PATCH v2 05/11] fast-export: move commit rewriting logic into a function for reuse Elijah Newren
2018-11-14  0:25           ` [PATCH v2 06/11] fast-export: when using paths, avoid corrupt stream with non-existent mark Elijah Newren
2018-11-14  0:25           ` [PATCH v2 07/11] fast-export: ensure we export requested refs Elijah Newren
2018-11-14  0:25           ` [PATCH v2 08/11] fast-export: add --reference-excluded-parents option Elijah Newren
2018-11-14 19:27             ` SZEDER Gábor
2018-11-14 23:16               ` Elijah Newren
2018-11-14  0:25           ` [PATCH v2 09/11] fast-import: remove unmaintained duplicate documentation Elijah Newren
2018-11-14  0:25           ` [PATCH v2 10/11] fast-export: add a --show-original-ids option to show original names Elijah Newren
2018-11-14  0:26           ` [PATCH v2 11/11] fast-export: add --always-show-modify-after-rename Elijah Newren
2018-11-14  7:25           ` [PATCH v2 00/11] fast export and import fixes and features Jeff King
2018-11-16  7:59           ` [PATCH v3 " Elijah Newren
2018-11-16  7:59             ` [PATCH v3 01/11] fast-export: convert sha1 to oid Elijah Newren
2018-11-16  7:59             ` [PATCH v3 02/11] git-fast-import.txt: fix documentation for --quiet option Elijah Newren
2018-11-16  7:59             ` [PATCH v3 03/11] git-fast-export.txt: clarify misleading documentation about rev-list args Elijah Newren
2018-11-16  7:59             ` [PATCH v3 04/11] fast-export: use value from correct enum Elijah Newren
2018-11-16  7:59             ` [PATCH v3 05/11] fast-export: avoid dying when filtering by paths and old tags exist Elijah Newren
2018-11-16  7:59             ` [PATCH v3 06/11] fast-export: move commit rewriting logic into a function for reuse Elijah Newren
2018-11-16  7:59             ` [PATCH v3 07/11] fast-export: when using paths, avoid corrupt stream with non-existent mark Elijah Newren
2018-11-16  7:59             ` [PATCH v3 08/11] fast-export: ensure we export requested refs Elijah Newren
2018-11-16  7:59             ` [PATCH v3 09/11] fast-export: add --reference-excluded-parents option Elijah Newren
2018-11-16  7:59             ` [PATCH v3 10/11] fast-import: remove unmaintained duplicate documentation Elijah Newren
2018-11-16  7:59             ` [PATCH v3 11/11] fast-export: add a --show-original-ids option to show original names Elijah Newren
2018-11-16 12:29               ` SZEDER Gábor
2018-11-16  8:50             ` [PATCH v3 00/11] fast export and import fixes and features Jeff King
2018-11-12  9:17       ` Import/Export as a fast way to purge files from Git? Ævar Arnfjörð Bjarmason
2018-11-12 15:34         ` Elijah Newren

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CABPp-BFbtusiT30_gU7SgmmMg25NCdgNTSEEJJysoT-1MwSnkA@mail.gmail.com \
    --to=newren@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=jrnieder@gmail.com \
    --cc=larsxschneider@gmail.com \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    --cc=sandals@crustytoothpaste.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).