* git-fast-import yields huge packfile @ 2019-03-16 20:31 Richard Hipp 2019-03-16 21:04 ` Linus Torvalds 2019-03-21 14:09 ` Johannes Schindelin 0 siblings, 2 replies; 6+ messages in thread From: Richard Hipp @ 2019-03-16 20:31 UTC (permalink / raw) To: git I'm trying to transform a repository from another VCS into a Git repository using "git fast-import". It appears to work, but the resulting Git repository is huge relative to the original - 18 times larger. Most of the space seems to be taken up by a single large packfile. That packfile is about 967 MB which is about 1/4th the total uncompressed size of all 41785 distinct Blobs in the original repository. The source VCS is able to compress this down to 52 MB by comparison. Maybe I'm doing something wrong with the fast-import stream that is defeating Git's attempts at delta compression.... Are there any utility programs available for analyzing packfiles so that I try to figure out where the inefficiencies are cropping up, so that I can try to address them? Anybody have any suggestions on what I should be looking for? If anyone would care to see this oversized packfile and perhaps offer suggestions on how I can make it more space-efficient, it can be cloned from https://github.com/drhsqlite/fossil-mirror.git - at least for now - surely I will delete that repo and regenerate it once I figure out this problem. -- D. Richard Hipp drh@sqlite.org ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: git-fast-import yields huge packfile 2019-03-16 20:31 git-fast-import yields huge packfile Richard Hipp @ 2019-03-16 21:04 ` Linus Torvalds 2019-03-16 22:12 ` Mike Hommey 2019-03-16 23:22 ` Richard Hipp 2019-03-21 14:09 ` Johannes Schindelin 1 sibling, 2 replies; 6+ messages in thread From: Linus Torvalds @ 2019-03-16 21:04 UTC (permalink / raw) To: Richard Hipp; +Cc: Git List Mailing On Sat, Mar 16, 2019 at 1:31 PM Richard Hipp <drh@sqlite.org> wrote: > > Maybe I'm doing something wrong with the fast-import stream that is > defeating Git's attempts at delta compression.... fast-import doesn't do fancy delta compression becayse that would defeat the "fast" part of fast-import. Just do a git repack after the import to do the proper repacking. I get a 41Mb packfile when I try that on your repo. So a simple git repack -adf should fix things up for you (the "-f" to make sure it doesn't try to re-use things from the silly bad pack). Alternatively, use "git gc --aggressive", which will do that forced repack too. Linus ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: git-fast-import yields huge packfile 2019-03-16 21:04 ` Linus Torvalds @ 2019-03-16 22:12 ` Mike Hommey 2019-03-16 23:22 ` Richard Hipp 1 sibling, 0 replies; 6+ messages in thread From: Mike Hommey @ 2019-03-16 22:12 UTC (permalink / raw) To: Linus Torvalds; +Cc: Richard Hipp, Git List Mailing On Sat, Mar 16, 2019 at 02:04:33PM -0700, Linus Torvalds wrote: > On Sat, Mar 16, 2019 at 1:31 PM Richard Hipp <drh@sqlite.org> wrote: > > > > Maybe I'm doing something wrong with the fast-import stream that is > > defeating Git's attempts at delta compression.... > > fast-import doesn't do fancy delta compression becayse that would > defeat the "fast" part of fast-import. fast-import however does try to do delta compression of blobs against the last blob that was imported, so if you put your blobs in an order where they can be delta-ed, you can win without a git repack. For one-shot conversions, you can just rely on git repack. Mike ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: git-fast-import yields huge packfile 2019-03-16 21:04 ` Linus Torvalds 2019-03-16 22:12 ` Mike Hommey @ 2019-03-16 23:22 ` Richard Hipp 1 sibling, 0 replies; 6+ messages in thread From: Richard Hipp @ 2019-03-16 23:22 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git List Mailing On 3/16/19, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > git repack -adf > Thanks for the tip! -- D. Richard Hipp drh@sqlite.org ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: git-fast-import yields huge packfile 2019-03-16 20:31 git-fast-import yields huge packfile Richard Hipp 2019-03-16 21:04 ` Linus Torvalds @ 2019-03-21 14:09 ` Johannes Schindelin 2019-03-21 14:23 ` Ævar Arnfjörð Bjarmason 1 sibling, 1 reply; 6+ messages in thread From: Johannes Schindelin @ 2019-03-21 14:09 UTC (permalink / raw) To: Richard Hipp; +Cc: git Hi Richard, On Sat, 16 Mar 2019, Richard Hipp wrote: > I'm trying to transform a repository from another VCS into a Git > repository using "git fast-import". It appears to work, but the > resulting Git repository is huge relative to the original - 18 times > larger. Most of the space seems to be taken up by a single large > packfile. That packfile is about 967 MB which is about 1/4th the > total uncompressed size of all 41785 distinct Blobs in the original > repository. The source VCS is able to compress this down to 52 MB by > comparison. I feel your pain, as I had the same problem back in the day. My use case was mirroring an upstream Mercurial repository to a Git repository. This use case went away, so I do not do that anymore (and there are more, less happy reasons why I would no longer work on that git-remote-hg project, but that's off topic). As one of the last rem(a)inders, Git for Windows carries this patch: https://github.com/git-for-windows/git/commit/b91911ff8d3e2cf279b4708be89de2e3bc8e9e87 Essentially, it *always* runs `git gc --auto` after running `fast-import`. Which is a lot more high-level advice than the rather low-level `git repack` hint given elsewhere in this thread. Now, I wonder whether we should integrate this into `fast-import` proper (with a knob to turn it off), maybe even offer to run `git gc --auto` every <N> imported commits? Ciao, Johannes > Maybe I'm doing something wrong with the fast-import stream that is > defeating Git's attempts at delta compression.... > > Are there any utility programs available for analyzing packfiles so > that I try to figure out where the inefficiencies are cropping up, so > that I can try to address them? > > Anybody have any suggestions on what I should be looking for? > > If anyone would care to see this oversized packfile and perhaps offer > suggestions on how I can make it more space-efficient, it can be > cloned from https://github.com/drhsqlite/fossil-mirror.git - at least > for now - surely I will delete that repo and regenerate it once I > figure out this problem. > > -- > D. Richard Hipp > drh@sqlite.org > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: git-fast-import yields huge packfile 2019-03-21 14:09 ` Johannes Schindelin @ 2019-03-21 14:23 ` Ævar Arnfjörð Bjarmason 0 siblings, 0 replies; 6+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2019-03-21 14:23 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Richard Hipp, git, Mike Hommey, Linus Torvalds On Thu, Mar 21 2019, Johannes Schindelin wrote: > Hi Richard, > > On Sat, 16 Mar 2019, Richard Hipp wrote: > >> I'm trying to transform a repository from another VCS into a Git >> repository using "git fast-import". It appears to work, but the >> resulting Git repository is huge relative to the original - 18 times >> larger. Most of the space seems to be taken up by a single large >> packfile. That packfile is about 967 MB which is about 1/4th the >> total uncompressed size of all 41785 distinct Blobs in the original >> repository. The source VCS is able to compress this down to 52 MB by >> comparison. > > I feel your pain, as I had the same problem back in the day. My use case > was mirroring an upstream Mercurial repository to a Git repository. This > use case went away, so I do not do that anymore (and there are more, less > happy reasons why I would no longer work on that git-remote-hg project, > but that's off topic). As one of the last rem(a)inders, Git for Windows > carries this patch: > > https://github.com/git-for-windows/git/commit/b91911ff8d3e2cf279b4708be89de2e3bc8e9e87 > > Essentially, it *always* runs `git gc --auto` after running `fast-import`. > > Which is a lot more high-level advice than the rather low-level `git > repack` hint given elsewhere in this thread. > > Now, I wonder whether we should integrate this into `fast-import` proper > (with a knob to turn it off), maybe even offer to run `git gc --auto` > every <N> imported commits? My reading of the combination of Linus's & Mike Hommey's E-Mails is that this just happened to work for you because the blob import order you used was such that you didn't get any on-the-fly deltas. But as Linus notes you need to pass "-f" aka. "--no-reuse-delta" down to pack-objects for this to work in the general case, so a plain "git gc" in that GFW patch won't do the right thing *unless* you didn't end up with any deltas at all (or close enough for it not to matter). So in the general case you need to run "git gc --aggressive" after a "fast-import". I'll add some docs about this in my re-roll of my concurrent gc doc series: https://public-inbox.org/git/20190318161502.7979-1-avarab@gmail.com/ I wonder if we should just leave it at that. The fast-import command is plumbing, and e.g. someone running N number of those now and doing a "git gc --aggressive" afterwards would have their use broken by this, their "gc" would abort if the "--aggressive" we spawned after the 1st fast-import invocation was still running. I was thinking of introducing some sub-mode for --aggressive that doesn't tweak the window size, but just passes down "-f". It would more generally cover these cases, and eta less CPU than the increased window size (although "--no-reuse-delta" by itself is very expensive). >> Maybe I'm doing something wrong with the fast-import stream that is >> defeating Git's attempts at delta compression.... >> >> Are there any utility programs available for analyzing packfiles so >> that I try to figure out where the inefficiencies are cropping up, so >> that I can try to address them? >> >> Anybody have any suggestions on what I should be looking for? >> >> If anyone would care to see this oversized packfile and perhaps offer >> suggestions on how I can make it more space-efficient, it can be >> cloned from https://github.com/drhsqlite/fossil-mirror.git - at least >> for now - surely I will delete that repo and regenerate it once I >> figure out this problem. >> >> -- >> D. Richard Hipp >> drh@sqlite.org >> ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2019-03-21 14:23 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2019-03-16 20:31 git-fast-import yields huge packfile Richard Hipp 2019-03-16 21:04 ` Linus Torvalds 2019-03-16 22:12 ` Mike Hommey 2019-03-16 23:22 ` Richard Hipp 2019-03-21 14:09 ` Johannes Schindelin 2019-03-21 14:23 ` Ævar Arnfjörð Bjarmason
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).