From: Avery Pennarun <apenwarr@gmail.com>
To: skillzero@gmail.com
Cc: "Marc Branchaud" <marcnarc@xiplink.com>,
"Jakub Narebski" <jnareb@gmail.com>,
"Jens Lehmann" <Jens.Lehmann@web.de>,
"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
"Bryan Larsen" <bryan.larsen@gmail.com>,
git <git@vger.kernel.org>, "Junio C Hamano" <gitster@pobox.com>,
"Linus Torvalds" <torvalds@linux-foundation.org>
Subject: Re: Avery Pennarun's git-subtree?
Date: Fri, 23 Jul 2010 21:20:07 -0400 [thread overview]
Message-ID: <AANLkTimLayG_HFxGdq+Tt8hU_MApBpSdHHiYPxcakpRJ@mail.gmail.com> (raw)
In-Reply-To: <AANLkTinhd2DYh7WXzMvhMkqp98fYtTWWuQi0RSL9Rome@mail.gmail.com>
On Fri, Jul 23, 2010 at 8:58 PM, <skillzero@gmail.com> wrote:
> On Fri, Jul 23, 2010 at 3:50 PM, Avery Pennarun <apenwarr@gmail.com> wrote:
>> Honest question: do you care about the wasted disk space and download
>> time for these extra files? Or just the fact that git gets slow when
>> you have them?
>
> I have the similar situation to the original poster (huge trees) and
> for me it's all three: disk space, download time, and performance. My
> tree has a few relatively small (< 20 MB) shared directories of common
> code, a few large (2-6 GB) directories of code for OS's, and then
> several medium size (< 500 MB) directories for application code. The
> application developers only care about the app+shared directories (and
> are very annoyed by the massive space and performance impact of the OS
> directories).
Given how cheap disk space is nowadays, I'm curious about this. Are
they really just annoyed by the performance problem, and they complain
about the extra size because they blame the performance on the extra
files? Or are they honestly short of disk space?
Similarly, are all your developers located at the same office? If so,
then bandwidth ought not be an issue.
I'm pushing extra hard on this because I believe there are lots of
opportunities to just improve git performance on huge repositories.
And if the only *real* reason people need to split repositories is
that performance goes down, then that's fixable, and you may need
neither git-submodule nor git-subtree.
> I work on all of the pieces, but even I would
> prefer to have things separated so when I work on the apps, git
> status/etc doesn't take a big hit for close to a million files in the
> OS directories (particularly when doing git status on Windows). Even
> when using the -uno option to git status, it's still pretty slow (over
> a minute).
This is indeed a problem with large repositories. Of course,
splitting them with git-submodule is kind of cheating, because it just
makes git-status *not look* to see if those files are dirty or not.
If they are dirty and you forget to commit them, you'll never know
until someone tells you later. It would be functionally equivalent to
just have git-status not look inside certain subdirs of a single
repository.
In any case, this is a pretty clear optimization target (especially
since Windows is so amazingly slow at statting files): just have a
daemon running inotify (or the Windows equivalent) that tracks whether
files are up-to-date or not. Then git would never need to recurse
through the entire tree, and operations like status, diff, checkout,
and commit could be fast even with a million-file repository.
> git-subtree could also possibly help, but there's still extra work to
> split and merge each repository. And I'm not sure how it handles
> commit IDs across the repositories because I want to be able to say "I
> fixed that bug in shared/code.c in commit abc123" and have both the
> OS+shared and the apps+shared people be able git log abc123 and see
> the same change (and merge/cherry-pick/etc.).
git-subtree (if you don't use --squash) keeps all the commit IDs. It
is extra work to split and merge between repositories, though. It
doesn't solve your repository-is-too-large problem.
> I think what I want is a way to do a sparse checkout where some sort
> of module is maintained in the git repository (probably just an
> INI-style file with paths) so I can clone directly from the server and
> it figures out the objects I need for the full history of only
> apps+shared (or firmware+shared, etc.) on the server side and only
> sends those objects. I still want to be able to branch, tag, and refer
> to commit IDs. So I only take the space/download/performance hit of
> directories included in the module, but I don't have to manually
> maintain that view of the repository (as I do with git-submodule and
> git-subtree).
Yes, better sparse checkout and sparse fetch would be very valuable
here and would eliminate a lot of the reasons people have for misusing
submodules.
> (although just having all those objects in
> the .git directory still slows it down quite a bit).
You're the second person who has mentioned this today (the first one
was to me in a private email). I'd like to understand this better.
In my bup project (http://github.com/apenwarr/bup) we regularly create
git repositories with hundreds of gigabytes of packs, comprising tens
or hundreds of millions of objects, and the repository doesn't get
slow. (Obviously this is a separate issue from having a huge work
tree with a million files in it.) In repositories this thoroughly
huge, we did find a way to improve memory usage versus git's pack .idx
files (bup has '.midx' files that combine multiple indexes into one,
thus reducing the binary search steps). But this only matters when
you get well over 10 gigabytes of stuff and you're wading through it
using crappy python code (as bup does) and frequently inserting a
million objects at a time (as bup does). The git usage pattern is
much simpler and therefore faster.
How big is your .git directory and what performance problems do you
see? I assume you've done 'git gc' to clean up all the loose objects,
right?
Have fun,
Avery
next prev parent reply other threads:[~2010-07-24 1:20 UTC|newest]
Thread overview: 58+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-07-21 17:15 Avery Pennarun's git-subtree? Bryan Larsen
2010-07-21 19:43 ` Ævar Arnfjörð Bjarmason
2010-07-21 19:56 ` Avery Pennarun
2010-07-21 20:36 ` Ævar Arnfjörð Bjarmason
2010-07-21 21:09 ` Avery Pennarun
2010-07-21 21:20 ` Avery Pennarun
2010-07-21 22:46 ` Jens Lehmann
2010-07-22 1:09 ` Avery Pennarun
[not found] ` <m31vavn8la.fsf@localhost.localdomain>
2010-07-22 18:23 ` Bryan Larsen
2010-07-24 22:36 ` Jakub Narebski
2010-07-22 19:41 ` Avery Pennarun
2010-07-22 19:56 ` Jonathan Nieder
2010-07-22 20:06 ` Avery Pennarun
2010-07-22 20:17 ` Ævar Arnfjörð Bjarmason
2010-07-22 21:33 ` Avery Pennarun
2010-07-23 15:10 ` Jens Lehmann
2010-07-26 17:34 ` Eugene Sajine
2010-07-22 20:43 ` Elijah Newren
2010-07-22 21:32 ` Avery Pennarun
2010-07-23 8:31 ` Chris Webb
2010-07-23 8:40 ` Avery Pennarun
2010-07-23 15:11 ` Jens Lehmann
2010-07-23 22:33 ` Avery Pennarun
2010-07-23 15:13 ` Jens Lehmann
2010-07-23 15:10 ` Jens Lehmann
2010-07-23 16:05 ` Bryan Larsen
2010-07-23 17:11 ` Jens Lehmann
2010-07-23 19:01 ` Bryan Larsen
2010-07-23 22:32 ` Avery Pennarun
2010-07-25 19:57 ` Jens Lehmann
2010-07-27 18:40 ` Avery Pennarun
2010-07-27 21:14 ` Jens Lehmann
2010-07-23 15:19 ` Marc Branchaud
2010-07-23 22:50 ` Avery Pennarun
2010-07-24 0:58 ` skillzero
2010-07-24 1:20 ` Avery Pennarun [this message]
2010-07-24 19:40 ` skillzero
2010-07-25 1:47 ` Nguyen Thai Ngoc Duy
2010-07-28 22:27 ` Jakub Narebski
2010-07-26 13:13 ` Jakub Narebski
2010-07-26 16:37 ` Marc Branchaud
2010-07-26 16:41 ` Linus Torvalds
2010-07-26 17:36 ` Bryan Larsen
2010-07-26 17:48 ` Linus Torvalds
2010-07-27 18:28 ` Avery Pennarun
2010-07-27 20:25 ` Junio C Hamano
2010-07-27 20:57 ` Avery Pennarun
2010-07-27 21:14 ` Junio C Hamano
2010-07-27 21:32 ` Jens Lehmann
2010-07-26 8:56 ` Jakub Narebski
2010-07-27 18:36 ` Avery Pennarun
2010-07-28 13:36 ` Marc Branchaud
2010-07-28 18:32 ` Jakub Narebski
2010-07-24 20:07 ` Sverre Rabbelier
2010-07-26 8:51 ` Jakub Narebski
2010-07-27 19:15 ` Avery Pennarun
2010-07-26 15:15 ` Marc Branchaud
2010-07-21 23:46 ` Ævar Arnfjörð Bjarmason
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=AANLkTimLayG_HFxGdq+Tt8hU_MApBpSdHHiYPxcakpRJ@mail.gmail.com \
--to=apenwarr@gmail.com \
--cc=Jens.Lehmann@web.de \
--cc=avarab@gmail.com \
--cc=bryan.larsen@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=jnareb@gmail.com \
--cc=marcnarc@xiplink.com \
--cc=skillzero@gmail.com \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).