Re: [PATCH] multi item packed files

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Chris Mason <mason@suse.com>
To: Linus Torvalds <torvalds@osdl.org>
Cc: Krzysztof Halasa <khc@pm.waw.pl>, git@vger.kernel.org
Subject: Re: [PATCH] multi item packed files
Date: Fri, 22 Apr 2005 16:32:24 -0400	[thread overview]
Message-ID: <200504221632.26278.mason@suse.com> (raw)
In-Reply-To: <Pine.LNX.4.58.0504221230020.2344@ppc970.osdl.org>

On Friday 22 April 2005 15:43, Linus Torvalds wrote:
> On Fri, 22 Apr 2005, Chris Mason wrote:
> > The problem I see for git is that once you have enough data, it should
> > degrade over and over again somewhat quickly.
>
> I really doubt that.
>
> There's a more or less constant amount of new data added all the time: the
> number of changes does _not_ grow with history. The number of changes
> grows with the amount of changes going on in the tree, and while that
> isn't exactly constant, it definitely is not something that grows very
> fast.

>From a filesystem point of view, it's not the number of changes that matters, 
it's the distance between them.  The amount of new data is constant, but the 
speed of accessing the new data is affected by the bulk of old data on disk.

Even with defragging you hopefully end up with a big chunk of the disk where 
everything is in order.  Then you add a new file and it goes either somewhere 
behind that big chunk or in front of it.  The next new file might go 
somewhere behind or in front etc etc.  Having a big chunk just means the new 
files are likely to be farther apart making reads of the new data very seeky.

>
> Btw, this is how git is able to be so fast in the first place. Git is fast
> because it knows that the "size of the change" is a lot smaller than the
> "size of the repository", so it fundamentally at all points tries to make
> sure that it only ever bothers with stuff that has changed.
>
> Stuff that hasn't changed, it ignores very _very_ efficiently.
>
git as a write engine is very fast, and we definitely write more then we read.

> > I grabbed Ingo's tarball of 28,000 patches since 2.4.0 and applied them
> > all into git on ext3 (htree).  It only took ~2.5 hrs to apply.
>
> Ok, I'd actually wish it took even less, but that's still a pretty
> impressive average of three patches a second.

Yeah, and this was a relatively old machine with slowish drives.  One run to 
apply into my packed tree is finished and only took 2 hours.  But, I had 
'tuned' it to make bigger packed files, and the end result is 2MB compressed 
objects.    Great for compression rate, but my dumb format doesn't hold up 
well for reading it back.

If I pack every 64k (uncompressed), the checkout-tree time goes down to 3m14s.  
That's a very big difference considering how stupid my code is  .git was only 
20% smaller with 64k chunks.  I should be able to do better...I'll do one 
more run.

>
> > Anyway, I ended up with a 2.6GB .git directory.  Then I:
> >
> > rm .git/index
> > umount ; mount again
> > time read-tree `tree-id` (24.45s)
> > time checkout-cache --prefix=../checkout/ -a -f (4m30s)
> >
> > --prefix is neat ;)
>
> That sounds pretty acceptable. Four minutes is a long time, but I assume
> that the whole point of the exercise was to try to test worst-case
> behaviour.  We can certainly make sure that real usage gets lower numbers
> than that (in particular, my "real usage" ends up being 100% in the disk
> cache ;)

I had a tree with 28,000 patches.  If we pretend that one bk changeset will 
equal one git changeset, we'd have 64,000 patches (57k without empty 
mergesets), and it probably wouldn't fit into ram anymore ;)  Our bk cset 
rate was about 24k/year, so we'll have to trim very aggressively to have 
reasonable performance.

For a working tree that's fine, but we need some fast central place to pull 
the working .git trees from, and we're really going to feel the random io 
there.

-chris

next prev parent reply	other threads:[~2005-04-22 20:27 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-04-21 15:13 [PATCH] multi item packed files Chris Mason
2005-04-21 15:41 ` Linus Torvalds
2005-04-21 16:23   ` Chris Mason
2005-04-21 19:28   ` Krzysztof Halasa
2005-04-21 20:07     ` Linus Torvalds
2005-04-22  9:40       ` Krzysztof Halasa
2005-04-22 18:12         ` Martin Uecker
2005-04-21 20:22     ` Chris Mason
2005-04-21 22:47       ` Linus Torvalds
2005-04-22  0:16         ` Chris Mason
2005-04-22 16:22           ` Linus Torvalds
2005-04-22 18:58             ` Chris Mason
2005-04-22 19:43               ` Linus Torvalds
2005-04-22 20:32                 ` Chris Mason [this message]
2005-04-22 23:55                   ` Chris Mason
2005-04-25 22:20                     ` Chris Mason
2005-04-22  9:48       ` Krzysztof Halasa

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200504221632.26278.mason@suse.com \
    --to=mason@suse.com \
    --cc=git@vger.kernel.org \
    --cc=khc@pm.waw.pl \
    --cc=torvalds@osdl.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).