From: Linus Torvalds <torvalds@linux-foundation.org>
To: Heikki Orsila <shdl@zakalwe.fi>
Cc: git@vger.kernel.org
Subject: Re: On data structures and parallelism
Date: Sun, 17 May 2009 10:06:26 -0700 (PDT) [thread overview]
Message-ID: <alpine.LFD.2.01.0905170950230.3301@localhost.localdomain> (raw)
In-Reply-To: <20090517152335.GC11543@zakalwe.fi>
On Sun, 17 May 2009, Heikki Orsila wrote:
>
> There was an interesting discussion at
>
> http://realworldtech.com/forums/index.cfm?action=detail&id=98909&threadid=98430&roomid=2
>
> that involves DAGs and decompression in Git. The problem is achieving
> parallelism. The following comment was made:
>
> "And is it possible to store the block pointers from one object to
> another in uncompressed form?"
For the biggest form of this, we actually already do.
The single biggest win of compression is the delta-compression: the
regular zlib compression is generally about a factor-of-two for unpacked
and "base" delta entries, but much less for already delta-compressed
entries. In comparison, the delta-compression is likely about a factor of
10 or more.
And the delta compression already has the SHA1 pointer to the delta base
entry uncompressed, but the compression is serialized by the fact that we
need to uncompress the base entry in order to then apply a delta on top of
it.
Now, there's no question that we could have higher levels of parallelism
by walking multiple such chains in parallel (we don't always have more
than one chain to walk, but sometimes we do). But as I point out in that
thread, we currently don't have any locking for the core object
datastructures, and adding that locking would likely slow down things more
than it speeds things up for the normal case.
For 'git fsck', we could speed things up a lot by doing parallel work -
and we wouldn't need to have anything else uncompressed, we could just
take advantage of the fact that we could try to uncompress the different
delta chains in parallel. And yes, fsck is really slow, but on the other
hand, it's something that most people never do, and I do about once a
month. The fact that it takes four minutes rather than one is not a big
deal.
There are other forms of compression where the SHA1 pointers are inside
the compressed data (the "regular" commit->{commit,tree} relationships and
tree->{anything} cases).
And yes, we could probably get rid of at least the zlib compression in
some of that. Much of that data doesn't even compress very well (SHA1's
are basically uncompressible), and the compression is done largely because
we have one unified interface for everything (so the basic object code
doesn't need to care about different object types or different formats:
it's all just binary data with a magic header to it).
But in that case, we'd probably not want to keep a separate uncompressed
tree, we'd just decide that "compression is too expensive to be worth it".
That said, on my laptops, CPU time really _never_ is the issue. Every
single time something is slow, the issue is a slow 4200rpm disk that may
get 25MB/s off it for linear things in the best case, but seeks take
milliseconds and any kind of random access will just kill performance.
So in the big picture, I suspect even the wasted CPU-time is worth it. It
makes some operations slower, but anything that makes the on-disk data
denser is good. Because the case I end up caring most about is always the
uncached case.
Linus
next prev parent reply other threads:[~2009-05-17 17:06 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-05-17 15:23 On data structures and parallelism Heikki Orsila
2009-05-17 17:06 ` Linus Torvalds [this message]
2009-05-17 17:46 ` Linus Torvalds
2009-05-17 19:31 ` david
2009-05-17 20:35 ` Linus Torvalds
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=alpine.LFD.2.01.0905170950230.3301@localhost.localdomain \
--to=torvalds@linux-foundation.org \
--cc=git@vger.kernel.org \
--cc=shdl@zakalwe.fi \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox