* On data structures and parallelism @ 2009-05-17 15:23 Heikki Orsila 2009-05-17 17:06 ` Linus Torvalds 0 siblings, 1 reply; 5+ messages in thread From: Heikki Orsila @ 2009-05-17 15:23 UTC (permalink / raw) To: git; +Cc: Linus Torvalds There was an interesting discussion at http://realworldtech.com/forums/index.cfm?action=detail&id=98909&threadid=98430&roomid=2 that involves DAGs and decompression in Git. The problem is achieving parallelism. The following comment was made: "And is it possible to store the block pointers from one object to another in uncompressed form?" Is there a case in Git where an "object" could store SHA1 in uncompressed format, allowing prefetching the next object in chain before uncompressing the current object? Prefetching could increase parallelism (and speedup) in some cases. A quick glance at Git's source code showed that commit objects are compressed. Having even a single parent SHA1 in uncompressed format would allow some prefetching. All but perhaps a few objects contain at least one parent SHA1 :-) -- Heikki Orsila heikki.orsila@iki.fi http://www.iki.fi/shd ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: On data structures and parallelism 2009-05-17 15:23 On data structures and parallelism Heikki Orsila @ 2009-05-17 17:06 ` Linus Torvalds 2009-05-17 17:46 ` Linus Torvalds 0 siblings, 1 reply; 5+ messages in thread From: Linus Torvalds @ 2009-05-17 17:06 UTC (permalink / raw) To: Heikki Orsila; +Cc: git On Sun, 17 May 2009, Heikki Orsila wrote: > > There was an interesting discussion at > > http://realworldtech.com/forums/index.cfm?action=detail&id=98909&threadid=98430&roomid=2 > > that involves DAGs and decompression in Git. The problem is achieving > parallelism. The following comment was made: > > "And is it possible to store the block pointers from one object to > another in uncompressed form?" For the biggest form of this, we actually already do. The single biggest win of compression is the delta-compression: the regular zlib compression is generally about a factor-of-two for unpacked and "base" delta entries, but much less for already delta-compressed entries. In comparison, the delta-compression is likely about a factor of 10 or more. And the delta compression already has the SHA1 pointer to the delta base entry uncompressed, but the compression is serialized by the fact that we need to uncompress the base entry in order to then apply a delta on top of it. Now, there's no question that we could have higher levels of parallelism by walking multiple such chains in parallel (we don't always have more than one chain to walk, but sometimes we do). But as I point out in that thread, we currently don't have any locking for the core object datastructures, and adding that locking would likely slow down things more than it speeds things up for the normal case. For 'git fsck', we could speed things up a lot by doing parallel work - and we wouldn't need to have anything else uncompressed, we could just take advantage of the fact that we could try to uncompress the different delta chains in parallel. And yes, fsck is really slow, but on the other hand, it's something that most people never do, and I do about once a month. The fact that it takes four minutes rather than one is not a big deal. There are other forms of compression where the SHA1 pointers are inside the compressed data (the "regular" commit->{commit,tree} relationships and tree->{anything} cases). And yes, we could probably get rid of at least the zlib compression in some of that. Much of that data doesn't even compress very well (SHA1's are basically uncompressible), and the compression is done largely because we have one unified interface for everything (so the basic object code doesn't need to care about different object types or different formats: it's all just binary data with a magic header to it). But in that case, we'd probably not want to keep a separate uncompressed tree, we'd just decide that "compression is too expensive to be worth it". That said, on my laptops, CPU time really _never_ is the issue. Every single time something is slow, the issue is a slow 4200rpm disk that may get 25MB/s off it for linear things in the best case, but seeks take milliseconds and any kind of random access will just kill performance. So in the big picture, I suspect even the wasted CPU-time is worth it. It makes some operations slower, but anything that makes the on-disk data denser is good. Because the case I end up caring most about is always the uncached case. Linus ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: On data structures and parallelism 2009-05-17 17:06 ` Linus Torvalds @ 2009-05-17 17:46 ` Linus Torvalds 2009-05-17 19:31 ` david 0 siblings, 1 reply; 5+ messages in thread From: Linus Torvalds @ 2009-05-17 17:46 UTC (permalink / raw) To: Heikki Orsila; +Cc: git On Sun, 17 May 2009, Linus Torvalds wrote: > > That said, on my laptops, CPU time really _never_ is the issue. Every > single time something is slow, the issue is a slow 4200rpm disk that may > get 25MB/s off it for linear things in the best case, but seeks take > milliseconds and any kind of random access will just kill performance. Side note - I've several times desperately tried to see if IO parallelism helps. It doesn't. Some drives do better if they get many independent reads and can just do them concurrently. Sadly, that's pretty rare for reads on rotational media, and impossible with legacy IDE drives (that don't have the ability to do tagged queueing). So when I try to do IO in parallel (which git does support for many operations), that just makes the whole system come to a screeching halt because it now seeks around the disk a lot more. A similar issue that often kill parallelism on CPU's (bad cache behavior, and lots of outstanding memory requests) kills parallelism on disks too - disk performance simply is much _better_ if you do serial things than if you try to parallelize the same work. It would be different if I had a fancy high-end RAID system with tagged queueing and lots of spare bandwidth that could be used in parallel. But that's not what the git usage scenario often is. All the people pushing multi-core seem to always ignore the big issues, and always working on nice trivial problems with a small and well-behaved "kernel" that has no IO and preferably didn't cache well even when single-threaded (ie "streaming" data). Linus ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: On data structures and parallelism 2009-05-17 17:46 ` Linus Torvalds @ 2009-05-17 19:31 ` david 2009-05-17 20:35 ` Linus Torvalds 0 siblings, 1 reply; 5+ messages in thread From: david @ 2009-05-17 19:31 UTC (permalink / raw) To: Linus Torvalds; +Cc: Heikki Orsila, git On Sun, 17 May 2009, Linus Torvalds wrote: > On Sun, 17 May 2009, Linus Torvalds wrote: >> >> That said, on my laptops, CPU time really _never_ is the issue. Every >> single time something is slow, the issue is a slow 4200rpm disk that may >> get 25MB/s off it for linear things in the best case, but seeks take >> milliseconds and any kind of random access will just kill performance. > > Side note - I've several times desperately tried to see if IO parallelism > helps. It doesn't. Some drives do better if they get many independent > reads and can just do them concurrently. Sadly, that's pretty rare for > reads on rotational media, and impossible with legacy IDE drives (that > don't have the ability to do tagged queueing). > > So when I try to do IO in parallel (which git does support for many > operations), that just makes the whole system come to a screeching halt > because it now seeks around the disk a lot more. A similar issue that > often kill parallelism on CPU's (bad cache behavior, and lots of > outstanding memory requests) kills parallelism on disks too - disk > performance simply is much _better_ if you do serial things than if you > try to parallelize the same work. > > It would be different if I had a fancy high-end RAID system with tagged > queueing and lots of spare bandwidth that could be used in parallel. But > that's not what the git usage scenario often is. All the people pushing > multi-core seem to always ignore the big issues, and always working on > nice trivial problems with a small and well-behaved "kernel" that has no > IO and preferably didn't cache well even when single-threaded (ie > "streaming" data). do things change with SSDs? I've heard that even (especially??) with the Intel SSDs you want to have several operations going in paralllel to get the best out of them. David Lang ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: On data structures and parallelism 2009-05-17 19:31 ` david @ 2009-05-17 20:35 ` Linus Torvalds 0 siblings, 0 replies; 5+ messages in thread From: Linus Torvalds @ 2009-05-17 20:35 UTC (permalink / raw) To: david; +Cc: Heikki Orsila, git On Sun, 17 May 2009, david@lang.hm wrote: > > do things change with SSDs? I've heard that even (especially??) with the Intel > SSDs you want to have several operations going in paralllel to get the best > out of them. There's a slight, but noticeable, improvement. This is: "echo 3 > /proc/sys/vm/drop_caches; time git diff" run in a loop. With 'core.preloadindex = true': real 0m1.138s real 0m1.116s real 0m1.132s real 0m1.120s real 0m1.106s real 0m1.132s and with it set to 'false': real 0m1.256s real 0m1.258s real 0m1.242s real 0m1.240s real 0m1.244s real 0m1.242s so it's about a 10% improvement. Which is pretty good, considering that (a) those disks are fast enough that even for that totally cache-cold case, I get about 35% CPU utilization for the single-threaded case. And that's despite this being a 3.2GHz Nehalem box, so 35% CPU is really quite remarkably good. Om my (much slower) laptop with a 1.2GHz Core 2, I get 2-3% CPU-time (and the whole operation takes 20 seconds). (b) Not all the IO ends up being parallelized, since there is a per-directory mutex that means that even though we start 20 threads, it probably gets a much smaller amount of real parallelism due to locking. in general, the IO parallelization obviously helps most when the IO is slow _and_ overlaps perfectly. Perfect overlap doesn't end up happening due to the per-directory lookup semaphore (think of it like a bank conflict in trying to parallelize memory accesses), but with a slow NFS connection you should get reasonably close to that optimal situation. But with a single spindle, and rotating media, there really is sadly very little room for optimization. I suspect a SATA with TCQ disk might be able to do _somewhat_ better than my old PATA-only laptop (discounting the fact that my PATA laptop harddisk is extra slow due to being just 4200rpm: any desktop disk will be much faster), but I doubt the index preloading is really all that noticeable. In fact, I just tested on another machine, and saw no difference what-so-ever. If anything, it was slightly slower. I suspect TCQ is a bigger win with writes. Linus ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2009-05-17 20:36 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-05-17 15:23 On data structures and parallelism Heikki Orsila 2009-05-17 17:06 ` Linus Torvalds 2009-05-17 17:46 ` Linus Torvalds 2009-05-17 19:31 ` david 2009-05-17 20:35 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).