* performance on repack @ 2007-08-11 21:12 Jon Smirl 2007-08-11 22:09 ` David Kastrup 2007-08-12 10:33 ` Martin Koegler 0 siblings, 2 replies; 30+ messages in thread From: Jon Smirl @ 2007-08-11 21:12 UTC (permalink / raw) To: Git Mailing List If anyone is bored and looking for something to do, making the delta code in git repack multithreaded would help. Yesterday I did a big repack that took 20 minutes and it only used one of my four cores. It was compute bound the entire time. Dell had 2GB quad core boxes on sale for $620 last week. I just had to buy one. I'm surprised, it's quieter than my old P4 box. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: performance on repack 2007-08-11 21:12 performance on repack Jon Smirl @ 2007-08-11 22:09 ` David Kastrup 2007-08-11 22:34 ` Linus Torvalds 2007-08-12 10:33 ` Martin Koegler 1 sibling, 1 reply; 30+ messages in thread From: David Kastrup @ 2007-08-11 22:09 UTC (permalink / raw) To: git "Jon Smirl" <jonsmirl@gmail.com> writes: > If anyone is bored and looking for something to do, making the delta > code in git repack multithreaded would help. I severely doubt that. It is like the "coding stuff in assembly language will make it faster" myth. The problem is that of manageable complexity. Making the stuff multithreaded or coded in assembly means that it becomes inaccessible for a sound algorithmic redesign. And anything that takes this long on today's machines has lots of room for algorithmic improvement, usually resulting in a speedup from one to several orders of magnitude. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: performance on repack 2007-08-11 22:09 ` David Kastrup @ 2007-08-11 22:34 ` Linus Torvalds 2007-08-11 23:21 ` Jon Smirl 0 siblings, 1 reply; 30+ messages in thread From: Linus Torvalds @ 2007-08-11 22:34 UTC (permalink / raw) To: David Kastrup; +Cc: git On Sun, 12 Aug 2007, David Kastrup wrote: > "Jon Smirl" <jonsmirl@gmail.com> writes: > > > If anyone is bored and looking for something to do, making the delta > > code in git repack multithreaded would help. > > I severely doubt that. It is like the "coding stuff in assembly > language will make it faster" myth. The problem is that of manageable > complexity. Making the stuff multithreaded or coded in assembly means > that it becomes inaccessible for a sound algorithmic redesign. I have to admit that I'm not a huge fan of threading: the complexity and locking often kills you, if memory bandwidth constraints do not, and the end result is often really really hard to debug. That said, I suspect we could some some *simple* form of this by just partitioning the problem space up - we could have a MT repack that generates four *different* packs on four different CPU's: each thread taking one quarter of the objects. At that point, you wouldn't even need threads, you could do it with regular processes, since the problem set is fully partitioned ocne you've generated the list of objects! Then, after you've generated four different packs, doing a "git gc" (without any threading) will repack them into one big pack, and mostly just re-use the existing deltas. So this would not be a generic thing, but it could be somethign that is useful for the forced full-repack after importing a large repository with fast-import, for example. So while I agree with David in general about the problem of threading, I think that we can possibly simplify the special case of repacking into something less complicated than a "real" multi-threading problem. Linus ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: performance on repack 2007-08-11 22:34 ` Linus Torvalds @ 2007-08-11 23:21 ` Jon Smirl 0 siblings, 0 replies; 30+ messages in thread From: Jon Smirl @ 2007-08-11 23:21 UTC (permalink / raw) To: Linus Torvalds; +Cc: David Kastrup, git On 8/11/07, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sun, 12 Aug 2007, David Kastrup wrote: > > "Jon Smirl" <jonsmirl@gmail.com> writes: > > > > > If anyone is bored and looking for something to do, making the delta > > > code in git repack multithreaded would help. To put this perspective, monotone takes 40hrs on the same box to repack a repository of similar size. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: performance on repack 2007-08-11 21:12 performance on repack Jon Smirl 2007-08-11 22:09 ` David Kastrup @ 2007-08-12 10:33 ` Martin Koegler 2007-08-12 13:49 ` Jon Smirl 1 sibling, 1 reply; 30+ messages in thread From: Martin Koegler @ 2007-08-12 10:33 UTC (permalink / raw) To: Jon Smirl; +Cc: Git Mailing List On Sat, Aug 11, 2007 at 05:12:24PM -0400, Jon Smirl wrote: > If anyone is bored and looking for something to do, making the delta > code in git repack multithreaded would help. Yesterday I did a big > repack that took 20 minutes and it only used one of my four cores. It > was compute bound the entire time. First, how much time is used by the write and how much by the deltify phase? If the writing phase uses too much time and you have enough free memory, you can try to raise the config variable pack.deltacachelimit (default 1000). It will save an additional delta operation for all object, whose delta is smaller than pack.deltacachelimit by caching the delta. Have you considered the impact on memory usage, if there are large blobs in the repository? While repacking, git keeps $window_size (default: 10) objects unpacked in memory. For all (except one), it additionally stores the delta index, which has about the same size as the object. So the worst case memory usage is "sizeof(biggest object)*(2*$window_size - 1)". If you have blobs >=100 MB, you need some GB of memory. Partitioning the problem is not trivial: * To get not worse packing resultes, we must first sort all objects by type, path, size. Then we can split split the list (for each task one part), which we can deltify individually. The problems are: - We need more memory, as each tasks keeps its own window of $window_size objects (+ delta indexes) in memory. - The list must be split in parts, which require the same amount of time. This is difficult, as it depends on the size of the objects as well as how they are stored (delta chain length). * On the other hand, we could run all try_delta operations for one object parallel. This way, we would need not very much more memory, but require more synchronisation (and more complex code). mfg Martin Kögler ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: performance on repack 2007-08-12 10:33 ` Martin Koegler @ 2007-08-12 13:49 ` Jon Smirl 2007-08-14 3:12 ` Shawn O. Pearce 0 siblings, 1 reply; 30+ messages in thread From: Jon Smirl @ 2007-08-12 13:49 UTC (permalink / raw) To: Martin Koegler; +Cc: Git Mailing List On 8/12/07, Martin Koegler <mkoegler@auto.tuwien.ac.at> wrote: > On Sat, Aug 11, 2007 at 05:12:24PM -0400, Jon Smirl wrote: > > If anyone is bored and looking for something to do, making the delta > > code in git repack multithreaded would help. Yesterday I did a big > > repack that took 20 minutes and it only used one of my four cores. It > > was compute bound the entire time. > > First, how much time is used by the write and how much by the deltify > phase? 95% in deltify > > If the writing phase uses too much time and you have enough free > memory, you can try to raise the config variable pack.deltacachelimit > (default 1000). It will save an additional delta operation for all > object, whose delta is smaller than pack.deltacachelimit by caching > the delta. I have 4GB RAM and fast RAID 10 disk, writing happens very quickly. Pretty much everything is in RAM. > Have you considered the impact on memory usage, if there are large > blobs in the repository? The process size maxed at 650MB. I'm in 64b mode so there is no virtual memory limit. On 32b there's windowing code for accessing the packfile since we can run out of address space, does this code get turned off for 64b? > > While repacking, git keeps $window_size (default: 10) objects unpacked > in memory. For all (except one), it additionally stores the delta > index, which has about the same size as the object. > > So the worst case memory usage is "sizeof(biggest object)*(2*$window_size - 1)". > If you have blobs >=100 MB, you need some GB of memory. > > Partitioning the problem is not trivial: > > * To get not worse packing resultes, we must first sort all objects by > type, path, size. Then we can split split the list (for each task > one part), which we can deltify individually. > > The problems are: > > - We need more memory, as each tasks keeps its own window of > $window_size objects (+ delta indexes) in memory. > > - The list must be split in parts, which require the same amount of > time. This is difficult, as it depends on the size of the objects as > well as how they are stored (delta chain length). > > * On the other hand, we could run all try_delta operations for one object > parallel. This way, we would need not very much more memory, but > require more synchronization (and more complex code). This solution was my first thought too. Use the main thread to get everything needed for the object into RAM, then multi-thread the compute bound, in-memory delta search operation. Shared CPU caches might make this very fast. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: performance on repack 2007-08-12 13:49 ` Jon Smirl @ 2007-08-14 3:12 ` Shawn O. Pearce 2007-08-14 4:10 ` Jon Smirl ` (2 more replies) 0 siblings, 3 replies; 30+ messages in thread From: Shawn O. Pearce @ 2007-08-14 3:12 UTC (permalink / raw) To: Jon Smirl; +Cc: Martin Koegler, Git Mailing List Jon Smirl <jonsmirl@gmail.com> wrote: > On 8/12/07, Martin Koegler <mkoegler@auto.tuwien.ac.at> wrote: > > Have you considered the impact on memory usage, if there are large > > blobs in the repository? > > The process size maxed at 650MB. I'm in 64b mode so there is no > virtual memory limit. > > On 32b there's windowing code for accessing the packfile since we can > run out of address space, does this code get turned off for 64b? The windowing code you are talking about defaults as follows: Parameter 32b 64b ----------------------------------------- core.packedGitWindowSize 32M 1G core.packedGitLimit 256M 8G So I doubt you are having issues with the windowing code on a 64b system, unless your repository is just *huge*. I did not think that anyone had a Git repository that exceeded 8G, though the window size of 1G might be a tad too small if there are many packfiles and they are each larger than 1G. > > * On the other hand, we could run all try_delta operations for one object > > parallel. This way, we would need not very much more memory, but > > require more synchronization (and more complex code). > > This solution was my first thought too. Use the main thread to get > everything needed for the object into RAM, then multi-thread the > compute bound, in-memory delta search operation. Shared CPU caches > might make this very fast. I have been thinking about doing this, especially now that the default window size is much larger. I think the default is up as high as 50, which means we'd keep that shiny new UltraSPARC T2 busy. Not that I have one... so anyone from Sun is welcome to send me one if they want. ;-) I'm not sure its that complex to run all try_delta calls of the current window in parallel. Might be a simple enough change that its actually worth the extra complexity, especially with these multi-core systems being so readily available. Repacking is the most CPU intensive operation Git performs, and the one that is also the easiest to make parallel. Maybe someone else will beat me to it, but if not I might give such a patch a shot in a few weeks. -- Shawn. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: performance on repack 2007-08-14 3:12 ` Shawn O. Pearce @ 2007-08-14 4:10 ` Jon Smirl 2007-08-14 5:13 ` Shawn O. Pearce 2007-08-14 14:52 ` Nicolas Pitre 2007-08-14 21:41 ` Nicolas Pitre 2 siblings, 1 reply; 30+ messages in thread From: Jon Smirl @ 2007-08-14 4:10 UTC (permalink / raw) To: Shawn O. Pearce; +Cc: Martin Koegler, Git Mailing List On 8/13/07, Shawn O. Pearce <spearce@spearce.org> wrote: > > On 32b there's windowing code for accessing the packfile since we can > > run out of address space, does this code get turned off for 64b? > > The windowing code you are talking about defaults as follows: > > Parameter 32b 64b > ----------------------------------------- > core.packedGitWindowSize 32M 1G > core.packedGitLimit 256M 8G > > So I doubt you are having issues with the windowing code on a 64b > system, unless your repository is just *huge*. I did not think that > anyone had a Git repository that exceeded 8G, though the window > size of 1G might be a tad too small if there are many packfiles > and they are each larger than 1G. Why use windows on 64b? Default core.packedGitWindowSize equal to core.packedGitLimit I haven't measured it but I suspect the OS calls for moving the windows are are quite slow on a relative basis since they have to rewrite a bunch of page tables. Why is the window so small on 32b? I thought we were up to about a 1GB packfile before running out of address space with Mozilla. Shouldn't the window simply be set as large as possible on 32b, this size being a function of the available address space, not the amount of physical memory? > > > * On the other hand, we could run all try_delta operations for one object > > > parallel. This way, we would need not very much more memory, but > > > require more synchronization (and more complex code). > > > > This solution was my first thought too. Use the main thread to get > > everything needed for the object into RAM, then multi-thread the > > compute bound, in-memory delta search operation. Shared CPU caches > > might make this very fast. > > I have been thinking about doing this, especially now that the > default window size is much larger. I think the default is up as > high as 50, which means we'd keep that shiny new UltraSPARC T2 busy. > Not that I have one... so anyone from Sun is welcome to send me > one if they want. ;-) > > I'm not sure its that complex to run all try_delta calls of the > current window in parallel. Might be a simple enough change that > its actually worth the extra complexity, especially with these > multi-core systems being so readily available. Repacking is the > most CPU intensive operation Git performs, and the one that is also > the easiest to make parallel. > > Maybe someone else will beat me to it, but if not I might give such > a patch a shot in a few weeks. I'll test it. But my assignment this week is to figure out how to program the BestComm DMA engine in a PowerPC chip. It's oh so much fun trying to figure out how to program these "engines" that ASIC designers love to build and never fully document. > > -- > Shawn. > -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: performance on repack 2007-08-14 4:10 ` Jon Smirl @ 2007-08-14 5:13 ` Shawn O. Pearce 2007-08-14 5:57 ` Jon Smirl 0 siblings, 1 reply; 30+ messages in thread From: Shawn O. Pearce @ 2007-08-14 5:13 UTC (permalink / raw) To: Jon Smirl; +Cc: Martin Koegler, Git Mailing List Jon Smirl <jonsmirl@gmail.com> wrote: > On 8/13/07, Shawn O. Pearce <spearce@spearce.org> wrote: > > > On 32b there's windowing code for accessing the packfile since we can > > > run out of address space, does this code get turned off for 64b? > > > > The windowing code you are talking about defaults as follows: > > > > Parameter 32b 64b > > ----------------------------------------- > > core.packedGitWindowSize 32M 1G > > core.packedGitLimit 256M 8G > > > > So I doubt you are having issues with the windowing code on a 64b > > system, unless your repository is just *huge*. I did not think that > > anyone had a Git repository that exceeded 8G, though the window > > size of 1G might be a tad too small if there are many packfiles > > and they are each larger than 1G. > > Why use windows on 64b? Default core.packedGitWindowSize equal to > core.packedGitLimit That's *not* a good idea when you have more than one packfile. The limit is for the sum of all packfiles. The settings above allow up to 8 packfiles to be opened and mapped at once on 64b systems, with each packfile being up to 1G in size before we start shifting the window(s) around. Doing as you suggest would reduce the number of open packfiles to 1, which would severely hurt performance when there is more than one packfile and Git keeps bouncing around between them to satisfy the current process' demands. One could probably argue that the defaults for 64b are too small; perhaps they should be closer to 4G/24G seeing as how the 64b address space is so huge that we're unlikely to run into issues with being able to use >24G of virtual address at once. > I haven't measured it but I suspect the OS calls for moving the > windows are are quite slow on a relative basis since they have to > rewrite a bunch of page tables. Maybe. Add a call to pack_report() at the end of the program you are interested in and run it. We keep track of how often we move windows around; you may find that we don't move them often enough (or at all) to cause problems here. Or just run it under strace and watch mmap() activity, filtering out the uninteresting bits. > Why is the window so small on 32b? I > thought we were up to about a 1GB packfile before running out of > address space with Mozilla. Shouldn't the window simply be set as > large as possible on 32b, this size being a function of the available > address space, not the amount of physical memory? Because programs need to malloc() stuff to work. And we need stack space. And we need to let the runtime linker mmap() in the shared libraries we are linked to. All in all we do get tight in some 32b cases. The above defaults for 32b were chosen based on the Linux kernel repository (its under 256M) and based on some (crude) performance testing on Linux (which seemed to say the 32M packedGitWindowSize wasn't really hurting us). They were basically set to give us maximum address space for working heap and yet not have a negative impact on one of our (at the time) largest user groups. In particular repack (aka git-pack-objects) is a real memory pig, especially now with its various caches. The more address space we can let it use in a 32b case the better off we probably are. If someone can show that increasing these 32b defaults is the right thing to do even in very large repositories, *especially* with something really brutal like `git-blame` on a very busy file or `git repack -f -a` then please submit a patch to boost them. ;-) -- Shawn. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: performance on repack 2007-08-14 5:13 ` Shawn O. Pearce @ 2007-08-14 5:57 ` Jon Smirl 0 siblings, 0 replies; 30+ messages in thread From: Jon Smirl @ 2007-08-14 5:57 UTC (permalink / raw) To: Shawn O. Pearce; +Cc: Martin Koegler, Git Mailing List On 8/14/07, Shawn O. Pearce <spearce@spearce.org> wrote: > Jon Smirl <jonsmirl@gmail.com> wrote: > One could probably argue that the defaults for 64b are too small; > perhaps they should be closer to 4G/24G seeing as how the 64b address > space is so huge that we're unlikely to run into issues with being > able to use >24G of virtual address at once. We've got 24,000GB of address space to play with. Why not set them 64G/512GB and forget about them for a decade? > > I haven't measured it but I suspect the OS calls for moving the > > windows are are quite slow on a relative basis since they have to > > rewrite a bunch of page tables. > > Maybe. Add a call to pack_report() at the end of the program you > are interested in and run it. We keep track of how often we move > windows around; you may find that we don't move them often enough > (or at all) to cause problems here. Or just run it under strace > and watch mmap() activity, filtering out the uninteresting bits. > > > Why is the window so small on 32b? I > > thought we were up to about a 1GB packfile before running out of > > address space with Mozilla. Shouldn't the window simply be set as > > large as possible on 32b, this size being a function of the available > > address space, not the amount of physical memory? > > Because programs need to malloc() stuff to work. And we need > stack space. And we need to let the runtime linker mmap() in the > shared libraries we are linked to. All in all we do get tight in > some 32b cases. The above defaults for 32b were chosen based on > the Linux kernel repository (its under 256M) and based on some > (crude) performance testing on Linux (which seemed to say the > 32M packedGitWindowSize wasn't really hurting us). They were > basically set to give us maximum address space for working heap > and yet not have a negative impact on one of our (at the time) > largest user groups. Kernel is barely under 256M. Do we need two limits, could we just start with the big one and assign first packfile fully to it, then start dividing it up dynamically as more pack files are opened? core.packedGitLimit core.packedGitWindowSize The real question is, what's the access pattern into the windows? If it is random then the windows get moved around continuously. If is it more or less sequential the windows will hardly get moved at all. Of course multi-threading the diff computation is going to do far more for me than playing with the window size. > In particular repack (aka git-pack-objects) is a real memory pig, > especially now with its various caches. The more address space we > can let it use in a 32b case the better off we probably are. > > If someone can show that increasing these 32b defaults is the > right thing to do even in very large repositories, *especially* > with something really brutal like `git-blame` on a very busy file or > `git repack -f -a` then please submit a patch to boost them. ;-) > > -- > Shawn. > -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: performance on repack 2007-08-14 3:12 ` Shawn O. Pearce 2007-08-14 4:10 ` Jon Smirl @ 2007-08-14 14:52 ` Nicolas Pitre 2007-08-14 21:41 ` Nicolas Pitre 2 siblings, 0 replies; 30+ messages in thread From: Nicolas Pitre @ 2007-08-14 14:52 UTC (permalink / raw) To: Shawn O. Pearce; +Cc: Jon Smirl, Martin Koegler, Git Mailing List On Mon, 13 Aug 2007, Shawn O. Pearce wrote: > Jon Smirl <jonsmirl@gmail.com> wrote: > > This solution was my first thought too. Use the main thread to get > > everything needed for the object into RAM, then multi-thread the > > compute bound, in-memory delta search operation. Shared CPU caches > > might make this very fast. > > I have been thinking about doing this, especially now that the > default window size is much larger. I think the default is up as > high as 50, which means we'd keep that shiny new UltraSPARC T2 busy. > Not that I have one... so anyone from Sun is welcome to send me > one if they want. ;-) Note that the default of 50 applies to the maximum delta depth, not the delta search window. And the delta depth limit is costless on the packing side. I, too, wouldn't mind the UltraSPARC T2 though. Nicolas ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: performance on repack 2007-08-14 3:12 ` Shawn O. Pearce 2007-08-14 4:10 ` Jon Smirl 2007-08-14 14:52 ` Nicolas Pitre @ 2007-08-14 21:41 ` Nicolas Pitre 2007-08-15 1:20 ` Jon Smirl 2007-08-15 5:32 ` Shawn O. Pearce 2 siblings, 2 replies; 30+ messages in thread From: Nicolas Pitre @ 2007-08-14 21:41 UTC (permalink / raw) To: Shawn O. Pearce; +Cc: Jon Smirl, Martin Koegler, Git Mailing List On Mon, 13 Aug 2007, Shawn O. Pearce wrote: > I'm not sure its that complex to run all try_delta calls of the > current window in parallel. Might be a simple enough change that > its actually worth the extra complexity, especially with these > multi-core systems being so readily available. Repacking is the > most CPU intensive operation Git performs, and the one that is also > the easiest to make parallel. > > Maybe someone else will beat me to it, but if not I might give such > a patch a shot in a few weeks. Well, here's my quick attempt at it. Unfortunately, performance isn't as good as I'd expected, especially with relatively small blobs like those found in the linux kernel repo. It looks like the overhead of thread creation/joining might be significant compared to the actual delta computation. I have a P4 with HT which might behave differently from a real SMP machine, or whatever, but the CPU usage never exceeded 110% according to top (sometimes it even dropped below 95%). Actually, a git-repack gets much slower due to 2m27s of system time compared to 0m03s without threads. And this is with NPTL. With a repo containing big blobs it is different though. CPU usage firmly gets to 200% hence all cores are saturated with work. So clearly the naive approach of spawning a thread for each combination in the delta window doesn't work. Remains the approach of calling find_deltas() n times with 1/n of the delta_list, one call per thread, for the bulk of the delta search work. This might even be much simpler than my current patch is. However this approach will require n times the memory for the delta window data. Thread creation overhead will occur only once. Also... read_sha1_file() is currently not thread safe, and threaded delta searching would requires that its usage be serialized with a global mutex (not done in this patch which is a bug), or ideally be made thread aware. Nicolas --- diff --git a/Makefile b/Makefile index 4eb4637..c3c6e68 100644 --- a/Makefile +++ b/Makefile @@ -122,6 +122,9 @@ all:: # If not set it defaults to the bare 'wish'. If it is set to the empty # string then NO_TCLTK will be forced (this is used by configure script). # +# Define THREADED_DELTA_SEARCH if you have pthreads and wish to exploit +# parallel delta searching when packing objects. +# GIT-VERSION-FILE: .FORCE-GIT-VERSION-FILE @$(SHELL_PATH) ./GIT-VERSION-GEN @@ -662,6 +665,11 @@ ifdef NO_HSTRERROR COMPAT_OBJS += compat/hstrerror.o endif +ifdef THREADED_DELTA_SEARCH + BASIC_CFLAGS += -DTHREADED_DELTA_SEARCH + EXTLIBS += -lpthread +endif + ifeq ($(TCLTK_PATH),) NO_TCLTK=NoThanks endif diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c index 5e9d1fd..142b6dd 100644 --- a/builtin-pack-objects.c +++ b/builtin-pack-objects.c @@ -15,6 +15,10 @@ #include "list-objects.h" #include "progress.h" +#ifdef THREADED_DELTA_SEARCH +#include <pthread.h> +#endif + static const char pack_usage[] = "\ git-pack-objects [{ -q | --progress | --all-progress }] \n\ [--max-pack-size=N] [--local] [--incremental] \n\ @@ -1273,6 +1277,12 @@ struct unpacked { void *data; struct delta_index *index; unsigned depth; +#ifdef THREADED_DELTA_SEARCH + pthread_t thread; + pthread_mutex_t mutex; + struct unpacked *trg; + unsigned trg_max_depth; +#endif }; static int delta_cacheable(struct unpacked *trg, struct unpacked *src, @@ -1292,6 +1302,18 @@ static int delta_cacheable(struct unpacked *trg, struct unpacked *src, return 0; } +#ifdef THREADED_DELTA_SEARCH +#define delta_lock_init(e) pthread_mutex_init(&(e)->mutex, NULL) +#define delta_lock_destroy(e) pthread_mutex_destroy(&(e)->mutex) +#define delta_lock(e) pthread_mutex_lock(&(e)->mutex) +#define delta_unlock(e) pthread_mutex_unlock(&(e)->mutex) +#else +#define delta_lock_init(e) 0 +#define delta_lock_destroy(e) 0 +#define delta_lock(e) 0 +#define delta_unlock(e) 0 +#endif + /* * We search for deltas _backwards_ in a list sorted by type and * by size, so that we see progressively smaller and smaller files. @@ -1335,6 +1357,7 @@ static int try_delta(struct unpacked *trg, struct unpacked *src, return 0; /* Now some size filtering heuristics. */ + delta_lock(trg); trg_size = trg_entry->size; if (!trg_entry->delta) { max_size = trg_size/2 - 20; @@ -1343,6 +1366,7 @@ static int try_delta(struct unpacked *trg, struct unpacked *src, max_size = trg_entry->delta_size; ref_depth = trg->depth; } + delta_unlock(trg); max_size = max_size * (max_depth - src->depth) / (max_depth - ref_depth + 1); if (max_size == 0) @@ -1355,6 +1379,8 @@ static int try_delta(struct unpacked *trg, struct unpacked *src, return 0; /* Load data if not already done */ + + delta_lock(trg); if (!trg->data) { trg->data = read_sha1_file(trg_entry->idx.sha1, &type, &sz); if (sz != trg_size) @@ -1362,6 +1388,9 @@ static int try_delta(struct unpacked *trg, struct unpacked *src, sha1_to_hex(trg_entry->idx.sha1), sz, trg_size); window_memory_usage += sz; } + delta_unlock(trg); + + delta_lock(src); if (!src->data) { src->data = read_sha1_file(src_entry->idx.sha1, &type, &sz); if (sz != src_size) @@ -1375,20 +1404,30 @@ static int try_delta(struct unpacked *trg, struct unpacked *src, static int warned = 0; if (!warned++) warning("suboptimal pack - out of memory"); + delta_unlock(src); return 0; } window_memory_usage += sizeof_delta_index(src->index); } + delta_unlock(src); delta_buf = create_delta(src->index, trg->data, trg_size, &delta_size, max_size); if (!delta_buf) return 0; + delta_lock(trg); if (trg_entry->delta_data) { - /* Prefer only shallower same-sized deltas. */ - if (delta_size == trg_entry->delta_size && - src->depth + 1 >= trg->depth) { +#ifdef THREADED_DELTA_SEARCH + /* A better match might have been found in the mean time */ + max_size = trg_entry->delta_size * (max_depth - src->depth) / + (max_depth - trg->depth + 1); +#endif + /* ... also prefer only shallower same-sized deltas. */ + if (delta_size > max_size || + (delta_size == trg_entry->delta_size && + src->depth + 1 >= trg->depth)) { free(delta_buf); + delta_unlock(trg); return 0; } delta_cache_size -= trg_entry->delta_size; @@ -1404,9 +1443,21 @@ static int try_delta(struct unpacked *trg, struct unpacked *src, delta_cache_size += trg_entry->delta_size; } else free(delta_buf); + delta_unlock(trg); return 1; } +#ifdef THREADED_DELTA_SEARCH + +static void *threaded_try_delta(void *arg) +{ + struct unpacked *src = arg; + int ret = try_delta(src->trg, src, src->trg_max_depth); + return (void *)ret; +} + +#endif + static unsigned int check_delta_limit(struct object_entry *me, unsigned int n) { struct object_entry *child = me->delta_child; @@ -1439,19 +1490,20 @@ static void find_deltas(struct object_entry **list, int window, int depth) uint32_t i = nr_objects, idx = 0, count = 0, processed = 0; unsigned int array_size = window * sizeof(struct unpacked); struct unpacked *array; - int max_depth; + int j, max_depth; if (!nr_objects) return; array = xmalloc(array_size); memset(array, 0, array_size); + for (j = 0; j < window; j++) + delta_lock_init(&array[j]); if (progress) start_progress(&progress_state, "Deltifying %u objects...", "", nr_result); do { struct object_entry *entry = list[--i]; struct unpacked *n = array + idx; - int j; if (!entry->preferred_base) processed++; @@ -1494,6 +1546,38 @@ static void find_deltas(struct object_entry **list, int window, int depth) goto next; } +#ifdef THREADED_DELTA_SEARCH + /* Create all threads ... */ + j = window; + while (--j > 0) { + int ret; + uint32_t other_idx = idx + j; + struct unpacked *m; + if (other_idx >= window) + other_idx -= window; + m = array + other_idx; + if (!m->entry) + break; + m->trg = n; + m->trg_max_depth= max_depth; + ret = pthread_create(&m->thread, NULL, + threaded_try_delta, m); + if (ret) + die("unable to create thread: %s", strerror(ret)); + } + /* ... then join them. */ + while (++j < window) { + int ret; + uint32_t other_idx = idx + j; + struct unpacked *m; + if (other_idx >= window) + other_idx -= window; + m = array + other_idx; + ret = pthread_join(m->thread, NULL); + if (ret) + die("unable to join thread: %s", strerror(ret)); + } +#else j = window; while (--j > 0) { uint32_t other_idx = idx + j; @@ -1506,6 +1590,7 @@ static void find_deltas(struct object_entry **list, int window, int depth) if (try_delta(n, m, max_depth) < 0) break; } +#endif /* if we made n a delta, and if n is already at max * depth, leaving it in the window is pointless. we @@ -1528,6 +1613,7 @@ static void find_deltas(struct object_entry **list, int window, int depth) for (i = 0; i < window; ++i) { free_delta_index(array[i].index); free(array[i].data); + delta_lock_destroy(&array[i]); } free(array); } ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: performance on repack 2007-08-14 21:41 ` Nicolas Pitre @ 2007-08-15 1:20 ` Jon Smirl 2007-08-15 1:59 ` Nicolas Pitre 2007-08-15 5:32 ` Shawn O. Pearce 1 sibling, 1 reply; 30+ messages in thread From: Jon Smirl @ 2007-08-15 1:20 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Shawn O. Pearce, Martin Koegler, Git Mailing List On 8/14/07, Nicolas Pitre <nico@cam.org> wrote: > On Mon, 13 Aug 2007, Shawn O. Pearce wrote: > > > I'm not sure its that complex to run all try_delta calls of the > > current window in parallel. Might be a simple enough change that > > its actually worth the extra complexity, especially with these > > multi-core systems being so readily available. Repacking is the > > most CPU intensive operation Git performs, and the one that is also > > the easiest to make parallel. > > > > Maybe someone else will beat me to it, but if not I might give such > > a patch a shot in a few weeks. > > Well, here's my quick attempt at it. Unfortunately, performance isn't > as good as I'd expected, especially with relatively small blobs like > those found in the linux kernel repo. It looks like the overhead of > thread creation/joining might be significant compared to the actual > delta computation. I have a P4 with HT which might behave differently > from a real SMP machine, or whatever, but the CPU usage never exceeded > 110% according to top (sometimes it even dropped below 95%). Actually, a > git-repack gets much slower due to 2m27s of system time compared to > 0m03s without threads. And this is with NPTL. Thread creation/destruction overhead is way too high to make these threads for every delta. Another strategy is to create four worker threads once when the process is loaded. Then use synchronization primitives to feed the threads lumps of work. The threads persist for the life of the process. > > With a repo containing big blobs it is different though. CPU usage > firmly gets to 200% hence all cores are saturated with work. > > So clearly the naive approach of spawning a thread for each combination > in the delta window doesn't work. > > Remains the approach of calling find_deltas() n times with 1/n of the > delta_list, one call per thread, for the bulk of the delta search work. > This might even be much simpler than my current patch is. However this > approach will require n times the memory for the delta window data. > Thread creation overhead will occur only once. > > Also... read_sha1_file() is currently not thread safe, and threaded > delta searching would requires that its usage be serialized with a > global mutex (not done in this patch which is a bug), or ideally be made > thread aware. > > > Nicolas > > --- > > diff --git a/Makefile b/Makefile > index 4eb4637..c3c6e68 100644 > --- a/Makefile > +++ b/Makefile > @@ -122,6 +122,9 @@ all:: > # If not set it defaults to the bare 'wish'. If it is set to the empty > # string then NO_TCLTK will be forced (this is used by configure script). > # > +# Define THREADED_DELTA_SEARCH if you have pthreads and wish to exploit > +# parallel delta searching when packing objects. > +# > > GIT-VERSION-FILE: .FORCE-GIT-VERSION-FILE > @$(SHELL_PATH) ./GIT-VERSION-GEN > @@ -662,6 +665,11 @@ ifdef NO_HSTRERROR > COMPAT_OBJS += compat/hstrerror.o > endif > > +ifdef THREADED_DELTA_SEARCH > + BASIC_CFLAGS += -DTHREADED_DELTA_SEARCH > + EXTLIBS += -lpthread > +endif > + > ifeq ($(TCLTK_PATH),) > NO_TCLTK=NoThanks > endif > diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c > index 5e9d1fd..142b6dd 100644 > --- a/builtin-pack-objects.c > +++ b/builtin-pack-objects.c > @@ -15,6 +15,10 @@ > #include "list-objects.h" > #include "progress.h" > > +#ifdef THREADED_DELTA_SEARCH > +#include <pthread.h> > +#endif > + > static const char pack_usage[] = "\ > git-pack-objects [{ -q | --progress | --all-progress }] \n\ > [--max-pack-size=N] [--local] [--incremental] \n\ > @@ -1273,6 +1277,12 @@ struct unpacked { > void *data; > struct delta_index *index; > unsigned depth; > +#ifdef THREADED_DELTA_SEARCH > + pthread_t thread; > + pthread_mutex_t mutex; > + struct unpacked *trg; > + unsigned trg_max_depth; > +#endif > }; > > static int delta_cacheable(struct unpacked *trg, struct unpacked *src, > @@ -1292,6 +1302,18 @@ static int delta_cacheable(struct unpacked *trg, struct unpacked *src, > return 0; > } > > +#ifdef THREADED_DELTA_SEARCH > +#define delta_lock_init(e) pthread_mutex_init(&(e)->mutex, NULL) > +#define delta_lock_destroy(e) pthread_mutex_destroy(&(e)->mutex) > +#define delta_lock(e) pthread_mutex_lock(&(e)->mutex) > +#define delta_unlock(e) pthread_mutex_unlock(&(e)->mutex) > +#else > +#define delta_lock_init(e) 0 > +#define delta_lock_destroy(e) 0 > +#define delta_lock(e) 0 > +#define delta_unlock(e) 0 > +#endif > + > /* > * We search for deltas _backwards_ in a list sorted by type and > * by size, so that we see progressively smaller and smaller files. > @@ -1335,6 +1357,7 @@ static int try_delta(struct unpacked *trg, struct unpacked *src, > return 0; > > /* Now some size filtering heuristics. */ > + delta_lock(trg); > trg_size = trg_entry->size; > if (!trg_entry->delta) { > max_size = trg_size/2 - 20; > @@ -1343,6 +1366,7 @@ static int try_delta(struct unpacked *trg, struct unpacked *src, > max_size = trg_entry->delta_size; > ref_depth = trg->depth; > } > + delta_unlock(trg); > max_size = max_size * (max_depth - src->depth) / > (max_depth - ref_depth + 1); > if (max_size == 0) > @@ -1355,6 +1379,8 @@ static int try_delta(struct unpacked *trg, struct unpacked *src, > return 0; > > /* Load data if not already done */ > + > + delta_lock(trg); > if (!trg->data) { > trg->data = read_sha1_file(trg_entry->idx.sha1, &type, &sz); > if (sz != trg_size) > @@ -1362,6 +1388,9 @@ static int try_delta(struct unpacked *trg, struct unpacked *src, > sha1_to_hex(trg_entry->idx.sha1), sz, trg_size); > window_memory_usage += sz; > } > + delta_unlock(trg); > + > + delta_lock(src); > if (!src->data) { > src->data = read_sha1_file(src_entry->idx.sha1, &type, &sz); > if (sz != src_size) > @@ -1375,20 +1404,30 @@ static int try_delta(struct unpacked *trg, struct unpacked *src, > static int warned = 0; > if (!warned++) > warning("suboptimal pack - out of memory"); > + delta_unlock(src); > return 0; > } > window_memory_usage += sizeof_delta_index(src->index); > } > + delta_unlock(src); > > delta_buf = create_delta(src->index, trg->data, trg_size, &delta_size, max_size); > if (!delta_buf) > return 0; > > + delta_lock(trg); > if (trg_entry->delta_data) { > - /* Prefer only shallower same-sized deltas. */ > - if (delta_size == trg_entry->delta_size && > - src->depth + 1 >= trg->depth) { > +#ifdef THREADED_DELTA_SEARCH > + /* A better match might have been found in the mean time */ > + max_size = trg_entry->delta_size * (max_depth - src->depth) / > + (max_depth - trg->depth + 1); > +#endif > + /* ... also prefer only shallower same-sized deltas. */ > + if (delta_size > max_size || > + (delta_size == trg_entry->delta_size && > + src->depth + 1 >= trg->depth)) { > free(delta_buf); > + delta_unlock(trg); > return 0; > } > delta_cache_size -= trg_entry->delta_size; > @@ -1404,9 +1443,21 @@ static int try_delta(struct unpacked *trg, struct unpacked *src, > delta_cache_size += trg_entry->delta_size; > } else > free(delta_buf); > + delta_unlock(trg); > return 1; > } > > +#ifdef THREADED_DELTA_SEARCH > + > +static void *threaded_try_delta(void *arg) > +{ > + struct unpacked *src = arg; > + int ret = try_delta(src->trg, src, src->trg_max_depth); > + return (void *)ret; > +} > + > +#endif > + > static unsigned int check_delta_limit(struct object_entry *me, unsigned int n) > { > struct object_entry *child = me->delta_child; > @@ -1439,19 +1490,20 @@ static void find_deltas(struct object_entry **list, int window, int depth) > uint32_t i = nr_objects, idx = 0, count = 0, processed = 0; > unsigned int array_size = window * sizeof(struct unpacked); > struct unpacked *array; > - int max_depth; > + int j, max_depth; > > if (!nr_objects) > return; > array = xmalloc(array_size); > memset(array, 0, array_size); > + for (j = 0; j < window; j++) > + delta_lock_init(&array[j]); > if (progress) > start_progress(&progress_state, "Deltifying %u objects...", "", nr_result); > > do { > struct object_entry *entry = list[--i]; > struct unpacked *n = array + idx; > - int j; > > if (!entry->preferred_base) > processed++; > @@ -1494,6 +1546,38 @@ static void find_deltas(struct object_entry **list, int window, int depth) > goto next; > } > > +#ifdef THREADED_DELTA_SEARCH > + /* Create all threads ... */ > + j = window; > + while (--j > 0) { > + int ret; > + uint32_t other_idx = idx + j; > + struct unpacked *m; > + if (other_idx >= window) > + other_idx -= window; > + m = array + other_idx; > + if (!m->entry) > + break; > + m->trg = n; > + m->trg_max_depth= max_depth; > + ret = pthread_create(&m->thread, NULL, > + threaded_try_delta, m); > + if (ret) > + die("unable to create thread: %s", strerror(ret)); > + } > + /* ... then join them. */ > + while (++j < window) { > + int ret; > + uint32_t other_idx = idx + j; > + struct unpacked *m; > + if (other_idx >= window) > + other_idx -= window; > + m = array + other_idx; > + ret = pthread_join(m->thread, NULL); > + if (ret) > + die("unable to join thread: %s", strerror(ret)); > + } > +#else > j = window; > while (--j > 0) { > uint32_t other_idx = idx + j; > @@ -1506,6 +1590,7 @@ static void find_deltas(struct object_entry **list, int window, int depth) > if (try_delta(n, m, max_depth) < 0) > break; > } > +#endif > > /* if we made n a delta, and if n is already at max > * depth, leaving it in the window is pointless. we > @@ -1528,6 +1613,7 @@ static void find_deltas(struct object_entry **list, int window, int depth) > for (i = 0; i < window; ++i) { > free_delta_index(array[i].index); > free(array[i].data); > + delta_lock_destroy(&array[i]); > } > free(array); > } > -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: performance on repack 2007-08-15 1:20 ` Jon Smirl @ 2007-08-15 1:59 ` Nicolas Pitre 0 siblings, 0 replies; 30+ messages in thread From: Nicolas Pitre @ 2007-08-15 1:59 UTC (permalink / raw) To: Jon Smirl; +Cc: Shawn O. Pearce, Martin Koegler, Git Mailing List On Tue, 14 Aug 2007, Jon Smirl wrote: > On 8/14/07, Nicolas Pitre <nico@cam.org> wrote: > > On Mon, 13 Aug 2007, Shawn O. Pearce wrote: > > > > > I'm not sure its that complex to run all try_delta calls of the > > > current window in parallel. Might be a simple enough change that > > > its actually worth the extra complexity, especially with these > > > multi-core systems being so readily available. Repacking is the > > > most CPU intensive operation Git performs, and the one that is also > > > the easiest to make parallel. > > > > > > Maybe someone else will beat me to it, but if not I might give such > > > a patch a shot in a few weeks. > > > > Well, here's my quick attempt at it. Unfortunately, performance isn't > > as good as I'd expected, especially with relatively small blobs like > > those found in the linux kernel repo. It looks like the overhead of > > thread creation/joining might be significant compared to the actual > > delta computation. I have a P4 with HT which might behave differently > > from a real SMP machine, or whatever, but the CPU usage never exceeded > > 110% according to top (sometimes it even dropped below 95%). Actually, a > > git-repack gets much slower due to 2m27s of system time compared to > > 0m03s without threads. And this is with NPTL. > > Thread creation/destruction overhead is way too high to make these > threads for every delta. > > Another strategy is to create four worker threads once when the > process is loaded. Then use synchronization primitives to feed the > threads lumps of work. The threads persist for the life of the > process. Still, those synchronization primitives would have to be activated for every delta which might also add some overhead. But there is another issue to consider: delta searching is limited by previous results for the same delta. If first attempt for a delta produces a 10x reduction, then the next delta computation has to produce less than 1/10 the original object size or it is aborted early. And so on for subsequent attempts. When performing delta computations in parallel for the same target then early delta computation abort cannot occur since no result is initially available to further limit delta processing. Segmenting the list of objects to deltify into sub-lists for individual threads solves both issues. Nicolas ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: performance on repack 2007-08-14 21:41 ` Nicolas Pitre 2007-08-15 1:20 ` Jon Smirl @ 2007-08-15 5:32 ` Shawn O. Pearce 2007-08-15 15:08 ` Jon Smirl 2007-08-15 20:49 ` Nicolas Pitre 1 sibling, 2 replies; 30+ messages in thread From: Shawn O. Pearce @ 2007-08-15 5:32 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Jon Smirl, Martin Koegler, Git Mailing List Nicolas Pitre <nico@cam.org> wrote: > On Mon, 13 Aug 2007, Shawn O. Pearce wrote: > > > I'm not sure its that complex to run all try_delta calls of the > > current window in parallel. > > Well, here's my quick attempt at it. Unfortunately, performance isn't > as good as I'd expected, especially with relatively small blobs like > those found in the linux kernel repo. It looks like the overhead of > thread creation/joining might be significant compared to the actual > delta computation. I have a P4 with HT which might behave differently > from a real SMP machine, or whatever, but the CPU usage never exceeded > 110% according to top (sometimes it even dropped below 95%). Actually, a > git-repack gets much slower due to 2m27s of system time compared to > 0m03s without threads. And this is with NPTL. Yea, I'm not surprised. This is quick and dirty but is really the wrong approach. The kernel is spending lots of time setting up the thread and its stack, then scheduling it onto a CPU, only to likely find it finish before it even yields the CPU or is interrupted off. So I'd expect to get huge system times here. Even Linux with its pretty amazing NPTL isn't up to the task. No threading package really is. > With a repo containing big blobs it is different though. CPU usage > firmly gets to 200% hence all cores are saturated with work. Right, that's what we want to see. ;-) What makes this bloody complex (yes, I just did disagree with myself) is you need to use a pool of threads, one per core, keep them running for the life of the delta phase (so we amortize the thread start/stop time) and also give them large enough bursts of data that they can almost always use their full time slice before being interrupted by the kernel. > Remains the approach of calling find_deltas() n times with 1/n of the > delta_list, one call per thread, for the bulk of the delta search work. > This might even be much simpler than my current patch is. However this > approach will require n times the memory for the delta window data. > Thread creation overhead will occur only once. Yea, that I agree with. The other thing is the main thread may need to push a couple of windows worth of work into the threads, so that if this window's 1/n unit goes really fast on that thread it doesn't stall out waiting for the main thread to get it more data. > Also... read_sha1_file() is currently not thread safe, and threaded > delta searching would requires that its usage be serialized with a > global mutex (not done in this patch which is a bug), or ideally be made > thread aware. Yea, that's a problem. From the Git library perspective that Luiz Capitulino has been working on for GSOC this summer being able to compile/link Git with -lpthreads and have it actually be thread safe would be useful. Unfortunately its waaaay more than just making read_sha1_file() threadsafe. :-/ read_sha1_file() is actually a reasonably critical part of the major functions of Git. Slowing that down by making it go through pthread_mutex_{lock,unlock} is going to hurt. I tried something like the following, and its a bit slower, and really ain't a great multi-thread aware implementation. # unmodified v1.5.3-rc4-91-g9fa3465 from Junio /usr/bin/time ../rl-master/git-rev-list HEAD . >/dev/null 1.67 real 1.42 user 0.20 sys 1.65 real 1.36 user 0.17 sys 1.67 real 1.35 user 0.16 sys 1.70 real 1.35 user 0.16 sys 1.64 real 1.35 user 0.16 sys # v1.5.3-rc4-91-g9fa3465 + patch below /usr/bin/time ../rl-pthread/git-rev-list HEAD . >/dev/null 2.86 real 1.48 user 0.29 sys 1.68 real 1.37 user 0.17 sys 1.59 real 1.37 user 0.16 sys 1.66 real 1.37 user 0.17 sys 1.68 real 1.37 user 0.17 sys That's on Mac OS X. I guess NPTL would probably be faster at this, but not so much faster as to make it not be an increase. -->8-- diff --git a/Makefile b/Makefile index 4eb4637..f31811b 100644 --- a/Makefile +++ b/Makefile @@ -710,7 +710,7 @@ SHELL_PATH_SQ = $(subst ','\'',$(SHELL_PATH)) PERL_PATH_SQ = $(subst ','\'',$(PERL_PATH)) TCLTK_PATH_SQ = $(subst ','\'',$(TCLTK_PATH)) -LIBS = $(GITLIBS) $(EXTLIBS) +LIBS = $(GITLIBS) $(EXTLIBS) -lpthread BASIC_CFLAGS += -DSHA1_HEADER='$(SHA1_HEADER_SQ)' \ -DETC_GITCONFIG='"$(ETC_GITCONFIG_SQ)"' $(COMPAT_CFLAGS) diff --git a/sha1_file.c b/sha1_file.c index aca741b..c239789 100644 --- a/sha1_file.c +++ b/sha1_file.c @@ -6,6 +6,7 @@ * This handles basic git sha1 object files - packing, unpacking, * creation etc. */ +#include <pthread.h> #include "cache.h" #include "delta.h" #include "pack.h" @@ -1862,10 +1863,12 @@ int pretend_sha1_file(void *buf, unsigned long len, enum object_type type, void *read_sha1_file(const unsigned char *sha1, enum object_type *type, unsigned long *size) { + static pthread_mutex_t locky = PTHREAD_MUTEX_INITIALIZER; unsigned long mapsize; void *map, *buf; struct cached_object *co; + pthread_mutex_lock(&locky); co = find_cached_object(sha1); if (co) { buf = xmalloc(co->size + 1); @@ -1873,20 +1876,26 @@ void *read_sha1_file(const unsigned char *sha1, enum object_type *type, ((char*)buf)[co->size] = 0; *type = co->type; *size = co->size; + pthread_mutex_unlock(&locky); return buf; } buf = read_packed_sha1(sha1, type, size); - if (buf) + if (buf) { + pthread_mutex_unlock(&locky); return buf; + } map = map_sha1_file(sha1, &mapsize); if (map) { buf = unpack_sha1_file(map, mapsize, type, size, sha1); munmap(map, mapsize); + pthread_mutex_unlock(&locky); return buf; } reprepare_packed_git(); - return read_packed_sha1(sha1, type, size); + buf = read_packed_sha1(sha1, type, size); + pthread_mutex_unlock(&locky); + return buf; } void *read_object_with_reference(const unsigned char *sha1, -- Shawn. ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: performance on repack 2007-08-15 5:32 ` Shawn O. Pearce @ 2007-08-15 15:08 ` Jon Smirl 2007-08-15 17:11 ` Martin Koegler 2007-08-15 21:05 ` Nicolas Pitre 2007-08-15 20:49 ` Nicolas Pitre 1 sibling, 2 replies; 30+ messages in thread From: Jon Smirl @ 2007-08-15 15:08 UTC (permalink / raw) To: Shawn O. Pearce; +Cc: Nicolas Pitre, Martin Koegler, Git Mailing List On 8/15/07, Shawn O. Pearce <spearce@spearce.org> wrote: > Nicolas Pitre <nico@cam.org> wrote: > > On Mon, 13 Aug 2007, Shawn O. Pearce wrote: > Even Linux with its pretty amazing NPTL isn't up to the task. > No threading package really is. You need something like fibers instead of threads for task that small. > Yea, that I agree with. The other thing is the main thread may need > to push a couple of windows worth of work into the threads, so that > if this window's 1/n unit goes really fast on that thread it doesn't > stall out waiting for the main thread to get it more data. > > > Also... read_sha1_file() is currently not thread safe, and threaded > > delta searching would requires that its usage be serialized with a > > global mutex (not done in this patch which is a bug), or ideally be made > > thread aware. You can avoid making all the low level calls thread safe by using the main thread to get everything into RAM before starting to search for the deltas. The worker threads would only deal with things completely in memory. You may need to ref count these in-memory objects if they are shared between worker threads. For simplicity the in-memory input objects should be read only by the threads. The worker threads create new structures to hand their results back to the main thread for writing to disk. A typical solution is to use a queue protected by locks. Main thread faults in all the needed objects to cache. Main thread builds a queue entry and increments reference count on all referenced objects. Main thread uses locks to add entry to queue, while queue is locked it removes any finished jobs. Main thread writes finished results to disks, decrements ref counts. Cache logic can then take over about when objects are actually deleted. Worker threads wait on the queue. When something is placed in the queue a waiting worker thread removes it, processes it, puts the results in RAM, and places the object back on the finished queue. Then waits for another object. It doesn't call into the main body of code. Initially I would just ignore very large objects while getting the basic code to work. After the basic code is working if a very large object is encountered when the main thread is faulting objects in, the main thread should just process this object on the spot using the existing low memory code. Hopefully the new adaptive read ahead code in the kernel will detect the pattern of the main thread faulting objects in and make good guesses about where to read ahead. This is key to not getting stuck waiting on IO to complete. If not the main thread could use ADVISE to give kernel read ahead clues about where it is going to read next. > Yea, that's a problem. From the Git library perspective that Luiz > Capitulino has been working on for GSOC this summer being able to > compile/link Git with -lpthreads and have it actually be thread > safe would be useful. Unfortunately its waaaay more than just > making read_sha1_file() threadsafe. :-/ > > read_sha1_file() is actually a reasonably critical part of the > major functions of Git. Slowing that down by making it go through > pthread_mutex_{lock,unlock} is going to hurt. I tried something > like the following, and its a bit slower, and really ain't a great > multi-thread aware implementation. > > # unmodified v1.5.3-rc4-91-g9fa3465 from Junio > /usr/bin/time ../rl-master/git-rev-list HEAD . >/dev/null > 1.67 real 1.42 user 0.20 sys > 1.65 real 1.36 user 0.17 sys > 1.67 real 1.35 user 0.16 sys > 1.70 real 1.35 user 0.16 sys > 1.64 real 1.35 user 0.16 sys > > # v1.5.3-rc4-91-g9fa3465 + patch below > /usr/bin/time ../rl-pthread/git-rev-list HEAD . >/dev/null > 2.86 real 1.48 user 0.29 sys > 1.68 real 1.37 user 0.17 sys > 1.59 real 1.37 user 0.16 sys > 1.66 real 1.37 user 0.17 sys > 1.68 real 1.37 user 0.17 sys > > That's on Mac OS X. I guess NPTL would probably be faster at this, > but not so much faster as to make it not be an increase. > > -->8-- > diff --git a/Makefile b/Makefile > index 4eb4637..f31811b 100644 > --- a/Makefile > +++ b/Makefile > @@ -710,7 +710,7 @@ SHELL_PATH_SQ = $(subst ','\'',$(SHELL_PATH)) > PERL_PATH_SQ = $(subst ','\'',$(PERL_PATH)) > TCLTK_PATH_SQ = $(subst ','\'',$(TCLTK_PATH)) > > -LIBS = $(GITLIBS) $(EXTLIBS) > +LIBS = $(GITLIBS) $(EXTLIBS) -lpthread > > BASIC_CFLAGS += -DSHA1_HEADER='$(SHA1_HEADER_SQ)' \ > -DETC_GITCONFIG='"$(ETC_GITCONFIG_SQ)"' $(COMPAT_CFLAGS) > diff --git a/sha1_file.c b/sha1_file.c > index aca741b..c239789 100644 > --- a/sha1_file.c > +++ b/sha1_file.c > @@ -6,6 +6,7 @@ > * This handles basic git sha1 object files - packing, unpacking, > * creation etc. > */ > +#include <pthread.h> > #include "cache.h" > #include "delta.h" > #include "pack.h" > @@ -1862,10 +1863,12 @@ int pretend_sha1_file(void *buf, unsigned long len, enum object_type type, > void *read_sha1_file(const unsigned char *sha1, enum object_type *type, > unsigned long *size) > { > + static pthread_mutex_t locky = PTHREAD_MUTEX_INITIALIZER; > unsigned long mapsize; > void *map, *buf; > struct cached_object *co; > > + pthread_mutex_lock(&locky); > co = find_cached_object(sha1); > if (co) { > buf = xmalloc(co->size + 1); > @@ -1873,20 +1876,26 @@ void *read_sha1_file(const unsigned char *sha1, enum object_type *type, > ((char*)buf)[co->size] = 0; > *type = co->type; > *size = co->size; > + pthread_mutex_unlock(&locky); > return buf; > } > > buf = read_packed_sha1(sha1, type, size); > - if (buf) > + if (buf) { > + pthread_mutex_unlock(&locky); > return buf; > + } > map = map_sha1_file(sha1, &mapsize); > if (map) { > buf = unpack_sha1_file(map, mapsize, type, size, sha1); > munmap(map, mapsize); > + pthread_mutex_unlock(&locky); > return buf; > } > reprepare_packed_git(); > - return read_packed_sha1(sha1, type, size); > + buf = read_packed_sha1(sha1, type, size); > + pthread_mutex_unlock(&locky); > + return buf; > } > > void *read_object_with_reference(const unsigned char *sha1, > -- > Shawn. > -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: performance on repack 2007-08-15 15:08 ` Jon Smirl @ 2007-08-15 17:11 ` Martin Koegler 2007-08-15 18:38 ` Jon Smirl 2007-08-15 21:05 ` Nicolas Pitre 1 sibling, 1 reply; 30+ messages in thread From: Martin Koegler @ 2007-08-15 17:11 UTC (permalink / raw) To: Jon Smirl; +Cc: Shawn O. Pearce, Nicolas Pitre, Git Mailing List On Wed, Aug 15, 2007 at 11:08:38AM -0400, Jon Smirl wrote: > On 8/15/07, Shawn O. Pearce <spearce@spearce.org> wrote: > > > Also... read_sha1_file() is currently not thread safe, and threaded > > > delta searching would requires that its usage be serialized with a > > > global mutex (not done in this patch which is a bug), or ideally be made > > > thread aware. > > You can avoid making all the low level calls thread safe by using the > main thread to get everything into RAM before starting to search for > the deltas. The worker threads would only deal with things completely > in memory. You may need to ref count these in-memory objects if they > are shared between worker threads. For simplicity the in-memory input > objects should be read only by the threads. The worker threads create > new structures to hand their results back to the main thread for > writing to disk. git-pack-objects knows the order, in which it will use the objects. A seperate thread could pre-read the next object and wait until the main thread starts processing it. After the read is complete, another thread could start computing the delta index. git-pack-objects currently reads an object (and computes the delta index), if it is really necessary. With the pre-read, unnecessary operations would happen. > Initially I would just ignore very large objects while getting the > basic code to work. After the basic code is working if a very large > object is encountered when the main thread is faulting objects in, the > main thread should just process this object on the spot using the > existing low memory code. I expect that the biggest gain will be for big objects, as they require more time to read+unpack the source objects and compute the delta index as well as the delta. > > @@ -1862,10 +1863,12 @@ int pretend_sha1_file(void *buf, unsigned long len, enum object_type type, > > void *read_sha1_file(const unsigned char *sha1, enum object_type *type, > > unsigned long *size) > > { > > + static pthread_mutex_t locky = PTHREAD_MUTEX_INITIALIZER; > > unsigned long mapsize; > > void *map, *buf; > > struct cached_object *co; > > > > + pthread_mutex_lock(&locky); > > co = find_cached_object(sha1); > > if (co) { > > buf = xmalloc(co->size + 1); > > @@ -1873,20 +1876,26 @@ void *read_sha1_file(const unsigned char *sha1, enum object_type *type, > > ((char*)buf)[co->size] = 0; > > *type = co->type; > > *size = co->size; > > + pthread_mutex_unlock(&locky); > > return buf; > > } > > > > buf = read_packed_sha1(sha1, type, size); Couldn't we release the mutex at this point? Why do we need to protect from concurrent access, when we are reading a loose object? > > - if (buf) > > + if (buf) { > > + pthread_mutex_unlock(&locky); > > return buf; > > + } > > map = map_sha1_file(sha1, &mapsize); > > if (map) { > > buf = unpack_sha1_file(map, mapsize, type, size, sha1); > > munmap(map, mapsize); > > + pthread_mutex_unlock(&locky); > > return buf; > > } > > reprepare_packed_git(); > > - return read_packed_sha1(sha1, type, size); > > + buf = read_packed_sha1(sha1, type, size); > > + pthread_mutex_unlock(&locky); > > + return buf; > > } mfg Martin Kögler ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: performance on repack 2007-08-15 17:11 ` Martin Koegler @ 2007-08-15 18:38 ` Jon Smirl 2007-08-15 19:00 ` Nicolas Pitre 0 siblings, 1 reply; 30+ messages in thread From: Jon Smirl @ 2007-08-15 18:38 UTC (permalink / raw) To: Martin Koegler; +Cc: Shawn O. Pearce, Nicolas Pitre, Git Mailing List On 8/15/07, Martin Koegler <mkoegler@auto.tuwien.ac.at> wrote: > git-pack-objects knows the order, in which it will use the objects. A > seperate thread could pre-read the next object and wait until the main > thread starts processing it. After the read is complete, another > thread could start computing the delta index. The hope is that the new adaptive read ahead code in the kernel will get this right and you won't need the second thread. Letting the kernel handle the read ahead will dynamically scale as other demands are made on the host. There's effectively only one read ahead cache in the system, only the kernel really knows how to divide it up between competing apps. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: performance on repack 2007-08-15 18:38 ` Jon Smirl @ 2007-08-15 19:00 ` Nicolas Pitre 2007-08-15 19:42 ` Jon Smirl 2007-08-16 8:10 ` David Kastrup 0 siblings, 2 replies; 30+ messages in thread From: Nicolas Pitre @ 2007-08-15 19:00 UTC (permalink / raw) To: Jon Smirl; +Cc: Martin Koegler, Shawn O. Pearce, Git Mailing List On Wed, 15 Aug 2007, Jon Smirl wrote: > On 8/15/07, Martin Koegler <mkoegler@auto.tuwien.ac.at> wrote: > > git-pack-objects knows the order, in which it will use the objects. A > > seperate thread could pre-read the next object and wait until the main > > thread starts processing it. After the read is complete, another > > thread could start computing the delta index. > > The hope is that the new adaptive read ahead code in the kernel will > get this right and you won't need the second thread. Letting the > kernel handle the read ahead will dynamically scale as other demands > are made on the host. There's effectively only one read ahead cache in > the system, only the kernel really knows how to divide it up between > competing apps. No read ahead will ever help the delta search phase. Objects listed for deltification against each other are sorted in a way that results in reads from completely random location in the object store. Normally the delta search phase is so compute intensive that the read shouldn't matter much. Nicolas ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: performance on repack 2007-08-15 19:00 ` Nicolas Pitre @ 2007-08-15 19:42 ` Jon Smirl 2007-08-16 8:10 ` David Kastrup 1 sibling, 0 replies; 30+ messages in thread From: Jon Smirl @ 2007-08-15 19:42 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Martin Koegler, Shawn O. Pearce, Git Mailing List On 8/15/07, Nicolas Pitre <nico@cam.org> wrote: > On Wed, 15 Aug 2007, Jon Smirl wrote: > > > On 8/15/07, Martin Koegler <mkoegler@auto.tuwien.ac.at> wrote: > > > git-pack-objects knows the order, in which it will use the objects. A > > > seperate thread could pre-read the next object and wait until the main > > > thread starts processing it. After the read is complete, another > > > thread could start computing the delta index. > > > > The hope is that the new adaptive read ahead code in the kernel will > > get this right and you won't need the second thread. Letting the > > kernel handle the read ahead will dynamically scale as other demands > > are made on the host. There's effectively only one read ahead cache in > > the system, only the kernel really knows how to divide it up between > > competing apps. > > No read ahead will ever help the delta search phase. Objects listed for > deltification against each other are sorted in a way that results in > reads from completely random location in the object store. > > Normally the delta search phase is so compute intensive that the read > shouldn't matter much. I agree, I am compute bound for 20 min, disk light occasionally dimly flickers. Besides, the main thread that is preparing the queue effectively functions as a read ahead thread. It will overlap with the compute bound delta searches. > > > Nicolas > -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: performance on repack 2007-08-15 19:00 ` Nicolas Pitre 2007-08-15 19:42 ` Jon Smirl @ 2007-08-16 8:10 ` David Kastrup 2007-08-16 15:34 ` Nicolas Pitre 1 sibling, 1 reply; 30+ messages in thread From: David Kastrup @ 2007-08-16 8:10 UTC (permalink / raw) To: git Nicolas Pitre <nico@cam.org> writes: > On Wed, 15 Aug 2007, Jon Smirl wrote: > >> On 8/15/07, Martin Koegler <mkoegler@auto.tuwien.ac.at> wrote: >> > git-pack-objects knows the order, in which it will use the objects. A >> > seperate thread could pre-read the next object and wait until the main >> > thread starts processing it. After the read is complete, another >> > thread could start computing the delta index. >> >> The hope is that the new adaptive read ahead code in the kernel will >> get this right and you won't need the second thread. Letting the >> kernel handle the read ahead will dynamically scale as other demands >> are made on the host. There's effectively only one read ahead cache in >> the system, only the kernel really knows how to divide it up between >> competing apps. > > No read ahead will ever help the delta search phase. Well, the delta search phase consists of computing a delta index and then matching against it. If I understand correctly, delta indices for the search window are kept, and the current file is compared against them. Locality might be better if just one delta index gets calculated and then compared with all _upcoming_ delta candidates in one go. No idea whether it would pay off the complications. -- David Kastrup ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: performance on repack 2007-08-16 8:10 ` David Kastrup @ 2007-08-16 15:34 ` Nicolas Pitre 2007-08-16 16:13 ` Jon Smirl 0 siblings, 1 reply; 30+ messages in thread From: Nicolas Pitre @ 2007-08-16 15:34 UTC (permalink / raw) To: David Kastrup; +Cc: git On Thu, 16 Aug 2007, David Kastrup wrote: > Nicolas Pitre <nico@cam.org> writes: > > > On Wed, 15 Aug 2007, Jon Smirl wrote: > > > >> On 8/15/07, Martin Koegler <mkoegler@auto.tuwien.ac.at> wrote: > >> > git-pack-objects knows the order, in which it will use the objects. A > >> > seperate thread could pre-read the next object and wait until the main > >> > thread starts processing it. After the read is complete, another > >> > thread could start computing the delta index. > >> > >> The hope is that the new adaptive read ahead code in the kernel will > >> get this right and you won't need the second thread. Letting the > >> kernel handle the read ahead will dynamically scale as other demands > >> are made on the host. There's effectively only one read ahead cache in > >> the system, only the kernel really knows how to divide it up between > >> competing apps. > > > > No read ahead will ever help the delta search phase. > > Well, the delta search phase consists of computing a delta index and > then matching against it. No, what I mean is what happens at a higher level where one object is deltified against several base object candidates to find the best match. Those several objects are presorted according to a combination of heuristics that makes their actual access completely random, hence no kernel read ahead might help here. > If I understand correctly, delta indices > for the search window are kept, and the current file is compared > against them. Locality might be better if just one delta index gets > calculated and then compared with all _upcoming_ delta candidates in > one go. This appears so obvious that I attempted that a while ago already. The idea turned up to be so complex to implement correctly and produced suboptimal results in practice that I abandoned it. See http://marc.info/?l=git&m=114610715706599&w=2 for the details if you're interested. PS: please at least CC me when replying to my mails Nicolas ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: performance on repack 2007-08-16 15:34 ` Nicolas Pitre @ 2007-08-16 16:13 ` Jon Smirl 2007-08-16 16:21 ` Nicolas Pitre 0 siblings, 1 reply; 30+ messages in thread From: Jon Smirl @ 2007-08-16 16:13 UTC (permalink / raw) To: Nicolas Pitre; +Cc: David Kastrup, git On 8/16/07, Nicolas Pitre <nico@cam.org> wrote: > On Thu, 16 Aug 2007, David Kastrup wrote: > > If I understand correctly, delta indices > > for the search window are kept, and the current file is compared > > against them. Locality might be better if just one delta index gets > > calculated and then compared with all _upcoming_ delta candidates in > > one go. > > This appears so obvious that I attempted that a while ago already. > > The idea turned up to be so complex to implement correctly and produced > suboptimal results in practice that I abandoned it. In practice it doesn't really matter what you do. Most developers have a decent amount of RAM. Disks run at 50MB/sec or more. The entire pack file ends up in the kernel disk cache within a few seconds. Turning the twenty minute delta search into 6-7 minutes is a much more obvious win. > > See http://marc.info/?l=git&m=114610715706599&w=2 for the details if > you're interested. > > PS: please at least CC me when replying to my mails > > > Nicolas > - > To unsubscribe from this list: send the line "unsubscribe git" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: performance on repack 2007-08-16 16:13 ` Jon Smirl @ 2007-08-16 16:21 ` Nicolas Pitre 0 siblings, 0 replies; 30+ messages in thread From: Nicolas Pitre @ 2007-08-16 16:21 UTC (permalink / raw) To: Jon Smirl; +Cc: David Kastrup, git On Thu, 16 Aug 2007, Jon Smirl wrote: > On 8/16/07, Nicolas Pitre <nico@cam.org> wrote: > > On Thu, 16 Aug 2007, David Kastrup wrote: > > > If I understand correctly, delta indices > > > for the search window are kept, and the current file is compared > > > against them. Locality might be better if just one delta index gets > > > calculated and then compared with all _upcoming_ delta candidates in > > > one go. > > > > This appears so obvious that I attempted that a while ago already. > > > > The idea turned up to be so complex to implement correctly and produced > > suboptimal results in practice that I abandoned it. > > In practice it doesn't really matter what you do. Most developers have > a decent amount of RAM. Disks run at 50MB/sec or more. The entire pack > file ends up in the kernel disk cache within a few seconds. > > Turning the twenty minute delta search into 6-7 minutes is a much more > obvious win. And your point in relation with what was just said is ? Nicolas ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: performance on repack 2007-08-15 15:08 ` Jon Smirl 2007-08-15 17:11 ` Martin Koegler @ 2007-08-15 21:05 ` Nicolas Pitre 1 sibling, 0 replies; 30+ messages in thread From: Nicolas Pitre @ 2007-08-15 21:05 UTC (permalink / raw) To: Jon Smirl; +Cc: Shawn O. Pearce, Martin Koegler, Git Mailing List On Wed, 15 Aug 2007, Jon Smirl wrote: > You can avoid making all the low level calls thread safe by using the > main thread to get everything into RAM before starting to search for > the deltas. The worker threads would only deal with things completely > in memory. You may need to ref count these in-memory objects if they > are shared between worker threads. For simplicity the in-memory input > objects should be read only by the threads. The worker threads create > new structures to hand their results back to the main thread for > writing to disk. > > A typical solution is to use a queue protected by locks. Main thread > faults in all the needed objects to cache. Main thread builds a queue > entry and increments reference count on all referenced objects. Main > thread uses locks to add entry to queue, while queue is locked it > removes any finished jobs. Main thread writes finished results to > disks, decrements ref counts. Cache logic can then take over about > when objects are actually deleted. > > Worker threads wait on the queue. When something is placed in the > queue a waiting worker thread removes it, processes it, puts the > results in RAM, and places the object back on the finished queue. Then > waits for another object. It doesn't call into the main body of code. Way too complex and rather unpractical with the current algorithms. Currently, information on objects is gathered (the "Counting objects" phase) and that hardly can be paralleled. Once objects are known then a sorted list is created so deltification of object x might be optimally attempted on objects x-1 through to x-10. Creating that list cannot be paralleled either, but it is quick anyway. Then comes the actual deltification phase where the huge cost is. The problem simply has to be partitioned into a few threads, where thread 1 deals with object 1 to 100000 from that sorted list, thread 2 has objects 100001 to 200000, etc. etc. This is now a partitioning problem where the thread synchronization is dealt with from a higher and non performance critical level. Nicolas ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: performance on repack 2007-08-15 5:32 ` Shawn O. Pearce 2007-08-15 15:08 ` Jon Smirl @ 2007-08-15 20:49 ` Nicolas Pitre 2007-08-30 4:27 ` Nicolas Pitre 1 sibling, 1 reply; 30+ messages in thread From: Nicolas Pitre @ 2007-08-15 20:49 UTC (permalink / raw) To: Shawn O. Pearce; +Cc: Jon Smirl, Martin Koegler, Git Mailing List On Wed, 15 Aug 2007, Shawn O. Pearce wrote: > Nicolas Pitre <nico@cam.org> wrote: > > On Mon, 13 Aug 2007, Shawn O. Pearce wrote: > > > > > I'm not sure its that complex to run all try_delta calls of the > > > current window in parallel. > > > > Well, here's my quick attempt at it. Unfortunately, performance isn't > > as good as I'd expected, especially with relatively small blobs like > > those found in the linux kernel repo. It looks like the overhead of > > thread creation/joining might be significant compared to the actual > > delta computation. I have a P4 with HT which might behave differently > > from a real SMP machine, or whatever, but the CPU usage never exceeded > > 110% according to top (sometimes it even dropped below 95%). Actually, a > > git-repack gets much slower due to 2m27s of system time compared to > > 0m03s without threads. And this is with NPTL. > > Yea, I'm not surprised. This is quick and dirty but is really the > wrong approach. The kernel is spending lots of time setting up the > thread and its stack, then scheduling it onto a CPU, only to likely > find it finish before it even yields the CPU or is interrupted off. > So I'd expect to get huge system times here. So... Here's a variation on the previous attempt. A thread is created for each delta window slot and it is awakened when there is a delta to compute. Now top is reporting an average of 175% CPU usage when delta searching on the linux repo. The threading overhead when down a lot, from 2m of system time to 42s. Still a far cry from the usual 3s when no threads are used though. > > With a repo containing big blobs it is different though. CPU usage > > firmly gets to 200% hence all cores are saturated with work. > > Right, that's what we want to see. ;-) > > What makes this bloody complex (yes, I just did disagree with > myself) is you need to use a pool of threads, one per core, keep > them running for the life of the delta phase (so we amortize the > thread start/stop time) and also give them large enough bursts of > data that they can almost always use their full time slice before > being interrupted by the kernel. This is still way suboptimal. The thread synchronization that has to be involved for _each_ delta attempt simply puts too much overhead. Also, all threads have to be done before sliding the window to the next object since deltification of the new object depends on the results from the previous window position. However, it is highly unlikely that all threads will be done with their share of work at the same time and therefore there will always be a certain amount of time when most threads will be idle while the last ones are still busy, up to the point where there is only one thread to wait after before moving the window forward. This is what explains why I get around 175% CPU usage instead of 200% when repacking the linux kernel repo. And still, the total amount of CPU cycles spent on the whole task is far greater than without any thread because, as I mentioned before, running delta computations in parallel doesn't allow early abort of create_delta() calls based on a good max_size value from a previous delta results because there is no such previous results when everything is computed in parallel. So, while the repack might be faster from a wall clock point of view, it wastes more CPU cycles in the end which is not really nice. > > Remains the approach of calling find_deltas() n times with 1/n of the > > delta_list, one call per thread, for the bulk of the delta search work. > > This might even be much simpler than my current patch is. However this > > approach will require n times the memory for the delta window data. > > Thread creation overhead will occur only once. > > Yea, that I agree with. The other thing is the main thread may need > to push a couple of windows worth of work into the threads, so that > if this window's 1/n unit goes really fast on that thread it doesn't > stall out waiting for the main thread to get it more data. It is easier to keep 4 threads busy with 100000 objects each to deal with in a conventional way than keeping 10 threads busy when processing a single delta window in parallel. Or the delta_list can be partitioned in smaller chunks and anytime one of the find_deltas() thread is done then it is fed another chunk. The side effect of that is that no delta will cross chunk boundaries. But just like when repacking multiple existing packs into one with delta data reuse, the actual chunk boundaries can be re-fed to find_deltas() in order to find more deltas across chunks. This approach will require less code than the patch below, it will have almost zero threading overhead and each thread will be sure to be busy for a long period of time. And the early create_delta() abort will just work too. > > Also... read_sha1_file() is currently not thread safe, and threaded > > delta searching would requires that its usage be serialized with a > > global mutex (not done in this patch which is a bug), or ideally be made > > thread aware. > > Yea, that's a problem. From the Git library perspective that Luiz > Capitulino has been working on for GSOC this summer being able to > compile/link Git with -lpthreads and have it actually be thread > safe would be useful. Unfortunately its waaaay more than just > making read_sha1_file() threadsafe. :-/ > > read_sha1_file() is actually a reasonably critical part of the > major functions of Git. Slowing that down by making it go through > pthread_mutex_{lock,unlock} is going to hurt. I tried something > like the following, and its a bit slower, and really ain't a great > multi-thread aware implementation. I'd say that, since almost none of the GIT code currently need thread aware read_sha1_file(), it is best to only serialize its usage at the caller level where it is actually needed. Nicolas --- diff --git a/Makefile b/Makefile index 4eb4637..c3c6e68 100644 --- a/Makefile +++ b/Makefile @@ -122,6 +122,9 @@ all:: # If not set it defaults to the bare 'wish'. If it is set to the empty # string then NO_TCLTK will be forced (this is used by configure script). # +# Define THREADED_DELTA_SEARCH if you have pthreads and wish to exploit +# parallel delta searching when packing objects. +# GIT-VERSION-FILE: .FORCE-GIT-VERSION-FILE @$(SHELL_PATH) ./GIT-VERSION-GEN @@ -662,6 +665,11 @@ ifdef NO_HSTRERROR COMPAT_OBJS += compat/hstrerror.o endif +ifdef THREADED_DELTA_SEARCH + BASIC_CFLAGS += -DTHREADED_DELTA_SEARCH + EXTLIBS += -lpthread +endif + ifeq ($(TCLTK_PATH),) NO_TCLTK=NoThanks endif diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c index 5e9d1fd..449f9f3 100644 --- a/builtin-pack-objects.c +++ b/builtin-pack-objects.c @@ -15,6 +15,10 @@ #include "list-objects.h" #include "progress.h" +#ifdef THREADED_DELTA_SEARCH +#include <pthread.h> +#endif + static const char pack_usage[] = "\ git-pack-objects [{ -q | --progress | --all-progress }] \n\ [--max-pack-size=N] [--local] [--incremental] \n\ @@ -1273,6 +1277,14 @@ struct unpacked { void *data; struct delta_index *index; unsigned depth; +#ifdef THREADED_DELTA_SEARCH + pthread_t thread; + pthread_mutex_t mutex; + pthread_mutex_t wait; + pthread_mutex_t done; + struct unpacked *trg; + unsigned trg_max_depth; +#endif }; static int delta_cacheable(struct unpacked *trg, struct unpacked *src, @@ -1292,6 +1304,24 @@ static int delta_cacheable(struct unpacked *trg, struct unpacked *src, return 0; } +#ifdef THREADED_DELTA_SEARCH + +static pthread_mutex_t read_sha1_file_mutex = PTHREAD_MUTEX_INITIALIZER; + +#define read_lock() pthread_mutex_lock(&read_sha1_file_mutex) +#define read_unlock() pthread_mutex_unlock(&read_sha1_file_mutex) +#define delta_lock(e) pthread_mutex_lock(&(e)->mutex) +#define delta_unlock(e) pthread_mutex_unlock(&(e)->mutex) + +#else + +#define read_lock() 0 +#define read_unlock() 0 +#define delta_lock(e) 0 +#define delta_unlock(e) 0 + +#endif + /* * We search for deltas _backwards_ in a list sorted by type and * by size, so that we see progressively smaller and smaller files. @@ -1335,6 +1365,7 @@ static int try_delta(struct unpacked *trg, struct unpacked *src, return 0; /* Now some size filtering heuristics. */ + delta_lock(trg); trg_size = trg_entry->size; if (!trg_entry->delta) { max_size = trg_size/2 - 20; @@ -1343,6 +1374,7 @@ static int try_delta(struct unpacked *trg, struct unpacked *src, max_size = trg_entry->delta_size; ref_depth = trg->depth; } + delta_unlock(trg); max_size = max_size * (max_depth - src->depth) / (max_depth - ref_depth + 1); if (max_size == 0) @@ -1355,19 +1387,28 @@ static int try_delta(struct unpacked *trg, struct unpacked *src, return 0; /* Load data if not already done */ + + delta_lock(trg); if (!trg->data) { + read_lock(); trg->data = read_sha1_file(trg_entry->idx.sha1, &type, &sz); if (sz != trg_size) die("object %s inconsistent object length (%lu vs %lu)", sha1_to_hex(trg_entry->idx.sha1), sz, trg_size); window_memory_usage += sz; + read_unlock(); } + delta_unlock(trg); + + delta_lock(src); if (!src->data) { + read_lock(); src->data = read_sha1_file(src_entry->idx.sha1, &type, &sz); if (sz != src_size) die("object %s inconsistent object length (%lu vs %lu)", sha1_to_hex(src_entry->idx.sha1), sz, src_size); window_memory_usage += sz; + read_unlock(); } if (!src->index) { src->index = create_delta_index(src->data, src_size); @@ -1375,20 +1416,32 @@ static int try_delta(struct unpacked *trg, struct unpacked *src, static int warned = 0; if (!warned++) warning("suboptimal pack - out of memory"); + delta_unlock(src); return 0; } + read_lock(); window_memory_usage += sizeof_delta_index(src->index); + read_unlock(); } + delta_unlock(src); delta_buf = create_delta(src->index, trg->data, trg_size, &delta_size, max_size); if (!delta_buf) return 0; + delta_lock(trg); if (trg_entry->delta_data) { - /* Prefer only shallower same-sized deltas. */ - if (delta_size == trg_entry->delta_size && - src->depth + 1 >= trg->depth) { +#ifdef THREADED_DELTA_SEARCH + /* A better match might have been found in the mean time */ + max_size = trg_entry->delta_size * (max_depth - src->depth) / + (max_depth - trg->depth + 1); +#endif + /* ... also prefer only shallower same-sized deltas. */ + if (delta_size > max_size || + (delta_size == trg_entry->delta_size && + src->depth + 1 >= trg->depth)) { free(delta_buf); + delta_unlock(trg); return 0; } delta_cache_size -= trg_entry->delta_size; @@ -1404,9 +1457,27 @@ static int try_delta(struct unpacked *trg, struct unpacked *src, delta_cache_size += trg_entry->delta_size; } else free(delta_buf); + delta_unlock(trg); return 1; } +#ifdef THREADED_DELTA_SEARCH + +static void *threaded_try_delta(void *arg) +{ + struct unpacked *src = arg; + for (;;) { + pthread_mutex_lock(&src->wait); + if (!src->trg) + break; + try_delta(src->trg, src, src->trg_max_depth); + pthread_mutex_unlock(&src->done); + } + return NULL; +} + +#endif + static unsigned int check_delta_limit(struct object_entry *me, unsigned int n) { struct object_entry *child = me->delta_child; @@ -1439,19 +1510,32 @@ static void find_deltas(struct object_entry **list, int window, int depth) uint32_t i = nr_objects, idx = 0, count = 0, processed = 0; unsigned int array_size = window * sizeof(struct unpacked); struct unpacked *array; - int max_depth; + int j, max_depth; if (!nr_objects) return; array = xmalloc(array_size); memset(array, 0, array_size); +#ifdef THREADED_DELTA_SEARCH + for (j = 0; j < window; j++) { + int ret; + pthread_mutex_init(&array[j].mutex, NULL); + pthread_mutex_init(&array[j].wait, NULL); + pthread_mutex_init(&array[j].done, NULL); + pthread_mutex_lock(&array[j].wait); + pthread_mutex_lock(&array[j].done); + ret = pthread_create(&array[j].thread, NULL, + threaded_try_delta, &array[j]); + if (ret) + die("unable to create thread: %s", strerror(ret)); + } +#endif if (progress) start_progress(&progress_state, "Deltifying %u objects...", "", nr_result); do { struct object_entry *entry = list[--i]; struct unpacked *n = array + idx; - int j; if (!entry->preferred_base) processed++; @@ -1494,6 +1578,31 @@ static void find_deltas(struct object_entry **list, int window, int depth) goto next; } +#ifdef THREADED_DELTA_SEARCH + /* Unlock all threads ... */ + j = window; + while (--j > 0) { + uint32_t other_idx = idx + j; + struct unpacked *m; + if (other_idx >= window) + other_idx -= window; + m = array + other_idx; + if (!m->entry) + break; + m->trg = n; + m->trg_max_depth= max_depth; + pthread_mutex_unlock(&m->wait); + } + /* ... then wait for their completion. */ + while (++j < window) { + uint32_t other_idx = idx + j; + struct unpacked *m; + if (other_idx >= window) + other_idx -= window; + m = array + other_idx; + pthread_mutex_lock(&m->done); + } +#else j = window; while (--j > 0) { uint32_t other_idx = idx + j; @@ -1506,6 +1615,7 @@ static void find_deltas(struct object_entry **list, int window, int depth) if (try_delta(n, m, max_depth) < 0) break; } +#endif /* if we made n a delta, and if n is already at max * depth, leaving it in the window is pointless. we @@ -1528,6 +1638,16 @@ static void find_deltas(struct object_entry **list, int window, int depth) for (i = 0; i < window; ++i) { free_delta_index(array[i].index); free(array[i].data); +#ifdef THREADED_DELTA_SEARCH + array[i].trg = NULL; + pthread_mutex_unlock(&array[i].wait); + pthread_join(array[i].thread, NULL); + pthread_mutex_unlock(&array[i].wait); + pthread_mutex_unlock(&array[i].done); + pthread_mutex_destroy(&array[i].mutex); + pthread_mutex_destroy(&array[i].wait); + pthread_mutex_destroy(&array[i].done); +#endif } free(array); } ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: performance on repack 2007-08-15 20:49 ` Nicolas Pitre @ 2007-08-30 4:27 ` Nicolas Pitre 2007-08-30 4:36 ` Nicolas Pitre 0 siblings, 1 reply; 30+ messages in thread From: Nicolas Pitre @ 2007-08-30 4:27 UTC (permalink / raw) To: Shawn O. Pearce; +Cc: Jon Smirl, Martin Koegler, Git Mailing List On Wed, 15 Aug 2007, Nicolas Pitre wrote: > On Wed, 15 Aug 2007, Shawn O. Pearce wrote: > > > Nicolas Pitre <nico@cam.org> wrote: > > > > > Remains the approach of calling find_deltas() n times with 1/n of the > > > delta_list, one call per thread, for the bulk of the delta search work. > > > This might even be much simpler than my current patch is. However this > > > approach will require n times the memory for the delta window data. > > > Thread creation overhead will occur only once. > > > > Yea, that I agree with. The other thing is the main thread may need > > to push a couple of windows worth of work into the threads, so that > > if this window's 1/n unit goes really fast on that thread it doesn't > > stall out waiting for the main thread to get it more data. > > It is easier to keep 4 threads busy with 100000 objects each to deal > with in a conventional way than keeping 10 threads busy when processing > a single delta window in parallel. Or the delta_list can be partitioned > in smaller chunks and anytime one of the find_deltas() thread is done > then it is fed another chunk. > > The side effect of that is that no delta will cross chunk boundaries. > But just like when repacking multiple existing packs into one with > delta data reuse, the actual chunk boundaries can be re-fed to > find_deltas() in order to find more deltas across chunks. Well, here is a quick implementation of this idea for those who would like to give it a try. Performance is indeed much better. The delta progress status is however fscked up at the moment, so don't pay too much attention to it. Partitioning is also extremely crude. But the idea looks promizing. --- diff --git a/Makefile b/Makefile index 4eb4637..c3c6e68 100644 --- a/Makefile +++ b/Makefile @@ -122,6 +122,9 @@ all:: # If not set it defaults to the bare 'wish'. If it is set to the empty # string then NO_TCLTK will be forced (this is used by configure script). # +# Define THREADED_DELTA_SEARCH if you have pthreads and wish to exploit +# parallel delta searching when packing objects. +# GIT-VERSION-FILE: .FORCE-GIT-VERSION-FILE @$(SHELL_PATH) ./GIT-VERSION-GEN @@ -662,6 +665,11 @@ ifdef NO_HSTRERROR COMPAT_OBJS += compat/hstrerror.o endif +ifdef THREADED_DELTA_SEARCH + BASIC_CFLAGS += -DTHREADED_DELTA_SEARCH + EXTLIBS += -lpthread +endif + ifeq ($(TCLTK_PATH),) NO_TCLTK=NoThanks endif diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c index e917e8e..7d68f82 100644 --- a/builtin-pack-objects.c +++ b/builtin-pack-objects.c @@ -15,6 +15,10 @@ #include "list-objects.h" #include "progress.h" +#ifdef THREADED_DELTA_SEARCH +#include <pthread.h> +#endif + static const char pack_usage[] = "\ git-pack-objects [{ -q | --progress | --all-progress }] \n\ [--max-pack-size=N] [--local] [--incremental] \n\ @@ -1290,6 +1294,20 @@ static int delta_cacheable(unsigned long src_size, unsigned long trg_size, return 0; } +#ifdef THREADED_DELTA_SEARCH + +static pthread_mutex_t read_sha1_file_mutex = PTHREAD_MUTEX_INITIALIZER; + +#define read_lock() pthread_mutex_lock(&read_sha1_file_mutex) +#define read_unlock() pthread_mutex_unlock(&read_sha1_file_mutex) + +#else + +#define read_lock() 0 +#define read_unlock() 0 + +#endif + /* * We search for deltas _backwards_ in a list sorted by type and * by size, so that we see progressively smaller and smaller files. @@ -1348,7 +1366,9 @@ static int try_delta(struct unpacked *trg, struct unpacked *src, /* Load data if not already done */ if (!trg->data) { + read_lock(); trg->data = read_sha1_file(trg_entry->idx.sha1, &type, &sz); + read_unlock(); if (!trg->data) die("object %s cannot be read", sha1_to_hex(trg_entry->idx.sha1)); @@ -1358,7 +1378,9 @@ static int try_delta(struct unpacked *trg, struct unpacked *src, *mem_usage += sz; } if (!src->data) { + read_lock(); src->data = read_sha1_file(src_entry->idx.sha1, &type, &sz); + read_unlock(); if (!src->data) die("object %s cannot be read", sha1_to_hex(src_entry->idx.sha1)); @@ -1521,6 +1543,56 @@ static void find_deltas(struct object_entry **list, unsigned list_size, free(array); } +#ifdef THREADED_DELTA_SEARCH + +struct thread_params { + pthread_t thread; + struct object_entry **list; + unsigned list_size; + unsigned nr_deltas; + int window; + int depth; +}; + +static void *threaded_find_deltas(void *arg) +{ + struct thread_params *p = arg; + if (p->list_size) + find_deltas(p->list, p->list_size, p->nr_deltas, + p->window, p->depth); + return NULL; +} + +static void ll_find_deltas(struct object_entry **list, unsigned list_size, + unsigned nr_deltas, int window, int depth) +{ + struct thread_params p[4]; + int i, ret; + + for (i = 0; i < 4; i++) { + unsigned sublist_size = list_size / (4 - i); + p[i].list = list; + p[i].list_size = sublist_size; + p[i].nr_deltas = nr_deltas; + p[i].window = window; + p[i].depth = depth; + ret = pthread_create(&p[i].thread, NULL, + threaded_find_deltas, &p[i]); + if (ret) + die("unable to create thread: %s", strerror(ret)); + list += sublist_size; + list_size -= sublist_size; + } + + for (i = 0; i < 4; i++) { + pthread_join(p[i].thread, NULL); + } +} + +#else +#define ll_find_deltas find_deltas +#endif + static void prepare_pack(int window, int depth) { struct object_entry **delta_list; @@ -1557,7 +1629,7 @@ static void prepare_pack(int window, int depth) if (nr_deltas) { qsort(delta_list, n, sizeof(*delta_list), type_size_sort); - find_deltas(delta_list, n, nr_deltas, window+1, depth); + ll_find_deltas(delta_list, n, nr_deltas, window+1, depth); } free(delta_list); } ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: performance on repack 2007-08-30 4:27 ` Nicolas Pitre @ 2007-08-30 4:36 ` Nicolas Pitre 2007-08-30 16:17 ` Jon Smirl 2007-09-01 21:54 ` Jon Smirl 0 siblings, 2 replies; 30+ messages in thread From: Nicolas Pitre @ 2007-08-30 4:36 UTC (permalink / raw) To: Shawn O. Pearce; +Cc: Jon Smirl, Martin Koegler, Git Mailing List On Thu, 30 Aug 2007, Nicolas Pitre wrote: > Well, here is a quick implementation of this idea for those who would > like to give it a try. [...] Well, that would help if I provided the full diff of course. --- diff --git a/Makefile b/Makefile index 4eb4637..c3c6e68 100644 --- a/Makefile +++ b/Makefile @@ -122,6 +122,9 @@ all:: # If not set it defaults to the bare 'wish'. If it is set to the empty # string then NO_TCLTK will be forced (this is used by configure script). # +# Define THREADED_DELTA_SEARCH if you have pthreads and wish to exploit +# parallel delta searching when packing objects. +# GIT-VERSION-FILE: .FORCE-GIT-VERSION-FILE @$(SHELL_PATH) ./GIT-VERSION-GEN @@ -662,6 +665,11 @@ ifdef NO_HSTRERROR COMPAT_OBJS += compat/hstrerror.o endif +ifdef THREADED_DELTA_SEARCH + BASIC_CFLAGS += -DTHREADED_DELTA_SEARCH + EXTLIBS += -lpthread +endif + ifeq ($(TCLTK_PATH),) NO_TCLTK=NoThanks endif diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c index 9b3ef94..7d68f82 100644 --- a/builtin-pack-objects.c +++ b/builtin-pack-objects.c @@ -15,6 +15,10 @@ #include "list-objects.h" #include "progress.h" +#ifdef THREADED_DELTA_SEARCH +#include <pthread.h> +#endif + static const char pack_usage[] = "\ git-pack-objects [{ -q | --progress | --all-progress }] \n\ [--max-pack-size=N] [--local] [--incremental] \n\ @@ -78,7 +82,6 @@ static unsigned long delta_cache_size = 0; static unsigned long max_delta_cache_size = 0; static unsigned long cache_max_small_delta_size = 1000; -static unsigned long window_memory_usage = 0; static unsigned long window_memory_limit = 0; /* @@ -1291,6 +1294,20 @@ static int delta_cacheable(unsigned long src_size, unsigned long trg_size, return 0; } +#ifdef THREADED_DELTA_SEARCH + +static pthread_mutex_t read_sha1_file_mutex = PTHREAD_MUTEX_INITIALIZER; + +#define read_lock() pthread_mutex_lock(&read_sha1_file_mutex) +#define read_unlock() pthread_mutex_unlock(&read_sha1_file_mutex) + +#else + +#define read_lock() 0 +#define read_unlock() 0 + +#endif + /* * We search for deltas _backwards_ in a list sorted by type and * by size, so that we see progressively smaller and smaller files. @@ -1300,7 +1317,7 @@ static int delta_cacheable(unsigned long src_size, unsigned long trg_size, * one. */ static int try_delta(struct unpacked *trg, struct unpacked *src, - unsigned max_depth) + unsigned max_depth, unsigned long *mem_usage) { struct object_entry *trg_entry = trg->entry; struct object_entry *src_entry = src->entry; @@ -1313,12 +1330,6 @@ static int try_delta(struct unpacked *trg, struct unpacked *src, if (trg_entry->type != src_entry->type) return -1; - /* We do not compute delta to *create* objects we are not - * going to pack. - */ - if (trg_entry->preferred_base) - return -1; - /* * We do not bother to try a delta that we discarded * on an earlier try, but only when reusing delta data. @@ -1355,24 +1366,28 @@ static int try_delta(struct unpacked *trg, struct unpacked *src, /* Load data if not already done */ if (!trg->data) { + read_lock(); trg->data = read_sha1_file(trg_entry->idx.sha1, &type, &sz); + read_unlock(); if (!trg->data) die("object %s cannot be read", sha1_to_hex(trg_entry->idx.sha1)); if (sz != trg_size) die("object %s inconsistent object length (%lu vs %lu)", sha1_to_hex(trg_entry->idx.sha1), sz, trg_size); - window_memory_usage += sz; + *mem_usage += sz; } if (!src->data) { + read_lock(); src->data = read_sha1_file(src_entry->idx.sha1, &type, &sz); + read_unlock(); if (!src->data) die("object %s cannot be read", sha1_to_hex(src_entry->idx.sha1)); if (sz != src_size) die("object %s inconsistent object length (%lu vs %lu)", sha1_to_hex(src_entry->idx.sha1), sz, src_size); - window_memory_usage += sz; + *mem_usage += sz; } if (!src->index) { src->index = create_delta_index(src->data, src_size); @@ -1382,7 +1397,7 @@ static int try_delta(struct unpacked *trg, struct unpacked *src, warning("suboptimal pack - out of memory"); return 0; } - window_memory_usage += sizeof_delta_index(src->index); + *mem_usage += sizeof_delta_index(src->index); } delta_buf = create_delta(src->index, trg->data, trg_size, &delta_size, max_size); @@ -1425,68 +1440,59 @@ static unsigned int check_delta_limit(struct object_entry *me, unsigned int n) return m; } -static void free_unpacked(struct unpacked *n) +static unsigned long free_unpacked(struct unpacked *n) { - window_memory_usage -= sizeof_delta_index(n->index); + unsigned long freed_mem = sizeof_delta_index(n->index); free_delta_index(n->index); n->index = NULL; if (n->data) { + freed_mem += n->entry->size; free(n->data); n->data = NULL; - window_memory_usage -= n->entry->size; } n->entry = NULL; n->depth = 0; + return freed_mem; } -static void find_deltas(struct object_entry **list, int window, int depth) +static void find_deltas(struct object_entry **list, unsigned list_size, + unsigned nr_deltas, int window, int depth) { - uint32_t i = nr_objects, idx = 0, count = 0, processed = 0; + uint32_t i = list_size, idx = 0, count = 0, processed = 0; unsigned int array_size = window * sizeof(struct unpacked); struct unpacked *array; - int max_depth; + unsigned long mem_usage = 0; - if (!nr_objects) - return; array = xmalloc(array_size); memset(array, 0, array_size); if (progress) - start_progress(&progress_state, "Deltifying %u objects...", "", nr_result); + start_progress(&progress_state, "Deltifying %u objects...", "", nr_deltas); do { struct object_entry *entry = list[--i]; struct unpacked *n = array + idx; - int j; - - if (!entry->preferred_base) - processed++; + int j, max_depth; - if (progress) - display_progress(&progress_state, processed); - - if (entry->delta) - /* This happens if we decided to reuse existing - * delta from a pack. "!no_reuse_delta &&" is implied. - */ - continue; - - if (entry->size < 50) - continue; - - if (entry->no_try_delta) - continue; - - free_unpacked(n); + mem_usage -= free_unpacked(n); n->entry = entry; while (window_memory_limit && - window_memory_usage > window_memory_limit && + mem_usage > window_memory_limit && count > 1) { uint32_t tail = (idx + window - count) % window; - free_unpacked(array + tail); + mem_usage -= free_unpacked(array + tail); count--; } + /* We do not compute delta to *create* objects we are not + * going to pack. + */ + if (entry->preferred_base) + goto next; + + if (progress) + display_progress(&progress_state, ++processed); + /* * If the current object is at pack edge, take the depth the * objects that depend on the current object into account @@ -1508,7 +1514,7 @@ static void find_deltas(struct object_entry **list, int window, int depth) m = array + other_idx; if (!m->entry) break; - if (try_delta(n, m, max_depth) < 0) + if (try_delta(n, m, max_depth, &mem_usage) < 0) break; } @@ -1537,21 +1543,94 @@ static void find_deltas(struct object_entry **list, int window, int depth) free(array); } +#ifdef THREADED_DELTA_SEARCH + +struct thread_params { + pthread_t thread; + struct object_entry **list; + unsigned list_size; + unsigned nr_deltas; + int window; + int depth; +}; + +static void *threaded_find_deltas(void *arg) +{ + struct thread_params *p = arg; + if (p->list_size) + find_deltas(p->list, p->list_size, p->nr_deltas, + p->window, p->depth); + return NULL; +} + +static void ll_find_deltas(struct object_entry **list, unsigned list_size, + unsigned nr_deltas, int window, int depth) +{ + struct thread_params p[4]; + int i, ret; + + for (i = 0; i < 4; i++) { + unsigned sublist_size = list_size / (4 - i); + p[i].list = list; + p[i].list_size = sublist_size; + p[i].nr_deltas = nr_deltas; + p[i].window = window; + p[i].depth = depth; + ret = pthread_create(&p[i].thread, NULL, + threaded_find_deltas, &p[i]); + if (ret) + die("unable to create thread: %s", strerror(ret)); + list += sublist_size; + list_size -= sublist_size; + } + + for (i = 0; i < 4; i++) { + pthread_join(p[i].thread, NULL); + } +} + +#else +#define ll_find_deltas find_deltas +#endif + static void prepare_pack(int window, int depth) { struct object_entry **delta_list; - uint32_t i; + uint32_t i, n, nr_deltas; get_object_details(); - if (!window || !depth) + if (!nr_objects || !window || !depth) return; delta_list = xmalloc(nr_objects * sizeof(*delta_list)); - for (i = 0; i < nr_objects; i++) - delta_list[i] = objects + i; - qsort(delta_list, nr_objects, sizeof(*delta_list), type_size_sort); - find_deltas(delta_list, window+1, depth); + nr_deltas = n = 0; + + for (i = 0; i < nr_objects; i++) { + struct object_entry *entry = objects + i; + + if (entry->delta) + /* This happens if we decided to reuse existing + * delta from a pack. "!no_reuse_delta &&" is implied. + */ + continue; + + if (entry->size < 50) + continue; + + if (entry->no_try_delta) + continue; + + if (!entry->preferred_base) + nr_deltas++; + + delta_list[n++] = entry; + } + + if (nr_deltas) { + qsort(delta_list, n, sizeof(*delta_list), type_size_sort); + ll_find_deltas(delta_list, n, nr_deltas, window+1, depth); + } free(delta_list); } ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: performance on repack 2007-08-30 4:36 ` Nicolas Pitre @ 2007-08-30 16:17 ` Jon Smirl 2007-09-01 21:54 ` Jon Smirl 1 sibling, 0 replies; 30+ messages in thread From: Jon Smirl @ 2007-08-30 16:17 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Shawn O. Pearce, Martin Koegler, Git Mailing List Practicing on the kernel, without the patch: Q6600 with 4GB RAM jonsmirl@terra:/home/linux$ time git repack -a -f --window=200 --depth=200 Generating pack... Counting objects: 518288 Done counting 546466 objects. Deltifying 546466 objects... 100% (546466/546466) done Writing 546466 objects... 100% (546466/546466) done Total 546466 (delta 452329), reused 0 (delta 0) Pack pack-04c98effa112233951acbb2d8486eefac17a5a97 created. real 16m27.752s user 16m23.829s sys 0m2.192s With the patch it didn't balance very well. First two threads finished in 2min, thrid in 5min, fourth in 10min. So for fun I set it up to 16 threads, that kept all fours cores running until the end. It added 1:20 in overhead, but it finished in 5m vs 16:30 unthreaded. Threading is an obvious win for this code. jonsmirl@terra:/home/linux$ time git repack -a -f --window=200 --depth=200 Generating pack... Counting objects: 502267 Done counting 546466 objects. Deltifying 545660 objects... Deltifying 545660 objects... Deltifying 545660 objects... Deltifying 545660 objects... Deltifying 545660 objects... Deltifying 545660 objects... Deltifying 545660 objects... Deltifying 545660 objects... Deltifying 545660 objects... Deltifying 545660 objects... Deltifying 545660 objects... Deltifying 545660 objects... Deltifying 545660 objects... Deltifying 545660 objects... Deltifying 545660 objects... Deltifying 545660 objects... 6% (34104/545660) done 6% (34099/545660) done 1% (6929/545660) donee 5% (30037/545660) done 6% (34101/545660) done 4% (25222/545660) done 6% (34102/545660) done 3% (19999/545660) done 6% (34102/545660) done 4% (27203/545660) done 6% (34102/545660) done 4% (23377/545660) done 6% (34042/545660) done 6% (34099/545660) done 6% (32740/545660) done real 5m0.360s user 17m41.822s sys 0m5.596s jonsmirl@terra:/home/linux$ -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: performance on repack 2007-08-30 4:36 ` Nicolas Pitre 2007-08-30 16:17 ` Jon Smirl @ 2007-09-01 21:54 ` Jon Smirl 1 sibling, 0 replies; 30+ messages in thread From: Jon Smirl @ 2007-09-01 21:54 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Shawn O. Pearce, Martin Koegler, Git Mailing List On 8/30/07, Nicolas Pitre <nico@cam.org> wrote: > On Thu, 30 Aug 2007, Nicolas Pitre wrote: > > > Well, here is a quick implementation of this idea for those who would > > like to give it a try. > [...] Appears to be broken for a push. jonsmirl@terra:~/mpc5200b$ git push dreamhost updating 'refs/remotes/linus/master' using 'refs/heads/master' from d1caeb02b17c6bc215a9a40a98a1beb92dcbd310 to 40ffbfad6bb79a99cc7627bdaca0ee22dec526f6 Generating pack... Counting objects: 1 Done counting 3255 objects. Result has 1997 objects. Deltifying 398 objects... Deltifying 398 objects... Deltifying 398 objects... Deltifying 398 objects... Deltifying 398 objects... Deltifying 398 objects... Deltifying 398 objects... Deltifying 398 objects... Deltifying 398 objects... Deltifying 398 objects... Deltifying 398 objects... Deltifying 398 objects... Deltifying 398 objects... Deltifying 398 objects... Deltifying 398 objects... Deltifying 398 objects... 0% (2/398) donee 1% (4/398) donee 11% (47/398) done 3% (12/398) done 2% (8/398) donee 12% (49/398) done 17% (69/398) done 1% (4/398) donee 25% (100/398) done 19% (77/398) donee 23% (94/398) done 21% (84/398) done 25% (100/398) done 2% (10/398) done 1% (6/398) done error: pack-objects died with strange error unpack eof before pack header was fully read ng refs/remotes/linus/master n/a (unpacker error) error: failed to push to 'ssh://jonsmirl1@git.digispeaker.com/~/mpc5200b.git' jonsmirl@terra:~/mpc5200b$ -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2007-09-01 21:54 UTC | newest] Thread overview: 30+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-08-11 21:12 performance on repack Jon Smirl 2007-08-11 22:09 ` David Kastrup 2007-08-11 22:34 ` Linus Torvalds 2007-08-11 23:21 ` Jon Smirl 2007-08-12 10:33 ` Martin Koegler 2007-08-12 13:49 ` Jon Smirl 2007-08-14 3:12 ` Shawn O. Pearce 2007-08-14 4:10 ` Jon Smirl 2007-08-14 5:13 ` Shawn O. Pearce 2007-08-14 5:57 ` Jon Smirl 2007-08-14 14:52 ` Nicolas Pitre 2007-08-14 21:41 ` Nicolas Pitre 2007-08-15 1:20 ` Jon Smirl 2007-08-15 1:59 ` Nicolas Pitre 2007-08-15 5:32 ` Shawn O. Pearce 2007-08-15 15:08 ` Jon Smirl 2007-08-15 17:11 ` Martin Koegler 2007-08-15 18:38 ` Jon Smirl 2007-08-15 19:00 ` Nicolas Pitre 2007-08-15 19:42 ` Jon Smirl 2007-08-16 8:10 ` David Kastrup 2007-08-16 15:34 ` Nicolas Pitre 2007-08-16 16:13 ` Jon Smirl 2007-08-16 16:21 ` Nicolas Pitre 2007-08-15 21:05 ` Nicolas Pitre 2007-08-15 20:49 ` Nicolas Pitre 2007-08-30 4:27 ` Nicolas Pitre 2007-08-30 4:36 ` Nicolas Pitre 2007-08-30 16:17 ` Jon Smirl 2007-09-01 21:54 ` Jon Smirl
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).