performance on repack

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* performance on repack
@ 2007-08-11 21:12 Jon Smirl
  2007-08-11 22:09 ` David Kastrup
  2007-08-12 10:33 ` Martin Koegler
  0 siblings, 2 replies; 30+ messages in thread
From: Jon Smirl @ 2007-08-11 21:12 UTC (permalink / raw)
  To: Git Mailing List

If anyone is bored and looking for something to do, making the delta
code in git repack multithreaded would help. Yesterday I did a big
repack that took 20 minutes and it only used one of my four cores. It
was compute bound the entire time.

Dell had 2GB quad core boxes on sale for $620 last week. I just had to
buy one. I'm surprised, it's quieter than my old P4 box.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: performance on repack
  2007-08-11 21:12 performance on repack Jon Smirl
@ 2007-08-11 22:09 ` David Kastrup
  2007-08-11 22:34   ` Linus Torvalds
  2007-08-12 10:33 ` Martin Koegler
  1 sibling, 1 reply; 30+ messages in thread
From: David Kastrup @ 2007-08-11 22:09 UTC (permalink / raw)
  To: git

"Jon Smirl" <jonsmirl@gmail.com> writes:

> If anyone is bored and looking for something to do, making the delta
> code in git repack multithreaded would help.

I severely doubt that.  It is like the "coding stuff in assembly
language will make it faster" myth.  The problem is that of manageable
complexity.  Making the stuff multithreaded or coded in assembly means
that it becomes inaccessible for a sound algorithmic redesign.

And anything that takes this long on today's machines has lots of room
for algorithmic improvement, usually resulting in a speedup from one
to several orders of magnitude.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: performance on repack
  2007-08-11 22:09 ` David Kastrup
@ 2007-08-11 22:34   ` Linus Torvalds
  2007-08-11 23:21     ` Jon Smirl
  0 siblings, 1 reply; 30+ messages in thread
From: Linus Torvalds @ 2007-08-11 22:34 UTC (permalink / raw)
  To: David Kastrup; +Cc: git

On Sun, 12 Aug 2007, David Kastrup wrote:

> "Jon Smirl" <jonsmirl@gmail.com> writes:
> 
> > If anyone is bored and looking for something to do, making the delta
> > code in git repack multithreaded would help.
> 
> I severely doubt that.  It is like the "coding stuff in assembly
> language will make it faster" myth.  The problem is that of manageable
> complexity.  Making the stuff multithreaded or coded in assembly means
> that it becomes inaccessible for a sound algorithmic redesign.

I have to admit that I'm not a huge fan of threading: the complexity and 
locking often kills you, if memory bandwidth constraints do not, and the 
end result is often really really hard to debug.

That said, I suspect we could some some *simple* form of this by just 
partitioning the problem space up - we could have a MT repack that 
generates four *different* packs on four different CPU's: each thread 
taking one quarter of the objects. At that point, you wouldn't even need 
threads, you could do it with regular processes, since the problem set is 
fully partitioned ocne you've generated the list of objects!

Then, after you've generated four different packs, doing a "git gc" 
(without any threading) will repack them into one big pack, and mostly 
just re-use the existing deltas.

So this would not be a generic thing, but it could be somethign that is 
useful for the forced full-repack after importing a large repository with 
fast-import, for example.

So while I agree with David in general about the problem of threading, I 
think that we can possibly simplify the special case of repacking into 
something less complicated than a "real" multi-threading problem.

		Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: performance on repack
  2007-08-11 22:34   ` Linus Torvalds
@ 2007-08-11 23:21     ` Jon Smirl
  0 siblings, 0 replies; 30+ messages in thread
From: Jon Smirl @ 2007-08-11 23:21 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David Kastrup, git

On 8/11/07, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Sun, 12 Aug 2007, David Kastrup wrote:
> > "Jon Smirl" <jonsmirl@gmail.com> writes:
> >
> > > If anyone is bored and looking for something to do, making the delta
> > > code in git repack multithreaded would help.

To put this perspective, monotone takes 40hrs on the same box to
repack a repository of similar size.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: performance on repack
  2007-08-11 21:12 performance on repack Jon Smirl
  2007-08-11 22:09 ` David Kastrup
@ 2007-08-12 10:33 ` Martin Koegler
  2007-08-12 13:49   ` Jon Smirl
  1 sibling, 1 reply; 30+ messages in thread
From: Martin Koegler @ 2007-08-12 10:33 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Git Mailing List

On Sat, Aug 11, 2007 at 05:12:24PM -0400, Jon Smirl wrote:
> If anyone is bored and looking for something to do, making the delta
> code in git repack multithreaded would help. Yesterday I did a big
> repack that took 20 minutes and it only used one of my four cores. It
> was compute bound the entire time.

First, how much time is used by the write and how much by the deltify
phase?

If the writing phase uses too much time and you have enough free
memory, you can try to raise the config variable pack.deltacachelimit
(default 1000). It will save an additional delta operation for all
object, whose delta is smaller than pack.deltacachelimit by caching
the delta.

Have you considered the impact on memory usage, if there are large
blobs in the repository?

While repacking, git keeps $window_size (default: 10) objects unpacked
in memory. For all (except one), it additionally stores the delta
index, which has about the same size as the object.

So the worst case memory usage is "sizeof(biggest object)*(2*$window_size - 1)".
If you have blobs >=100 MB, you need some GB of memory.

Partitioning the problem is not trivial:

* To get not worse packing resultes, we must first sort all objects by
  type, path, size. Then we can split split the list (for each task
  one part), which we can deltify individually.

  The problems are:

  - We need more memory, as each tasks keeps its own window of
    $window_size objects (+ delta indexes) in memory.

  - The list must be split in parts, which require the same amount of
    time. This is difficult, as it depends on the size of the objects as
    well as how they are stored (delta chain length).

* On the other hand, we could run all try_delta operations for one object
  parallel. This way, we would need not very much more memory, but
  require more synchronisation (and more complex code).

mfg Martin Kögler

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: performance on repack
  2007-08-12 10:33 ` Martin Koegler
@ 2007-08-12 13:49   ` Jon Smirl
  2007-08-14  3:12     ` Shawn O. Pearce
  0 siblings, 1 reply; 30+ messages in thread
From: Jon Smirl @ 2007-08-12 13:49 UTC (permalink / raw)
  To: Martin Koegler; +Cc: Git Mailing List

On 8/12/07, Martin Koegler <mkoegler@auto.tuwien.ac.at> wrote:
> On Sat, Aug 11, 2007 at 05:12:24PM -0400, Jon Smirl wrote:
> > If anyone is bored and looking for something to do, making the delta
> > code in git repack multithreaded would help. Yesterday I did a big
> > repack that took 20 minutes and it only used one of my four cores. It
> > was compute bound the entire time.
>
> First, how much time is used by the write and how much by the deltify
> phase?

95% in deltify

>
> If the writing phase uses too much time and you have enough free
> memory, you can try to raise the config variable pack.deltacachelimit
> (default 1000). It will save an additional delta operation for all
> object, whose delta is smaller than pack.deltacachelimit by caching
> the delta.

I have 4GB RAM and fast RAID 10 disk, writing happens very quickly.
Pretty much everything is in RAM.

> Have you considered the impact on memory usage, if there are large
> blobs in the repository?

The process size maxed at 650MB. I'm in 64b mode so there is no
virtual memory limit.

On 32b there's windowing code for accessing the packfile since we can
run out of address space, does this code get turned off for 64b?

>
> While repacking, git keeps $window_size (default: 10) objects unpacked
> in memory. For all (except one), it additionally stores the delta
> index, which has about the same size as the object.
>
> So the worst case memory usage is "sizeof(biggest object)*(2*$window_size - 1)".
> If you have blobs >=100 MB, you need some GB of memory.
>
> Partitioning the problem is not trivial:
>
> * To get not worse packing resultes, we must first sort all objects by
>   type, path, size. Then we can split split the list (for each task
>   one part), which we can deltify individually.
>
>   The problems are:
>
>   - We need more memory, as each tasks keeps its own window of
>     $window_size objects (+ delta indexes) in memory.
>
>   - The list must be split in parts, which require the same amount of
>     time. This is difficult, as it depends on the size of the objects as
>     well as how they are stored (delta chain length).
>
> * On the other hand, we could run all try_delta operations for one object
>   parallel. This way, we would need not very much more memory, but
>   require more synchronization (and more complex code).

This solution was my first thought too. Use the main thread to get
everything needed for the object into RAM, then multi-thread the
compute bound, in-memory delta search operation. Shared CPU caches
might make this very fast.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: performance on repack
  2007-08-12 13:49   ` Jon Smirl
@ 2007-08-14  3:12     ` Shawn O. Pearce
  2007-08-14  4:10       ` Jon Smirl
                         ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Shawn O. Pearce @ 2007-08-14  3:12 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Martin Koegler, Git Mailing List

Jon Smirl <jonsmirl@gmail.com> wrote:
> On 8/12/07, Martin Koegler <mkoegler@auto.tuwien.ac.at> wrote:
> > Have you considered the impact on memory usage, if there are large
> > blobs in the repository?
> 
> The process size maxed at 650MB. I'm in 64b mode so there is no
> virtual memory limit.
> 
> On 32b there's windowing code for accessing the packfile since we can
> run out of address space, does this code get turned off for 64b?

The windowing code you are talking about defaults as follows:

  Parameter                  32b      64b
  -----------------------------------------
  core.packedGitWindowSize    32M     1G
  core.packedGitLimit        256M     8G

So I doubt you are having issues with the windowing code on a 64b
system, unless your repository is just *huge*.  I did not think that
anyone had a Git repository that exceeded 8G, though the window
size of 1G might be a tad too small if there are many packfiles
and they are each larger than 1G.

> > * On the other hand, we could run all try_delta operations for one object
> >   parallel. This way, we would need not very much more memory, but
> >   require more synchronization (and more complex code).
> 
> This solution was my first thought too. Use the main thread to get
> everything needed for the object into RAM, then multi-thread the
> compute bound, in-memory delta search operation. Shared CPU caches
> might make this very fast.

I have been thinking about doing this, especially now that the
default window size is much larger.  I think the default is up as
high as 50, which means we'd keep that shiny new UltraSPARC T2 busy.
Not that I have one...  so anyone from Sun is welcome to send me
one if they want.  ;-)

I'm not sure its that complex to run all try_delta calls of the
current window in parallel.  Might be a simple enough change that
its actually worth the extra complexity, especially with these
multi-core systems being so readily available.  Repacking is the
most CPU intensive operation Git performs, and the one that is also
the easiest to make parallel.

Maybe someone else will beat me to it, but if not I might give such
a patch a shot in a few weeks.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: performance on repack
  2007-08-14  3:12     ` Shawn O. Pearce
@ 2007-08-14  4:10       ` Jon Smirl
  2007-08-14  5:13         ` Shawn O. Pearce
  2007-08-14 14:52       ` Nicolas Pitre
  2007-08-14 21:41       ` Nicolas Pitre
  2 siblings, 1 reply; 30+ messages in thread
From: Jon Smirl @ 2007-08-14  4:10 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Martin Koegler, Git Mailing List

On 8/13/07, Shawn O. Pearce <spearce@spearce.org> wrote:
> > On 32b there's windowing code for accessing the packfile since we can
> > run out of address space, does this code get turned off for 64b?
>
> The windowing code you are talking about defaults as follows:
>
>   Parameter                  32b      64b
>   -----------------------------------------
>   core.packedGitWindowSize    32M     1G
>   core.packedGitLimit        256M     8G
>
> So I doubt you are having issues with the windowing code on a 64b
> system, unless your repository is just *huge*.  I did not think that
> anyone had a Git repository that exceeded 8G, though the window
> size of 1G might be a tad too small if there are many packfiles
> and they are each larger than 1G.

Why use windows on 64b? Default core.packedGitWindowSize equal to
core.packedGitLimit

I haven't measured it but I suspect the OS calls for moving the
windows are are quite slow on a relative basis since they have to
rewrite a bunch of page tables. Why is the window so small on 32b? I
thought we were up to about a 1GB packfile before running out of
address space with Mozilla. Shouldn't the window simply be set as
large as possible on 32b, this size being a function of the available
address space, not the amount of physical memory?


> > > * On the other hand, we could run all try_delta operations for one object
> > >   parallel. This way, we would need not very much more memory, but
> > >   require more synchronization (and more complex code).
> >
> > This solution was my first thought too. Use the main thread to get
> > everything needed for the object into RAM, then multi-thread the
> > compute bound, in-memory delta search operation. Shared CPU caches
> > might make this very fast.
>
> I have been thinking about doing this, especially now that the
> default window size is much larger.  I think the default is up as
> high as 50, which means we'd keep that shiny new UltraSPARC T2 busy.
> Not that I have one...  so anyone from Sun is welcome to send me
> one if they want.  ;-)
>
> I'm not sure its that complex to run all try_delta calls of the
> current window in parallel.  Might be a simple enough change that
> its actually worth the extra complexity, especially with these
> multi-core systems being so readily available.  Repacking is the
> most CPU intensive operation Git performs, and the one that is also
> the easiest to make parallel.
>
> Maybe someone else will beat me to it, but if not I might give such
> a patch a shot in a few weeks.

I'll test it. But my assignment this week is to figure out how to
program the BestComm DMA engine in a PowerPC chip. It's oh so much fun
trying to figure out how to program these "engines" that ASIC
designers love to build and never fully document.

>
> --
> Shawn.
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: performance on repack
  2007-08-14  4:10       ` Jon Smirl
@ 2007-08-14  5:13         ` Shawn O. Pearce
  2007-08-14  5:57           ` Jon Smirl
  0 siblings, 1 reply; 30+ messages in thread
From: Shawn O. Pearce @ 2007-08-14  5:13 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Martin Koegler, Git Mailing List

Jon Smirl <jonsmirl@gmail.com> wrote:
> On 8/13/07, Shawn O. Pearce <spearce@spearce.org> wrote:
> > > On 32b there's windowing code for accessing the packfile since we can
> > > run out of address space, does this code get turned off for 64b?
> >
> > The windowing code you are talking about defaults as follows:
> >
> >   Parameter                  32b      64b
> >   -----------------------------------------
> >   core.packedGitWindowSize    32M     1G
> >   core.packedGitLimit        256M     8G
> >
> > So I doubt you are having issues with the windowing code on a 64b
> > system, unless your repository is just *huge*.  I did not think that
> > anyone had a Git repository that exceeded 8G, though the window
> > size of 1G might be a tad too small if there are many packfiles
> > and they are each larger than 1G.
> 
> Why use windows on 64b? Default core.packedGitWindowSize equal to
> core.packedGitLimit

That's *not* a good idea when you have more than one packfile.

The limit is for the sum of all packfiles.  The settings above allow
up to 8 packfiles to be opened and mapped at once on 64b systems,
with each packfile being up to 1G in size before we start shifting
the window(s) around.  Doing as you suggest would reduce the number
of open packfiles to 1, which would severely hurt performance
when there is more than one packfile and Git keeps bouncing around
between them to satisfy the current process' demands.

One could probably argue that the defaults for 64b are too small;
perhaps they should be closer to 4G/24G seeing as how the 64b address
space is so huge that we're unlikely to run into issues with being
able to use >24G of virtual address at once.

> I haven't measured it but I suspect the OS calls for moving the
> windows are are quite slow on a relative basis since they have to
> rewrite a bunch of page tables.

Maybe.  Add a call to pack_report() at the end of the program you
are interested in and run it.  We keep track of how often we move
windows around; you may find that we don't move them often enough
(or at all) to cause problems here.  Or just run it under strace
and watch mmap() activity, filtering out the uninteresting bits.

> Why is the window so small on 32b? I
> thought we were up to about a 1GB packfile before running out of
> address space with Mozilla. Shouldn't the window simply be set as
> large as possible on 32b, this size being a function of the available
> address space, not the amount of physical memory?

Because programs need to malloc() stuff to work.  And we need
stack space.  And we need to let the runtime linker mmap() in the
shared libraries we are linked to.  All in all we do get tight in
some 32b cases.  The above defaults for 32b were chosen based on
the Linux kernel repository (its under 256M) and based on some
(crude) performance testing on Linux (which seemed to say the
32M packedGitWindowSize wasn't really hurting us).  They were
basically set to give us maximum address space for working heap
and yet not have a negative impact on one of our (at the time)
largest user groups.

In particular repack (aka git-pack-objects) is a real memory pig,
especially now with its various caches.  The more address space we
can let it use in a 32b case the better off we probably are.

If someone can show that increasing these 32b defaults is the
right thing to do even in very large repositories, *especially*
with something really brutal like `git-blame` on a very busy file or
`git repack -f -a` then please submit a patch to boost them.  ;-)

-- 
Shawn.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: performance on repack
  2007-08-14  5:13         ` Shawn O. Pearce
@ 2007-08-14  5:57           ` Jon Smirl
  0 siblings, 0 replies; 30+ messages in thread
From: Jon Smirl @ 2007-08-14  5:57 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Martin Koegler, Git Mailing List

On 8/14/07, Shawn O. Pearce <spearce@spearce.org> wrote:
> Jon Smirl <jonsmirl@gmail.com> wrote:

> One could probably argue that the defaults for 64b are too small;
> perhaps they should be closer to 4G/24G seeing as how the 64b address
> space is so huge that we're unlikely to run into issues with being
> able to use >24G of virtual address at once.

We've got 24,000GB of address space to play with. Why not set them
64G/512GB and forget about them for a decade?


> > I haven't measured it but I suspect the OS calls for moving the
> > windows are are quite slow on a relative basis since they have to
> > rewrite a bunch of page tables.
>
> Maybe.  Add a call to pack_report() at the end of the program you
> are interested in and run it.  We keep track of how often we move
> windows around; you may find that we don't move them often enough
> (or at all) to cause problems here.  Or just run it under strace
> and watch mmap() activity, filtering out the uninteresting bits.
>
> > Why is the window so small on 32b? I
> > thought we were up to about a 1GB packfile before running out of
> > address space with Mozilla. Shouldn't the window simply be set as
> > large as possible on 32b, this size being a function of the available
> > address space, not the amount of physical memory?
>
> Because programs need to malloc() stuff to work.  And we need
> stack space.  And we need to let the runtime linker mmap() in the
> shared libraries we are linked to.  All in all we do get tight in
> some 32b cases.  The above defaults for 32b were chosen based on
> the Linux kernel repository (its under 256M) and based on some
> (crude) performance testing on Linux (which seemed to say the
> 32M packedGitWindowSize wasn't really hurting us).  They were
> basically set to give us maximum address space for working heap
> and yet not have a negative impact on one of our (at the time)
> largest user groups.

Kernel is barely under 256M.

Do we need two limits, could we just start with the big one and assign
first packfile fully to it, then start dividing it up dynamically as
more pack files are opened?

core.packedGitLimit
core.packedGitWindowSize

The real question is, what's the access pattern into the windows? If
it is random then the windows get moved around continuously. If is it
more or less sequential the windows will hardly get moved at all.


Of course multi-threading the diff computation is going to do far more
for me than playing with the window size.


> In particular repack (aka git-pack-objects) is a real memory pig,
> especially now with its various caches.  The more address space we
> can let it use in a 32b case the better off we probably are.
>
> If someone can show that increasing these 32b defaults is the
> right thing to do even in very large repositories, *especially*
> with something really brutal like `git-blame` on a very busy file or
> `git repack -f -a` then please submit a patch to boost them.  ;-)
>
> --
> Shawn.
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: performance on repack
  2007-08-14  3:12     ` Shawn O. Pearce
  2007-08-14  4:10       ` Jon Smirl
@ 2007-08-14 14:52       ` Nicolas Pitre
  2007-08-14 21:41       ` Nicolas Pitre
  2 siblings, 0 replies; 30+ messages in thread
From: Nicolas Pitre @ 2007-08-14 14:52 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Jon Smirl, Martin Koegler, Git Mailing List

On Mon, 13 Aug 2007, Shawn O. Pearce wrote:

> Jon Smirl <jonsmirl@gmail.com> wrote:
> > This solution was my first thought too. Use the main thread to get
> > everything needed for the object into RAM, then multi-thread the
> > compute bound, in-memory delta search operation. Shared CPU caches
> > might make this very fast.
> 
> I have been thinking about doing this, especially now that the
> default window size is much larger.  I think the default is up as
> high as 50, which means we'd keep that shiny new UltraSPARC T2 busy.
> Not that I have one...  so anyone from Sun is welcome to send me
> one if they want.  ;-)

Note that the default of 50 applies to the maximum delta depth, not the 
delta search window.  And the delta depth limit is costless on the 
packing side.  I, too, wouldn't mind the UltraSPARC T2 though.


Nicolas

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: performance on repack
  2007-08-14  3:12     ` Shawn O. Pearce
  2007-08-14  4:10       ` Jon Smirl
  2007-08-14 14:52       ` Nicolas Pitre
@ 2007-08-14 21:41       ` Nicolas Pitre
  2007-08-15  1:20         ` Jon Smirl
  2007-08-15  5:32         ` Shawn O. Pearce
  2 siblings, 2 replies; 30+ messages in thread
From: Nicolas Pitre @ 2007-08-14 21:41 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Jon Smirl, Martin Koegler, Git Mailing List

On Mon, 13 Aug 2007, Shawn O. Pearce wrote:

> I'm not sure its that complex to run all try_delta calls of the
> current window in parallel.  Might be a simple enough change that
> its actually worth the extra complexity, especially with these
> multi-core systems being so readily available.  Repacking is the
> most CPU intensive operation Git performs, and the one that is also
> the easiest to make parallel.
> 
> Maybe someone else will beat me to it, but if not I might give such
> a patch a shot in a few weeks.

Well, here's my quick attempt at it.  Unfortunately, performance isn't 
as good as I'd expected, especially with relatively small blobs like 
those found in the linux kernel repo.  It looks like the overhead of 
thread creation/joining might be significant compared to the actual 
delta computation.  I have a P4 with HT which might behave differently 
from a real SMP machine, or whatever, but the CPU usage never exceeded 
110% according to top (sometimes it even dropped below 95%). Actually, a 
git-repack gets much slower due to 2m27s of system time compared to 
0m03s without threads.  And this is with NPTL.

With a repo containing big blobs it is different though.  CPU usage 
firmly gets to 200% hence all cores are saturated with work.

So clearly the naive approach of spawning a thread for each combination 
in the delta window doesn't work.

Remains the approach of calling find_deltas() n times with 1/n of the 
delta_list, one call per thread, for the bulk of the delta search work.  
This might even be much simpler than my current patch is.  However this 
approach will require n times the memory for the delta window data.  
Thread creation overhead will occur only once.

Also... read_sha1_file() is currently not thread safe, and threaded 
delta searching would requires that its usage be serialized with a 
global mutex (not done in this patch which is a bug), or ideally be made 
thread aware.


Nicolas

---

diff --git a/Makefile b/Makefile
index 4eb4637..c3c6e68 100644
--- a/Makefile
+++ b/Makefile
@@ -122,6 +122,9 @@ all::
 # If not set it defaults to the bare 'wish'. If it is set to the empty
 # string then NO_TCLTK will be forced (this is used by configure script).
 #
+# Define THREADED_DELTA_SEARCH if you have pthreads and wish to exploit
+# parallel delta searching when packing objects.
+#
 
 GIT-VERSION-FILE: .FORCE-GIT-VERSION-FILE
 	@$(SHELL_PATH) ./GIT-VERSION-GEN
@@ -662,6 +665,11 @@ ifdef NO_HSTRERROR
 	COMPAT_OBJS += compat/hstrerror.o
 endif
 
+ifdef THREADED_DELTA_SEARCH
+	BASIC_CFLAGS += -DTHREADED_DELTA_SEARCH
+	EXTLIBS += -lpthread
+endif
+
 ifeq ($(TCLTK_PATH),)
 NO_TCLTK=NoThanks
 endif
diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index 5e9d1fd..142b6dd 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -15,6 +15,10 @@
 #include "list-objects.h"
 #include "progress.h"
 
+#ifdef THREADED_DELTA_SEARCH
+#include <pthread.h>
+#endif
+
 static const char pack_usage[] = "\
 git-pack-objects [{ -q | --progress | --all-progress }] \n\
 	[--max-pack-size=N] [--local] [--incremental] \n\
@@ -1273,6 +1277,12 @@ struct unpacked {
 	void *data;
 	struct delta_index *index;
 	unsigned depth;
+#ifdef THREADED_DELTA_SEARCH
+	pthread_t thread;
+	pthread_mutex_t mutex;
+	struct unpacked *trg;
+	unsigned trg_max_depth;
+#endif
 };
 
 static int delta_cacheable(struct unpacked *trg, struct unpacked *src,
@@ -1292,6 +1302,18 @@ static int delta_cacheable(struct unpacked *trg, struct unpacked *src,
 	return 0;
 }
 
+#ifdef THREADED_DELTA_SEARCH
+#define delta_lock_init(e)	pthread_mutex_init(&(e)->mutex, NULL)
+#define delta_lock_destroy(e)	pthread_mutex_destroy(&(e)->mutex)
+#define delta_lock(e)		pthread_mutex_lock(&(e)->mutex)
+#define delta_unlock(e)		pthread_mutex_unlock(&(e)->mutex)
+#else
+#define delta_lock_init(e)	0
+#define delta_lock_destroy(e)	0
+#define delta_lock(e)		0
+#define delta_unlock(e)		0
+#endif
+
 /*
  * We search for deltas _backwards_ in a list sorted by type and
  * by size, so that we see progressively smaller and smaller files.
@@ -1335,6 +1357,7 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
 		return 0;
 
 	/* Now some size filtering heuristics. */
+	delta_lock(trg);
 	trg_size = trg_entry->size;
 	if (!trg_entry->delta) {
 		max_size = trg_size/2 - 20;
@@ -1343,6 +1366,7 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
 		max_size = trg_entry->delta_size;
 		ref_depth = trg->depth;
 	}
+	delta_unlock(trg);
 	max_size = max_size * (max_depth - src->depth) /
 						(max_depth - ref_depth + 1);
 	if (max_size == 0)
@@ -1355,6 +1379,8 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
 		return 0;
 
 	/* Load data if not already done */
+
+	delta_lock(trg);
 	if (!trg->data) {
 		trg->data = read_sha1_file(trg_entry->idx.sha1, &type, &sz);
 		if (sz != trg_size)
@@ -1362,6 +1388,9 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
 			    sha1_to_hex(trg_entry->idx.sha1), sz, trg_size);
 		window_memory_usage += sz;
 	}
+	delta_unlock(trg);
+
+	delta_lock(src);
 	if (!src->data) {
 		src->data = read_sha1_file(src_entry->idx.sha1, &type, &sz);
 		if (sz != src_size)
@@ -1375,20 +1404,30 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
 			static int warned = 0;
 			if (!warned++)
 				warning("suboptimal pack - out of memory");
+			delta_unlock(src);
 			return 0;
 		}
 		window_memory_usage += sizeof_delta_index(src->index);
 	}
+	delta_unlock(src);
 
 	delta_buf = create_delta(src->index, trg->data, trg_size, &delta_size, max_size);
 	if (!delta_buf)
 		return 0;
 
+	delta_lock(trg);
 	if (trg_entry->delta_data) {
-		/* Prefer only shallower same-sized deltas. */
-		if (delta_size == trg_entry->delta_size &&
-		    src->depth + 1 >= trg->depth) {
+#ifdef THREADED_DELTA_SEARCH
+		/* A better match might have been found in the mean time */
+		max_size = trg_entry->delta_size * (max_depth - src->depth) /
+						(max_depth - trg->depth + 1);
+#endif
+		/* ... also prefer only shallower same-sized deltas. */
+		if (delta_size > max_size ||
+		    (delta_size == trg_entry->delta_size &&
+		     src->depth + 1 >= trg->depth)) {
 			free(delta_buf);
+			delta_unlock(trg);
 			return 0;
 		}
 		delta_cache_size -= trg_entry->delta_size;
@@ -1404,9 +1443,21 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
 		delta_cache_size += trg_entry->delta_size;
 	} else
 		free(delta_buf);
+	delta_unlock(trg);
 	return 1;
 }
 
+#ifdef THREADED_DELTA_SEARCH
+
+static void *threaded_try_delta(void *arg)
+{
+	struct unpacked *src = arg;
+	int ret = try_delta(src->trg, src, src->trg_max_depth);
+	return (void *)ret;
+}
+
+#endif
+
 static unsigned int check_delta_limit(struct object_entry *me, unsigned int n)
 {
 	struct object_entry *child = me->delta_child;
@@ -1439,19 +1490,20 @@ static void find_deltas(struct object_entry **list, int window, int depth)
 	uint32_t i = nr_objects, idx = 0, count = 0, processed = 0;
 	unsigned int array_size = window * sizeof(struct unpacked);
 	struct unpacked *array;
-	int max_depth;
+	int j, max_depth;
 
 	if (!nr_objects)
 		return;
 	array = xmalloc(array_size);
 	memset(array, 0, array_size);
+	for (j = 0; j < window; j++)
+		delta_lock_init(&array[j]);
 	if (progress)
 		start_progress(&progress_state, "Deltifying %u objects...", "", nr_result);
 
 	do {
 		struct object_entry *entry = list[--i];
 		struct unpacked *n = array + idx;
-		int j;
 
 		if (!entry->preferred_base)
 			processed++;
@@ -1494,6 +1546,38 @@ static void find_deltas(struct object_entry **list, int window, int depth)
 				goto next;
 		}
 
+#ifdef THREADED_DELTA_SEARCH
+		/* Create all threads ... */
+		j = window;
+		while (--j > 0) {
+			int ret;
+			uint32_t other_idx = idx + j;
+			struct unpacked *m;
+			if (other_idx >= window)
+				other_idx -= window;
+			m = array + other_idx;
+			if (!m->entry)
+				break;
+			m->trg = n;
+			m->trg_max_depth= max_depth;
+			ret = pthread_create(&m->thread, NULL,
+					     threaded_try_delta, m);
+			if (ret)
+				die("unable to create thread: %s", strerror(ret));
+		}
+		/* ... then join them. */
+		while (++j < window) {
+			int ret;
+			uint32_t other_idx = idx + j;
+			struct unpacked *m;
+			if (other_idx >= window)
+				other_idx -= window;
+			m = array + other_idx;
+			ret = pthread_join(m->thread, NULL);
+			if (ret)
+				die("unable to join thread: %s", strerror(ret));
+		}
+#else
 		j = window;
 		while (--j > 0) {
 			uint32_t other_idx = idx + j;
@@ -1506,6 +1590,7 @@ static void find_deltas(struct object_entry **list, int window, int depth)
 			if (try_delta(n, m, max_depth) < 0)
 				break;
 		}
+#endif
 
 		/* if we made n a delta, and if n is already at max
 		 * depth, leaving it in the window is pointless.  we
@@ -1528,6 +1613,7 @@ static void find_deltas(struct object_entry **list, int window, int depth)
 	for (i = 0; i < window; ++i) {
 		free_delta_index(array[i].index);
 		free(array[i].data);
+		delta_lock_destroy(&array[i]);
 	}
 	free(array);
 }

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: performance on repack
  2007-08-14 21:41       ` Nicolas Pitre
@ 2007-08-15  1:20         ` Jon Smirl
  2007-08-15  1:59           ` Nicolas Pitre
  2007-08-15  5:32         ` Shawn O. Pearce
  1 sibling, 1 reply; 30+ messages in thread
From: Jon Smirl @ 2007-08-15  1:20 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Shawn O. Pearce, Martin Koegler, Git Mailing List

On 8/14/07, Nicolas Pitre <nico@cam.org> wrote:
> On Mon, 13 Aug 2007, Shawn O. Pearce wrote:
>
> > I'm not sure its that complex to run all try_delta calls of the
> > current window in parallel.  Might be a simple enough change that
> > its actually worth the extra complexity, especially with these
> > multi-core systems being so readily available.  Repacking is the
> > most CPU intensive operation Git performs, and the one that is also
> > the easiest to make parallel.
> >
> > Maybe someone else will beat me to it, but if not I might give such
> > a patch a shot in a few weeks.
>
> Well, here's my quick attempt at it.  Unfortunately, performance isn't
> as good as I'd expected, especially with relatively small blobs like
> those found in the linux kernel repo.  It looks like the overhead of
> thread creation/joining might be significant compared to the actual
> delta computation.  I have a P4 with HT which might behave differently
> from a real SMP machine, or whatever, but the CPU usage never exceeded
> 110% according to top (sometimes it even dropped below 95%). Actually, a
> git-repack gets much slower due to 2m27s of system time compared to
> 0m03s without threads.  And this is with NPTL.

Thread creation/destruction overhead is way too high to make these
threads for every delta.

Another strategy is to create four worker threads once when the
process is loaded. Then use synchronization primitives to feed the
threads lumps of work. The threads persist for the life of the
process.


>
> With a repo containing big blobs it is different though.  CPU usage
> firmly gets to 200% hence all cores are saturated with work.
>
> So clearly the naive approach of spawning a thread for each combination
> in the delta window doesn't work.
>
> Remains the approach of calling find_deltas() n times with 1/n of the
> delta_list, one call per thread, for the bulk of the delta search work.
> This might even be much simpler than my current patch is.  However this
> approach will require n times the memory for the delta window data.
> Thread creation overhead will occur only once.
>
> Also... read_sha1_file() is currently not thread safe, and threaded
> delta searching would requires that its usage be serialized with a
> global mutex (not done in this patch which is a bug), or ideally be made
> thread aware.
>
>
> Nicolas
>
> ---
>
> diff --git a/Makefile b/Makefile
> index 4eb4637..c3c6e68 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -122,6 +122,9 @@ all::
>  # If not set it defaults to the bare 'wish'. If it is set to the empty
>  # string then NO_TCLTK will be forced (this is used by configure script).
>  #
> +# Define THREADED_DELTA_SEARCH if you have pthreads and wish to exploit
> +# parallel delta searching when packing objects.
> +#
>
>  GIT-VERSION-FILE: .FORCE-GIT-VERSION-FILE
>         @$(SHELL_PATH) ./GIT-VERSION-GEN
> @@ -662,6 +665,11 @@ ifdef NO_HSTRERROR
>         COMPAT_OBJS += compat/hstrerror.o
>  endif
>
> +ifdef THREADED_DELTA_SEARCH
> +       BASIC_CFLAGS += -DTHREADED_DELTA_SEARCH
> +       EXTLIBS += -lpthread
> +endif
> +
>  ifeq ($(TCLTK_PATH),)
>  NO_TCLTK=NoThanks
>  endif
> diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
> index 5e9d1fd..142b6dd 100644
> --- a/builtin-pack-objects.c
> +++ b/builtin-pack-objects.c
> @@ -15,6 +15,10 @@
>  #include "list-objects.h"
>  #include "progress.h"
>
> +#ifdef THREADED_DELTA_SEARCH
> +#include <pthread.h>
> +#endif
> +
>  static const char pack_usage[] = "\
>  git-pack-objects [{ -q | --progress | --all-progress }] \n\
>         [--max-pack-size=N] [--local] [--incremental] \n\
> @@ -1273,6 +1277,12 @@ struct unpacked {
>         void *data;
>         struct delta_index *index;
>         unsigned depth;
> +#ifdef THREADED_DELTA_SEARCH
> +       pthread_t thread;
> +       pthread_mutex_t mutex;
> +       struct unpacked *trg;
> +       unsigned trg_max_depth;
> +#endif
>  };
>
>  static int delta_cacheable(struct unpacked *trg, struct unpacked *src,
> @@ -1292,6 +1302,18 @@ static int delta_cacheable(struct unpacked *trg, struct unpacked *src,
>         return 0;
>  }
>
> +#ifdef THREADED_DELTA_SEARCH
> +#define delta_lock_init(e)     pthread_mutex_init(&(e)->mutex, NULL)
> +#define delta_lock_destroy(e)  pthread_mutex_destroy(&(e)->mutex)
> +#define delta_lock(e)          pthread_mutex_lock(&(e)->mutex)
> +#define delta_unlock(e)                pthread_mutex_unlock(&(e)->mutex)
> +#else
> +#define delta_lock_init(e)     0
> +#define delta_lock_destroy(e)  0
> +#define delta_lock(e)          0
> +#define delta_unlock(e)                0
> +#endif
> +
>  /*
>   * We search for deltas _backwards_ in a list sorted by type and
>   * by size, so that we see progressively smaller and smaller files.
> @@ -1335,6 +1357,7 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
>                 return 0;
>
>         /* Now some size filtering heuristics. */
> +       delta_lock(trg);
>         trg_size = trg_entry->size;
>         if (!trg_entry->delta) {
>                 max_size = trg_size/2 - 20;
> @@ -1343,6 +1366,7 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
>                 max_size = trg_entry->delta_size;
>                 ref_depth = trg->depth;
>         }
> +       delta_unlock(trg);
>         max_size = max_size * (max_depth - src->depth) /
>                                                 (max_depth - ref_depth + 1);
>         if (max_size == 0)
> @@ -1355,6 +1379,8 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
>                 return 0;
>
>         /* Load data if not already done */
> +
> +       delta_lock(trg);
>         if (!trg->data) {
>                 trg->data = read_sha1_file(trg_entry->idx.sha1, &type, &sz);
>                 if (sz != trg_size)
> @@ -1362,6 +1388,9 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
>                             sha1_to_hex(trg_entry->idx.sha1), sz, trg_size);
>                 window_memory_usage += sz;
>         }
> +       delta_unlock(trg);
> +
> +       delta_lock(src);
>         if (!src->data) {
>                 src->data = read_sha1_file(src_entry->idx.sha1, &type, &sz);
>                 if (sz != src_size)
> @@ -1375,20 +1404,30 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
>                         static int warned = 0;
>                         if (!warned++)
>                                 warning("suboptimal pack - out of memory");
> +                       delta_unlock(src);
>                         return 0;
>                 }
>                 window_memory_usage += sizeof_delta_index(src->index);
>         }
> +       delta_unlock(src);
>
>         delta_buf = create_delta(src->index, trg->data, trg_size, &delta_size, max_size);
>         if (!delta_buf)
>                 return 0;
>
> +       delta_lock(trg);
>         if (trg_entry->delta_data) {
> -               /* Prefer only shallower same-sized deltas. */
> -               if (delta_size == trg_entry->delta_size &&
> -                   src->depth + 1 >= trg->depth) {
> +#ifdef THREADED_DELTA_SEARCH
> +               /* A better match might have been found in the mean time */
> +               max_size = trg_entry->delta_size * (max_depth - src->depth) /
> +                                               (max_depth - trg->depth + 1);
> +#endif
> +               /* ... also prefer only shallower same-sized deltas. */
> +               if (delta_size > max_size ||
> +                   (delta_size == trg_entry->delta_size &&
> +                    src->depth + 1 >= trg->depth)) {
>                         free(delta_buf);
> +                       delta_unlock(trg);
>                         return 0;
>                 }
>                 delta_cache_size -= trg_entry->delta_size;
> @@ -1404,9 +1443,21 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
>                 delta_cache_size += trg_entry->delta_size;
>         } else
>                 free(delta_buf);
> +       delta_unlock(trg);
>         return 1;
>  }
>
> +#ifdef THREADED_DELTA_SEARCH
> +
> +static void *threaded_try_delta(void *arg)
> +{
> +       struct unpacked *src = arg;
> +       int ret = try_delta(src->trg, src, src->trg_max_depth);
> +       return (void *)ret;
> +}
> +
> +#endif
> +
>  static unsigned int check_delta_limit(struct object_entry *me, unsigned int n)
>  {
>         struct object_entry *child = me->delta_child;
> @@ -1439,19 +1490,20 @@ static void find_deltas(struct object_entry **list, int window, int depth)
>         uint32_t i = nr_objects, idx = 0, count = 0, processed = 0;
>         unsigned int array_size = window * sizeof(struct unpacked);
>         struct unpacked *array;
> -       int max_depth;
> +       int j, max_depth;
>
>         if (!nr_objects)
>                 return;
>         array = xmalloc(array_size);
>         memset(array, 0, array_size);
> +       for (j = 0; j < window; j++)
> +               delta_lock_init(&array[j]);
>         if (progress)
>                 start_progress(&progress_state, "Deltifying %u objects...", "", nr_result);
>
>         do {
>                 struct object_entry *entry = list[--i];
>                 struct unpacked *n = array + idx;
> -               int j;
>
>                 if (!entry->preferred_base)
>                         processed++;
> @@ -1494,6 +1546,38 @@ static void find_deltas(struct object_entry **list, int window, int depth)
>                                 goto next;
>                 }
>
> +#ifdef THREADED_DELTA_SEARCH
> +               /* Create all threads ... */
> +               j = window;
> +               while (--j > 0) {
> +                       int ret;
> +                       uint32_t other_idx = idx + j;
> +                       struct unpacked *m;
> +                       if (other_idx >= window)
> +                               other_idx -= window;
> +                       m = array + other_idx;
> +                       if (!m->entry)
> +                               break;
> +                       m->trg = n;
> +                       m->trg_max_depth= max_depth;
> +                       ret = pthread_create(&m->thread, NULL,
> +                                            threaded_try_delta, m);
> +                       if (ret)
> +                               die("unable to create thread: %s", strerror(ret));
> +               }
> +               /* ... then join them. */
> +               while (++j < window) {
> +                       int ret;
> +                       uint32_t other_idx = idx + j;
> +                       struct unpacked *m;
> +                       if (other_idx >= window)
> +                               other_idx -= window;
> +                       m = array + other_idx;
> +                       ret = pthread_join(m->thread, NULL);
> +                       if (ret)
> +                               die("unable to join thread: %s", strerror(ret));
> +               }
> +#else
>                 j = window;
>                 while (--j > 0) {
>                         uint32_t other_idx = idx + j;
> @@ -1506,6 +1590,7 @@ static void find_deltas(struct object_entry **list, int window, int depth)
>                         if (try_delta(n, m, max_depth) < 0)
>                                 break;
>                 }
> +#endif
>
>                 /* if we made n a delta, and if n is already at max
>                  * depth, leaving it in the window is pointless.  we
> @@ -1528,6 +1613,7 @@ static void find_deltas(struct object_entry **list, int window, int depth)
>         for (i = 0; i < window; ++i) {
>                 free_delta_index(array[i].index);
>                 free(array[i].data);
> +               delta_lock_destroy(&array[i]);
>         }
>         free(array);
>  }
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: performance on repack
  2007-08-15  1:20         ` Jon Smirl
@ 2007-08-15  1:59           ` Nicolas Pitre
  0 siblings, 0 replies; 30+ messages in thread
From: Nicolas Pitre @ 2007-08-15  1:59 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Shawn O. Pearce, Martin Koegler, Git Mailing List

On Tue, 14 Aug 2007, Jon Smirl wrote:

> On 8/14/07, Nicolas Pitre <nico@cam.org> wrote:
> > On Mon, 13 Aug 2007, Shawn O. Pearce wrote:
> >
> > > I'm not sure its that complex to run all try_delta calls of the
> > > current window in parallel.  Might be a simple enough change that
> > > its actually worth the extra complexity, especially with these
> > > multi-core systems being so readily available.  Repacking is the
> > > most CPU intensive operation Git performs, and the one that is also
> > > the easiest to make parallel.
> > >
> > > Maybe someone else will beat me to it, but if not I might give such
> > > a patch a shot in a few weeks.
> >
> > Well, here's my quick attempt at it.  Unfortunately, performance isn't
> > as good as I'd expected, especially with relatively small blobs like
> > those found in the linux kernel repo.  It looks like the overhead of
> > thread creation/joining might be significant compared to the actual
> > delta computation.  I have a P4 with HT which might behave differently
> > from a real SMP machine, or whatever, but the CPU usage never exceeded
> > 110% according to top (sometimes it even dropped below 95%). Actually, a
> > git-repack gets much slower due to 2m27s of system time compared to
> > 0m03s without threads.  And this is with NPTL.
> 
> Thread creation/destruction overhead is way too high to make these
> threads for every delta.
> 
> Another strategy is to create four worker threads once when the
> process is loaded. Then use synchronization primitives to feed the
> threads lumps of work. The threads persist for the life of the
> process.

Still, those synchronization primitives would have to be activated for 
every delta which might also add some overhead.

But there is another issue to consider: delta searching is limited by 
previous results for the same delta.  If first attempt for a delta 
produces a 10x reduction, then the next delta computation has to produce 
less than 1/10 the original object size or it is aborted early. And so 
on for subsequent attempts.  When performing delta computations in 
parallel for the same target then early delta computation abort cannot 
occur since no result is initially available to further limit delta 
processing.

Segmenting the list of objects to deltify into sub-lists for individual 
threads solves both issues.


Nicolas

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: performance on repack
  2007-08-14 21:41       ` Nicolas Pitre
  2007-08-15  1:20         ` Jon Smirl
@ 2007-08-15  5:32         ` Shawn O. Pearce
  2007-08-15 15:08           ` Jon Smirl
  2007-08-15 20:49           ` Nicolas Pitre
  1 sibling, 2 replies; 30+ messages in thread
From: Shawn O. Pearce @ 2007-08-15  5:32 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Jon Smirl, Martin Koegler, Git Mailing List

Nicolas Pitre <nico@cam.org> wrote:
> On Mon, 13 Aug 2007, Shawn O. Pearce wrote:
> 
> > I'm not sure its that complex to run all try_delta calls of the
> > current window in parallel.
> 
> Well, here's my quick attempt at it.  Unfortunately, performance isn't 
> as good as I'd expected, especially with relatively small blobs like 
> those found in the linux kernel repo.  It looks like the overhead of 
> thread creation/joining might be significant compared to the actual 
> delta computation.  I have a P4 with HT which might behave differently 
> from a real SMP machine, or whatever, but the CPU usage never exceeded 
> 110% according to top (sometimes it even dropped below 95%). Actually, a 
> git-repack gets much slower due to 2m27s of system time compared to 
> 0m03s without threads.  And this is with NPTL.

Yea, I'm not surprised.  This is quick and dirty but is really the
wrong approach.  The kernel is spending lots of time setting up the
thread and its stack, then scheduling it onto a CPU, only to likely
find it finish before it even yields the CPU or is interrupted off.
So I'd expect to get huge system times here.

Even Linux with its pretty amazing NPTL isn't up to the task.
No threading package really is.
 
> With a repo containing big blobs it is different though.  CPU usage 
> firmly gets to 200% hence all cores are saturated with work.

Right, that's what we want to see.  ;-)

What makes this bloody complex (yes, I just did disagree with
myself) is you need to use a pool of threads, one per core, keep
them running for the life of the delta phase (so we amortize the
thread start/stop time) and also give them large enough bursts of
data that they can almost always use their full time slice before
being interrupted by the kernel.

> Remains the approach of calling find_deltas() n times with 1/n of the 
> delta_list, one call per thread, for the bulk of the delta search work.  
> This might even be much simpler than my current patch is.  However this 
> approach will require n times the memory for the delta window data.  
> Thread creation overhead will occur only once.

Yea, that I agree with.  The other thing is the main thread may need
to push a couple of windows worth of work into the threads, so that
if this window's 1/n unit goes really fast on that thread it doesn't
stall out waiting for the main thread to get it more data.
 
> Also... read_sha1_file() is currently not thread safe, and threaded 
> delta searching would requires that its usage be serialized with a 
> global mutex (not done in this patch which is a bug), or ideally be made 
> thread aware.

Yea, that's a problem.  From the Git library perspective that Luiz
Capitulino has been working on for GSOC this summer being able to
compile/link Git with -lpthreads and have it actually be thread
safe would be useful.  Unfortunately its waaaay more than just
making read_sha1_file() threadsafe.  :-/

read_sha1_file() is actually a reasonably critical part of the
major functions of Git.  Slowing that down by making it go through
pthread_mutex_{lock,unlock} is going to hurt.  I tried something
like the following, and its a bit slower, and really ain't a great
multi-thread aware implementation.

# unmodified v1.5.3-rc4-91-g9fa3465 from Junio
/usr/bin/time ../rl-master/git-rev-list HEAD . >/dev/null
        1.67 real         1.42 user         0.20 sys
        1.65 real         1.36 user         0.17 sys
        1.67 real         1.35 user         0.16 sys
        1.70 real         1.35 user         0.16 sys
        1.64 real         1.35 user         0.16 sys

# v1.5.3-rc4-91-g9fa3465 + patch below
/usr/bin/time ../rl-pthread/git-rev-list HEAD . >/dev/null
        2.86 real         1.48 user         0.29 sys
        1.68 real         1.37 user         0.17 sys
        1.59 real         1.37 user         0.16 sys
        1.66 real         1.37 user         0.17 sys
        1.68 real         1.37 user         0.17 sys

That's on Mac OS X.  I guess NPTL would probably be faster at this,
but not so much faster as to make it not be an increase.

-->8--
diff --git a/Makefile b/Makefile
index 4eb4637..f31811b 100644
--- a/Makefile
+++ b/Makefile
@@ -710,7 +710,7 @@ SHELL_PATH_SQ = $(subst ','\'',$(SHELL_PATH))
 PERL_PATH_SQ = $(subst ','\'',$(PERL_PATH))
 TCLTK_PATH_SQ = $(subst ','\'',$(TCLTK_PATH))
 
-LIBS = $(GITLIBS) $(EXTLIBS)
+LIBS = $(GITLIBS) $(EXTLIBS) -lpthread
 
 BASIC_CFLAGS += -DSHA1_HEADER='$(SHA1_HEADER_SQ)' \
 	-DETC_GITCONFIG='"$(ETC_GITCONFIG_SQ)"' $(COMPAT_CFLAGS)
diff --git a/sha1_file.c b/sha1_file.c
index aca741b..c239789 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -6,6 +6,7 @@
  * This handles basic git sha1 object files - packing, unpacking,
  * creation etc.
  */
+#include <pthread.h>
 #include "cache.h"
 #include "delta.h"
 #include "pack.h"
@@ -1862,10 +1863,12 @@ int pretend_sha1_file(void *buf, unsigned long len, enum object_type type,
 void *read_sha1_file(const unsigned char *sha1, enum object_type *type,
 		     unsigned long *size)
 {
+	static pthread_mutex_t locky = PTHREAD_MUTEX_INITIALIZER;
 	unsigned long mapsize;
 	void *map, *buf;
 	struct cached_object *co;
 
+	pthread_mutex_lock(&locky);
 	co = find_cached_object(sha1);
 	if (co) {
 		buf = xmalloc(co->size + 1);
@@ -1873,20 +1876,26 @@ void *read_sha1_file(const unsigned char *sha1, enum object_type *type,
 		((char*)buf)[co->size] = 0;
 		*type = co->type;
 		*size = co->size;
+		pthread_mutex_unlock(&locky);
 		return buf;
 	}
 
 	buf = read_packed_sha1(sha1, type, size);
-	if (buf)
+	if (buf) {
+		pthread_mutex_unlock(&locky);
 		return buf;
+	}
 	map = map_sha1_file(sha1, &mapsize);
 	if (map) {
 		buf = unpack_sha1_file(map, mapsize, type, size, sha1);
 		munmap(map, mapsize);
+		pthread_mutex_unlock(&locky);
 		return buf;
 	}
 	reprepare_packed_git();
-	return read_packed_sha1(sha1, type, size);
+	buf = read_packed_sha1(sha1, type, size);
+	pthread_mutex_unlock(&locky);
+	return buf;
 }
 
 void *read_object_with_reference(const unsigned char *sha1,
-- 
Shawn.

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: performance on repack
  2007-08-15  5:32         ` Shawn O. Pearce
@ 2007-08-15 15:08           ` Jon Smirl
  2007-08-15 17:11             ` Martin Koegler
  2007-08-15 21:05             ` Nicolas Pitre
  2007-08-15 20:49           ` Nicolas Pitre
  1 sibling, 2 replies; 30+ messages in thread
From: Jon Smirl @ 2007-08-15 15:08 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Nicolas Pitre, Martin Koegler, Git Mailing List

On 8/15/07, Shawn O. Pearce <spearce@spearce.org> wrote:
> Nicolas Pitre <nico@cam.org> wrote:
> > On Mon, 13 Aug 2007, Shawn O. Pearce wrote:
> Even Linux with its pretty amazing NPTL isn't up to the task.
> No threading package really is.

You need something like fibers instead of threads for task that small.

> Yea, that I agree with.  The other thing is the main thread may need
> to push a couple of windows worth of work into the threads, so that
> if this window's 1/n unit goes really fast on that thread it doesn't
> stall out waiting for the main thread to get it more data.
>
> > Also... read_sha1_file() is currently not thread safe, and threaded
> > delta searching would requires that its usage be serialized with a
> > global mutex (not done in this patch which is a bug), or ideally be made
> > thread aware.

You can avoid making all the low level calls thread safe by using the
main thread to get everything into RAM before starting to search for
the deltas. The worker threads would only deal with things completely
in memory. You may need to ref count these in-memory objects if they
are shared between worker threads. For simplicity the in-memory input
objects should be read only by the threads. The worker threads create
new structures to hand their results back to the main thread for
writing to disk.

A typical solution is to use a queue protected by locks. Main thread
faults in all the needed objects to cache. Main thread builds a queue
entry and increments reference count on all referenced objects. Main
thread uses locks to add entry to queue, while queue is locked it
removes any finished jobs. Main thread writes finished results to
disks, decrements ref counts. Cache logic can then take over about
when objects are actually deleted.

Worker threads wait on the queue. When something is placed in the
queue a waiting worker thread removes it, processes it, puts the
results in RAM, and places the object back on the finished queue. Then
waits for another object. It doesn't call into the main body of code.

Initially I would just ignore very large objects while getting the
basic code to work. After the basic code is working if a very large
object is encountered when the main thread is faulting objects in, the
main thread should just process this object on the spot using the
existing low memory code.

Hopefully the new adaptive read ahead code in the kernel will detect
the pattern of the main thread faulting objects in and make good
guesses about where to read ahead. This is key to not getting stuck
waiting on IO to complete. If not the main thread could use ADVISE to
give kernel read ahead clues about where it is going to read next.

> Yea, that's a problem.  From the Git library perspective that Luiz
> Capitulino has been working on for GSOC this summer being able to
> compile/link Git with -lpthreads and have it actually be thread
> safe would be useful.  Unfortunately its waaaay more than just
> making read_sha1_file() threadsafe.  :-/
>
> read_sha1_file() is actually a reasonably critical part of the
> major functions of Git.  Slowing that down by making it go through
> pthread_mutex_{lock,unlock} is going to hurt.  I tried something
> like the following, and its a bit slower, and really ain't a great
> multi-thread aware implementation.
>
> # unmodified v1.5.3-rc4-91-g9fa3465 from Junio
> /usr/bin/time ../rl-master/git-rev-list HEAD . >/dev/null
>         1.67 real         1.42 user         0.20 sys
>         1.65 real         1.36 user         0.17 sys
>         1.67 real         1.35 user         0.16 sys
>         1.70 real         1.35 user         0.16 sys
>         1.64 real         1.35 user         0.16 sys
>
> # v1.5.3-rc4-91-g9fa3465 + patch below
> /usr/bin/time ../rl-pthread/git-rev-list HEAD . >/dev/null
>         2.86 real         1.48 user         0.29 sys
>         1.68 real         1.37 user         0.17 sys
>         1.59 real         1.37 user         0.16 sys
>         1.66 real         1.37 user         0.17 sys
>         1.68 real         1.37 user         0.17 sys
>
> That's on Mac OS X.  I guess NPTL would probably be faster at this,
> but not so much faster as to make it not be an increase.
>
> -->8--
> diff --git a/Makefile b/Makefile
> index 4eb4637..f31811b 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -710,7 +710,7 @@ SHELL_PATH_SQ = $(subst ','\'',$(SHELL_PATH))
>  PERL_PATH_SQ = $(subst ','\'',$(PERL_PATH))
>  TCLTK_PATH_SQ = $(subst ','\'',$(TCLTK_PATH))
>
> -LIBS = $(GITLIBS) $(EXTLIBS)
> +LIBS = $(GITLIBS) $(EXTLIBS) -lpthread
>
>  BASIC_CFLAGS += -DSHA1_HEADER='$(SHA1_HEADER_SQ)' \
>         -DETC_GITCONFIG='"$(ETC_GITCONFIG_SQ)"' $(COMPAT_CFLAGS)
> diff --git a/sha1_file.c b/sha1_file.c
> index aca741b..c239789 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -6,6 +6,7 @@
>   * This handles basic git sha1 object files - packing, unpacking,
>   * creation etc.
>   */
> +#include <pthread.h>
>  #include "cache.h"
>  #include "delta.h"
>  #include "pack.h"
> @@ -1862,10 +1863,12 @@ int pretend_sha1_file(void *buf, unsigned long len, enum object_type type,
>  void *read_sha1_file(const unsigned char *sha1, enum object_type *type,
>                      unsigned long *size)
>  {
> +       static pthread_mutex_t locky = PTHREAD_MUTEX_INITIALIZER;
>         unsigned long mapsize;
>         void *map, *buf;
>         struct cached_object *co;
>
> +       pthread_mutex_lock(&locky);
>         co = find_cached_object(sha1);
>         if (co) {
>                 buf = xmalloc(co->size + 1);
> @@ -1873,20 +1876,26 @@ void *read_sha1_file(const unsigned char *sha1, enum object_type *type,
>                 ((char*)buf)[co->size] = 0;
>                 *type = co->type;
>                 *size = co->size;
> +               pthread_mutex_unlock(&locky);
>                 return buf;
>         }
>
>         buf = read_packed_sha1(sha1, type, size);
> -       if (buf)
> +       if (buf) {
> +               pthread_mutex_unlock(&locky);
>                 return buf;
> +       }
>         map = map_sha1_file(sha1, &mapsize);
>         if (map) {
>                 buf = unpack_sha1_file(map, mapsize, type, size, sha1);
>                 munmap(map, mapsize);
> +               pthread_mutex_unlock(&locky);
>                 return buf;
>         }
>         reprepare_packed_git();
> -       return read_packed_sha1(sha1, type, size);
> +       buf = read_packed_sha1(sha1, type, size);
> +       pthread_mutex_unlock(&locky);
> +       return buf;
>  }
>
>  void *read_object_with_reference(const unsigned char *sha1,
> --
> Shawn.
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: performance on repack
  2007-08-15 15:08           ` Jon Smirl
@ 2007-08-15 17:11             ` Martin Koegler
  2007-08-15 18:38               ` Jon Smirl
  2007-08-15 21:05             ` Nicolas Pitre
  1 sibling, 1 reply; 30+ messages in thread
From: Martin Koegler @ 2007-08-15 17:11 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Shawn O. Pearce, Nicolas Pitre, Git Mailing List

On Wed, Aug 15, 2007 at 11:08:38AM -0400, Jon Smirl wrote:
> On 8/15/07, Shawn O. Pearce <spearce@spearce.org> wrote:
> > > Also... read_sha1_file() is currently not thread safe, and threaded
> > > delta searching would requires that its usage be serialized with a
> > > global mutex (not done in this patch which is a bug), or ideally be made
> > > thread aware.
> 
> You can avoid making all the low level calls thread safe by using the
> main thread to get everything into RAM before starting to search for
> the deltas. The worker threads would only deal with things completely
> in memory. You may need to ref count these in-memory objects if they
> are shared between worker threads. For simplicity the in-memory input
> objects should be read only by the threads. The worker threads create
> new structures to hand their results back to the main thread for
> writing to disk.

git-pack-objects knows the order, in which it will use the objects.  A
seperate thread could pre-read the next object and wait until the main
thread starts processing it. After the read is complete, another
thread could start computing the delta index.

git-pack-objects currently reads an object (and computes the delta
index), if it is really necessary. With the pre-read, unnecessary
operations would happen.

> Initially I would just ignore very large objects while getting the
> basic code to work. After the basic code is working if a very large
> object is encountered when the main thread is faulting objects in, the
> main thread should just process this object on the spot using the
> existing low memory code.

I expect that the biggest gain will be for big objects, as they
require more time to read+unpack the source objects and compute the
delta index as well as the delta.

> > @@ -1862,10 +1863,12 @@ int pretend_sha1_file(void *buf, unsigned long len, enum object_type type,
> >  void *read_sha1_file(const unsigned char *sha1, enum object_type *type,
> >                      unsigned long *size)
> >  {
> > +       static pthread_mutex_t locky = PTHREAD_MUTEX_INITIALIZER;
> >         unsigned long mapsize;
> >         void *map, *buf;
> >         struct cached_object *co;
> >
> > +       pthread_mutex_lock(&locky);
> >         co = find_cached_object(sha1);
> >         if (co) {
> >                 buf = xmalloc(co->size + 1);
> > @@ -1873,20 +1876,26 @@ void *read_sha1_file(const unsigned char *sha1, enum object_type *type,
> >                 ((char*)buf)[co->size] = 0;
> >                 *type = co->type;
> >                 *size = co->size;
> > +               pthread_mutex_unlock(&locky);
> >                 return buf;
> >         }
> >
> >         buf = read_packed_sha1(sha1, type, size);

Couldn't we release the mutex at this point?

Why do we need to protect from concurrent access, when we are reading
a loose object?

> > -       if (buf)
> > +       if (buf) {
> > +               pthread_mutex_unlock(&locky);
> >                 return buf;
> > +       }
> >         map = map_sha1_file(sha1, &mapsize);
> >         if (map) {
> >                 buf = unpack_sha1_file(map, mapsize, type, size, sha1);
> >                 munmap(map, mapsize);
> > +               pthread_mutex_unlock(&locky);
> >                 return buf;
> >         }
> >         reprepare_packed_git();
> > -       return read_packed_sha1(sha1, type, size);
> > +       buf = read_packed_sha1(sha1, type, size);
> > +       pthread_mutex_unlock(&locky);
> > +       return buf;
> >  }

mfg Martin Kögler

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: performance on repack
  2007-08-15 17:11             ` Martin Koegler
@ 2007-08-15 18:38               ` Jon Smirl
  2007-08-15 19:00                 ` Nicolas Pitre
  0 siblings, 1 reply; 30+ messages in thread
From: Jon Smirl @ 2007-08-15 18:38 UTC (permalink / raw)
  To: Martin Koegler; +Cc: Shawn O. Pearce, Nicolas Pitre, Git Mailing List

On 8/15/07, Martin Koegler <mkoegler@auto.tuwien.ac.at> wrote:
> git-pack-objects knows the order, in which it will use the objects.  A
> seperate thread could pre-read the next object and wait until the main
> thread starts processing it. After the read is complete, another
> thread could start computing the delta index.

The hope is that the new adaptive read ahead code in the kernel will
get this right and you won't need the second thread. Letting the
kernel handle the read ahead will dynamically scale as other demands
are made on the host. There's effectively only one read ahead cache in
the system, only the kernel really knows how to divide it up between
competing apps.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: performance on repack
  2007-08-15 18:38               ` Jon Smirl
@ 2007-08-15 19:00                 ` Nicolas Pitre
  2007-08-15 19:42                   ` Jon Smirl
  2007-08-16  8:10                   ` David Kastrup
  0 siblings, 2 replies; 30+ messages in thread
From: Nicolas Pitre @ 2007-08-15 19:00 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Martin Koegler, Shawn O. Pearce, Git Mailing List

On Wed, 15 Aug 2007, Jon Smirl wrote:

> On 8/15/07, Martin Koegler <mkoegler@auto.tuwien.ac.at> wrote:
> > git-pack-objects knows the order, in which it will use the objects.  A
> > seperate thread could pre-read the next object and wait until the main
> > thread starts processing it. After the read is complete, another
> > thread could start computing the delta index.
> 
> The hope is that the new adaptive read ahead code in the kernel will
> get this right and you won't need the second thread. Letting the
> kernel handle the read ahead will dynamically scale as other demands
> are made on the host. There's effectively only one read ahead cache in
> the system, only the kernel really knows how to divide it up between
> competing apps.

No read ahead will ever help the delta search phase.  Objects listed for 
deltification against each other are sorted in a way that results in 
reads from completely random location in the object store.

Normally the delta search phase is so compute intensive that the read 
shouldn't matter much.


Nicolas

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: performance on repack
  2007-08-15 19:00                 ` Nicolas Pitre
@ 2007-08-15 19:42                   ` Jon Smirl
  2007-08-16  8:10                   ` David Kastrup
  1 sibling, 0 replies; 30+ messages in thread
From: Jon Smirl @ 2007-08-15 19:42 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Martin Koegler, Shawn O. Pearce, Git Mailing List

On 8/15/07, Nicolas Pitre <nico@cam.org> wrote:
> On Wed, 15 Aug 2007, Jon Smirl wrote:
>
> > On 8/15/07, Martin Koegler <mkoegler@auto.tuwien.ac.at> wrote:
> > > git-pack-objects knows the order, in which it will use the objects.  A
> > > seperate thread could pre-read the next object and wait until the main
> > > thread starts processing it. After the read is complete, another
> > > thread could start computing the delta index.
> >
> > The hope is that the new adaptive read ahead code in the kernel will
> > get this right and you won't need the second thread. Letting the
> > kernel handle the read ahead will dynamically scale as other demands
> > are made on the host. There's effectively only one read ahead cache in
> > the system, only the kernel really knows how to divide it up between
> > competing apps.
>
> No read ahead will ever help the delta search phase.  Objects listed for
> deltification against each other are sorted in a way that results in
> reads from completely random location in the object store.
>
> Normally the delta search phase is so compute intensive that the read
> shouldn't matter much.

I agree, I am compute bound for 20 min, disk light occasionally dimly flickers.

Besides, the main thread that is preparing the queue effectively
functions as a read ahead thread. It will overlap with the compute
bound delta searches.

>
>
> Nicolas
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: performance on repack
  2007-08-15 19:00                 ` Nicolas Pitre
  2007-08-15 19:42                   ` Jon Smirl
@ 2007-08-16  8:10                   ` David Kastrup
  2007-08-16 15:34                     ` Nicolas Pitre
  1 sibling, 1 reply; 30+ messages in thread
From: David Kastrup @ 2007-08-16  8:10 UTC (permalink / raw)
  To: git

Nicolas Pitre <nico@cam.org> writes:

> On Wed, 15 Aug 2007, Jon Smirl wrote:
>
>> On 8/15/07, Martin Koegler <mkoegler@auto.tuwien.ac.at> wrote:
>> > git-pack-objects knows the order, in which it will use the objects.  A
>> > seperate thread could pre-read the next object and wait until the main
>> > thread starts processing it. After the read is complete, another
>> > thread could start computing the delta index.
>> 
>> The hope is that the new adaptive read ahead code in the kernel will
>> get this right and you won't need the second thread. Letting the
>> kernel handle the read ahead will dynamically scale as other demands
>> are made on the host. There's effectively only one read ahead cache in
>> the system, only the kernel really knows how to divide it up between
>> competing apps.
>
> No read ahead will ever help the delta search phase.

Well, the delta search phase consists of computing a delta index and
then matching against it.  If I understand correctly, delta indices
for the search window are kept, and the current file is compared
against them.  Locality might be better if just one delta index gets
calculated and then compared with all _upcoming_ delta candidates in
one go.

No idea whether it would pay off the complications.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: performance on repack
  2007-08-16  8:10                   ` David Kastrup
@ 2007-08-16 15:34                     ` Nicolas Pitre
  2007-08-16 16:13                       ` Jon Smirl
  0 siblings, 1 reply; 30+ messages in thread
From: Nicolas Pitre @ 2007-08-16 15:34 UTC (permalink / raw)
  To: David Kastrup; +Cc: git

On Thu, 16 Aug 2007, David Kastrup wrote:

> Nicolas Pitre <nico@cam.org> writes:
> 
> > On Wed, 15 Aug 2007, Jon Smirl wrote:
> >
> >> On 8/15/07, Martin Koegler <mkoegler@auto.tuwien.ac.at> wrote:
> >> > git-pack-objects knows the order, in which it will use the objects.  A
> >> > seperate thread could pre-read the next object and wait until the main
> >> > thread starts processing it. After the read is complete, another
> >> > thread could start computing the delta index.
> >> 
> >> The hope is that the new adaptive read ahead code in the kernel will
> >> get this right and you won't need the second thread. Letting the
> >> kernel handle the read ahead will dynamically scale as other demands
> >> are made on the host. There's effectively only one read ahead cache in
> >> the system, only the kernel really knows how to divide it up between
> >> competing apps.
> >
> > No read ahead will ever help the delta search phase.
> 
> Well, the delta search phase consists of computing a delta index and
> then matching against it.

No, what I mean is what happens at a higher level where one object is 
deltified against several base object candidates to find the best match.  
Those several objects are presorted according to a combination of 
heuristics that makes their actual access completely random, hence no 
kernel read ahead might help here.

> If I understand correctly, delta indices
> for the search window are kept, and the current file is compared
> against them.  Locality might be better if just one delta index gets
> calculated and then compared with all _upcoming_ delta candidates in
> one go.

This appears so obvious that I attempted that a while ago already.

The idea turned up to be so complex to implement correctly and produced 
suboptimal results in practice that I abandoned it.

See http://marc.info/?l=git&m=114610715706599&w=2 for the details if 
you're interested.

PS: please at least CC me when replying to my mails


Nicolas

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: performance on repack
  2007-08-16 15:34                     ` Nicolas Pitre
@ 2007-08-16 16:13                       ` Jon Smirl
  2007-08-16 16:21                         ` Nicolas Pitre
  0 siblings, 1 reply; 30+ messages in thread
From: Jon Smirl @ 2007-08-16 16:13 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: David Kastrup, git

On 8/16/07, Nicolas Pitre <nico@cam.org> wrote:
> On Thu, 16 Aug 2007, David Kastrup wrote:
> > If I understand correctly, delta indices
> > for the search window are kept, and the current file is compared
> > against them.  Locality might be better if just one delta index gets
> > calculated and then compared with all _upcoming_ delta candidates in
> > one go.
>
> This appears so obvious that I attempted that a while ago already.
>
> The idea turned up to be so complex to implement correctly and produced
> suboptimal results in practice that I abandoned it.

In practice it doesn't really matter what you do. Most developers have
a decent amount of RAM. Disks run at 50MB/sec or more. The entire pack
file ends up in the kernel disk cache within a few seconds.

Turning the twenty minute delta search into 6-7 minutes is a much more
obvious win.


>
> See http://marc.info/?l=git&m=114610715706599&w=2 for the details if
> you're interested.
>
> PS: please at least CC me when replying to my mails
>
>
> Nicolas
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: performance on repack
  2007-08-16 16:13                       ` Jon Smirl
@ 2007-08-16 16:21                         ` Nicolas Pitre
  0 siblings, 0 replies; 30+ messages in thread
From: Nicolas Pitre @ 2007-08-16 16:21 UTC (permalink / raw)
  To: Jon Smirl; +Cc: David Kastrup, git

On Thu, 16 Aug 2007, Jon Smirl wrote:

> On 8/16/07, Nicolas Pitre <nico@cam.org> wrote:
> > On Thu, 16 Aug 2007, David Kastrup wrote:
> > > If I understand correctly, delta indices
> > > for the search window are kept, and the current file is compared
> > > against them.  Locality might be better if just one delta index gets
> > > calculated and then compared with all _upcoming_ delta candidates in
> > > one go.
> >
> > This appears so obvious that I attempted that a while ago already.
> >
> > The idea turned up to be so complex to implement correctly and produced
> > suboptimal results in practice that I abandoned it.
> 
> In practice it doesn't really matter what you do. Most developers have
> a decent amount of RAM. Disks run at 50MB/sec or more. The entire pack
> file ends up in the kernel disk cache within a few seconds.
> 
> Turning the twenty minute delta search into 6-7 minutes is a much more
> obvious win.

And your point in relation with what was just said is ?


Nicolas

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: performance on repack
  2007-08-15 15:08           ` Jon Smirl
  2007-08-15 17:11             ` Martin Koegler
@ 2007-08-15 21:05             ` Nicolas Pitre
  1 sibling, 0 replies; 30+ messages in thread
From: Nicolas Pitre @ 2007-08-15 21:05 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Shawn O. Pearce, Martin Koegler, Git Mailing List

On Wed, 15 Aug 2007, Jon Smirl wrote:

> You can avoid making all the low level calls thread safe by using the
> main thread to get everything into RAM before starting to search for
> the deltas. The worker threads would only deal with things completely
> in memory. You may need to ref count these in-memory objects if they
> are shared between worker threads. For simplicity the in-memory input
> objects should be read only by the threads. The worker threads create
> new structures to hand their results back to the main thread for
> writing to disk.
> 
> A typical solution is to use a queue protected by locks. Main thread
> faults in all the needed objects to cache. Main thread builds a queue
> entry and increments reference count on all referenced objects. Main
> thread uses locks to add entry to queue, while queue is locked it
> removes any finished jobs. Main thread writes finished results to
> disks, decrements ref counts. Cache logic can then take over about
> when objects are actually deleted.
> 
> Worker threads wait on the queue. When something is placed in the
> queue a waiting worker thread removes it, processes it, puts the
> results in RAM, and places the object back on the finished queue. Then
> waits for another object. It doesn't call into the main body of code.

Way too complex and rather unpractical with the current algorithms.

Currently, information on objects is gathered (the "Counting objects" 
phase) and that hardly can be paralleled.

Once objects are known then a sorted list is created so deltification of 
object x might be optimally attempted on objects x-1 through to x-10.  
Creating that list cannot be paralleled either, but it is quick 
anyway.

Then comes the actual deltification phase where the huge cost is. The 
problem simply has to be partitioned into a few threads, where thread 1 
deals with object 1 to 100000 from that sorted list, thread 2 has 
objects 100001 to 200000, etc. etc.  This is now a partitioning problem 
where the thread synchronization is dealt with from a higher and non 
performance critical level.

Nicolas

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: performance on repack
  2007-08-15  5:32         ` Shawn O. Pearce
  2007-08-15 15:08           ` Jon Smirl
@ 2007-08-15 20:49           ` Nicolas Pitre
  2007-08-30  4:27             ` Nicolas Pitre
  1 sibling, 1 reply; 30+ messages in thread
From: Nicolas Pitre @ 2007-08-15 20:49 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Jon Smirl, Martin Koegler, Git Mailing List

On Wed, 15 Aug 2007, Shawn O. Pearce wrote:

> Nicolas Pitre <nico@cam.org> wrote:
> > On Mon, 13 Aug 2007, Shawn O. Pearce wrote:
> > 
> > > I'm not sure its that complex to run all try_delta calls of the
> > > current window in parallel.
> > 
> > Well, here's my quick attempt at it.  Unfortunately, performance isn't 
> > as good as I'd expected, especially with relatively small blobs like 
> > those found in the linux kernel repo.  It looks like the overhead of 
> > thread creation/joining might be significant compared to the actual 
> > delta computation.  I have a P4 with HT which might behave differently 
> > from a real SMP machine, or whatever, but the CPU usage never exceeded 
> > 110% according to top (sometimes it even dropped below 95%). Actually, a 
> > git-repack gets much slower due to 2m27s of system time compared to 
> > 0m03s without threads.  And this is with NPTL.
> 
> Yea, I'm not surprised.  This is quick and dirty but is really the
> wrong approach.  The kernel is spending lots of time setting up the
> thread and its stack, then scheduling it onto a CPU, only to likely
> find it finish before it even yields the CPU or is interrupted off.
> So I'd expect to get huge system times here.

So... Here's a variation on the previous attempt.  A thread is created 
for each delta window slot and it is awakened when there is a delta to 
compute.  Now top is reporting an average of 175% CPU usage when delta 
searching on the linux repo. The threading overhead when down a lot, 
from 2m of system time to 42s.  Still a far cry from the usual 3s when 
no threads are used though.

> > With a repo containing big blobs it is different though.  CPU usage 
> > firmly gets to 200% hence all cores are saturated with work.
> 
> Right, that's what we want to see.  ;-)
> 
> What makes this bloody complex (yes, I just did disagree with
> myself) is you need to use a pool of threads, one per core, keep
> them running for the life of the delta phase (so we amortize the
> thread start/stop time) and also give them large enough bursts of
> data that they can almost always use their full time slice before
> being interrupted by the kernel.

This is still way suboptimal.  The thread synchronization that has to be 
involved for _each_ delta attempt simply puts too much overhead.

Also, all threads have to be done before sliding the window to the 
next object since deltification of the new object depends on the results 
from the previous window position. However, it is highly unlikely that 
all threads will be done with their share of work at the same time and 
therefore there will always be a certain amount of time when most 
threads will be idle while the last ones are still busy, up to the point 
where there is only one thread to wait after before moving the window 
forward.  This is what explains why I get around 175% CPU usage instead 
of 200% when repacking the linux kernel repo.

And still, the total amount of CPU cycles spent on the whole task is far 
greater than without any thread because, as I mentioned before, running 
delta computations in parallel doesn't allow early abort of 
create_delta() calls based on a good max_size value from a previous 
delta results because there is no such previous results when everything 
is computed in parallel.  So, while the repack might be faster from a 
wall clock point of view, it wastes more CPU cycles in the end which is 
not really nice.

> > Remains the approach of calling find_deltas() n times with 1/n of the 
> > delta_list, one call per thread, for the bulk of the delta search work.  
> > This might even be much simpler than my current patch is.  However this 
> > approach will require n times the memory for the delta window data.  
> > Thread creation overhead will occur only once.
> 
> Yea, that I agree with.  The other thing is the main thread may need
> to push a couple of windows worth of work into the threads, so that
> if this window's 1/n unit goes really fast on that thread it doesn't
> stall out waiting for the main thread to get it more data.

It is easier to keep 4 threads busy with 100000 objects each to deal 
with in a conventional way than keeping 10 threads busy when processing 
a single delta window in parallel.  Or the delta_list can be partitioned 
in smaller chunks and anytime one of the find_deltas() thread is done 
then it is fed another chunk.

The side effect of that is that no delta will cross chunk boundaries.  
But just like when repacking multiple existing packs into one with 
delta data reuse, the actual chunk boundaries can be re-fed to 
find_deltas() in order to find more deltas across chunks.

This approach will require less code than the patch below, it will have 
almost zero threading overhead and each thread will be sure to be busy 
for a long period of time.  And the early create_delta() abort will just 
work too.

> > Also... read_sha1_file() is currently not thread safe, and threaded 
> > delta searching would requires that its usage be serialized with a 
> > global mutex (not done in this patch which is a bug), or ideally be made 
> > thread aware.
> 
> Yea, that's a problem.  From the Git library perspective that Luiz
> Capitulino has been working on for GSOC this summer being able to
> compile/link Git with -lpthreads and have it actually be thread
> safe would be useful.  Unfortunately its waaaay more than just
> making read_sha1_file() threadsafe.  :-/
> 
> read_sha1_file() is actually a reasonably critical part of the
> major functions of Git.  Slowing that down by making it go through
> pthread_mutex_{lock,unlock} is going to hurt.  I tried something
> like the following, and its a bit slower, and really ain't a great
> multi-thread aware implementation.

I'd say that, since almost none of the GIT code currently need thread 
aware read_sha1_file(), it is best to only serialize its usage at the 
caller level where it is actually needed.


Nicolas

---

diff --git a/Makefile b/Makefile
index 4eb4637..c3c6e68 100644
--- a/Makefile
+++ b/Makefile
@@ -122,6 +122,9 @@ all::
 # If not set it defaults to the bare 'wish'. If it is set to the empty
 # string then NO_TCLTK will be forced (this is used by configure script).
 #
+# Define THREADED_DELTA_SEARCH if you have pthreads and wish to exploit
+# parallel delta searching when packing objects.
+#
 
 GIT-VERSION-FILE: .FORCE-GIT-VERSION-FILE
 	@$(SHELL_PATH) ./GIT-VERSION-GEN
@@ -662,6 +665,11 @@ ifdef NO_HSTRERROR
 	COMPAT_OBJS += compat/hstrerror.o
 endif
 
+ifdef THREADED_DELTA_SEARCH
+	BASIC_CFLAGS += -DTHREADED_DELTA_SEARCH
+	EXTLIBS += -lpthread
+endif
+
 ifeq ($(TCLTK_PATH),)
 NO_TCLTK=NoThanks
 endif
diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index 5e9d1fd..449f9f3 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -15,6 +15,10 @@
 #include "list-objects.h"
 #include "progress.h"
 
+#ifdef THREADED_DELTA_SEARCH
+#include <pthread.h>
+#endif
+
 static const char pack_usage[] = "\
 git-pack-objects [{ -q | --progress | --all-progress }] \n\
 	[--max-pack-size=N] [--local] [--incremental] \n\
@@ -1273,6 +1277,14 @@ struct unpacked {
 	void *data;
 	struct delta_index *index;
 	unsigned depth;
+#ifdef THREADED_DELTA_SEARCH
+	pthread_t thread;
+	pthread_mutex_t mutex;
+	pthread_mutex_t wait;
+	pthread_mutex_t done;
+	struct unpacked *trg;
+	unsigned trg_max_depth;
+#endif
 };
 
 static int delta_cacheable(struct unpacked *trg, struct unpacked *src,
@@ -1292,6 +1304,24 @@ static int delta_cacheable(struct unpacked *trg, struct unpacked *src,
 	return 0;
 }
 
+#ifdef THREADED_DELTA_SEARCH
+
+static pthread_mutex_t read_sha1_file_mutex = PTHREAD_MUTEX_INITIALIZER;
+
+#define read_lock()		pthread_mutex_lock(&read_sha1_file_mutex)
+#define read_unlock()		pthread_mutex_unlock(&read_sha1_file_mutex)
+#define delta_lock(e)		pthread_mutex_lock(&(e)->mutex)
+#define delta_unlock(e)		pthread_mutex_unlock(&(e)->mutex)
+
+#else
+
+#define read_lock()		0
+#define read_unlock()		0
+#define delta_lock(e)		0
+#define delta_unlock(e)		0
+
+#endif
+
 /*
  * We search for deltas _backwards_ in a list sorted by type and
  * by size, so that we see progressively smaller and smaller files.
@@ -1335,6 +1365,7 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
 		return 0;
 
 	/* Now some size filtering heuristics. */
+	delta_lock(trg);
 	trg_size = trg_entry->size;
 	if (!trg_entry->delta) {
 		max_size = trg_size/2 - 20;
@@ -1343,6 +1374,7 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
 		max_size = trg_entry->delta_size;
 		ref_depth = trg->depth;
 	}
+	delta_unlock(trg);
 	max_size = max_size * (max_depth - src->depth) /
 						(max_depth - ref_depth + 1);
 	if (max_size == 0)
@@ -1355,19 +1387,28 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
 		return 0;
 
 	/* Load data if not already done */
+
+	delta_lock(trg);
 	if (!trg->data) {
+		read_lock();
 		trg->data = read_sha1_file(trg_entry->idx.sha1, &type, &sz);
 		if (sz != trg_size)
 			die("object %s inconsistent object length (%lu vs %lu)",
 			    sha1_to_hex(trg_entry->idx.sha1), sz, trg_size);
 		window_memory_usage += sz;
+		read_unlock();
 	}
+	delta_unlock(trg);
+
+	delta_lock(src);
 	if (!src->data) {
+		read_lock();
 		src->data = read_sha1_file(src_entry->idx.sha1, &type, &sz);
 		if (sz != src_size)
 			die("object %s inconsistent object length (%lu vs %lu)",
 			    sha1_to_hex(src_entry->idx.sha1), sz, src_size);
 		window_memory_usage += sz;
+		read_unlock();
 	}
 	if (!src->index) {
 		src->index = create_delta_index(src->data, src_size);
@@ -1375,20 +1416,32 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
 			static int warned = 0;
 			if (!warned++)
 				warning("suboptimal pack - out of memory");
+			delta_unlock(src);
 			return 0;
 		}
+		read_lock();
 		window_memory_usage += sizeof_delta_index(src->index);
+		read_unlock();
 	}
+	delta_unlock(src);
 
 	delta_buf = create_delta(src->index, trg->data, trg_size, &delta_size, max_size);
 	if (!delta_buf)
 		return 0;
 
+	delta_lock(trg);
 	if (trg_entry->delta_data) {
-		/* Prefer only shallower same-sized deltas. */
-		if (delta_size == trg_entry->delta_size &&
-		    src->depth + 1 >= trg->depth) {
+#ifdef THREADED_DELTA_SEARCH
+		/* A better match might have been found in the mean time */
+		max_size = trg_entry->delta_size * (max_depth - src->depth) /
+						(max_depth - trg->depth + 1);
+#endif
+		/* ... also prefer only shallower same-sized deltas. */
+		if (delta_size > max_size ||
+		    (delta_size == trg_entry->delta_size &&
+		     src->depth + 1 >= trg->depth)) {
 			free(delta_buf);
+			delta_unlock(trg);
 			return 0;
 		}
 		delta_cache_size -= trg_entry->delta_size;
@@ -1404,9 +1457,27 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
 		delta_cache_size += trg_entry->delta_size;
 	} else
 		free(delta_buf);
+	delta_unlock(trg);
 	return 1;
 }
 
+#ifdef THREADED_DELTA_SEARCH
+
+static void *threaded_try_delta(void *arg)
+{
+	struct unpacked *src = arg;
+	for (;;) {
+		pthread_mutex_lock(&src->wait);
+		if (!src->trg)
+			break;
+		try_delta(src->trg, src, src->trg_max_depth);
+		pthread_mutex_unlock(&src->done);
+	}
+	return NULL;
+}
+
+#endif
+
 static unsigned int check_delta_limit(struct object_entry *me, unsigned int n)
 {
 	struct object_entry *child = me->delta_child;
@@ -1439,19 +1510,32 @@ static void find_deltas(struct object_entry **list, int window, int depth)
 	uint32_t i = nr_objects, idx = 0, count = 0, processed = 0;
 	unsigned int array_size = window * sizeof(struct unpacked);
 	struct unpacked *array;
-	int max_depth;
+	int j, max_depth;
 
 	if (!nr_objects)
 		return;
 	array = xmalloc(array_size);
 	memset(array, 0, array_size);
+#ifdef THREADED_DELTA_SEARCH
+	for (j = 0; j < window; j++) {
+		int ret;
+		pthread_mutex_init(&array[j].mutex, NULL);
+		pthread_mutex_init(&array[j].wait, NULL);
+		pthread_mutex_init(&array[j].done, NULL);
+		pthread_mutex_lock(&array[j].wait);
+		pthread_mutex_lock(&array[j].done);
+		ret = pthread_create(&array[j].thread, NULL,
+				     threaded_try_delta, &array[j]);
+			if (ret)
+				die("unable to create thread: %s", strerror(ret));
+	}
+#endif
 	if (progress)
 		start_progress(&progress_state, "Deltifying %u objects...", "", nr_result);
 
 	do {
 		struct object_entry *entry = list[--i];
 		struct unpacked *n = array + idx;
-		int j;
 
 		if (!entry->preferred_base)
 			processed++;
@@ -1494,6 +1578,31 @@ static void find_deltas(struct object_entry **list, int window, int depth)
 				goto next;
 		}
 
+#ifdef THREADED_DELTA_SEARCH
+		/* Unlock all threads ... */
+		j = window;
+		while (--j > 0) {
+			uint32_t other_idx = idx + j;
+			struct unpacked *m;
+			if (other_idx >= window)
+				other_idx -= window;
+			m = array + other_idx;
+			if (!m->entry)
+				break;
+			m->trg = n;
+			m->trg_max_depth= max_depth;
+			pthread_mutex_unlock(&m->wait);
+		}
+		/* ... then wait for their completion. */
+		while (++j < window) {
+			uint32_t other_idx = idx + j;
+			struct unpacked *m;
+			if (other_idx >= window)
+				other_idx -= window;
+			m = array + other_idx;
+			pthread_mutex_lock(&m->done);
+		}
+#else
 		j = window;
 		while (--j > 0) {
 			uint32_t other_idx = idx + j;
@@ -1506,6 +1615,7 @@ static void find_deltas(struct object_entry **list, int window, int depth)
 			if (try_delta(n, m, max_depth) < 0)
 				break;
 		}
+#endif
 
 		/* if we made n a delta, and if n is already at max
 		 * depth, leaving it in the window is pointless.  we
@@ -1528,6 +1638,16 @@ static void find_deltas(struct object_entry **list, int window, int depth)
 	for (i = 0; i < window; ++i) {
 		free_delta_index(array[i].index);
 		free(array[i].data);
+#ifdef THREADED_DELTA_SEARCH
+		array[i].trg = NULL;
+		pthread_mutex_unlock(&array[i].wait);
+		pthread_join(array[i].thread, NULL);
+		pthread_mutex_unlock(&array[i].wait);
+		pthread_mutex_unlock(&array[i].done);
+		pthread_mutex_destroy(&array[i].mutex);
+		pthread_mutex_destroy(&array[i].wait);
+		pthread_mutex_destroy(&array[i].done);
+#endif
 	}
 	free(array);
 }

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: performance on repack
  2007-08-15 20:49           ` Nicolas Pitre
@ 2007-08-30  4:27             ` Nicolas Pitre
  2007-08-30  4:36               ` Nicolas Pitre
  0 siblings, 1 reply; 30+ messages in thread
From: Nicolas Pitre @ 2007-08-30  4:27 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Jon Smirl, Martin Koegler, Git Mailing List

On Wed, 15 Aug 2007, Nicolas Pitre wrote:

> On Wed, 15 Aug 2007, Shawn O. Pearce wrote:
> 
> > Nicolas Pitre <nico@cam.org> wrote:
> >
> > > Remains the approach of calling find_deltas() n times with 1/n of the 
> > > delta_list, one call per thread, for the bulk of the delta search work.  
> > > This might even be much simpler than my current patch is.  However this 
> > > approach will require n times the memory for the delta window data.  
> > > Thread creation overhead will occur only once.
> > 
> > Yea, that I agree with.  The other thing is the main thread may need
> > to push a couple of windows worth of work into the threads, so that
> > if this window's 1/n unit goes really fast on that thread it doesn't
> > stall out waiting for the main thread to get it more data.
> 
> It is easier to keep 4 threads busy with 100000 objects each to deal 
> with in a conventional way than keeping 10 threads busy when processing 
> a single delta window in parallel.  Or the delta_list can be partitioned 
> in smaller chunks and anytime one of the find_deltas() thread is done 
> then it is fed another chunk.
> 
> The side effect of that is that no delta will cross chunk boundaries.  
> But just like when repacking multiple existing packs into one with 
> delta data reuse, the actual chunk boundaries can be re-fed to 
> find_deltas() in order to find more deltas across chunks.

Well, here is a quick implementation of this idea for those who would 
like to give it a try.  Performance is indeed much better.  The delta 
progress status is however fscked up at the moment, so don't pay too 
much attention to it.  Partitioning is also extremely crude.  But the 
idea looks promizing.

---

diff --git a/Makefile b/Makefile
index 4eb4637..c3c6e68 100644
--- a/Makefile
+++ b/Makefile
@@ -122,6 +122,9 @@ all::
 # If not set it defaults to the bare 'wish'. If it is set to the empty
 # string then NO_TCLTK will be forced (this is used by configure script).
 #
+# Define THREADED_DELTA_SEARCH if you have pthreads and wish to exploit
+# parallel delta searching when packing objects.
+#
 
 GIT-VERSION-FILE: .FORCE-GIT-VERSION-FILE
 	@$(SHELL_PATH) ./GIT-VERSION-GEN
@@ -662,6 +665,11 @@ ifdef NO_HSTRERROR
 	COMPAT_OBJS += compat/hstrerror.o
 endif
 
+ifdef THREADED_DELTA_SEARCH
+	BASIC_CFLAGS += -DTHREADED_DELTA_SEARCH
+	EXTLIBS += -lpthread
+endif
+
 ifeq ($(TCLTK_PATH),)
 NO_TCLTK=NoThanks
 endif
diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index e917e8e..7d68f82 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -15,6 +15,10 @@
 #include "list-objects.h"
 #include "progress.h"
 
+#ifdef THREADED_DELTA_SEARCH
+#include <pthread.h>
+#endif
+
 static const char pack_usage[] = "\
 git-pack-objects [{ -q | --progress | --all-progress }] \n\
 	[--max-pack-size=N] [--local] [--incremental] \n\
@@ -1290,6 +1294,20 @@ static int delta_cacheable(unsigned long src_size, unsigned long trg_size,
 	return 0;
 }
 
+#ifdef THREADED_DELTA_SEARCH
+
+static pthread_mutex_t read_sha1_file_mutex = PTHREAD_MUTEX_INITIALIZER;
+
+#define read_lock()		pthread_mutex_lock(&read_sha1_file_mutex)
+#define read_unlock()		pthread_mutex_unlock(&read_sha1_file_mutex)
+
+#else
+
+#define read_lock()		0
+#define read_unlock()		0
+
+#endif
+
 /*
  * We search for deltas _backwards_ in a list sorted by type and
  * by size, so that we see progressively smaller and smaller files.
@@ -1348,7 +1366,9 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
 
 	/* Load data if not already done */
 	if (!trg->data) {
+		read_lock();
 		trg->data = read_sha1_file(trg_entry->idx.sha1, &type, &sz);
+		read_unlock();
 		if (!trg->data)
 			die("object %s cannot be read",
 			    sha1_to_hex(trg_entry->idx.sha1));
@@ -1358,7 +1378,9 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
 		*mem_usage += sz;
 	}
 	if (!src->data) {
+		read_lock();
 		src->data = read_sha1_file(src_entry->idx.sha1, &type, &sz);
+		read_unlock();
 		if (!src->data)
 			die("object %s cannot be read",
 			    sha1_to_hex(src_entry->idx.sha1));
@@ -1521,6 +1543,56 @@ static void find_deltas(struct object_entry **list, unsigned list_size,
 	free(array);
 }
 
+#ifdef THREADED_DELTA_SEARCH
+
+struct thread_params {
+	pthread_t thread;
+	struct object_entry **list;
+	unsigned list_size;
+	unsigned nr_deltas;
+	int window;
+	int depth;
+};
+
+static void *threaded_find_deltas(void *arg)
+{
+	struct thread_params *p = arg;
+	if (p->list_size)
+		find_deltas(p->list, p->list_size, p->nr_deltas,
+			    p->window, p->depth);
+	return NULL;
+}
+
+static void ll_find_deltas(struct object_entry **list, unsigned list_size,
+			   unsigned nr_deltas, int window, int depth)
+{
+	struct thread_params p[4];
+	int i, ret;
+
+	for (i = 0; i < 4; i++) {
+		unsigned sublist_size = list_size / (4 - i);
+		p[i].list = list;
+		p[i].list_size = sublist_size;
+		p[i].nr_deltas = nr_deltas;
+		p[i].window = window;
+		p[i].depth = depth;
+		ret = pthread_create(&p[i].thread, NULL,
+				     threaded_find_deltas, &p[i]);
+		if (ret)
+			die("unable to create thread: %s", strerror(ret));
+		list += sublist_size;
+		list_size -= sublist_size;
+	}
+
+	for (i = 0; i < 4; i++) {
+		pthread_join(p[i].thread, NULL);
+	}
+}
+
+#else
+#define ll_find_deltas find_deltas
+#endif
+
 static void prepare_pack(int window, int depth)
 {
 	struct object_entry **delta_list;
@@ -1557,7 +1629,7 @@ static void prepare_pack(int window, int depth)
 
 	if (nr_deltas) {
 		qsort(delta_list, n, sizeof(*delta_list), type_size_sort);
-		find_deltas(delta_list, n, nr_deltas, window+1, depth);
+		ll_find_deltas(delta_list, n, nr_deltas, window+1, depth);
 	}
 	free(delta_list);
 }

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: performance on repack
  2007-08-30  4:27             ` Nicolas Pitre
@ 2007-08-30  4:36               ` Nicolas Pitre
  2007-08-30 16:17                 ` Jon Smirl
  2007-09-01 21:54                 ` Jon Smirl
  0 siblings, 2 replies; 30+ messages in thread
From: Nicolas Pitre @ 2007-08-30  4:36 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Jon Smirl, Martin Koegler, Git Mailing List

On Thu, 30 Aug 2007, Nicolas Pitre wrote:

> Well, here is a quick implementation of this idea for those who would 
> like to give it a try.
[...]

Well, that would help if I provided the full diff of course.

---

diff --git a/Makefile b/Makefile
index 4eb4637..c3c6e68 100644
--- a/Makefile
+++ b/Makefile
@@ -122,6 +122,9 @@ all::
 # If not set it defaults to the bare 'wish'. If it is set to the empty
 # string then NO_TCLTK will be forced (this is used by configure script).
 #
+# Define THREADED_DELTA_SEARCH if you have pthreads and wish to exploit
+# parallel delta searching when packing objects.
+#
 
 GIT-VERSION-FILE: .FORCE-GIT-VERSION-FILE
 	@$(SHELL_PATH) ./GIT-VERSION-GEN
@@ -662,6 +665,11 @@ ifdef NO_HSTRERROR
 	COMPAT_OBJS += compat/hstrerror.o
 endif
 
+ifdef THREADED_DELTA_SEARCH
+	BASIC_CFLAGS += -DTHREADED_DELTA_SEARCH
+	EXTLIBS += -lpthread
+endif
+
 ifeq ($(TCLTK_PATH),)
 NO_TCLTK=NoThanks
 endif
diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index 9b3ef94..7d68f82 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -15,6 +15,10 @@
 #include "list-objects.h"
 #include "progress.h"
 
+#ifdef THREADED_DELTA_SEARCH
+#include <pthread.h>
+#endif
+
 static const char pack_usage[] = "\
 git-pack-objects [{ -q | --progress | --all-progress }] \n\
 	[--max-pack-size=N] [--local] [--incremental] \n\
@@ -78,7 +82,6 @@ static unsigned long delta_cache_size = 0;
 static unsigned long max_delta_cache_size = 0;
 static unsigned long cache_max_small_delta_size = 1000;
 
-static unsigned long window_memory_usage = 0;
 static unsigned long window_memory_limit = 0;
 
 /*
@@ -1291,6 +1294,20 @@ static int delta_cacheable(unsigned long src_size, unsigned long trg_size,
 	return 0;
 }
 
+#ifdef THREADED_DELTA_SEARCH
+
+static pthread_mutex_t read_sha1_file_mutex = PTHREAD_MUTEX_INITIALIZER;
+
+#define read_lock()		pthread_mutex_lock(&read_sha1_file_mutex)
+#define read_unlock()		pthread_mutex_unlock(&read_sha1_file_mutex)
+
+#else
+
+#define read_lock()		0
+#define read_unlock()		0
+
+#endif
+
 /*
  * We search for deltas _backwards_ in a list sorted by type and
  * by size, so that we see progressively smaller and smaller files.
@@ -1300,7 +1317,7 @@ static int delta_cacheable(unsigned long src_size, unsigned long trg_size,
  * one.
  */
 static int try_delta(struct unpacked *trg, struct unpacked *src,
-		     unsigned max_depth)
+		     unsigned max_depth, unsigned long *mem_usage)
 {
 	struct object_entry *trg_entry = trg->entry;
 	struct object_entry *src_entry = src->entry;
@@ -1313,12 +1330,6 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
 	if (trg_entry->type != src_entry->type)
 		return -1;
 
-	/* We do not compute delta to *create* objects we are not
-	 * going to pack.
-	 */
-	if (trg_entry->preferred_base)
-		return -1;
-
 	/*
 	 * We do not bother to try a delta that we discarded
 	 * on an earlier try, but only when reusing delta data.
@@ -1355,24 +1366,28 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
 
 	/* Load data if not already done */
 	if (!trg->data) {
+		read_lock();
 		trg->data = read_sha1_file(trg_entry->idx.sha1, &type, &sz);
+		read_unlock();
 		if (!trg->data)
 			die("object %s cannot be read",
 			    sha1_to_hex(trg_entry->idx.sha1));
 		if (sz != trg_size)
 			die("object %s inconsistent object length (%lu vs %lu)",
 			    sha1_to_hex(trg_entry->idx.sha1), sz, trg_size);
-		window_memory_usage += sz;
+		*mem_usage += sz;
 	}
 	if (!src->data) {
+		read_lock();
 		src->data = read_sha1_file(src_entry->idx.sha1, &type, &sz);
+		read_unlock();
 		if (!src->data)
 			die("object %s cannot be read",
 			    sha1_to_hex(src_entry->idx.sha1));
 		if (sz != src_size)
 			die("object %s inconsistent object length (%lu vs %lu)",
 			    sha1_to_hex(src_entry->idx.sha1), sz, src_size);
-		window_memory_usage += sz;
+		*mem_usage += sz;
 	}
 	if (!src->index) {
 		src->index = create_delta_index(src->data, src_size);
@@ -1382,7 +1397,7 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
 				warning("suboptimal pack - out of memory");
 			return 0;
 		}
-		window_memory_usage += sizeof_delta_index(src->index);
+		*mem_usage += sizeof_delta_index(src->index);
 	}
 
 	delta_buf = create_delta(src->index, trg->data, trg_size, &delta_size, max_size);
@@ -1425,68 +1440,59 @@ static unsigned int check_delta_limit(struct object_entry *me, unsigned int n)
 	return m;
 }
 
-static void free_unpacked(struct unpacked *n)
+static unsigned long free_unpacked(struct unpacked *n)
 {
-	window_memory_usage -= sizeof_delta_index(n->index);
+	unsigned long freed_mem = sizeof_delta_index(n->index);
 	free_delta_index(n->index);
 	n->index = NULL;
 	if (n->data) {
+		freed_mem += n->entry->size;
 		free(n->data);
 		n->data = NULL;
-		window_memory_usage -= n->entry->size;
 	}
 	n->entry = NULL;
 	n->depth = 0;
+	return freed_mem;
 }
 
-static void find_deltas(struct object_entry **list, int window, int depth)
+static void find_deltas(struct object_entry **list, unsigned list_size,
+			unsigned nr_deltas, int window, int depth)
 {
-	uint32_t i = nr_objects, idx = 0, count = 0, processed = 0;
+	uint32_t i = list_size, idx = 0, count = 0, processed = 0;
 	unsigned int array_size = window * sizeof(struct unpacked);
 	struct unpacked *array;
-	int max_depth;
+	unsigned long mem_usage = 0;
 
-	if (!nr_objects)
-		return;
 	array = xmalloc(array_size);
 	memset(array, 0, array_size);
 	if (progress)
-		start_progress(&progress_state, "Deltifying %u objects...", "", nr_result);
+		start_progress(&progress_state, "Deltifying %u objects...", "", nr_deltas);
 
 	do {
 		struct object_entry *entry = list[--i];
 		struct unpacked *n = array + idx;
-		int j;
-
-		if (!entry->preferred_base)
-			processed++;
+		int j, max_depth;
 
-		if (progress)
-			display_progress(&progress_state, processed);
-
-		if (entry->delta)
-			/* This happens if we decided to reuse existing
-			 * delta from a pack.  "!no_reuse_delta &&" is implied.
-			 */
-			continue;
-
-		if (entry->size < 50)
-			continue;
-
-		if (entry->no_try_delta)
-			continue;
-
-		free_unpacked(n);
+		mem_usage -= free_unpacked(n);
 		n->entry = entry;
 
 		while (window_memory_limit &&
-		       window_memory_usage > window_memory_limit &&
+		       mem_usage > window_memory_limit &&
 		       count > 1) {
 			uint32_t tail = (idx + window - count) % window;
-			free_unpacked(array + tail);
+			mem_usage -= free_unpacked(array + tail);
 			count--;
 		}
 
+		/* We do not compute delta to *create* objects we are not
+		 * going to pack.
+		 */
+		if (entry->preferred_base)
+			goto next;
+
+		if (progress)
+			display_progress(&progress_state, ++processed);
+
 		/*
 		 * If the current object is at pack edge, take the depth the
 		 * objects that depend on the current object into account
@@ -1508,7 +1514,7 @@ static void find_deltas(struct object_entry **list, int window, int depth)
 			m = array + other_idx;
 			if (!m->entry)
 				break;
-			if (try_delta(n, m, max_depth) < 0)
+			if (try_delta(n, m, max_depth, &mem_usage) < 0)
 				break;
 		}
 
@@ -1537,21 +1543,94 @@ static void find_deltas(struct object_entry **list, int window, int depth)
 	free(array);
 }
 
+#ifdef THREADED_DELTA_SEARCH
+
+struct thread_params {
+	pthread_t thread;
+	struct object_entry **list;
+	unsigned list_size;
+	unsigned nr_deltas;
+	int window;
+	int depth;
+};
+
+static void *threaded_find_deltas(void *arg)
+{
+	struct thread_params *p = arg;
+	if (p->list_size)
+		find_deltas(p->list, p->list_size, p->nr_deltas,
+			    p->window, p->depth);
+	return NULL;
+}
+
+static void ll_find_deltas(struct object_entry **list, unsigned list_size,
+			   unsigned nr_deltas, int window, int depth)
+{
+	struct thread_params p[4];
+	int i, ret;
+
+	for (i = 0; i < 4; i++) {
+		unsigned sublist_size = list_size / (4 - i);
+		p[i].list = list;
+		p[i].list_size = sublist_size;
+		p[i].nr_deltas = nr_deltas;
+		p[i].window = window;
+		p[i].depth = depth;
+		ret = pthread_create(&p[i].thread, NULL,
+				     threaded_find_deltas, &p[i]);
+		if (ret)
+			die("unable to create thread: %s", strerror(ret));
+		list += sublist_size;
+		list_size -= sublist_size;
+	}
+
+	for (i = 0; i < 4; i++) {
+		pthread_join(p[i].thread, NULL);
+	}
+}
+
+#else
+#define ll_find_deltas find_deltas
+#endif
+
 static void prepare_pack(int window, int depth)
 {
 	struct object_entry **delta_list;
-	uint32_t i;
+	uint32_t i, n, nr_deltas;
 
 	get_object_details();
 
-	if (!window || !depth)
+	if (!nr_objects || !window || !depth)
 		return;
 
 	delta_list = xmalloc(nr_objects * sizeof(*delta_list));
-	for (i = 0; i < nr_objects; i++)
-		delta_list[i] = objects + i;
-	qsort(delta_list, nr_objects, sizeof(*delta_list), type_size_sort);
-	find_deltas(delta_list, window+1, depth);
+	nr_deltas = n = 0;
+
+	for (i = 0; i < nr_objects; i++) {
+		struct object_entry *entry = objects + i;
+
+		if (entry->delta)
+			/* This happens if we decided to reuse existing
+			 * delta from a pack.  "!no_reuse_delta &&" is implied.
+			 */
+			continue;
+
+		if (entry->size < 50)
+			continue;
+
+		if (entry->no_try_delta)
+			continue;
+
+		if (!entry->preferred_base)
+			nr_deltas++;
+
+		delta_list[n++] = entry;
+	}
+
+	if (nr_deltas) {
+		qsort(delta_list, n, sizeof(*delta_list), type_size_sort);
+		ll_find_deltas(delta_list, n, nr_deltas, window+1, depth);
+	}
 	free(delta_list);
 }
 

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: performance on repack
  2007-08-30  4:36               ` Nicolas Pitre
@ 2007-08-30 16:17                 ` Jon Smirl
  2007-09-01 21:54                 ` Jon Smirl
  1 sibling, 0 replies; 30+ messages in thread
From: Jon Smirl @ 2007-08-30 16:17 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Shawn O. Pearce, Martin Koegler, Git Mailing List

Practicing on the kernel, without the patch:
Q6600 with 4GB RAM

jonsmirl@terra:/home/linux$ time git repack -a -f --window=200 --depth=200
Generating pack...
Counting objects: 518288
Done counting 546466 objects.
Deltifying 546466 objects...
 100% (546466/546466) done
Writing 546466 objects...
 100% (546466/546466) done
Total 546466 (delta 452329), reused 0 (delta 0)
Pack pack-04c98effa112233951acbb2d8486eefac17a5a97 created.

real    16m27.752s
user    16m23.829s
sys     0m2.192s

With the patch it didn't balance very well. First two threads finished
in 2min, thrid in 5min, fourth in 10min.

So for fun I set it up to 16 threads, that kept all fours cores
running until the end. It added 1:20 in overhead, but it finished in
5m vs 16:30 unthreaded. Threading is an obvious win for this code.

jonsmirl@terra:/home/linux$ time git repack -a -f --window=200 --depth=200
Generating pack...
Counting objects: 502267
Done counting 546466 objects.
Deltifying 545660 objects...
Deltifying 545660 objects...
Deltifying 545660 objects...
Deltifying 545660 objects...
Deltifying 545660 objects...
Deltifying 545660 objects...
Deltifying 545660 objects...
Deltifying 545660 objects...
Deltifying 545660 objects...
Deltifying 545660 objects...
Deltifying 545660 objects...
Deltifying 545660 objects...
Deltifying 545660 objects...
Deltifying 545660 objects...
Deltifying 545660 objects...
Deltifying 545660 objects...
   6% (34104/545660) done
   6% (34099/545660) done
   1% (6929/545660) donee
   5% (30037/545660) done
   6% (34101/545660) done
   4% (25222/545660) done
   6% (34102/545660) done
   3% (19999/545660) done
   6% (34102/545660) done
   4% (27203/545660) done
   6% (34102/545660) done
   4% (23377/545660) done
   6% (34042/545660) done
   6% (34099/545660) done
   6% (32740/545660) done


real    5m0.360s
user    17m41.822s
sys     0m5.596s
jonsmirl@terra:/home/linux$


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: performance on repack
  2007-08-30  4:36               ` Nicolas Pitre
  2007-08-30 16:17                 ` Jon Smirl
@ 2007-09-01 21:54                 ` Jon Smirl
  1 sibling, 0 replies; 30+ messages in thread
From: Jon Smirl @ 2007-09-01 21:54 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Shawn O. Pearce, Martin Koegler, Git Mailing List

On 8/30/07, Nicolas Pitre <nico@cam.org> wrote:
> On Thu, 30 Aug 2007, Nicolas Pitre wrote:
>
> > Well, here is a quick implementation of this idea for those who would
> > like to give it a try.
> [...]


Appears to be broken for a push.

jonsmirl@terra:~/mpc5200b$ git push dreamhost
updating 'refs/remotes/linus/master' using 'refs/heads/master'
  from d1caeb02b17c6bc215a9a40a98a1beb92dcbd310
  to   40ffbfad6bb79a99cc7627bdaca0ee22dec526f6
Generating pack...
Counting objects: 1
Done counting 3255 objects.
Result has 1997 objects.
Deltifying 398 objects...
Deltifying 398 objects...
Deltifying 398 objects...
Deltifying 398 objects...
Deltifying 398 objects...
Deltifying 398 objects...
Deltifying 398 objects...
Deltifying 398 objects...
Deltifying 398 objects...
Deltifying 398 objects...
Deltifying 398 objects...
Deltifying 398 objects...
Deltifying 398 objects...
Deltifying 398 objects...
Deltifying 398 objects...
Deltifying 398 objects...
   0% (2/398) donee
   1% (4/398) donee
  11% (47/398) done
   3% (12/398) done
   2% (8/398) donee

  12% (49/398) done
  17% (69/398) done
   1% (4/398) donee
  25% (100/398) done
  19% (77/398) donee
  23% (94/398) done
  21% (84/398) done
  25% (100/398) done
   2% (10/398) done
   1% (6/398) done
error: pack-objects died with strange error
unpack eof before pack header was fully read
ng refs/remotes/linus/master n/a (unpacker error)
error: failed to push to 'ssh://jonsmirl1@git.digispeaker.com/~/mpc5200b.git'
jonsmirl@terra:~/mpc5200b$



-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2007-09-01 21:54 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-08-11 21:12 performance on repack Jon Smirl
2007-08-11 22:09 ` David Kastrup
2007-08-11 22:34   ` Linus Torvalds
2007-08-11 23:21     ` Jon Smirl
2007-08-12 10:33 ` Martin Koegler
2007-08-12 13:49   ` Jon Smirl
2007-08-14  3:12     ` Shawn O. Pearce
2007-08-14  4:10       ` Jon Smirl
2007-08-14  5:13         ` Shawn O. Pearce
2007-08-14  5:57           ` Jon Smirl
2007-08-14 14:52       ` Nicolas Pitre
2007-08-14 21:41       ` Nicolas Pitre
2007-08-15  1:20         ` Jon Smirl
2007-08-15  1:59           ` Nicolas Pitre
2007-08-15  5:32         ` Shawn O. Pearce
2007-08-15 15:08           ` Jon Smirl
2007-08-15 17:11             ` Martin Koegler
2007-08-15 18:38               ` Jon Smirl
2007-08-15 19:00                 ` Nicolas Pitre
2007-08-15 19:42                   ` Jon Smirl
2007-08-16  8:10                   ` David Kastrup
2007-08-16 15:34                     ` Nicolas Pitre
2007-08-16 16:13                       ` Jon Smirl
2007-08-16 16:21                         ` Nicolas Pitre
2007-08-15 21:05             ` Nicolas Pitre
2007-08-15 20:49           ` Nicolas Pitre
2007-08-30  4:27             ` Nicolas Pitre
2007-08-30  4:36               ` Nicolas Pitre
2007-08-30 16:17                 ` Jon Smirl
2007-09-01 21:54                 ` Jon Smirl

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).