git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] always start looking up objects in the last used pack first
@ 2007-05-31  2:48 Nicolas Pitre
  2007-05-31  3:24 ` Nicolas Pitre
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Nicolas Pitre @ 2007-05-31  2:48 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Jon Smirl said:

| Once an object reference hits a pack file it is very likely that 
| following references will hit the same pack file. So first place to 
| look for an object is the same place the previous object was found.

This is indeed a good heuristic so here it is.  The search always start
with the pack where the last object lookup succeeded.  If the wanted 
object is not available there then the search continues with the normal
pack ordering.

To test this I split the Linux repository into 66 packs and performed a
"time git-rev-list --objects --all > /dev/null".  Best results are as 
follows:

	Pack Sort			w/o this patch	w/ this patch
	-------------------------------------------------------------
	recent objects last		26.4s		20.9s
	recent objects first		24.9s		18.4s

This shows that the pack order based on object age has some influence, 
but that the last-used-pack heuristic is even more significant in 
reducing object lookup.

Signed-off-by: Nicolas Pitre <nico@cam.org> --- Note: the 
--max-pack-size to git-repack currently produces packs with old objects 
after those containing recent objects.  The pack sort based on 
filesystem timestamp is therefore backward for those.  This needs to be 
fixed of course, but at least it made me think about this variable for 
the test.

diff --git a/sha1_file.c b/sha1_file.c
index a3637d7..aa6d499 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1687,20 +1688,25 @@ static int matches_pack_name(struct packed_git *p, const char *ig)
 
 static int find_pack_entry(const unsigned char *sha1, struct pack_entry *e, const char **ignore_packed)
 {
+	static struct packed_git *last_found = (void *)1;
 	struct packed_git *p;
 	off_t offset;
 
 	prepare_packed_git();
+	if (!packed_git)
+		return 0;
+	p = (last_found == (void *)1) ? packed_git : last_found;
 
-	for (p = packed_git; p; p = p->next) {
+	do {
 		if (ignore_packed) {
 			const char **ig;
 			for (ig = ignore_packed; *ig; ig++)
 				if (!matches_pack_name(p, *ig))
 					break;
 			if (*ig)
-				continue;
+				goto next;
 		}
+
 		offset = find_pack_entry_one(sha1, p);
 		if (offset) {
 			/*
@@ -1713,14 +1719,23 @@ static int find_pack_entry(const unsigned char *sha1, struct pack_entry *e, cons
 			 */
 			if (p->pack_fd == -1 && open_packed_git(p)) {
 				error("packfile %s cannot be accessed", p->pack_name);
-				continue;
+				goto next;
 			}
 			e->offset = offset;
 			e->p = p;
 			hashcpy(e->sha1, sha1);
+			last_found = p;
 			return 1;
 		}
-	}
+
+		next:
+		if (p == last_found)
+			p = packed_git;
+		else
+			p = p->next;
+		if (p == last_found)
+			p = p->next;
+	} while (p);
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] always start looking up objects in the last used pack first
  2007-05-31  2:48 [PATCH] always start looking up objects in the last used pack first Nicolas Pitre
@ 2007-05-31  3:24 ` Nicolas Pitre
  2007-05-31  5:02 ` Shawn O. Pearce
  2007-06-02 14:53 ` Dana How
  2 siblings, 0 replies; 6+ messages in thread
From: Nicolas Pitre @ 2007-05-31  3:24 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Wed, 30 May 2007, Nicolas Pitre wrote:

> To test this I split the Linux repository into 66 packs and performed a
> "time git-rev-list --objects --all > /dev/null".  Best results are as 
> follows:
> 
> 	Pack Sort			w/o this patch	w/ this patch
> 	-------------------------------------------------------------
> 	recent objects last		26.4s		20.9s
> 	recent objects first		24.9s		18.4s
> 
> This shows that the pack order based on object age has some influence, 
> but that the last-used-pack heuristic is even more significant in 
> reducing object lookup.

For reference, the same operation on a fully packed into a single pack 
repository takes 17.1s.  So this looks damn good to me.


Nicolas

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] always start looking up objects in the last used pack first
  2007-05-31  2:48 [PATCH] always start looking up objects in the last used pack first Nicolas Pitre
  2007-05-31  3:24 ` Nicolas Pitre
@ 2007-05-31  5:02 ` Shawn O. Pearce
  2007-05-31 15:39   ` Nicolas Pitre
  2007-06-02 14:53 ` Dana How
  2 siblings, 1 reply; 6+ messages in thread
From: Shawn O. Pearce @ 2007-05-31  5:02 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, git

Nicolas Pitre <nico@cam.org> wrote:
> 	Pack Sort			w/o this patch	w/ this patch
> 	-------------------------------------------------------------
> 	recent objects last		26.4s		20.9s
> 	recent objects first		24.9s		18.4s

Looks pretty good.
 
> +		next:
> +		if (p == last_found)
> +			p = packed_git;
> +		else
> +			p = p->next;
> +		if (p == last_found)
> +			p = p->next;
> +	} while (p);

So if we didn't find the object in the pack that we found the
last object in, we restart our search with the most recent pack?
Why not just go to p->next and loop around?  If we missed in this
pack and the packs are sorted by recency, wouldn't we want to just
search the next pack?

-- 
Shawn.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] always start looking up objects in the last used pack first
  2007-05-31  5:02 ` Shawn O. Pearce
@ 2007-05-31 15:39   ` Nicolas Pitre
  2007-06-02 15:00     ` Dana How
  0 siblings, 1 reply; 6+ messages in thread
From: Nicolas Pitre @ 2007-05-31 15:39 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Junio C Hamano, git

On Thu, 31 May 2007, Shawn O. Pearce wrote:

> Nicolas Pitre <nico@cam.org> wrote:
> > 	Pack Sort			w/o this patch	w/ this patch
> > 	-------------------------------------------------------------
> > 	recent objects last		26.4s		20.9s
> > 	recent objects first		24.9s		18.4s
> 
> Looks pretty good.
>  
> > +		next:
> > +		if (p == last_found)
> > +			p = packed_git;
> > +		else
> > +			p = p->next;
> > +		if (p == last_found)
> > +			p = p->next;
> > +	} while (p);
> 
> So if we didn't find the object in the pack that we found the
> last object in, we restart our search with the most recent pack?

Right.

> Why not just go to p->next and loop around?  If we missed in this
> pack and the packs are sorted by recency, wouldn't we want to just
> search the next pack?

Testing shows that with such a strategy the same operation went from 
18.4s to 19.8s.  But this is with split packs resulting from a 
git-repack -a -d --max-pack-size=...

If you search for an object and don't find it in the last used pack, 
that doesn't mean that the object is necessarily in the pack next to the 
last one.  

The object recency order doesn't actually mean object age.  Consider for 
example 3 commits: the first with everything initially created, the 
second commit modifying only half the files, and the third modifying 
only a quarter of the files.  

Object recency order means that the latest commit will see (almost) all 
its objects early and contiguously in the pack stream. The latest commit 
will still reference between 1/4 to 1/2 of the objects that were created 
from the first commit though.

Objects for the middle commit will appear after that, but only those 
that were not referenced by the latest commit.  This means that in this 
example 3/4 of the needed objects for the middle commit will still be 
found early in the pack stream.

And the first commit will see its objects at the end of the pack stream, 
but again only those objects that weren't referenced by younger commits 
already.  Up to 1/2 of those objects needed for the first commit might 
be found in the early portion of the pack stream, or at worst 1/4 in the 
early portion and 1/4 in the mid portion of the pack stream.

If a repack creates split packs more or less according to the commit 
separation above then failure to find an object in the last used pack 
should really continue to scan available packs from the beginning of the 
list.  Merely going on with the next available pack is a bad move in 
this case.

HOWEVER...

If the accumulation of packs is due to multiple fetches/pushes without a 
repack then the situation is somewhat reversed and much less clear.  
Considering the scenario above but with a fetch between commits you'd 
end up with this:

1- a large pack with all objects for the first commit.

2- a pack with only half the objects for the second commit (the other 
   half is still available in pack #1).

3- a pack with only 1/4 of the objects for the latest commit (the rest 
   is shared between pack #1 and pack #2).

Here it is not clear what to do.  Given that the last-used-pack 
heuristic makes such a big difference in the split pack it will 
certainly help a lot in this case too.  But where to go when that 
heuristic fails is unclear.  If you want to favorize speed for latest 
commits then I'd argue that going back to the beginning of the list is 
still the right thing to do.  If you want to have a more balanced 
behavior for all commits (think git-blame) then probably going to the 
pack next to the last used would be the best thing to do in this case as 
younger packs will never contain objects that older commits are 
interested in.

...

Which makes me wonder about a possible incremental improvement to my 
patch: on failure to find an object in the last used pack, the search 
should then start again from the pack containing the commit from which 
this object search is related to.  In the split pack case all commit 
objects will be located in the first pack so nothing will change there.  
In the multiple-fetch case then the search will always reset to packs 
not younger than the commit triggering those object lookups.  Question 
is how to implement that nicely...


Nicolas

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] always start looking up objects in the last used pack first
  2007-05-31  2:48 [PATCH] always start looking up objects in the last used pack first Nicolas Pitre
  2007-05-31  3:24 ` Nicolas Pitre
  2007-05-31  5:02 ` Shawn O. Pearce
@ 2007-06-02 14:53 ` Dana How
  2 siblings, 0 replies; 6+ messages in thread
From: Dana How @ 2007-06-02 14:53 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, git, danahow

On 5/30/07, Nicolas Pitre <nico@cam.org> wrote:
> Jon Smirl said:
> | Once an object reference hits a pack file it is very likely that
> | following references will hit the same pack file. So first place to
> | look for an object is the same place the previous object was found.
>
> This is indeed a good heuristic so here it is.  The search always start
> with the pack where the last object lookup succeeded.  If the wanted
> object is not available there then the search continues with the normal
> pack ordering.

Nice numbers for performance,
especially your later email showing this makes
split packs almost as quick as one pack.

> Note: the
> --max-pack-size to git-repack currently produces packs with old objects
> after those containing recent objects.  The pack sort based on
> filesystem timestamp is therefore backward for those.  This needs to be
> fixed of course, but at least it made me think about this variable for
> the test.

Yes,  I was intending to submit a patch to builtin-pack-objects.c
to reverse the timestamps when split packs were created.
Haven't got around to it yet.
-- 
Dana L. How  danahow@gmail.com  +1 650 804 5991 cell

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] always start looking up objects in the last used pack first
  2007-05-31 15:39   ` Nicolas Pitre
@ 2007-06-02 15:00     ` Dana How
  0 siblings, 0 replies; 6+ messages in thread
From: Dana How @ 2007-06-02 15:00 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Shawn O. Pearce, Junio C Hamano, git, danahow

On 5/31/07, Nicolas Pitre <nico@cam.org> wrote:
> Which makes me wonder about a possible incremental improvement to my
> patch: on failure to find an object in the last used pack, the search
> should then start again from the pack containing the commit from which
> this object search is related to.  In the split pack case all commit
> objects will be located in the first pack so nothing will change there.
> In the multiple-fetch case then the search will always reset to packs
> not younger than the commit triggering those object lookups.  Question
> is how to implement that nicely...

My immediate reaction to this patch  was that there should be
a last-used-pack per object type.  Or perhaps one for commits,
and one for trees+blobs [since the latter are intermingled]?
Unfortunately the interface only specifies
the SHA-1,  not the object type,  and certainly not the "commit
this is related to".  I think your related-commit idea could be
very useful,  but it does require some extra info to be passed
around which currently is not.
-- 
Dana L. How  danahow@gmail.com  +1 650 804 5991 cell

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2007-06-02 15:01 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-31  2:48 [PATCH] always start looking up objects in the last used pack first Nicolas Pitre
2007-05-31  3:24 ` Nicolas Pitre
2007-05-31  5:02 ` Shawn O. Pearce
2007-05-31 15:39   ` Nicolas Pitre
2007-06-02 15:00     ` Dana How
2007-06-02 14:53 ` Dana How

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).