Why are ref_lists sorted?

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Why are ref_lists sorted?
@ 2007-03-19  3:38 Julian Phillips
  2007-03-19  4:50 ` Junio C Hamano
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Julian Phillips @ 2007-03-19  3:38 UTC (permalink / raw)
  To: git

Having got a C implementation of fetch far enough to be able to tell me 
that I'm up-to-date (when using git:// and not using remotes or branches 
files) I thought I'd compare performance with the fetch--tool based 
version.  I was a bit surprised at what I saw.

Hot-cache time went from ~30s to ~3s, which didn't seem too bad.  However 
what did puzzle me was where the C version was spending it's time.  Being 
a work in progress it's a bit chatty in places, but there is an up-front 
delay of just over 1s (i.e. before my first message, which is the name of 
the remote to fetch).

A bit of investigation showed this to be due to the first attempt to read 
a ref causing the packed refs to be loaded.  In my test repo the 
packed-refs file has over 9000 entries, but I still thought that it would 
load faster than that.  It turns out that the overhead is from sorting the 
refs when building the ref_list.  If I remove the code for sorting the 
entries I lose that initial 1s delay, without appearing to break anything 
in the fetch.  However I assume that it's there for a reason ...

So my questions are:

1) what have I broken by removing the sort?
2) is it worth trying to optimise the sort?

-- 
Julian

  ---
<SpanKY> http://www.ananova.com/news/story/sm_806582.html?menu=news.quirkies
<Mr_Bones_> SpanKY: my life would have been much improved without that
  link.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Why are ref_lists sorted?
  2007-03-19  3:38 Why are ref_lists sorted? Julian Phillips
@ 2007-03-19  4:50 ` Junio C Hamano
  2007-03-19  4:54 ` Shawn O. Pearce
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: Junio C Hamano @ 2007-03-19  4:50 UTC (permalink / raw)
  To: Julian Phillips; +Cc: git

Julian Phillips <julian@quantumfyre.co.uk> writes:

> So my questions are:
>
> 1) what have I broken by removing the sort?

See do_for_each_ref().  We walk packed and loose refs in
parallel, preferring loose ref if exists (so that after packing,
updating a ref does not have to remove entry from packed-refs
file).

> 2) is it worth trying to optimise the sort?

If you have 9,000 entries, it probably is worth doing.  Also you
might want to use an expanding array of pointers instead of
linked list.

Currently add_ref() is shared between packed and loose refs, and
the order loose refs are discovered depends on readdir() order,
so we cannot do without sorting on that side.  But I suspect
reading them in in the order they are discovered and then
sorting at the end just once might be a good thing to do.

If we write packed refs _after_ sorting, we should be able to
optimize on the side that reads packed-refs even better.  As you
read one new line from the file, you make sure that it sorts
later than the last entry you saw, and append to the end of
list; if you find an unorderd entry, remember that the list is
not sorted, and sort at the very end after reading all the refs.

Or something like that...

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Why are ref_lists sorted?
  2007-03-19  3:38 Why are ref_lists sorted? Julian Phillips
  2007-03-19  4:50 ` Junio C Hamano
@ 2007-03-19  4:54 ` Shawn O. Pearce
  2007-03-19  5:22 ` Julian Phillips
  2007-03-19 17:03 ` Linus Torvalds
  3 siblings, 0 replies; 7+ messages in thread
From: Shawn O. Pearce @ 2007-03-19  4:54 UTC (permalink / raw)
  To: Julian Phillips; +Cc: git

Julian Phillips <julian@quantumfyre.co.uk> wrote:
> A bit of investigation showed this to be due to the first attempt to read 
> a ref causing the packed refs to be loaded.  In my test repo the 
> packed-refs file has over 9000 entries, but I still thought that it would 
> load faster than that.  It turns out that the overhead is from sorting the 
> refs when building the ref_list.  If I remove the code for sorting the 
> entries I lose that initial 1s delay, without appearing to break anything 
> in the fetch.  However I assume that it's there for a reason ...
> 
> So my questions are:
> 
> 1) what have I broken by removing the sort?
> 2) is it worth trying to optimise the sort?

Oh gawd, that thing is an O(n**2) insertion.  Ouch.

The entire reason its sorted is just to allow us to find the
existing ref entry, if one exists, and replace it.  This way
a loose (unpacked) ref will shadow/override its packed version.

I think the other reason is to provide predictable behavior.
By making the list sorted, refs always appear at the same relative
position to other refs within that repository.  This makes it easier
to write the packed-refs file and yet keep the ordering within the
file predictable.

I think this could probably be better done as a hash, not as a
linked list, and that only the packed-refs writer needs to pay
any sort of sorting costs.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Why are ref_lists sorted?
  2007-03-19  3:38 Why are ref_lists sorted? Julian Phillips
  2007-03-19  4:50 ` Junio C Hamano
  2007-03-19  4:54 ` Shawn O. Pearce
@ 2007-03-19  5:22 ` Julian Phillips
  2007-03-19  5:33   ` Shawn O. Pearce
  2007-03-19 17:03 ` Linus Torvalds
  3 siblings, 1 reply; 7+ messages in thread
From: Julian Phillips @ 2007-03-19  5:22 UTC (permalink / raw)
  To: git

On Mon, 19 Mar 2007, Julian Phillips wrote:

> A bit of investigation showed this to be due to the first attempt to read a 
> ref causing the packed refs to be loaded.  In my test repo the packed-refs 
> file has over 9000 entries, but I still thought that it would load faster 
> than that.  It turns out that the overhead is from sorting the refs when 
> building the ref_list.  If I remove the code for sorting the entries I lose 
> that initial 1s delay, without appearing to break anything in the fetch. 
> However I assume that it's there for a reason ...
>

Actually, it's worse than that ... I forgot about the other end.  If I 
point fetch at an upload-pack with the sorted removed the time goes down 
by another second.  So that sorting accounts for 2 of the 3s ...

-- 
Julian

  ---
Know thyself.  If you need help, call the C.I.A.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Why are ref_lists sorted?
  2007-03-19  5:22 ` Julian Phillips
@ 2007-03-19  5:33   ` Shawn O. Pearce
  2007-03-19  5:42     ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: Shawn O. Pearce @ 2007-03-19  5:33 UTC (permalink / raw)
  To: Julian Phillips; +Cc: git

Julian Phillips <julian@quantumfyre.co.uk> wrote:
> On Mon, 19 Mar 2007, Julian Phillips wrote:
> 
> >A bit of investigation showed this to be due to the first attempt to read 
> >a ref causing the packed refs to be loaded.  In my test repo the 
> >packed-refs file has over 9000 entries, but I still thought that it would 
> >load faster than that.  It turns out that the overhead is from sorting the 
> >refs when building the ref_list.  If I remove the code for sorting the 
> >entries I lose that initial 1s delay, without appearing to break anything 
> >in the fetch. However I assume that it's there for a reason ...

I have to say, I'm looking at the refs.c code and I'm not seeing
any reason why these need to be sorted.  Granted, the current code
expects that as it walks both get_packed_refs() and get_loose_refs()
(as Junio pointed out), but change that code to toss both into a
single refs hash table, with the elements being the current contents
of struct ref_list.

We already have the ISPACKED flag in the flags field to tell us if
the ref is loose or not.  When inserting a ref into the hash table
you keep the loose version (either if its in the table, or the one
being inserted).

-- 
Shawn.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Why are ref_lists sorted?
  2007-03-19  5:33   ` Shawn O. Pearce
@ 2007-03-19  5:42     ` Junio C Hamano
  0 siblings, 0 replies; 7+ messages in thread
From: Junio C Hamano @ 2007-03-19  5:42 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Julian Phillips, git

"Shawn O. Pearce" <spearce@spearce.org> writes:

> We already have the ISPACKED flag in the flags field to tell us if
> the ref is loose or not.  When inserting a ref into the hash table
> you keep the loose version (either if its in the table, or the one
> being inserted).

As long as you are careful enough not to break the unwrapped
entries, repacking and re-reading of refs (I think these three
are the reasons we have separate lists), I am fine with that
change.  I also think some callers of do_for_each_refs (this
includes the userland that use git-for-each-ref and/or
git-show-ref) expect the traversal to be sorted, so I would like
to keep that sorted output behaviour.

An additional hash to speed up look-up operation is probably a
good thing to have.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Why are ref_lists sorted?
  2007-03-19  3:38 Why are ref_lists sorted? Julian Phillips
                   ` (2 preceding siblings ...)
  2007-03-19  5:22 ` Julian Phillips
@ 2007-03-19 17:03 ` Linus Torvalds
  3 siblings, 0 replies; 7+ messages in thread
From: Linus Torvalds @ 2007-03-19 17:03 UTC (permalink / raw)
  To: Julian Phillips; +Cc: git

On Mon, 19 Mar 2007, Julian Phillips wrote:
> 
> So my questions are:
> 
> 1) what have I broken by removing the sort?

The big thing is probably consistency.

I *really* think we need to sort these things. Otherwise you'll see two 
totally identical repositories giving different results to something as 
fundamental as "git ls-remote" just because they didn't get sorted.

So I think sorting is absolutely required, perhaps not so much because it 
is necessarily "incorrect" without the sorting, but because I think 
consistency in this area is too important *not* to sort it.

And sorting it really is simple. The fact that we use a O(n**2) list 
insertion thing that is also probably pessimal for the case of "already 
sorted" input is just a "hey, it was easy, we never actually hit it in 
practice" issue. 

> 2) is it worth trying to optimise the sort?

Absolutely. It might involve changing the "ref_list *" thing into an array 
of ref_entries, and that will cause a lot of (fairly trivial) changes, but 
it should all be entirely internal to refs.c, so it's hopefully not 
painful, just some boring grunt-work.

		Linus

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2007-03-19 17:03 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-03-19  3:38 Why are ref_lists sorted? Julian Phillips
2007-03-19  4:50 ` Junio C Hamano
2007-03-19  4:54 ` Shawn O. Pearce
2007-03-19  5:22 ` Julian Phillips
2007-03-19  5:33   ` Shawn O. Pearce
2007-03-19  5:42     ` Junio C Hamano
2007-03-19 17:03 ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).