* [PATCH] Avoid sorting if references are added to ref_cache in order @ 2012-05-24 12:16 mhagger 2012-05-24 17:49 ` Jeff King 0 siblings, 1 reply; 4+ messages in thread From: mhagger @ 2012-05-24 12:16 UTC (permalink / raw) To: Junio C Hamano; +Cc: Jeff King, Martin Fick, git, Michael Haggerty From: Michael Haggerty <mhagger@alum.mit.edu> The old code allowed many references to be efficiently added to a single directory, because it just appended the references to the containing directory unsorted without doing any searching (and therefore without requiring any intermediate sorting). But the old code was inefficient when a large number of subdirectories were added to a directory, because the directory always had to be searched to see if the new subdirectory already existed, and this search required the directory to be sorted first. The same was repeated for every new subdirectory, so the time scaled like O(N^2), where N is the number of subdirectories within a single directory. In practice, references are often added to the ref_cache in lexicographic order, for example when reading the packed-refs file. So build some intelligence into add_entry_to_dir() to optimize for the case of references and/or subdirectories being added in lexicographic order: if the existing entries were already sorted, and the new entry comes after the last existing entry, then adjust ref_dir::sorted to reflect the fact that the ref_dir is still sorted. Thanks to Peff for pointing out the performance regression that inspired this change. Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu> --- This patch can be applied to either mh/ref-api or next. It should fix the performance regression discovered by Peff [1]. [1] http://article.gmane.org/gmane.comp.version-control.git/198163 refs.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/refs.c b/refs.c index 09322fe..98f6425 100644 --- a/refs.c +++ b/refs.c @@ -208,6 +208,12 @@ static void add_entry_to_dir(struct ref_dir *dir, struct ref_entry *entry) { ALLOC_GROW(dir->entries, dir->nr + 1, dir->alloc); dir->entries[dir->nr++] = entry; + /* optimize for the case that entries are added in order */ + if (dir->nr == 1 || + (dir->nr == dir->sorted + 1 && + strcmp(dir->entries[dir->nr - 2]->name, + dir->entries[dir->nr - 1]->name) < 0)) + dir->sorted = dir->nr; } /* -- 1.7.10 ^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH] Avoid sorting if references are added to ref_cache in order 2012-05-24 12:16 [PATCH] Avoid sorting if references are added to ref_cache in order mhagger @ 2012-05-24 17:49 ` Jeff King 2012-05-24 21:00 ` Michael Haggerty 0 siblings, 1 reply; 4+ messages in thread From: Jeff King @ 2012-05-24 17:49 UTC (permalink / raw) To: mhagger; +Cc: Junio C Hamano, Martin Fick, git On Thu, May 24, 2012 at 02:16:50PM +0200, mhagger@alum.mit.edu wrote: > The old code allowed many references to be efficiently added to a > single directory, because it just appended the references to the > containing directory unsorted without doing any searching (and > therefore without requiring any intermediate sorting). But the old > code was inefficient when a large number of subdirectories were added > to a directory, because the directory always had to be searched to see > if the new subdirectory already existed, and this search required the > directory to be sorted first. The same was repeated for every new > subdirectory, so the time scaled like O(N^2), where N is the number of > subdirectories within a single directory. > > In practice, references are often added to the ref_cache in > lexicographic order, for example when reading the packed-refs file. > So build some intelligence into add_entry_to_dir() to optimize for the > case of references and/or subdirectories being added in lexicographic > order: if the existing entries were already sorted, and the new entry > comes after the last existing entry, then adjust ref_dir::sorted to > reflect the fact that the ref_dir is still sorted. Thanks for a nice analysis and the patch; this definitely fixes the regression I was seeing. The fix feels a little bit like a hack to me. The real problem is that there is a mismatch in how the ref code wants to receive chunks of refs, and how the packed-refs code wants to feed them. The optimal way to feed refs would be breadth-first, giving an entire single level (say, "refs/remotes/*", but not "refs/remotes/*/*") at once, and then sorting the result. And as far as I can tell, that is what read_loose_refs does. But the packed-refs file is read sequentially, and we end up inserting the refs in a depth-first way, which triggers this problem. Your fix relies on the fact that our depth-first feed happens in sorted order, and we can avoid the extra sorts. But I think the problem would still exist if we did a depth-first feed of unsorted refs. The packed-refs file is always in sorted order. The only other source of feed-one-at-a-time refs seems to be clone.c:write_remote_refs. It gets its refs from the mapped_refs, which eventually come from the remote side of the connection (and git tends to list refs in sorted order). So I think in practice we are OK, but would go quadratic again if we ever fed an unsorted list of refs. So the right thing is probably to apply this patch (which makes sense _anyway_, as it is even cheaper than sorting afterwards when we can avoid it), and be aware of the issue for any future unsorted provider, and deal with it if and when it ever happens. > + /* optimize for the case that entries are added in order */ > + if (dir->nr == 1 || > + (dir->nr == dir->sorted + 1 && > + strcmp(dir->entries[dir->nr - 2]->name, > + dir->entries[dir->nr - 1]->name) < 0)) > + dir->sorted = dir->nr; Technically we would still be sorted if strcmp(...) == 0. But I guess it probably doesn't matter, as we shouldn't ever be adding duplicates here. -Peff ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] Avoid sorting if references are added to ref_cache in order 2012-05-24 17:49 ` Jeff King @ 2012-05-24 21:00 ` Michael Haggerty 2012-05-24 21:10 ` Jeff King 0 siblings, 1 reply; 4+ messages in thread From: Michael Haggerty @ 2012-05-24 21:00 UTC (permalink / raw) To: Jeff King; +Cc: Junio C Hamano, Martin Fick, git On 05/24/2012 07:49 PM, Jeff King wrote: > Thanks for a nice analysis and the patch; this definitely fixes the > regression I was seeing. And thanks for your feedback and for helping me reproduce your problem. > The fix feels a little bit like a hack to me. The real problem is that > there is a mismatch in how the ref code wants to receive chunks of refs, > and how the packed-refs code wants to feed them. The optimal way to feed > refs would be breadth-first, giving an entire single level (say, > "refs/remotes/*", but not "refs/remotes/*/*") at once, and then sorting > the result. And as far as I can tell, that is what read_loose_refs does. > > But the packed-refs file is read sequentially, and we end up inserting > the refs in a depth-first way, which triggers this problem. Your fix > relies on the fact that our depth-first feed happens in sorted order, > and we can avoid the extra sorts. But I think the problem would still > exist if we did a depth-first feed of unsorted refs. > > The packed-refs file is always in sorted order. The only other source of > feed-one-at-a-time refs seems to be clone.c:write_remote_refs. It gets > its refs from the mapped_refs, which eventually come from the remote > side of the connection (and git tends to list refs in sorted order). > > So I think in practice we are OK, but would go quadratic again if we > ever fed an unsorted list of refs. So the right thing is probably to > apply this patch (which makes sense _anyway_, as it is even cheaper than > sorting afterwards when we can avoid it), and be aware of the issue for > any future unsorted provider, and deal with it if and when it ever > happens. I agree with you that this patch is a bit hacky, but it is simple, harmless, and it happens to do the right thing in the cases that are important in today's git. See below for other approaches that could be used in combination with this one. >> + /* optimize for the case that entries are added in order */ >> + if (dir->nr == 1 || >> + (dir->nr == dir->sorted + 1&& >> + strcmp(dir->entries[dir->nr - 2]->name, >> + dir->entries[dir->nr - 1]->name)< 0)) >> + dir->sorted = dir->nr; > > Technically we would still be sorted if strcmp(...) == 0. But I guess it > probably doesn't matter, as we shouldn't ever be adding duplicates here. Yes, duplicate refs should be an exceptional case and needn't be handled efficiently. sort_ref_dir() explicitly accepts the possibility of duplicate references, de-dups them if they are consistent with each other, or dies if they are inconsistent. We shouldn't add a way to bypass that logic. We could implement the duplicate-detection-and-checking code again in add_entry_to_dir(), but my choice was to leave it to sort_ref_dir() to deal with duplicates. More general approaches: The optimization implemented above is unsatisfying in the sense that it only works because refs are inserted in order. What could be done to handle the general case (i.e., handling references that are added to the cache out of order)? If quick, repeated iteration is thought to be important, then the canonical answer would be to use something like a balanced tree or a skip list. These data structures are somewhat involved and have extra space and time overhead for the case of serial processing. If ordered iteration is expected to be rare but lookups, adds, and deletes are expected to be frequent, then one could use a hash map, and sort the entries when needed as part of the iteration. But a hash map discards the ordering that is already present when reading packed-refs and requires an extra sort every time that packed-refs is written. Something that would help read_packed_refs() would be to keep track of the directory that it is currently working in. This would effectively remove the need to access a directory while working in one of its subdirectories, thereby avoiding any need to repeatedly sort intermediate directories. It would also avoid having to traverse the directories starting at the root for each new entry, which itself would save time independent of the time for sorting. I have some patches that implement this change but they are stale. I want to do some benchmarking first though to see whether the extra complication brings measurable benefits. Finally, I did some work on the idea of keeping the refs within a directory *mostly* sorted. (Probably this technique is already known, but I have never run into it.) One would keep the first part of the array sorted, and append new references to the tail of the array unsorted. Searching would be via a binary search over the sorted part of the array, and a linear search over the unsorted tail. The trick is that every so often the tail should be sorted and merged into the head. How often? It is a tradeoff between the work of sorting and merging versus the work saved by avoiding linear searches through the tail. I worked out the theory, and I think the optimum was to re-sort the array when the size of the unsorted tail reached the squareroot of the total size or something like that. This method could reduce the cost of (lookup, add, lookup, add, ...) sequences, albeit not to the extent of a more optimal data structure. Michael -- Michael Haggerty mhagger@alum.mit.edu http://softwareswirl.blogspot.com/ ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] Avoid sorting if references are added to ref_cache in order 2012-05-24 21:00 ` Michael Haggerty @ 2012-05-24 21:10 ` Jeff King 0 siblings, 0 replies; 4+ messages in thread From: Jeff King @ 2012-05-24 21:10 UTC (permalink / raw) To: Michael Haggerty; +Cc: Junio C Hamano, Martin Fick, git On Thu, May 24, 2012 at 11:00:29PM +0200, Michael Haggerty wrote: > >Technically we would still be sorted if strcmp(...) == 0. But I guess it > >probably doesn't matter, as we shouldn't ever be adding duplicates here. > > Yes, duplicate refs should be an exceptional case and needn't be > handled efficiently. sort_ref_dir() explicitly accepts the > possibility of duplicate references, de-dups them if they are > consistent with each other, or dies if they are inconsistent. We > shouldn't add a way to bypass that logic. We could implement the > duplicate-detection-and-checking code again in add_entry_to_dir(), > but my choice was to leave it to sort_ref_dir() to deal with > duplicates. Ah, I didn't realize that there was duplicate-detection magic was in sort_ref_dir. So comparing to 0 is not even "this doesn't matter, but we could..." but "doing it that way will actively break other code". I withdraw my suggestion. :) > More general approaches: > [...] > Something that would help read_packed_refs() would be to keep track > of the directory that it is currently working in. This would > effectively remove the need to access a directory while working in > one of its subdirectories, thereby avoiding any need to repeatedly > sort intermediate directories. It would also avoid having to > traverse the directories starting at the root for each new entry, > which itself would save time independent of the time for sorting. I > have some patches that implement this change but they are stale. I > want to do some benchmarking first though to see whether the extra > complication brings measurable benefits. This is specifically what I was thinking of when I wrote my previous message. But I rejected it as too complicated to be worth the trouble, if the only code path helped would be mass-inserting unsorted refs, which we don't even currently do. However, I didn't think about the fact that we could avoid traversing the refs tree entirely for each ref. That might actually yield a measurable speedup. I'd be curious to see your results. > Finally, I did some work on the idea of keeping the refs within a > directory *mostly* sorted. (Probably this technique is already > known, but I have never run into it.) One would keep the first part > of the array sorted, and append new references to the tail of the > array unsorted. Searching would be via a binary search over the > sorted part of the array, and a linear search over the unsorted tail. > The trick is that every so often the tail should be sorted and merged > into the head. How often? It is a tradeoff between the work of > sorting and merging versus the work saved by avoiding linear searches > through the tail. I worked out the theory, and I think the optimum > was to re-sort the array when the size of the unsorted tail reached > the squareroot of the total size or something like that. This method > could reduce the cost of (lookup, add, lookup, add, ...) sequences, > albeit not to the extent of a more optimal data structure. That's very clever. In fact, way more clever than this problem deserves. It is not like we are journaling a database; we basically have one mass-insertion when we read packed-refs, and I think the simpler solutions will perform just fine. -Peff ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2012-05-24 21:10 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-05-24 12:16 [PATCH] Avoid sorting if references are added to ref_cache in order mhagger 2012-05-24 17:49 ` Jeff King 2012-05-24 21:00 ` Michael Haggerty 2012-05-24 21:10 ` Jeff King
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).