git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: "René Scharfe" <l.s.r@web.de>
Cc: git@vger.kernel.org, Michael Haggerty <mhagger@alum.mit.edu>
Subject: Re: [PATCH 0/6] address packed-refs speed regressions
Date: Sun, 5 Apr 2015 14:59:12 -0400	[thread overview]
Message-ID: <20150405185911.GA19902@peff.net> (raw)
In-Reply-To: <20150405185259.GB13096@peff.net>

On Sun, Apr 05, 2015 at 02:52:59PM -0400, Jeff King wrote:

> Right now we parse all of the packed-refs file into an in-memory cache,
> and then do single lookups from that cache. Doing an mmap() and a binary
> search is way faster (and costs less memory) for doing individual
> lookups. It relies on the list being sorted. This is generally true, but
> not something we currently rely on (however, it would be easy to add a
> "sorted" flag to top of the file and have the readers fall back when the
> flag is missing). I've played with a patch to do this (it's not entirely
> trivial, because you jump into the middle of a line, and then have to
> walk backwards to find the start of the record).
> 
> For traversals, it's more complicated. Obviously if you are traversing
> all refs, you have to read the whole thing anyway. If you are traversing
> a subset of the refs, you can binary-search the start of the subset, and
> then walk forward. But that's where it gets tricky with the current
> code.

In case you are curious, here is my proof-of-concept for the packed-refs
binary search. You'll note that it's a separate program, and not
integrated into refs.c. I wrote this last August, and after trying to
integrate it into refs.c, I found the ref_cache problems I described,
and I haven't touched it since.

I also seem to have saved the patch for stuffing it into refs.c, but I
am not sure if it even compiles (I wrote only "horrible wip" in the
commit message ;) ).

-- >8 --
Subject: [PATCH] add git-quick-list

This is a proof of concept for binary-searching the
packed-refs file in order to traverse an ordered subset of
it. Note that it _only_ reads the packed-refs file
currently. To really compare to for-each-ref, it would need
to also walk the loose ref area for its prefix. On a
mostly-packed repository that shouldn't make a big speed
difference, though.

And of course we don't _really_ want a separate command here
at all. This should be part of refs.c, and everyone who
calls for_each_ref should benefit from it.

Still, the numbers are promising. Here's are comparisons
against for-each-ref on torvalds/linux, which has a 218M
packed-refs file:

  $ time git for-each-ref \
      --format='%(objectname) %(refname)' \
      refs/remotes/2325298/ |
      wc -c
  44139

  real    0m1.649s
  user    0m1.332s
  sys     0m0.304s

  $ time ~peff/git-quick-list refs/remotes/2325298/ | wc -c
  44139

  real    0m0.012s
  user    0m0.004s
  sys     0m0.004s
---
 Makefile     |   1 +
 quick-list.c | 174 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 175 insertions(+)
 create mode 100644 quick-list.c

diff --git a/Makefile b/Makefile
index 2457065..aa32598 100644
--- a/Makefile
+++ b/Makefile
@@ -541,6 +541,7 @@ PROGRAM_OBJS += shell.o
 PROGRAM_OBJS += show-index.o
 PROGRAM_OBJS += upload-pack.o
 PROGRAM_OBJS += remote-testsvn.o
+PROGRAM_OBJS += quick-list.o
 
 # Binary suffix, set to .exe for Windows builds
 X =
diff --git a/quick-list.c b/quick-list.c
new file mode 100644
index 0000000..e423f1f
--- /dev/null
+++ b/quick-list.c
@@ -0,0 +1,174 @@
+#include "cache.h"
+#include "refs.h"
+
+struct packed_refs_iterator {
+	const char *start;
+	const char *end;
+
+	const char *cur;
+	const char *ref;
+	const char *eol;
+	const char *next;
+};
+
+static void iterator_init(struct packed_refs_iterator *pos,
+			  const char *buf, size_t len)
+{
+	pos->start = buf;
+	pos->end = buf + len;
+
+	/* skip past header line */
+	if (pos->start < pos->end && *pos->start == '#') {
+		while (pos->start < pos->end && *pos->start != '\n')
+			pos->start++;
+		if (pos->start < pos->end)
+			pos->start++;
+	}
+}
+
+static int iterator_cmp(const char *key, struct packed_refs_iterator *pos)
+{
+	const char *ref = pos->ref;
+	for (; *key && ref < pos->eol; key++, ref++)
+		if (*key != *ref)
+			return (unsigned char)*key - (unsigned char)*ref;
+	return ref == pos->eol ? *key ? 1 : 0 : -1;
+}
+
+static const char *find_eol(const char *p, const char *end)
+{
+	p = memchr(p, '\n', end - p);
+	return p ? p : end;
+}
+
+static void parse_line(struct packed_refs_iterator *pos, const char *p)
+{
+	pos->cur = p;
+	if (pos->end - p < 41)
+		die("truncated packed-refs file");
+	p += 41;
+
+	pos->ref = p;
+	pos->eol = p = find_eol(p, pos->end);
+
+	/* skip newline, and then past any peel records */
+	if (p < pos->end)
+		p++;
+	while (p < pos->end && *p == '^') {
+		p = find_eol(p, pos->end);
+		if (p < pos->end)
+			p++;
+	}
+	pos->next = p;
+}
+
+static void iterator_next(struct packed_refs_iterator *pos)
+{
+	if (pos->next < pos->end)
+		parse_line(pos, pos->next);
+	else
+		pos->cur = NULL;
+}
+
+static void iterator_start(struct packed_refs_iterator *pos, const char *prefix)
+{
+	const char *lo = pos->start, *hi = pos->end;
+
+	while (lo < hi) {
+		const char *mi = lo + ((hi - lo) / 2);
+		int cmp;
+
+		/*
+		 * We landed somewhere on a line. Walk back to find
+		 * the start of the line.
+		 */
+		while (mi > lo && *(mi-1) != '\n')
+			mi--;
+
+		/*
+		 * We may have hit a peel-line. In that case, try
+		 * to walk back to the actual ref line (and skip as
+		 * many peel lines as we find, for future-proofing).
+		 */
+		while (*mi == '^') {
+			if (mi == lo)
+				die("peel line without a record before it?");
+			mi--;
+			if (mi == lo)
+				die("peel line with bare newline before it?");
+			mi--;
+			while (mi > lo && *(mi-1) != '\n')
+				mi--;
+		}
+
+		/* Now we should be at a real ref line. */
+		parse_line(pos, mi);
+		cmp = iterator_cmp(prefix, pos);
+		if (!cmp)
+			return;
+		else if (cmp < 0)
+			hi = pos->cur;
+		else
+			lo = pos->next;
+	}
+
+	if (hi < pos->end)
+		parse_line(pos, hi);
+	else
+		pos->cur = NULL;
+}
+
+static void quick_list(const char *prefix, each_ref_fn fn, void *data)
+{
+	int fd = open(git_path("packed-refs"), O_RDONLY);
+	struct stat st;
+	const char *buf = NULL;
+	size_t len;
+	struct packed_refs_iterator pos;
+
+	if (fd < 0)
+		goto out;
+	if (fstat(fd, &st) < 0)
+		goto out;
+	len = xsize_t(st.st_size);
+	buf = xmmap(NULL, len, PROT_READ, MAP_PRIVATE, fd, 0);
+	if (!buf)
+		goto out;
+
+	iterator_init(&pos, buf, len);
+	for (iterator_start(&pos, prefix);
+	     pos.cur && starts_with(pos.ref, prefix);
+	     iterator_next(&pos)) {
+		unsigned char sha1[20];
+		char *refname;
+
+		if (get_sha1_hex(pos.cur, sha1) < 0)
+			die("packed-refs contained invalid sha1");
+		refname = xmemdupz(pos.ref, pos.eol - pos.ref);
+		fn(refname, sha1, 0, data);
+		free(refname);
+	}
+
+out:
+	close(fd);
+	if (buf)
+		munmap((void *)buf, len);
+}
+
+static int show_ref(const char *refname, const unsigned char *sha1,
+		    int flags, void *data)
+{
+	printf("%s %s\n", sha1_to_hex(sha1), refname);
+	return 0;
+}
+
+int main(int argc, char **argv)
+{
+	if (argc != 2)
+		usage("git quick-list <prefix>");
+
+	setup_git_directory();
+	quick_list(argv[1], show_ref, NULL);
+
+	return 0;
+}
-- 
2.4.0.rc0.363.gf9f328b

  reply	other threads:[~2015-04-05 18:59 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-04-05  1:06 [PATCH 0/6] address packed-refs speed regressions Jeff King
2015-04-05  1:07 ` [PATCH 1/6] strbuf_getwholeline: use getc macro Jeff King
2015-04-05  1:08 ` [PATCH 2/6] git-compat-util: add fallbacks for unlocked stdio Jeff King
2015-04-05  1:11 ` [PATCH 3/6] strbuf_getwholeline: use getc_unlocked Jeff King
2015-04-05  4:56   ` Jeff King
2015-04-05  5:27     ` Jeff King
2015-04-05  5:35       ` Jeff King
2015-04-05 20:49         ` Junio C Hamano
2015-04-05 14:36     ` Duy Nguyen
2015-04-05 18:24       ` Jeff King
2015-04-05 20:09     ` Junio C Hamano
2015-04-07 13:48     ` Rasmus Villemoes
2015-04-07 19:04       ` Jeff King
2015-04-07 22:43         ` Rasmus Villemoes
2015-04-08  0:17           ` Jeff King
2015-04-05  1:11 ` [PATCH 4/6] strbuf: add an optimized 1-character strbuf_grow Jeff King
2015-04-06  2:13   ` Eric Sunshine
2015-04-06  5:05     ` Jeff King
2015-04-05  1:11 ` [PATCH 5/6] t1430: add another refs-escape test Jeff King
2015-04-05  1:15 ` [PATCH 6/6] refname_is_safe: avoid expensive normalize_path_copy call Jeff King
2015-04-05 13:41 ` [PATCH 0/6] address packed-refs speed regressions René Scharfe
2015-04-05 18:52   ` Jeff King
2015-04-05 18:59     ` Jeff King [this message]
2015-04-05 23:04       ` René Scharfe
2015-04-05 22:39     ` René Scharfe
2015-04-06  4:49       ` Jeff King
2015-04-16  8:47 ` [PATCH v2 0/9] " Jeff King
2015-04-16  8:48   ` [PATCH 1/9] strbuf_getwholeline: use getc macro Jeff King
2015-04-16  8:48   ` [PATCH 2/9] git-compat-util: add fallbacks for unlocked stdio Jeff King
2015-04-16  8:49   ` [PATCH 3/9] strbuf_getwholeline: use getc_unlocked Jeff King
2015-04-16  8:51   ` [PATCH 4/9] config: use getc_unlocked when reading from file Jeff King
2015-04-16  8:53   ` [PATCH 5/9] strbuf_addch: avoid calling strbuf_grow Jeff King
2015-04-16  8:58   ` [PATCH 6/9] strbuf_getwholeline: " Jeff King
2015-04-16  9:01   ` [PATCH 7/9] strbuf_getwholeline: use getdelim if it is available Jeff King
2015-04-17 10:16     ` Eric Sunshine
2015-04-21 23:09       ` Jeff King
2015-05-08 23:56         ` Eric Sunshine
2015-05-09  1:09           ` Jeff King
2015-06-02 18:22             ` Eric Sunshine
2015-04-22 18:00       ` Johannes Schindelin
2015-04-22 18:06         ` Jeff King
2015-04-16  9:03   ` [PATCH 8/9] read_packed_refs: avoid double-checking sane refs Jeff King
2015-04-16  9:04   ` [PATCH 9/9] t1430: add another refs-escape test Jeff King
2015-04-16  9:25   ` [PATCH v2 0/9] address packed-refs speed regressions Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150405185911.GA19902@peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=l.s.r@web.de \
    --cc=mhagger@alum.mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).