From: Michael Haggerty <mhagger@alum.mit.edu>
To: Jeff King <peff@peff.net>, git@vger.kernel.org
Subject: Re: [PATCH 09/16] prune: factor out loose-object directory traversal
Date: Tue, 07 Oct 2014 16:07:52 +0200 [thread overview]
Message-ID: <5433F3B8.2080505@alum.mit.edu> (raw)
In-Reply-To: <20141003202943.GI16293@peff.net>
On 10/03/2014 10:29 PM, Jeff King wrote:
> Prune has to walk $GIT_DIR/objects/?? in order to find the
> set of loose objects to prune. Other parts of the code
> (e.g., count-objects) want to do the same. Let's factor it
> out into a reusable for_each-style function.
>
> Note that this is not quite a straight code movement. There
> are two differences:
>
> 1. The original code iterated from 0 to 256, trying to
> opendir("$GIT_DIR/%02x"). The new code just does a
> readdir() on the object directory, and descends into
> any matching directories. This is faster on
> already-pruned repositories, and should not ever be
> slower (nobody ever creates other files in the object
> directory).
This would change the order that the objects are processed. I doubt that
matters to anybody, but it's probably worth mentioning in the commit
message.
> 2. The original code had strange behavior when it found a
> file of the form "[0-9a-f]{2}/.{38}" that did _not_
> contain all hex digits. It executed a "break" from the
> loop, meaning that we stopped pruning in that directory
> (but still pruned other directories!). This was
> probably a bug; we do not want to process the file as
> an object, but we should keep going otherwise.
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
> I admit the speedup in (1) almost certainly doesn't matter. It is real,
> and I found out about it while writing a different program that was
> basically "count-objects" across a large number of repositories. However
> for a single repo it's probably not big enough to matter (calling
> count-objects in a loop while get dominated by the startup costs). The
> end result is a little more obvious IMHO, but that's subjective.
>
> builtin/prune.c | 87 ++++++++++++++++------------------------------------
> cache.h | 31 +++++++++++++++++++
> sha1_file.c | 95 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 152 insertions(+), 61 deletions(-)
>
> [...]
> diff --git a/cache.h b/cache.h
> index cd16e25..7abe7f6 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -1239,6 +1239,37 @@ extern unsigned long unpack_object_header_buffer(const unsigned char *buf, unsig
> extern unsigned long get_size_from_delta(struct packed_git *, struct pack_window **, off_t);
> extern int unpack_object_header(struct packed_git *, struct pack_window **, off_t *, unsigned long *);
>
> +/*
> + * Iterate over the files in the loose-object parts of the object
> + * directory "path", triggering the following callbacks:
> + *
> + * - loose_object is called for each loose object we find.
> + *
> + * - loose_cruft is called for any files that do not appear to be
> + * loose objects.
> + *
> + * - loose_subdir is called for each top-level hashed subdirectory
> + * of the object directory (e.g., "$OBJDIR/f0"). It is called
> + * after the objects in the directory are processed.
> + *
> + * Any callback that is NULL will be ignored. Callbacks returning non-zero
> + * will end the iteration.
> + */
> +typedef int each_loose_object_fn(const unsigned char *sha1,
> + const char *path,
> + void *data);
> +typedef int each_loose_cruft_fn(const char *basename,
> + const char *path,
> + void *data);
> +typedef int each_loose_subdir_fn(const char *basename,
> + const char *path,
> + void *data);
> +int for_each_loose_file_in_objdir(const char *path,
> + each_loose_object_fn obj_cb,
> + each_loose_cruft_fn cruft_cb,
> + each_loose_subdir_fn subdir_cb,
> + void *data);
> +
> struct object_info {
> /* Request */
> enum object_type *typep;
> diff --git a/sha1_file.c b/sha1_file.c
> index bae1c15..9fdad47 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -3218,3 +3218,98 @@ void assert_sha1_type(const unsigned char *sha1, enum object_type expect)
> die("%s is not a valid '%s' object", sha1_to_hex(sha1),
> typename(expect));
> }
> +
> +static int opendir_error(const char *path)
> +{
> + if (errno == ENOENT)
> + return 0;
> + return error("unable to open %s: %s", path, strerror(errno));
> +}
> +
> +static int for_each_file_in_obj_subdir(struct strbuf *path,
> + const char *prefix,
> + each_loose_object_fn obj_cb,
> + each_loose_cruft_fn cruft_cb,
> + each_loose_subdir_fn subdir_cb,
> + void *data)
> +{
> + size_t baselen = path->len;
> + DIR *dir = opendir(path->buf);
> + struct dirent *de;
> + int r = 0;
> +
> + if (!dir)
> + return opendir_error(path->buf);
OK, so if there is a non-directory named $GIT_DIR/objects/33, then we
emit an "unable to open" error rather than treating it as cruft. I think
this is reasonable.
> +
> + while ((de = readdir(dir))) {
> + if (is_dot_or_dotdot(de->d_name))
> + continue;
> +
> + strbuf_setlen(path, baselen);
> + strbuf_addf(path, "/%s", de->d_name);
> +
> + if (strlen(de->d_name) == 38) {
> + char hex[41];
> + unsigned char sha1[20];
> +
> + memcpy(hex, prefix, 2);
> + memcpy(hex + 2, de->d_name, 38);
> + hex[40] = 0;
> + if (!get_sha1_hex(hex, sha1)) {
> + if (obj_cb) {
> + r = obj_cb(sha1, path->buf, data);
> + if (r)
> + break;
> + }
> + continue;
> + }
> + }
> +
> + if (cruft_cb) {
> + r = cruft_cb(de->d_name, path->buf, data);
So, files *and* directories at the $GIT_DIR/objects/XX/ level are
reported as cruft (as opposed to, say, descending into the directories
and reporting any files found deeper in the hierarchy). This seems fine,
too.
> + if (r)
> + break;
> + }
> + }
> + if (!r && subdir_cb)
> + r = subdir_cb(de->d_name, path->buf, data);
By my reading, path->buf still contains the name of the last file in the
directory at this point. I assume you want to pass it the original
"baselen"-length path here.
> + closedir(dir);
> + return r;
...and anyway, it would be more polite to restore the path strbuf to its
original length before returning.
> +}
> +
> +int for_each_loose_file_in_objdir(const char *path,
> + each_loose_object_fn obj_cb,
> + each_loose_cruft_fn cruft_cb,
> + each_loose_subdir_fn subdir_cb,
> + void *data)
> +{
> + struct strbuf buf = STRBUF_INIT;
> + size_t baselen;
> + DIR *dir = opendir(path);
> + struct dirent *de;
> + int r = 0;
> +
> + if (!dir)
> + return opendir_error(path);
> +
> + strbuf_addstr(&buf, path);
> + baselen = buf.len;
> +
> + while ((de = readdir(dir))) {
> + if (!isxdigit(de->d_name[0]) ||
> + !isxdigit(de->d_name[1]) ||
> + de->d_name[2])
> + continue;
So other files or directories at the $GIT_DIR/objects/ level are just
ignored; they are not considered cruft. This is worth clarifying in the
docstring.
> +
> + strbuf_addf(&buf, "/%s", de->d_name);
> + r = for_each_file_in_obj_subdir(&buf, de->d_name, obj_cb,
> + cruft_cb, subdir_cb, data);
> + strbuf_setlen(&buf, baselen);
> + if (r)
> + break;
> + }
> +
> + closedir(dir);
> + strbuf_release(&buf);
> + return r;
> +}
>
Other than my comments above, it looks good to me.
Michael
--
Michael Haggerty
mhagger@alum.mit.edu
next prev parent reply other threads:[~2014-10-07 14:08 UTC|newest]
Thread overview: 58+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-10-03 20:20 [PATCH 0/16] make prune mtime-checking more careful Jeff King
2014-10-03 20:21 ` [PATCH 01/16] foreach_alt_odb: propagate return value from callback Jeff King
2014-10-03 22:55 ` René Scharfe
2014-10-04 0:31 ` Jeff King
2014-10-03 20:22 ` [PATCH 02/16] isxdigit: cast input to unsigned char Jeff King
2014-10-03 20:22 ` [PATCH 03/16] object_array: factor out slopbuf-freeing logic Jeff King
2014-10-07 11:25 ` Michael Haggerty
2014-10-08 7:36 ` Jeff King
2014-10-08 8:40 ` Michael Haggerty
2014-10-08 8:55 ` Jeff King
2014-10-03 20:22 ` [PATCH 04/16] object_array: add a "clear" function Jeff King
2014-10-03 20:23 ` [PATCH 05/16] clean up name allocation in prepare_revision_walk Jeff King
2014-10-03 20:24 ` [PATCH 06/16] reachable: clear pending array after walking it Jeff King
2014-10-03 20:25 ` [PATCH 07/16] t5304: use test_path_is_* instead of "test -f" Jeff King
2014-10-03 22:12 ` Junio C Hamano
2014-10-03 20:27 ` [PATCH 08/16] t5304: use helper to report failure of "test foo = bar" Jeff King
2014-10-03 22:17 ` Junio C Hamano
2014-10-04 0:13 ` Jeff King
2014-10-07 13:21 ` Michael Haggerty
2014-10-07 17:29 ` Junio C Hamano
2014-10-07 20:18 ` Jeff King
2014-10-07 20:35 ` Junio C Hamano
2014-10-07 21:29 ` Jeff King
2014-10-07 21:53 ` Junio C Hamano
2014-10-07 22:17 ` Michael Haggerty
2014-10-08 1:13 ` Jeff King
2014-10-08 16:58 ` Junio C Hamano
2014-10-07 21:16 ` Junio C Hamano
2014-10-03 20:29 ` [PATCH 09/16] prune: factor out loose-object directory traversal Jeff King
2014-10-03 22:19 ` Junio C Hamano
2014-10-04 0:24 ` Jeff King
2014-10-07 14:07 ` Michael Haggerty [this message]
2014-10-08 7:33 ` Jeff King
2014-10-03 20:31 ` [PATCH 10/16] count-objects: do not use xsize_t when counting object size Jeff King
2014-10-03 20:31 ` [PATCH 11/16] count-objects: use for_each_loose_file_in_objdir Jeff King
2014-10-03 20:32 ` [PATCH 12/16] sha1_file: add for_each iterators for loose and packed objects Jeff King
2014-10-05 8:15 ` René Scharfe
2014-10-05 10:47 ` Ramsay Jones
2014-10-03 20:39 ` [PATCH 13/16] prune: keep objects reachable from recent objects Jeff King
2014-10-03 21:47 ` Junio C Hamano
2014-10-04 0:09 ` Jeff King
2014-10-04 0:30 ` Jeff King
2014-10-04 3:04 ` Junio C Hamano
2014-10-07 16:29 ` Michael Haggerty
2014-10-08 7:19 ` Jeff King
2014-10-08 10:37 ` Michael Haggerty
2014-10-03 20:39 ` [PATCH 14/16] pack-objects: refactor unpack-unreachable expiration check Jeff King
2014-10-03 20:40 ` [PATCH 15/16] pack-objects: match prune logic for discarding objects Jeff King
2014-10-03 20:41 ` [PATCH 16/16] write_sha1_file: freshen existing objects Jeff King
2014-10-03 21:29 ` Junio C Hamano
2014-10-04 0:01 ` Jeff King
2014-10-05 9:12 ` René Scharfe
2014-10-03 22:20 ` [PATCH 0/16] make prune mtime-checking more careful Junio C Hamano
2014-10-04 22:22 ` Junio C Hamano
2014-10-05 9:19 ` René Scharfe
2014-10-06 1:42 ` Jeff King
2014-10-08 8:31 ` Jeff King
2014-10-08 17:03 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5433F3B8.2080505@alum.mit.edu \
--to=mhagger@alum.mit.edu \
--cc=git@vger.kernel.org \
--cc=peff@peff.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.