From: Michael Haggerty <mhagger@alum.mit.edu>
To: Jeff King <peff@peff.net>, git@vger.kernel.org
Subject: Re: [PATCH 09/16] prune: factor out loose-object directory traversal
Date: Tue, 07 Oct 2014 16:07:52 +0200 [thread overview]
Message-ID: <5433F3B8.2080505@alum.mit.edu> (raw)
In-Reply-To: <20141003202943.GI16293@peff.net>
On 10/03/2014 10:29 PM, Jeff King wrote:
> Prune has to walk $GIT_DIR/objects/?? in order to find the
> set of loose objects to prune. Other parts of the code
> (e.g., count-objects) want to do the same. Let's factor it
> out into a reusable for_each-style function.
>
> Note that this is not quite a straight code movement. There
> are two differences:
>
> 1. The original code iterated from 0 to 256, trying to
> opendir("$GIT_DIR/%02x"). The new code just does a
> readdir() on the object directory, and descends into
> any matching directories. This is faster on
> already-pruned repositories, and should not ever be
> slower (nobody ever creates other files in the object
> directory).
This would change the order that the objects are processed. I doubt that
matters to anybody, but it's probably worth mentioning in the commit
message.
> 2. The original code had strange behavior when it found a
> file of the form "[0-9a-f]{2}/.{38}" that did _not_
> contain all hex digits. It executed a "break" from the
> loop, meaning that we stopped pruning in that directory
> (but still pruned other directories!). This was
> probably a bug; we do not want to process the file as
> an object, but we should keep going otherwise.
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
> I admit the speedup in (1) almost certainly doesn't matter. It is real,
> and I found out about it while writing a different program that was
> basically "count-objects" across a large number of repositories. However
> for a single repo it's probably not big enough to matter (calling
> count-objects in a loop while get dominated by the startup costs). The
> end result is a little more obvious IMHO, but that's subjective.
>
> builtin/prune.c | 87 ++++++++++++++++------------------------------------
> cache.h | 31 +++++++++++++++++++
> sha1_file.c | 95 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 152 insertions(+), 61 deletions(-)
>
> [...]
> diff --git a/cache.h b/cache.h
> index cd16e25..7abe7f6 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -1239,6 +1239,37 @@ extern unsigned long unpack_object_header_buffer(const unsigned char *buf, unsig
> extern unsigned long get_size_from_delta(struct packed_git *, struct pack_window **, off_t);
> extern int unpack_object_header(struct packed_git *, struct pack_window **, off_t *, unsigned long *);
>
> +/*
> + * Iterate over the files in the loose-object parts of the object
> + * directory "path", triggering the following callbacks:
> + *
> + * - loose_object is called for each loose object we find.
> + *
> + * - loose_cruft is called for any files that do not appear to be
> + * loose objects.
> + *
> + * - loose_subdir is called for each top-level hashed subdirectory
> + * of the object directory (e.g., "$OBJDIR/f0"). It is called
> + * after the objects in the directory are processed.
> + *
> + * Any callback that is NULL will be ignored. Callbacks returning non-zero
> + * will end the iteration.
> + */
> +typedef int each_loose_object_fn(const unsigned char *sha1,
> + const char *path,
> + void *data);
> +typedef int each_loose_cruft_fn(const char *basename,
> + const char *path,
> + void *data);
> +typedef int each_loose_subdir_fn(const char *basename,
> + const char *path,
> + void *data);
> +int for_each_loose_file_in_objdir(const char *path,
> + each_loose_object_fn obj_cb,
> + each_loose_cruft_fn cruft_cb,
> + each_loose_subdir_fn subdir_cb,
> + void *data);
> +
> struct object_info {
> /* Request */
> enum object_type *typep;
> diff --git a/sha1_file.c b/sha1_file.c
> index bae1c15..9fdad47 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -3218,3 +3218,98 @@ void assert_sha1_type(const unsigned char *sha1, enum object_type expect)
> die("%s is not a valid '%s' object", sha1_to_hex(sha1),
> typename(expect));
> }
> +
> +static int opendir_error(const char *path)
> +{
> + if (errno == ENOENT)
> + return 0;
> + return error("unable to open %s: %s", path, strerror(errno));
> +}
> +
> +static int for_each_file_in_obj_subdir(struct strbuf *path,
> + const char *prefix,
> + each_loose_object_fn obj_cb,
> + each_loose_cruft_fn cruft_cb,
> + each_loose_subdir_fn subdir_cb,
> + void *data)
> +{
> + size_t baselen = path->len;
> + DIR *dir = opendir(path->buf);
> + struct dirent *de;
> + int r = 0;
> +
> + if (!dir)
> + return opendir_error(path->buf);
OK, so if there is a non-directory named $GIT_DIR/objects/33, then we
emit an "unable to open" error rather than treating it as cruft. I think
this is reasonable.
> +
> + while ((de = readdir(dir))) {
> + if (is_dot_or_dotdot(de->d_name))
> + continue;
> +
> + strbuf_setlen(path, baselen);
> + strbuf_addf(path, "/%s", de->d_name);
> +
> + if (strlen(de->d_name) == 38) {
> + char hex[41];
> + unsigned char sha1[20];
> +
> + memcpy(hex, prefix, 2);
> + memcpy(hex + 2, de->d_name, 38);
> + hex[40] = 0;
> + if (!get_sha1_hex(hex, sha1)) {
> + if (obj_cb) {
> + r = obj_cb(sha1, path->buf, data);
> + if (r)
> + break;
> + }
> + continue;
> + }
> + }
> +
> + if (cruft_cb) {
> + r = cruft_cb(de->d_name, path->buf, data);
So, files *and* directories at the $GIT_DIR/objects/XX/ level are
reported as cruft (as opposed to, say, descending into the directories
and reporting any files found deeper in the hierarchy). This seems fine,
too.
> + if (r)
> + break;
> + }
> + }
> + if (!r && subdir_cb)
> + r = subdir_cb(de->d_name, path->buf, data);
By my reading, path->buf still contains the name of the last file in the
directory at this point. I assume you want to pass it the original
"baselen"-length path here.
> + closedir(dir);
> + return r;
...and anyway, it would be more polite to restore the path strbuf to its
original length before returning.
> +}
> +
> +int for_each_loose_file_in_objdir(const char *path,
> + each_loose_object_fn obj_cb,
> + each_loose_cruft_fn cruft_cb,
> + each_loose_subdir_fn subdir_cb,
> + void *data)
> +{
> + struct strbuf buf = STRBUF_INIT;
> + size_t baselen;
> + DIR *dir = opendir(path);
> + struct dirent *de;
> + int r = 0;
> +
> + if (!dir)
> + return opendir_error(path);
> +
> + strbuf_addstr(&buf, path);
> + baselen = buf.len;
> +
> + while ((de = readdir(dir))) {
> + if (!isxdigit(de->d_name[0]) ||
> + !isxdigit(de->d_name[1]) ||
> + de->d_name[2])
> + continue;
So other files or directories at the $GIT_DIR/objects/ level are just
ignored; they are not considered cruft. This is worth clarifying in the
docstring.
> +
> + strbuf_addf(&buf, "/%s", de->d_name);
> + r = for_each_file_in_obj_subdir(&buf, de->d_name, obj_cb,
> + cruft_cb, subdir_cb, data);
> + strbuf_setlen(&buf, baselen);
> + if (r)
> + break;
> + }
> +
> + closedir(dir);
> + strbuf_release(&buf);
> + return r;
> +}
>
Other than my comments above, it looks good to me.
Michael
--
Michael Haggerty
mhagger@alum.mit.edu
next prev parent reply other threads:[~2014-10-07 14:08 UTC|newest]
Thread overview: 58+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-10-03 20:20 [PATCH 0/16] make prune mtime-checking more careful Jeff King
2014-10-03 20:21 ` [PATCH 01/16] foreach_alt_odb: propagate return value from callback Jeff King
2014-10-03 22:55 ` René Scharfe
2014-10-04 0:31 ` Jeff King
2014-10-03 20:22 ` [PATCH 02/16] isxdigit: cast input to unsigned char Jeff King
2014-10-03 20:22 ` [PATCH 03/16] object_array: factor out slopbuf-freeing logic Jeff King
2014-10-07 11:25 ` Michael Haggerty
2014-10-08 7:36 ` Jeff King
2014-10-08 8:40 ` Michael Haggerty
2014-10-08 8:55 ` Jeff King
2014-10-03 20:22 ` [PATCH 04/16] object_array: add a "clear" function Jeff King
2014-10-03 20:23 ` [PATCH 05/16] clean up name allocation in prepare_revision_walk Jeff King
2014-10-03 20:24 ` [PATCH 06/16] reachable: clear pending array after walking it Jeff King
2014-10-03 20:25 ` [PATCH 07/16] t5304: use test_path_is_* instead of "test -f" Jeff King
2014-10-03 22:12 ` Junio C Hamano
2014-10-03 20:27 ` [PATCH 08/16] t5304: use helper to report failure of "test foo = bar" Jeff King
2014-10-03 22:17 ` Junio C Hamano
2014-10-04 0:13 ` Jeff King
2014-10-07 13:21 ` Michael Haggerty
2014-10-07 17:29 ` Junio C Hamano
2014-10-07 20:18 ` Jeff King
2014-10-07 20:35 ` Junio C Hamano
2014-10-07 21:29 ` Jeff King
2014-10-07 21:53 ` Junio C Hamano
2014-10-07 22:17 ` Michael Haggerty
2014-10-08 1:13 ` Jeff King
2014-10-08 16:58 ` Junio C Hamano
2014-10-07 21:16 ` Junio C Hamano
2014-10-03 20:29 ` [PATCH 09/16] prune: factor out loose-object directory traversal Jeff King
2014-10-03 22:19 ` Junio C Hamano
2014-10-04 0:24 ` Jeff King
2014-10-07 14:07 ` Michael Haggerty [this message]
2014-10-08 7:33 ` Jeff King
2014-10-03 20:31 ` [PATCH 10/16] count-objects: do not use xsize_t when counting object size Jeff King
2014-10-03 20:31 ` [PATCH 11/16] count-objects: use for_each_loose_file_in_objdir Jeff King
2014-10-03 20:32 ` [PATCH 12/16] sha1_file: add for_each iterators for loose and packed objects Jeff King
2014-10-05 8:15 ` René Scharfe
2014-10-05 10:47 ` Ramsay Jones
2014-10-03 20:39 ` [PATCH 13/16] prune: keep objects reachable from recent objects Jeff King
2014-10-03 21:47 ` Junio C Hamano
2014-10-04 0:09 ` Jeff King
2014-10-04 0:30 ` Jeff King
2014-10-04 3:04 ` Junio C Hamano
2014-10-07 16:29 ` Michael Haggerty
2014-10-08 7:19 ` Jeff King
2014-10-08 10:37 ` Michael Haggerty
2014-10-03 20:39 ` [PATCH 14/16] pack-objects: refactor unpack-unreachable expiration check Jeff King
2014-10-03 20:40 ` [PATCH 15/16] pack-objects: match prune logic for discarding objects Jeff King
2014-10-03 20:41 ` [PATCH 16/16] write_sha1_file: freshen existing objects Jeff King
2014-10-03 21:29 ` Junio C Hamano
2014-10-04 0:01 ` Jeff King
2014-10-05 9:12 ` René Scharfe
2014-10-03 22:20 ` [PATCH 0/16] make prune mtime-checking more careful Junio C Hamano
2014-10-04 22:22 ` Junio C Hamano
2014-10-05 9:19 ` René Scharfe
2014-10-06 1:42 ` Jeff King
2014-10-08 8:31 ` Jeff King
2014-10-08 17:03 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5433F3B8.2080505@alum.mit.edu \
--to=mhagger@alum.mit.edu \
--cc=git@vger.kernel.org \
--cc=peff@peff.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).