From: Junio C Hamano <gitster@pobox.com>
To: Patrick Steinhardt <ps@pks.im>
Cc: git@vger.kernel.org
Subject: Re: [PATCH 3/6] object-file: extract logic to approximate object count
Date: Tue, 10 Mar 2026 10:44:06 -0700 [thread overview]
Message-ID: <xmqqjyvjvau1.fsf@gitster.g> (raw)
In-Reply-To: <20260310-b4-pks-odb-source-count-objects-v1-3-109e07d425f4@pks.im> (Patrick Steinhardt's message of "Tue, 10 Mar 2026 16:18:23 +0100")
Patrick Steinhardt <ps@pks.im> writes:
> static int too_many_loose_objects(int limit)
> {
> ...
> + int auto_threshold = DIV_ROUND_UP(limit, 256) * 256;
> + unsigned long loose_count;
> +
> + if (odb_source_loose_approximate_object_count(the_repository->objects->sources,
> + &loose_count) < 0)
> return 0;
>
> - auto_threshold = DIV_ROUND_UP(limit, 256);
> - while ((ent = readdir(dir)) != NULL) {
> - if (strspn(ent->d_name, "0123456789abcdef") != hexsz_loose ||
> - ent->d_name[hexsz_loose] != '\0')
> - continue;
> - if (++num_loose > auto_threshold) {
> - needed = 1;
> - break;
> - }
> - }
> - closedir(dir);
> - return needed;
> + return loose_count > auto_threshold;
> }
We used to sample one shared directory and stopped when we know we
have more than auto_threshold, which is roughly 1/256 of the given
limit. Now, we ask "approximate" function to count and then compare
the result with the same auto_threshold (i.e., 1/256 of the given
limit), which means we expect approximate function to count only
1/256 of the total loose objects somehow? Let's keep reading.
> static struct packed_git *find_base_packs(struct string_list *packs,
> diff --git a/object-file.c b/object-file.c
> index a3ff7f586c..da67e3c9ff 100644
> --- a/object-file.c
> +++ b/object-file.c
> @@ -1868,6 +1868,47 @@ int odb_source_loose_for_each_object(struct odb_source *source,
> NULL, NULL, &data);
> }
>
> +int odb_source_loose_approximate_object_count(struct odb_source *source,
> + unsigned long *out)
> +{
> + const unsigned hexsz = source->odb->repo->hash_algo->hexsz - 2;
> + unsigned long count = 0;
> + struct dirent *ent;
> + char *path = NULL;
> + DIR *dir = NULL;
> + int ret;
> +
> + path = xstrfmt("%s/17", source->path);
> +
> + dir = opendir(path);
> + if (!dir) {
> + if (errno == ENOENT) {
> + *out = 0;
> + ret = 0;
> + goto out;
> + }
> +
> + ret = error_errno("cannot open object shard '%s'", path);
> + goto out;
> + }
> +
> + while ((ent = readdir(dir)) != NULL) {
> + if (strspn(ent->d_name, "0123456789abcdef") != hexsz ||
> + ent->d_name[hexsz] != '\0')
> + continue;
> + count++;
> + }
This counts one shared ("17" that is randomly picked) fully and then ...
> + *out = count * 256;
... estimate that the entire world would probably have 256 times as
many as the objects in that one shared.
Ah, my earlier read of the caller was confused. auto_threshold used
to be 1/256 of the limit, but now the number used is computed in a
strange arithmetic, "DIV_ROUND_UP(limit,256) * 256". Not directly
using "limit" fooled me into thinking that it somehow kept using the
same 1/256 of the limit.
So we are answering "do we have too many?" question using roughly
the same criteria as before, not 1/256 off as I suspected earlier.
The old implementation exited early as soon as the threshold was
hit. While scanning a single shard directory is likely fast enough
that this may not matter in practice, it is a slight change in
behaviour. If a repository has an extremely large number of loose
objects (e.g. tens of thousands in shard 17), this will now count
all of them instead of stopping at ~30 (if the limit set to around
7000 objects).
Given that this is an "auto" GC check, the performance difference is
probably negligible, but I thought it worth pointing out.
next prev parent reply other threads:[~2026-03-10 17:44 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-10 15:18 [PATCH 0/6] odb: introduce generic object counting Patrick Steinhardt
2026-03-10 15:18 ` [PATCH 1/6] odb: stop including "odb/source.h" Patrick Steinhardt
2026-03-10 15:18 ` [PATCH 2/6] packfile: extract logic to count number of objects Patrick Steinhardt
2026-03-11 12:41 ` Toon Claes
2026-03-11 13:55 ` Patrick Steinhardt
2026-03-10 15:18 ` [PATCH 3/6] object-file: extract logic to approximate object count Patrick Steinhardt
2026-03-10 17:44 ` Junio C Hamano [this message]
2026-03-11 12:47 ` Toon Claes
2026-03-11 13:58 ` Patrick Steinhardt
2026-03-10 15:18 ` [PATCH 4/6] object-file: generalize counting objects Patrick Steinhardt
2026-03-11 13:53 ` Toon Claes
2026-03-11 14:01 ` Patrick Steinhardt
2026-03-10 15:18 ` [PATCH 5/6] odb/source: introduce generic object counting Patrick Steinhardt
2026-03-10 17:51 ` Junio C Hamano
2026-03-11 6:44 ` Patrick Steinhardt
2026-03-11 15:03 ` Toon Claes
2026-03-10 15:18 ` [PATCH 6/6] odb: " Patrick Steinhardt
2026-03-11 15:30 ` Toon Claes
2026-03-12 6:57 ` Patrick Steinhardt
2026-03-12 8:42 ` [PATCH v2 0/6] " Patrick Steinhardt
2026-03-12 8:42 ` [PATCH v2 1/6] odb: stop including "odb/source.h" Patrick Steinhardt
2026-03-12 8:42 ` [PATCH v2 2/6] packfile: extract logic to count number of objects Patrick Steinhardt
2026-03-12 8:42 ` [PATCH v2 3/6] object-file: extract logic to approximate object count Patrick Steinhardt
2026-03-12 8:42 ` [PATCH v2 4/6] object-file: generalize counting objects Patrick Steinhardt
2026-03-12 8:43 ` [PATCH v2 5/6] odb/source: introduce generic object counting Patrick Steinhardt
2026-03-12 8:43 ` [PATCH v2 6/6] odb: " Patrick Steinhardt
2026-03-13 11:52 ` [PATCH v2 0/6] " Toon Claes
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=xmqqjyvjvau1.fsf@gitster.g \
--to=gitster@pobox.com \
--cc=git@vger.kernel.org \
--cc=ps@pks.im \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.