From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: christian.couder@gmail.com, gitster@pobox.com,
johannes.schindelin@gmx.de, johncai86@gmail.com,
karthik.188@gmail.com, kristofferhaugsbakk@fastmail.com,
me@ttaylorr.com, newren@gmail.com, peff@peff.net, ps@pks.im,
Derrick Stolee <stolee@gmail.com>,
Derrick Stolee <stolee@gmail.com>
Subject: [PATCH v2 05/10] path-walk: support blob size limit filter
Date: Mon, 04 May 2026 20:21:14 +0000 [thread overview]
Message-ID: <d309345fece6fe293b392197b717b947b48590fc.1777926079.git.gitgitgadget@gmail.com> (raw)
In-Reply-To: <pull.2101.v2.git.1777926079.gitgitgadget@gmail.com>
From: Derrick Stolee <stolee@gmail.com>
Extend the path-walk API to handle the 'blob:limit=<size>' object
filter natively. This filter omits blobs whose size is equal to or
greater than the given limit, matching the semantics used by the
list-objects-filter machinery.
When revs->filter.choice is LOFC_BLOB_LIMIT, the prepare_filters()
method stores the limit value in info->blob_limit and clears the filter
from revs. If the limit is zero, this degenerates to blob:none (all
blobs excluded), so info->blobs is set to 0 instead.
During walk_path(), blob batches are filtered before being delivered to
the callback: each blob's size is checked via odb_read_object_info(),
and only blobs strictly smaller than the limit are included. Blobs whose
size cannot be determined (e.g. missing in a partial clone) are
conservatively included, matching the existing filter behavior. Empty
batches after filtering are skipped entirely.
The check for inclusion in the path batch looks a little strange at
first glance. We use odb_read_object_info() to read the object's size.
Based on all of the assumptions to this point, this _should_ return
OBJ_BLOB. Since we are focused on the size filter, we use a
short-circuited OR (||) to skip the size check if that method returns a
different object type.
Notice that this inspection of object sizes requires the content to be
present in the repository. The odb_read_object_info() call will download
a missing blob on-demand. This means that the use of the path-walk API
within 'git backfill' would not operate nicely with this filter type.
The intention of that command is to download missing blobs in batches.
Downloading objects one-by-one would go against the point. Update the
validation in 'git backfill' to add its own compatibility check on top
of path_walk_filter_compatible().
Add tests for blob:limit=0 (equivalent to blob:none) and blob:limit=3
(which exercises partial filtering within a batch where some blobs are
kept and others are excluded).
Co-authored-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
Documentation/git-pack-objects.adoc | 2 +-
builtin/backfill.c | 2 +
path-walk.c | 38 ++++++++++++--
path-walk.h | 8 +++
t/t5620-backfill.sh | 2 +-
t/t6601-path-walk.sh | 78 +++++++++++++++++++++++++++++
6 files changed, 125 insertions(+), 5 deletions(-)
diff --git a/Documentation/git-pack-objects.adoc b/Documentation/git-pack-objects.adoc
index 917045d5c3..3821bf7e22 100644
--- a/Documentation/git-pack-objects.adoc
+++ b/Documentation/git-pack-objects.adoc
@@ -404,7 +404,7 @@ will be automatically changed to version `1`.
+
Incompatible with `--delta-islands`. The `--use-bitmap-index` option is
ignored in the presence of `--path-walk`. Whe `--path-walk` option
-supports the `--filter=<spec>` form `blob:none`.
+supports the `--filter=<spec>` form `blob:none` and `blob:limit=<n>`.
DELTA ISLANDS
diff --git a/builtin/backfill.c b/builtin/backfill.c
index b80f9ebe69..5254a42711 100644
--- a/builtin/backfill.c
+++ b/builtin/backfill.c
@@ -98,6 +98,8 @@ static void reject_unsupported_rev_list_options(struct rev_info *revs)
"--diff-merges");
if (!path_walk_filter_compatible(&revs->filter))
die(_("cannot backfill with these filter options"));
+ if (revs->filter.blob_limit_value)
+ die(_("cannot backfill with blob size limits"));
}
static int do_backfill(struct backfill_context *ctx)
diff --git a/path-walk.c b/path-walk.c
index a4dd197c37..0e7dab7a6a 100644
--- a/path-walk.c
+++ b/path-walk.c
@@ -10,6 +10,7 @@
#include "hex.h"
#include "list-objects.h"
#include "list-objects-filter-options.h"
+#include "odb.h"
#include "object.h"
#include "oid-array.h"
#include "path.h"
@@ -315,9 +316,29 @@ static int walk_path(struct path_walk_context *ctx,
/* Evaluate function pointer on this data, if requested. */
if ((list->type == OBJ_TREE && ctx->info->trees) ||
(list->type == OBJ_BLOB && ctx->info->blobs) ||
- (list->type == OBJ_TAG && ctx->info->tags))
- ret = ctx->info->path_fn(path, &list->oids, list->type,
- ctx->info->path_fn_data);
+ (list->type == OBJ_TAG && ctx->info->tags)) {
+ struct oid_array *oids = &list->oids;
+ struct oid_array filtered = OID_ARRAY_INIT;
+
+ if (list->type == OBJ_BLOB && ctx->info->blob_limit) {
+ for (size_t i = 0; i < list->oids.nr; i++) {
+ unsigned long size;
+
+ if (odb_read_object_info(ctx->repo->objects,
+ &list->oids.oid[i],
+ &size) != OBJ_BLOB ||
+ size < ctx->info->blob_limit)
+ oid_array_append(&filtered,
+ &list->oids.oid[i]);
+ }
+ oids = &filtered;
+ }
+
+ if (oids->nr)
+ ret = ctx->info->path_fn(path, oids, list->type,
+ ctx->info->path_fn_data);
+ oid_array_clear(&filtered);
+ }
/* Expand data for children. */
if (list->type == OBJ_TREE) {
@@ -500,6 +521,17 @@ static int prepare_filters(struct path_walk_info *info,
}
return 1;
+ case LOFC_BLOB_LIMIT:
+ if (info) {
+ if (!options->blob_limit_value) {
+ info->blobs = 0;
+ } else {
+ info->blob_limit = options->blob_limit_value;
+ }
+ list_objects_filter_release(options);
+ }
+ return 1;
+
default:
error(_("object filter '%s' not supported by the path-walk API"),
list_objects_filter_spec(options));
diff --git a/path-walk.h b/path-walk.h
index be8d27b398..bcb81b70a1 100644
--- a/path-walk.h
+++ b/path-walk.h
@@ -42,6 +42,14 @@ struct path_walk_info {
int blobs;
int tags;
+ /**
+ * If non-zero, specifies a maximum blob size. Blobs with a
+ * size equal to or greater than this limit will be omitted
+ * from the walk. Blobs smaller than the limit (or blobs
+ * whose size cannot be determined) are still visited.
+ */
+ unsigned long blob_limit;
+
/**
* When 'prune_all_uninteresting' is set and a path has all objects
* marked as UNINTERESTING, then the path-walk will not visit those
diff --git a/t/t5620-backfill.sh b/t/t5620-backfill.sh
index ede89f8c33..d2ea68e065 100755
--- a/t/t5620-backfill.sh
+++ b/t/t5620-backfill.sh
@@ -20,7 +20,7 @@ test_expect_success 'backfill rejects incompatible filter options' '
test_grep "cannot backfill with these filter options" err &&
test_must_fail git backfill --objects --filter=blob:limit=10m 2>err &&
- test_grep "cannot backfill with these filter options" err
+ test_grep "cannot backfill with blob size limits" err
'
# We create objects in the 'src' repo.
diff --git a/t/t6601-path-walk.sh b/t/t6601-path-walk.sh
index 94df309987..d9be7b9cd2 100755
--- a/t/t6601-path-walk.sh
+++ b/t/t6601-path-walk.sh
@@ -475,4 +475,82 @@ test_expect_success 'topic only, blob:none filter' '
test_cmp_sorted expect out
'
+test_expect_success 'all, blob:limit=0 filter' '
+ test-tool path-walk --filter=blob:limit=0 -- --all >out &&
+
+ cat >expect <<-EOF &&
+ 0:commit::$(git rev-parse topic)
+ 0:commit::$(git rev-parse base)
+ 0:commit::$(git rev-parse base~1)
+ 0:commit::$(git rev-parse base~2)
+ 1:tag:/tags:$(git rev-parse refs/tags/first)
+ 1:tag:/tags:$(git rev-parse refs/tags/second.1)
+ 1:tag:/tags:$(git rev-parse refs/tags/second.2)
+ 1:tag:/tags:$(git rev-parse refs/tags/third)
+ 1:tag:/tags:$(git rev-parse refs/tags/fourth)
+ 1:tag:/tags:$(git rev-parse refs/tags/tree-tag)
+ 1:tag:/tags:$(git rev-parse refs/tags/blob-tag)
+ 2:tree::$(git rev-parse topic^{tree})
+ 2:tree::$(git rev-parse base^{tree})
+ 2:tree::$(git rev-parse base~1^{tree})
+ 2:tree::$(git rev-parse base~2^{tree})
+ 2:tree::$(git rev-parse refs/tags/tree-tag^{})
+ 2:tree::$(git rev-parse refs/tags/tree-tag2^{})
+ 3:tree:a/:$(git rev-parse base:a)
+ 4:tree:child/:$(git rev-parse refs/tags/tree-tag:child)
+ 5:tree:left/:$(git rev-parse base:left)
+ 5:tree:left/:$(git rev-parse base~2:left)
+ 6:tree:right/:$(git rev-parse topic:right)
+ 6:tree:right/:$(git rev-parse base~1:right)
+ 6:tree:right/:$(git rev-parse base~2:right)
+ blobs:0
+ commits:4
+ tags:7
+ trees:13
+ EOF
+
+ test_cmp_sorted expect out
+'
+
+test_expect_success 'all, blob:limit=3 filter' '
+ test-tool path-walk --filter=blob:limit=3 -- --all >out &&
+
+ cat >expect <<-EOF &&
+ 0:commit::$(git rev-parse topic)
+ 0:commit::$(git rev-parse base)
+ 0:commit::$(git rev-parse base~1)
+ 0:commit::$(git rev-parse base~2)
+ 1:tag:/tags:$(git rev-parse refs/tags/first)
+ 1:tag:/tags:$(git rev-parse refs/tags/second.1)
+ 1:tag:/tags:$(git rev-parse refs/tags/second.2)
+ 1:tag:/tags:$(git rev-parse refs/tags/third)
+ 1:tag:/tags:$(git rev-parse refs/tags/fourth)
+ 1:tag:/tags:$(git rev-parse refs/tags/tree-tag)
+ 1:tag:/tags:$(git rev-parse refs/tags/blob-tag)
+ 2:tree::$(git rev-parse topic^{tree})
+ 2:tree::$(git rev-parse base^{tree})
+ 2:tree::$(git rev-parse base~1^{tree})
+ 2:tree::$(git rev-parse base~2^{tree})
+ 2:tree::$(git rev-parse refs/tags/tree-tag^{})
+ 2:tree::$(git rev-parse refs/tags/tree-tag2^{})
+ 3:blob:a:$(git rev-parse base~2:a)
+ 4:tree:a/:$(git rev-parse base:a)
+ 5:tree:child/:$(git rev-parse refs/tags/tree-tag:child)
+ 6:tree:left/:$(git rev-parse base:left)
+ 6:tree:left/:$(git rev-parse base~2:left)
+ 7:blob:left/b:$(git rev-parse base~2:left/b)
+ 8:tree:right/:$(git rev-parse topic:right)
+ 8:tree:right/:$(git rev-parse base~1:right)
+ 8:tree:right/:$(git rev-parse base~2:right)
+ 9:blob:right/c:$(git rev-parse base~2:right/c)
+ 10:blob:right/d:$(git rev-parse base~1:right/d)
+ blobs:4
+ commits:4
+ tags:7
+ trees:13
+ EOF
+
+ test_cmp_sorted expect out
+'
+
test_done
--
gitgitgadget
next prev parent reply other threads:[~2026-05-04 20:21 UTC|newest]
Thread overview: 71+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-02 14:15 [PATCH 0/7] pack-objects: integrate --path-walk and some --filter options Derrick Stolee via GitGitGadget
2026-05-02 14:15 ` [PATCH 1/7] pack-objects: pass --objects with --path-walk Derrick Stolee via GitGitGadget
2026-05-04 0:49 ` Junio C Hamano
2026-05-04 12:01 ` Derrick Stolee
2026-05-02 14:15 ` [PATCH 2/7] t/perf: add pack-objects filter and path-walk benchmark Derrick Stolee via GitGitGadget
2026-05-02 14:15 ` [PATCH 3/7] path-walk: support blobless filter Derrick Stolee via GitGitGadget
2026-05-02 14:15 ` [PATCH 4/7] backfill: die on incompatible filter options Derrick Stolee via GitGitGadget
2026-05-03 22:59 ` Junio C Hamano
2026-05-04 12:09 ` Derrick Stolee
2026-05-02 14:15 ` [PATCH 5/7] path-walk: support blob size limit filter Derrick Stolee via GitGitGadget
2026-05-02 14:15 ` [PATCH 6/7] path-walk: add pl_sparse_trees to control tree pruning Derrick Stolee via GitGitGadget
2026-05-02 14:15 ` [PATCH 7/7] pack-objects: support sparse:oid filter with path-walk Derrick Stolee via GitGitGadget
2026-05-04 20:21 ` [PATCH v2 00/10] pack-objects: integrate --path-walk and some --filter options Derrick Stolee via GitGitGadget
2026-05-04 20:21 ` [PATCH v2 01/10] pack-objects: pass --objects with --path-walk Derrick Stolee via GitGitGadget
2026-05-04 20:21 ` [PATCH v2 02/10] t/perf: add pack-objects filter and path-walk benchmark Derrick Stolee via GitGitGadget
2026-05-04 20:21 ` [PATCH v2 03/10] path-walk: support blobless filter Derrick Stolee via GitGitGadget
2026-05-04 20:21 ` [PATCH v2 04/10] backfill: die on incompatible filter options Derrick Stolee via GitGitGadget
2026-05-04 20:21 ` Derrick Stolee via GitGitGadget [this message]
2026-05-04 20:21 ` [PATCH v2 06/10] path-walk: add pl_sparse_trees to control tree pruning Derrick Stolee via GitGitGadget
2026-05-04 20:21 ` [PATCH v2 07/10] pack-objects: support sparse:oid filter with path-walk Derrick Stolee via GitGitGadget
2026-05-04 20:21 ` [PATCH v2 08/10] path-walk: support `tree:0` filter Taylor Blau via GitGitGadget
2026-05-04 20:21 ` [PATCH v2 09/10] path-walk: support `object:type` filter Taylor Blau via GitGitGadget
2026-05-04 20:21 ` [PATCH v2 10/10] path-walk: support `combine` filter Taylor Blau via GitGitGadget
2026-05-05 16:18 ` [PATCH v2 00/10] pack-objects: integrate --path-walk and some --filter options Derrick Stolee
2026-05-05 19:01 ` Taylor Blau
2026-05-05 19:44 ` Derrick Stolee
2026-05-05 20:42 ` Taylor Blau
2026-05-07 11:40 ` Derrick Stolee
2026-05-11 3:05 ` Junio C Hamano
2026-05-11 13:58 ` Derrick Stolee
2026-05-11 18:12 ` [PATCH v3 00/12] " Derrick Stolee via GitGitGadget
2026-05-11 18:12 ` [PATCH v3 01/12] t5620: make test work with path-walk var Derrick Stolee via GitGitGadget
2026-05-12 1:03 ` Taylor Blau
2026-05-11 18:12 ` [PATCH v3 02/12] pack-objects: pass --objects with --path-walk Derrick Stolee via GitGitGadget
2026-05-12 1:04 ` Taylor Blau
2026-05-11 18:13 ` [PATCH v3 03/12] t/perf: add pack-objects filter and path-walk benchmark Derrick Stolee via GitGitGadget
2026-05-12 1:11 ` Taylor Blau
2026-05-13 18:23 ` Derrick Stolee
2026-05-11 18:13 ` [PATCH v3 04/12] path-walk: always emit directly-requested objects Derrick Stolee via GitGitGadget
2026-05-12 1:23 ` Taylor Blau
2026-05-13 18:29 ` Derrick Stolee
2026-05-11 18:13 ` [PATCH v3 05/12] path-walk: support blobless filter Derrick Stolee via GitGitGadget
2026-05-11 18:38 ` Taylor Blau
2026-05-11 19:44 ` Derrick Stolee
2026-05-11 18:13 ` [PATCH v3 06/12] backfill: die on incompatible filter options Derrick Stolee via GitGitGadget
2026-05-12 1:26 ` Taylor Blau
2026-05-11 18:13 ` [PATCH v3 07/12] path-walk: support blob size limit filter Derrick Stolee via GitGitGadget
2026-05-12 1:33 ` Taylor Blau
2026-05-13 18:35 ` Derrick Stolee
2026-05-11 18:13 ` [PATCH v3 08/12] path-walk: add pl_sparse_trees to control tree pruning Derrick Stolee via GitGitGadget
2026-05-11 18:13 ` [PATCH v3 09/12] pack-objects: support sparse:oid filter with path-walk Derrick Stolee via GitGitGadget
2026-05-11 18:13 ` [PATCH v3 10/12] path-walk: support `tree:0` filter Taylor Blau via GitGitGadget
2026-05-12 1:41 ` Taylor Blau
2026-05-13 19:46 ` Derrick Stolee
2026-05-11 18:13 ` [PATCH v3 11/12] path-walk: support `object:type` filter Taylor Blau via GitGitGadget
2026-05-11 18:13 ` [PATCH v3 12/12] path-walk: support `combine` filter Taylor Blau via GitGitGadget
2026-05-12 1:43 ` [PATCH v3 00/12] pack-objects: integrate --path-walk and some --filter options Taylor Blau
2026-05-13 21:18 ` [PATCH v4 00/13] " Derrick Stolee via GitGitGadget
2026-05-13 21:18 ` [PATCH v4 01/13] t5620: make test work with path-walk var Derrick Stolee via GitGitGadget
2026-05-13 21:18 ` [PATCH v4 02/13] pack-objects: pass --objects with --path-walk Derrick Stolee via GitGitGadget
2026-05-13 21:18 ` [PATCH v4 03/13] t/perf: add pack-objects filter and path-walk benchmark Derrick Stolee via GitGitGadget
2026-05-13 21:18 ` [PATCH v4 04/13] path-walk: always emit directly-requested objects Derrick Stolee via GitGitGadget
2026-05-13 21:18 ` [PATCH v4 05/13] path-walk: support blobless filter Derrick Stolee via GitGitGadget
2026-05-13 21:18 ` [PATCH v4 06/13] backfill: die on incompatible filter options Derrick Stolee via GitGitGadget
2026-05-13 21:18 ` [PATCH v4 07/13] path-walk: support blob size limit filter Derrick Stolee via GitGitGadget
2026-05-13 21:18 ` [PATCH v4 08/13] path-walk: add pl_sparse_trees to control tree pruning Derrick Stolee via GitGitGadget
2026-05-13 21:18 ` [PATCH v4 09/13] pack-objects: support sparse:oid filter with path-walk Derrick Stolee via GitGitGadget
2026-05-13 21:18 ` [PATCH v4 10/13] t6601: tag otherwise-unreachable trees Derrick Stolee via GitGitGadget
2026-05-13 21:18 ` [PATCH v4 11/13] path-walk: support `tree:0` filter Taylor Blau via GitGitGadget
2026-05-13 21:18 ` [PATCH v4 12/13] path-walk: support `object:type` filter Taylor Blau via GitGitGadget
2026-05-13 21:18 ` [PATCH v4 13/13] path-walk: support `combine` filter Taylor Blau via GitGitGadget
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d309345fece6fe293b392197b717b947b48590fc.1777926079.git.gitgitgadget@gmail.com \
--to=gitgitgadget@gmail.com \
--cc=christian.couder@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=johannes.schindelin@gmx.de \
--cc=johncai86@gmail.com \
--cc=karthik.188@gmail.com \
--cc=kristofferhaugsbakk@fastmail.com \
--cc=me@ttaylorr.com \
--cc=newren@gmail.com \
--cc=peff@peff.net \
--cc=ps@pks.im \
--cc=stolee@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox