* [PATCH 0/9] builtin/cat-file: allow filtering objects in batch mode
@ 2025-02-21 7:47 Patrick Steinhardt
2025-02-21 7:47 ` [PATCH 1/9] builtin/cat-file: rename variable that tracks usage Patrick Steinhardt
` (10 more replies)
0 siblings, 11 replies; 72+ messages in thread
From: Patrick Steinhardt @ 2025-02-21 7:47 UTC (permalink / raw)
To: git
Hi,
at GitLab, we sometimes have the need to list all objects regardless of
their reachability. We use git-cat-file(1) with `--batch-all-objects` to
do this, and typically this is quite a good fit. In some cases though,
we only want to list objects of a specific type, where we then basically
have the following pipeline:
git cat-file --batch-all-objects --batch-check='%(objecttype) %(objectname)' |
grep '^commit ' |
cut -d' ' -f2 |
git cat-file --batch
This works okayish in medium-sized repositories, but once you reach a
certain size this isn't really an option anymore. In the Chromium
repository for example [1] simply listing all objects in the first
invocation of git-cat-file(1) takes around 80 to 100 seconds. The
workload is completely I/O-bottlenecked: my machine reads at ~500MB/s,
and the packfile is 50GB in size, which matches the 100 seconds that I
observe.
This series addresses the issue by introducing object filters into
git-cat-file(1). These object filters use the exact same syntax as the
filters we have in git-rev-list(1), but only a subset of them is
supported because not all filters can be computed by git-cat-file(1).
Supported are "blob:none", "blob:limit=" as well as "object:type=".
The filters alone don't really help though: we still have to scan
through the whole packfile in order to compute the packfiles. While we
are able to shed a bit of CPU time because we can stop emitting some of
the objects, we're still I/O-bottlenecked.
The second part of the series thus expands the filters so that they can
make use of bitmap indices for some of the filters, if available. This
allows us to efficiently answer the question where to find all objects
of a specific type, and thus we can avoid scanning through the packfile
and instead directly look up relevant objects, leading to a significant
speedup:
Benchmark 1: git cat-file --batch-check --batch-all-objects --unordered --buffer --no-objects-filter
Time (mean ± σ): 82.806 s ± 6.363 s [User: 30.956 s, System: 8.264 s]
Range (min … max): 73.936 s … 89.690 s 10 runs
Benchmark 2: git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=tag
Time (mean ± σ): 20.8 ms ± 1.3 ms [User: 6.1 ms, System: 14.5 ms]
Range (min … max): 18.2 ms … 23.6 ms 127 runs
Benchmark 3: git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=commit
Time (mean ± σ): 1.551 s ± 0.008 s [User: 1.401 s, System: 0.147 s]
Range (min … max): 1.541 s … 1.566 s 10 runs
Benchmark 4: git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=tree
Time (mean ± σ): 11.169 s ± 0.046 s [User: 10.076 s, System: 1.063 s]
Range (min … max): 11.114 s … 11.245 s 10 runs
Benchmark 5: git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=blob
Time (mean ± σ): 67.342 s ± 3.368 s [User: 20.318 s, System: 7.787 s]
Range (min … max): 62.836 s … 73.618 s 10 runs
Benchmark 6: git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=blob:none
Time (mean ± σ): 13.032 s ± 0.072 s [User: 11.638 s, System: 1.368 s]
Range (min … max): 12.960 s … 13.199 s 10 runs
Summary
git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=tag
74.75 ± 4.61 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=commit
538.17 ± 33.17 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=tree
627.98 ± 38.77 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=blob:none
3244.93 ± 257.23 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=blob
3990.07 ± 392.72 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --no-objects-filter
We now directly scale with the number of objects of a specific type
contained in the packfile instead of scaling with the overall number of
objects. It's quite fun to see how the math plays out: if you sum up the
times for each of the types you arrive at the time for the unfiltered
case.
Thanks!
Patrick
[1]: https://github.com/chromium/chromium.git
---
Patrick Steinhardt (9):
builtin/cat-file: rename variable that tracks usage
builtin/cat-file: wire up an option to filter objects
builtin/cat-file: support "blob:none" objects filter
builtin/cat-file: support "blob:limit=" objects filter
builtin/cat-file: support "object:type=" objects filter
pack-bitmap: expose function to iterate over bitmapped objects
pack-bitmap: introduce function to check whether a pack is bitmapped
builtin/cat-file: deduplicate logic to iterate over all objects
builtin/cat-file: use bitmaps to efficiently filter by object type
Documentation/git-cat-file.adoc | 16 +++
builtin/cat-file.c | 225 +++++++++++++++++++++++++++++-----------
builtin/pack-objects.c | 3 +-
builtin/rev-list.c | 3 +-
pack-bitmap.c | 80 +++++++++-----
pack-bitmap.h | 19 +++-
reachable.c | 3 +-
t/t1006-cat-file.sh | 77 ++++++++++++++
8 files changed, 339 insertions(+), 87 deletions(-)
---
base-commit: a554262210b4a2ee6fa2d594e1f09f5830888c56
change-id: 20250220-pks-cat-file-object-type-filter-9140c0ed5ee1
^ permalink raw reply [flat|nested] 72+ messages in thread
* [PATCH 1/9] builtin/cat-file: rename variable that tracks usage
2025-02-21 7:47 [PATCH 0/9] builtin/cat-file: allow filtering objects in batch mode Patrick Steinhardt
@ 2025-02-21 7:47 ` Patrick Steinhardt
2025-02-21 7:47 ` [PATCH 2/9] builtin/cat-file: wire up an option to filter objects Patrick Steinhardt
` (9 subsequent siblings)
10 siblings, 0 replies; 72+ messages in thread
From: Patrick Steinhardt @ 2025-02-21 7:47 UTC (permalink / raw)
To: git
The usage strings for git-cat-file(1) that we pass to `parse_options()`
and `usage_msg_optf()` are stored in a variable called `usage`. This
variable shadows the declaration of `usage()`, which we'll want to use
in a subsequent commit.
Rename the variable to `builtin_catfile_usage`, which is in line with
how the variable is typically called in other builtins.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
builtin/cat-file.c | 32 ++++++++++++++++----------------
1 file changed, 16 insertions(+), 16 deletions(-)
diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index b13561cf73b..8e40016dd24 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -941,7 +941,7 @@ int cmd_cat_file(int argc,
int input_nul_terminated = 0;
int nul_terminated = 0;
- const char * const usage[] = {
+ const char * const builtin_catfile_usage[] = {
N_("git cat-file <type> <object>"),
N_("git cat-file (-e | -p) <object>"),
N_("git cat-file (-t | -s) [--allow-unknown-type] <object>"),
@@ -1007,7 +1007,7 @@ int cmd_cat_file(int argc,
batch.buffer_output = -1;
- argc = parse_options(argc, argv, prefix, options, usage, 0);
+ argc = parse_options(argc, argv, prefix, options, builtin_catfile_usage, 0);
opt_cw = (opt == 'c' || opt == 'w');
opt_epts = (opt == 'e' || opt == 'p' || opt == 't' || opt == 's');
@@ -1021,7 +1021,7 @@ int cmd_cat_file(int argc,
/* Option compatibility */
if (force_path && !opt_cw)
usage_msg_optf(_("'%s=<%s>' needs '%s' or '%s'"),
- usage, options,
+ builtin_catfile_usage, options,
"--path", _("path|tree-ish"), "--filters",
"--textconv");
@@ -1029,19 +1029,19 @@ int cmd_cat_file(int argc,
if (batch.enabled)
;
else if (batch.follow_symlinks)
- usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
+ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options,
"--follow-symlinks");
else if (batch.buffer_output >= 0)
- usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
+ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options,
"--buffer");
else if (batch.all_objects)
- usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
+ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options,
"--batch-all-objects");
else if (input_nul_terminated)
- usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
+ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options,
"-z");
else if (nul_terminated)
- usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
+ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options,
"-Z");
batch.input_delim = batch.output_delim = '\n';
@@ -1063,9 +1063,9 @@ int cmd_cat_file(int argc,
batch.transform_mode = opt;
else if (opt && opt != 'b')
usage_msg_optf(_("'-%c' is incompatible with batch mode"),
- usage, options, opt);
+ builtin_catfile_usage, options, opt);
else if (argc)
- usage_msg_opt(_("batch modes take no arguments"), usage,
+ usage_msg_opt(_("batch modes take no arguments"), builtin_catfile_usage,
options);
return batch_objects(&batch);
@@ -1074,22 +1074,22 @@ int cmd_cat_file(int argc,
if (opt) {
if (!argc && opt == 'c')
usage_msg_optf(_("<rev> required with '%s'"),
- usage, options, "--textconv");
+ builtin_catfile_usage, options, "--textconv");
else if (!argc && opt == 'w')
usage_msg_optf(_("<rev> required with '%s'"),
- usage, options, "--filters");
+ builtin_catfile_usage, options, "--filters");
else if (!argc && opt_epts)
usage_msg_optf(_("<object> required with '-%c'"),
- usage, options, opt);
+ builtin_catfile_usage, options, opt);
else if (argc == 1)
obj_name = argv[0];
else
- usage_msg_opt(_("too many arguments"), usage, options);
+ usage_msg_opt(_("too many arguments"), builtin_catfile_usage, options);
} else if (!argc) {
- usage_with_options(usage, options);
+ usage_with_options(builtin_catfile_usage, options);
} else if (argc != 2) {
usage_msg_optf(_("only two arguments allowed in <type> <object> mode, not %d"),
- usage, options, argc);
+ builtin_catfile_usage, options, argc);
} else if (argc) {
exp_type = argv[0];
obj_name = argv[1];
--
2.48.1.683.gf705b3209c.dirty
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH 2/9] builtin/cat-file: wire up an option to filter objects
2025-02-21 7:47 [PATCH 0/9] builtin/cat-file: allow filtering objects in batch mode Patrick Steinhardt
2025-02-21 7:47 ` [PATCH 1/9] builtin/cat-file: rename variable that tracks usage Patrick Steinhardt
@ 2025-02-21 7:47 ` Patrick Steinhardt
2025-02-26 15:20 ` Toon Claes
2025-02-27 11:20 ` Karthik Nayak
2025-02-21 7:47 ` [PATCH 3/9] builtin/cat-file: support "blob:none" objects filter Patrick Steinhardt
` (8 subsequent siblings)
10 siblings, 2 replies; 72+ messages in thread
From: Patrick Steinhardt @ 2025-02-21 7:47 UTC (permalink / raw)
To: git
In batch mode, git-cat-file(1) enumerates all objects and prints them
by iterating through both loose and packed objects. This works without
considering their reachability at all, and consequently most options to
filter objects as they exist in e.g. git-rev-list(1) are not applicable.
In some situations it may still be useful though to filter objects based
on properties that are inherent to them. This includes the object size
as well as its type.
Such a filter already exists in git-rev-list(1) with the `--filter=`
command line option. While this option supports a couple of filters that
are not applicable to our usecase, some of them are quite a neat fit.
Wire up the filter as an option for git-cat-file(1). This allows us to
reuse the same syntax as in git-rev-list(1) so that we don't have to
reinvent the wheel. For now, we die when any of the filter options has
been passed by the user, but they will be wired up in subsequent
commits.
Note that we don't use the same `--filter=` name fo the option as we use
in git-rev-list(1). We already have `--filters`, and having both
`--filter=` and `--filters` would be quite confusing. Instead, the new
option is called `--objects-filter`.
Further note that the filters that we are about to introduce don't
significantly speed up the runtime of git-cat-file(1). While we can skip
emitting a lot of objects in case they are uninteresting to us, the
majority of time is spent reading the packfile, which is bottlenecked by
I/O and not the processor. This will change though once we start to make
use of bitmaps, which will allow us to skip reading the whole packfile.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
Documentation/git-cat-file.adoc | 6 ++++++
builtin/cat-file.c | 37 +++++++++++++++++++++++++++++++++----
t/t1006-cat-file.sh | 32 ++++++++++++++++++++++++++++++++
3 files changed, 71 insertions(+), 4 deletions(-)
diff --git a/Documentation/git-cat-file.adoc b/Documentation/git-cat-file.adoc
index d5890ae3686..7c1c888079a 100644
--- a/Documentation/git-cat-file.adoc
+++ b/Documentation/git-cat-file.adoc
@@ -81,6 +81,12 @@ OPTIONS
end-of-line conversion, etc). In this case, `<object>` has to be of
the form `<tree-ish>:<path>`, or `:<path>`.
+--objects-filter=<filter-spec>::
+--no-objects-filter::
+ Omit objects from the list of printed objects. This can only be used in
+ combination with one of the batched modes. The '<filter-spec>' may be
+ one of the following:
+
--path=<path>::
For use with `--textconv` or `--filters`, to allow specifying an object
name and a path separately, e.g. when it is difficult to figure out
diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 8e40016dd24..723644fbba8 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -15,6 +15,7 @@
#include "gettext.h"
#include "hex.h"
#include "ident.h"
+#include "list-objects-filter-options.h"
#include "parse-options.h"
#include "userdiff.h"
#include "streaming.h"
@@ -35,6 +36,7 @@ enum batch_mode {
};
struct batch_options {
+ struct list_objects_filter_options objects_filter;
int enabled;
int follow_symlinks;
enum batch_mode batch_mode;
@@ -487,6 +489,13 @@ static void batch_object_write(const char *obj_name,
return;
}
+ switch (opt->objects_filter.choice) {
+ case LOFC_DISABLED:
+ break;
+ default:
+ BUG("unsupported objects filter");
+ }
+
if (use_mailmap && (data->type == OBJ_COMMIT || data->type == OBJ_TAG)) {
size_t s = data->size;
char *buf = NULL;
@@ -812,7 +821,8 @@ static int batch_objects(struct batch_options *opt)
struct object_cb_data cb;
struct object_info empty = OBJECT_INFO_INIT;
- if (!memcmp(&data.info, &empty, sizeof(empty)))
+ if (!memcmp(&data.info, &empty, sizeof(empty)) &&
+ opt->objects_filter.choice == LOFC_DISABLED)
data.skip_object_info = 1;
if (repo_has_promisor_remote(the_repository))
@@ -936,10 +946,13 @@ int cmd_cat_file(int argc,
int opt_cw = 0;
int opt_epts = 0;
const char *exp_type = NULL, *obj_name = NULL;
- struct batch_options batch = {0};
+ struct batch_options batch = {
+ .objects_filter = LIST_OBJECTS_FILTER_INIT,
+ };
int unknown_type = 0;
int input_nul_terminated = 0;
int nul_terminated = 0;
+ int ret;
const char * const builtin_catfile_usage[] = {
N_("git cat-file <type> <object>"),
@@ -1000,6 +1013,8 @@ int cmd_cat_file(int argc,
N_("run filters on object's content"), 'w'),
OPT_STRING(0, "path", &force_path, N_("blob|tree"),
N_("use a <path> for (--textconv | --filters); Not with 'batch'")),
+ OPT_CALLBACK(0, "objects-filter", &batch.objects_filter, N_("args"),
+ N_("object filtering"), opt_parse_list_objects_filter),
OPT_END()
};
@@ -1014,6 +1029,14 @@ int cmd_cat_file(int argc,
if (use_mailmap)
read_mailmap(&mailmap);
+ switch (batch.objects_filter.choice) {
+ case LOFC_DISABLED:
+ break;
+ default:
+ usagef(_("objects filter not supported: '%s'"),
+ list_object_filter_config_name(batch.objects_filter.choice));
+ }
+
/* --batch-all-objects? */
if (opt == 'b')
batch.all_objects = 1;
@@ -1068,7 +1091,8 @@ int cmd_cat_file(int argc,
usage_msg_opt(_("batch modes take no arguments"), builtin_catfile_usage,
options);
- return batch_objects(&batch);
+ ret = batch_objects(&batch);
+ goto out;
}
if (opt) {
@@ -1097,5 +1121,10 @@ int cmd_cat_file(int argc,
if (unknown_type && opt != 't' && opt != 's')
die("git cat-file --allow-unknown-type: use with -s or -t");
- return cat_one_file(opt, exp_type, obj_name, unknown_type);
+
+ ret = cat_one_file(opt, exp_type, obj_name, unknown_type);
+
+out:
+ list_objects_filter_release(&batch.objects_filter);
+ return ret;
}
diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
index 398865d6ebe..48840a13561 100755
--- a/t/t1006-cat-file.sh
+++ b/t/t1006-cat-file.sh
@@ -1353,4 +1353,36 @@ test_expect_success PERL '--batch-command info is unbuffered by default' '
perl -e "$script" -- --batch-command $hello_oid "$expect" "info "
'
+test_expect_success 'setup for objects filter' '
+ git init repo
+'
+
+test_expect_success 'objects filter with unknown option' '
+ cat >expect <<-EOF &&
+ fatal: invalid filter-spec ${SQ}unknown${SQ}
+ EOF
+ test_must_fail git -C repo cat-file --objects-filter=unknown 2>err &&
+ test_cmp expect err
+'
+
+for option in blob:none blob:limit=1 object:type=tag sparse:oid=1234 tree:1 sparse:path=x
+do
+ test_expect_success "objects filter with unsupported option $option" '
+ case "$option" in
+ tree:1)
+ echo "usage: objects filter not supported: ${SQ}tree${SQ}" >expect
+ ;;
+ sparse:path=x)
+ echo "fatal: sparse:path filters support has been dropped" >expect
+ ;;
+ *)
+ option_name=$(echo "$option" | cut -d= -f1) &&
+ printf "usage: objects filter not supported: ${SQ}%s${SQ}\n" "$option_name" >expect
+ ;;
+ esac &&
+ test_must_fail git -C repo cat-file --objects-filter=$option 2>err &&
+ test_cmp expect err
+ '
+done
+
test_done
--
2.48.1.683.gf705b3209c.dirty
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH 3/9] builtin/cat-file: support "blob:none" objects filter
2025-02-21 7:47 [PATCH 0/9] builtin/cat-file: allow filtering objects in batch mode Patrick Steinhardt
2025-02-21 7:47 ` [PATCH 1/9] builtin/cat-file: rename variable that tracks usage Patrick Steinhardt
2025-02-21 7:47 ` [PATCH 2/9] builtin/cat-file: wire up an option to filter objects Patrick Steinhardt
@ 2025-02-21 7:47 ` Patrick Steinhardt
2025-02-26 15:22 ` Toon Claes
2025-02-27 11:26 ` Karthik Nayak
2025-02-21 7:47 ` [PATCH 4/9] builtin/cat-file: support "blob:limit=" " Patrick Steinhardt
` (7 subsequent siblings)
10 siblings, 2 replies; 72+ messages in thread
From: Patrick Steinhardt @ 2025-02-21 7:47 UTC (permalink / raw)
To: git
Implement support for the "blob:none" filter in git-cat-file(1), which
causes us to omit all blobs.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
Documentation/git-cat-file.adoc | 2 ++
builtin/cat-file.c | 11 ++++++++++-
t/t1006-cat-file.sh | 33 +++++++++++++++++++++++++++++++--
3 files changed, 43 insertions(+), 3 deletions(-)
diff --git a/Documentation/git-cat-file.adoc b/Documentation/git-cat-file.adoc
index 7c1c888079a..c11952d9eca 100644
--- a/Documentation/git-cat-file.adoc
+++ b/Documentation/git-cat-file.adoc
@@ -86,6 +86,8 @@ OPTIONS
Omit objects from the list of printed objects. This can only be used in
combination with one of the batched modes. The '<filter-spec>' may be
one of the following:
++
+The form '--filter=blob:none' omits all blobs.
--path=<path>::
For use with `--textconv` or `--filters`, to allow specifying an object
diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 723644fbba8..8e5572ba43e 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -472,7 +472,8 @@ static void batch_object_write(const char *obj_name,
if (!data->skip_object_info) {
int ret;
- if (use_mailmap)
+ if (use_mailmap ||
+ opt->objects_filter.choice == LOFC_BLOB_NONE)
data->info.typep = &data->type;
if (pack)
@@ -492,6 +493,10 @@ static void batch_object_write(const char *obj_name,
switch (opt->objects_filter.choice) {
case LOFC_DISABLED:
break;
+ case LOFC_BLOB_NONE:
+ if (data->type == OBJ_BLOB)
+ return;
+ break;
default:
BUG("unsupported objects filter");
}
@@ -1032,6 +1037,10 @@ int cmd_cat_file(int argc,
switch (batch.objects_filter.choice) {
case LOFC_DISABLED:
break;
+ case LOFC_BLOB_NONE:
+ if (!batch.enabled)
+ usage(_("objects filter only supported in batch mode"));
+ break;
default:
usagef(_("objects filter not supported: '%s'"),
list_object_filter_config_name(batch.objects_filter.choice));
diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
index 48840a13561..97533225982 100755
--- a/t/t1006-cat-file.sh
+++ b/t/t1006-cat-file.sh
@@ -1354,7 +1354,22 @@ test_expect_success PERL '--batch-command info is unbuffered by default' '
'
test_expect_success 'setup for objects filter' '
- git init repo
+ git init repo &&
+ (
+ # Seed the repository with three different sets of objects:
+ #
+ # - The first set is fully packed and has a bitmap.
+ # - The second set is packed, but has no bitmap.
+ # - The third set is loose.
+ #
+ # This ensures that we cover all these types as expected.
+ cd repo &&
+ test_commit first &&
+ git repack -Adb &&
+ test_commit second &&
+ git repack -d &&
+ test_commit third
+ )
'
test_expect_success 'objects filter with unknown option' '
@@ -1365,7 +1380,7 @@ test_expect_success 'objects filter with unknown option' '
test_cmp expect err
'
-for option in blob:none blob:limit=1 object:type=tag sparse:oid=1234 tree:1 sparse:path=x
+for option in blob:limit=1 object:type=tag sparse:oid=1234 tree:1 sparse:path=x
do
test_expect_success "objects filter with unsupported option $option" '
case "$option" in
@@ -1385,4 +1400,18 @@ do
'
done
+test_objects_filter () {
+ filter="$1"
+
+ test_expect_success "objects filter: $filter" '
+ git -C repo cat-file --batch-check="%(objectname)" --batch-all-objects --objects-filter="$filter" >actual &&
+ sort actual >actual.sorted &&
+ git -C repo rev-list --objects --no-object-names --all --filter="$filter" --filter-provided-objects >expect &&
+ sort expect >expect.sorted &&
+ test_cmp expect.sorted actual.sorted
+ '
+}
+
+test_objects_filter "blob:none"
+
test_done
--
2.48.1.683.gf705b3209c.dirty
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH 4/9] builtin/cat-file: support "blob:limit=" objects filter
2025-02-21 7:47 [PATCH 0/9] builtin/cat-file: allow filtering objects in batch mode Patrick Steinhardt
` (2 preceding siblings ...)
2025-02-21 7:47 ` [PATCH 3/9] builtin/cat-file: support "blob:none" objects filter Patrick Steinhardt
@ 2025-02-21 7:47 ` Patrick Steinhardt
2025-02-21 7:47 ` [PATCH 5/9] builtin/cat-file: support "object:type=" " Patrick Steinhardt
` (6 subsequent siblings)
10 siblings, 0 replies; 72+ messages in thread
From: Patrick Steinhardt @ 2025-02-21 7:47 UTC (permalink / raw)
To: git
Implement support for the "blob:limit=" filter in git-cat-file(1), which
causes us to omit all blobs that are bigger than a certain size.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
Documentation/git-cat-file.adoc | 5 +++++
builtin/cat-file.c | 11 ++++++++++-
t/t1006-cat-file.sh | 18 +++++++++++++++---
3 files changed, 30 insertions(+), 4 deletions(-)
diff --git a/Documentation/git-cat-file.adoc b/Documentation/git-cat-file.adoc
index c11952d9eca..8c474418b52 100644
--- a/Documentation/git-cat-file.adoc
+++ b/Documentation/git-cat-file.adoc
@@ -88,6 +88,11 @@ OPTIONS
one of the following:
+
The form '--filter=blob:none' omits all blobs.
++
+The form '--filter=blob:limit=<n>[kmg]' omits blobs of size at least n
+bytes or units. n may be zero. The suffixes k, m, and g can be used
+to name units in KiB, MiB, or GiB. For example, 'blob:limit=1k'
+is the same as 'blob:limit=1024'.
--path=<path>::
For use with `--textconv` or `--filters`, to allow specifying an object
diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 8e5572ba43e..f57bf65cb03 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -473,8 +473,11 @@ static void batch_object_write(const char *obj_name,
int ret;
if (use_mailmap ||
- opt->objects_filter.choice == LOFC_BLOB_NONE)
+ opt->objects_filter.choice == LOFC_BLOB_NONE ||
+ opt->objects_filter.choice == LOFC_BLOB_LIMIT)
data->info.typep = &data->type;
+ if (opt->objects_filter.choice == LOFC_BLOB_LIMIT)
+ data->info.sizep = &data->size;
if (pack)
ret = packed_object_info(the_repository, pack, offset,
@@ -497,6 +500,11 @@ static void batch_object_write(const char *obj_name,
if (data->type == OBJ_BLOB)
return;
break;
+ case LOFC_BLOB_LIMIT:
+ if (data->type == OBJ_BLOB &&
+ data->size >= opt->objects_filter.blob_limit_value)
+ return;
+ break;
default:
BUG("unsupported objects filter");
}
@@ -1038,6 +1046,7 @@ int cmd_cat_file(int argc,
case LOFC_DISABLED:
break;
case LOFC_BLOB_NONE:
+ case LOFC_BLOB_LIMIT:
if (!batch.enabled)
usage(_("objects filter only supported in batch mode"));
break;
diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
index 97533225982..86c53e01b2f 100755
--- a/t/t1006-cat-file.sh
+++ b/t/t1006-cat-file.sh
@@ -1356,11 +1356,12 @@ test_expect_success PERL '--batch-command info is unbuffered by default' '
test_expect_success 'setup for objects filter' '
git init repo &&
(
- # Seed the repository with three different sets of objects:
+ # Seed the repository with four different sets of objects:
#
# - The first set is fully packed and has a bitmap.
# - The second set is packed, but has no bitmap.
# - The third set is loose.
+ # - The fourth set is loose and contains big objects.
#
# This ensures that we cover all these types as expected.
cd repo &&
@@ -1368,7 +1369,14 @@ test_expect_success 'setup for objects filter' '
git repack -Adb &&
test_commit second &&
git repack -d &&
- test_commit third
+ test_commit third &&
+
+ for n in 1000 10000
+ do
+ printf "%"$n"s" X >large.$n || return 1
+ done &&
+ git add large.* &&
+ git commit -m fourth
)
'
@@ -1380,7 +1388,7 @@ test_expect_success 'objects filter with unknown option' '
test_cmp expect err
'
-for option in blob:limit=1 object:type=tag sparse:oid=1234 tree:1 sparse:path=x
+for option in object:type=tag sparse:oid=1234 tree:1 sparse:path=x
do
test_expect_success "objects filter with unsupported option $option" '
case "$option" in
@@ -1413,5 +1421,9 @@ test_objects_filter () {
}
test_objects_filter "blob:none"
+test_objects_filter "blob:limit=1"
+test_objects_filter "blob:limit=500"
+test_objects_filter "blob:limit=1000"
+test_objects_filter "blob:limit=1g"
test_done
--
2.48.1.683.gf705b3209c.dirty
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH 5/9] builtin/cat-file: support "object:type=" objects filter
2025-02-21 7:47 [PATCH 0/9] builtin/cat-file: allow filtering objects in batch mode Patrick Steinhardt
` (3 preceding siblings ...)
2025-02-21 7:47 ` [PATCH 4/9] builtin/cat-file: support "blob:limit=" " Patrick Steinhardt
@ 2025-02-21 7:47 ` Patrick Steinhardt
2025-02-26 15:23 ` Toon Claes
2025-02-21 7:47 ` [PATCH 6/9] pack-bitmap: expose function to iterate over bitmapped objects Patrick Steinhardt
` (5 subsequent siblings)
10 siblings, 1 reply; 72+ messages in thread
From: Patrick Steinhardt @ 2025-02-21 7:47 UTC (permalink / raw)
To: git
Implement support for the "object:type=" filter in git-cat-file(1),
which causes us to omit all objects that don't match the provided object
type.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
Documentation/git-cat-file.adoc | 3 +++
builtin/cat-file.c | 8 +++++++-
t/t1006-cat-file.sh | 6 +++++-
3 files changed, 15 insertions(+), 2 deletions(-)
diff --git a/Documentation/git-cat-file.adoc b/Documentation/git-cat-file.adoc
index 8c474418b52..540d9dffdf9 100644
--- a/Documentation/git-cat-file.adoc
+++ b/Documentation/git-cat-file.adoc
@@ -93,6 +93,9 @@ The form '--filter=blob:limit=<n>[kmg]' omits blobs of size at least n
bytes or units. n may be zero. The suffixes k, m, and g can be used
to name units in KiB, MiB, or GiB. For example, 'blob:limit=1k'
is the same as 'blob:limit=1024'.
++
+The form '--filter=object:type=(tag|commit|tree|blob)' omits all objects
+which are not of the requested type.
--path=<path>::
For use with `--textconv` or `--filters`, to allow specifying an object
diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index f57bf65cb03..b374c2bb104 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -474,7 +474,8 @@ static void batch_object_write(const char *obj_name,
if (use_mailmap ||
opt->objects_filter.choice == LOFC_BLOB_NONE ||
- opt->objects_filter.choice == LOFC_BLOB_LIMIT)
+ opt->objects_filter.choice == LOFC_BLOB_LIMIT ||
+ opt->objects_filter.choice == LOFC_OBJECT_TYPE)
data->info.typep = &data->type;
if (opt->objects_filter.choice == LOFC_BLOB_LIMIT)
data->info.sizep = &data->size;
@@ -505,6 +506,10 @@ static void batch_object_write(const char *obj_name,
data->size >= opt->objects_filter.blob_limit_value)
return;
break;
+ case LOFC_OBJECT_TYPE:
+ if (data->type != opt->objects_filter.object_type)
+ return;
+ break;
default:
BUG("unsupported objects filter");
}
@@ -1047,6 +1052,7 @@ int cmd_cat_file(int argc,
break;
case LOFC_BLOB_NONE:
case LOFC_BLOB_LIMIT:
+ case LOFC_OBJECT_TYPE:
if (!batch.enabled)
usage(_("objects filter only supported in batch mode"));
break;
diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
index 86c53e01b2f..b908bbf60e1 100755
--- a/t/t1006-cat-file.sh
+++ b/t/t1006-cat-file.sh
@@ -1388,7 +1388,7 @@ test_expect_success 'objects filter with unknown option' '
test_cmp expect err
'
-for option in object:type=tag sparse:oid=1234 tree:1 sparse:path=x
+for option in sparse:oid=1234 tree:1 sparse:path=x
do
test_expect_success "objects filter with unsupported option $option" '
case "$option" in
@@ -1425,5 +1425,9 @@ test_objects_filter "blob:limit=1"
test_objects_filter "blob:limit=500"
test_objects_filter "blob:limit=1000"
test_objects_filter "blob:limit=1g"
+test_objects_filter "object:type=blob"
+test_objects_filter "object:type=commit"
+test_objects_filter "object:type=tag"
+test_objects_filter "object:type=tree"
test_done
--
2.48.1.683.gf705b3209c.dirty
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH 6/9] pack-bitmap: expose function to iterate over bitmapped objects
2025-02-21 7:47 [PATCH 0/9] builtin/cat-file: allow filtering objects in batch mode Patrick Steinhardt
` (4 preceding siblings ...)
2025-02-21 7:47 ` [PATCH 5/9] builtin/cat-file: support "object:type=" " Patrick Steinhardt
@ 2025-02-21 7:47 ` Patrick Steinhardt
2025-02-24 18:05 ` Junio C Hamano
2025-02-21 7:47 ` [PATCH 7/9] pack-bitmap: introduce function to check whether a pack is bitmapped Patrick Steinhardt
` (4 subsequent siblings)
10 siblings, 1 reply; 72+ messages in thread
From: Patrick Steinhardt @ 2025-02-21 7:47 UTC (permalink / raw)
To: git
Expose a function that allows the caller to iterate over all bitmapped
objects of a specific type. This mechanism allows us to use the object
type-specific bitmaps to enumerate all objects of that type without
having to scan through a complete packfile.
This functionality will be used in a subsequent commit.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
builtin/pack-objects.c | 3 ++-
builtin/rev-list.c | 3 ++-
pack-bitmap.c | 65 +++++++++++++++++++++++++++++++-------------------
pack-bitmap.h | 12 +++++++++-
reachable.c | 3 ++-
5 files changed, 57 insertions(+), 29 deletions(-)
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 58a9b161262..8f99e2b4fa8 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1735,7 +1735,8 @@ static int add_object_entry(const struct object_id *oid, enum object_type type,
static int add_object_entry_from_bitmap(const struct object_id *oid,
enum object_type type,
int flags UNUSED, uint32_t name_hash,
- struct packed_git *pack, off_t offset)
+ struct packed_git *pack, off_t offset,
+ void *payload UNUSED)
{
display_progress(progress_state, ++nr_seen);
diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index bb26bee0d45..1100dd2abe7 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -429,7 +429,8 @@ static int show_object_fast(
int exclude UNUSED,
uint32_t name_hash UNUSED,
struct packed_git *found_pack UNUSED,
- off_t found_offset UNUSED)
+ off_t found_offset UNUSED,
+ void *payload UNUSED)
{
fprintf(stdout, "%s\n", oid_to_hex(oid));
return 1;
diff --git a/pack-bitmap.c b/pack-bitmap.c
index 6406953d322..fc92e0aae65 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1509,50 +1509,45 @@ static void show_extended_objects(struct bitmap_index *bitmap_git,
(obj->type == OBJ_TAG && !revs->tag_objects))
continue;
- show_reach(&obj->oid, obj->type, 0, eindex->hashes[i], NULL, 0);
+ show_reach(&obj->oid, obj->type, 0, eindex->hashes[i], NULL, 0, NULL);
}
}
-static void init_type_iterator(struct ewah_iterator *it,
- struct bitmap_index *bitmap_git,
- enum object_type type)
+static struct ewah_bitmap *ewah_for_type(struct bitmap_index *bitmap_git,
+ enum object_type type)
{
switch (type) {
case OBJ_COMMIT:
- ewah_iterator_init(it, bitmap_git->commits);
- break;
-
+ return bitmap_git->commits;
case OBJ_TREE:
- ewah_iterator_init(it, bitmap_git->trees);
- break;
-
+ return bitmap_git->trees;
case OBJ_BLOB:
- ewah_iterator_init(it, bitmap_git->blobs);
- break;
-
+ return bitmap_git->blobs;
case OBJ_TAG:
- ewah_iterator_init(it, bitmap_git->tags);
- break;
-
+ return bitmap_git->tags;
default:
BUG("object type %d not stored by bitmap type index", type);
- break;
}
}
-static void show_objects_for_type(
- struct bitmap_index *bitmap_git,
- enum object_type object_type,
- show_reachable_fn show_reach)
+static void init_type_iterator(struct ewah_iterator *it,
+ struct bitmap_index *bitmap_git,
+ enum object_type type)
+{
+ ewah_iterator_init(it, ewah_for_type(bitmap_git, type));
+}
+
+static void for_each_bitmapped_object_internal(struct bitmap_index *bitmap_git,
+ struct bitmap *objects,
+ enum object_type object_type,
+ show_reachable_fn show_reach,
+ void *payload)
{
size_t i = 0;
uint32_t offset;
-
struct ewah_iterator it;
eword_t filter;
- struct bitmap *objects = bitmap_git->result;
-
init_type_iterator(&it, bitmap_git, object_type);
for (i = 0; i < objects->word_alloc &&
@@ -1595,11 +1590,31 @@ static void show_objects_for_type(
if (bitmap_git->hashes)
hash = get_be32(bitmap_git->hashes + index_pos);
- show_reach(&oid, object_type, 0, hash, pack, ofs);
+ show_reach(&oid, object_type, 0, hash, pack, ofs, payload);
}
}
}
+static void show_objects_for_type(
+ struct bitmap_index *bitmap_git,
+ enum object_type object_type,
+ show_reachable_fn show_reach)
+{
+ for_each_bitmapped_object_internal(bitmap_git, bitmap_git->result,
+ object_type, show_reach, NULL);
+}
+
+void for_each_bitmapped_object(struct bitmap_index *bitmap_git,
+ enum object_type object_type,
+ show_reachable_fn show_reach,
+ void *payload)
+{
+ struct bitmap *bitmap = ewah_to_bitmap(ewah_for_type(bitmap_git, object_type));
+ for_each_bitmapped_object_internal(bitmap_git, bitmap,
+ object_type, show_reach, payload);
+ bitmap_free(bitmap);
+}
+
static int in_bitmapped_pack(struct bitmap_index *bitmap_git,
struct object_list *roots)
{
diff --git a/pack-bitmap.h b/pack-bitmap.h
index d7f4b8b8e95..3368e79ed5a 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -50,7 +50,8 @@ typedef int (*show_reachable_fn)(
int flags,
uint32_t hash,
struct packed_git *found_pack,
- off_t found_offset);
+ off_t found_offset,
+ void *payload);
struct bitmap_index;
@@ -78,6 +79,15 @@ int test_bitmap_pseudo_merges(struct repository *r);
int test_bitmap_pseudo_merge_commits(struct repository *r, uint32_t n);
int test_bitmap_pseudo_merge_objects(struct repository *r, uint32_t n);
+/*
+ * Iterate through all bitmapped objects of the given type and execute the
+ * `show_reach` for each of them.
+ */
+ void for_each_bitmapped_object(struct bitmap_index *bitmap_git,
+ enum object_type object_type,
+ show_reachable_fn show_reach,
+ void *payload);
+
#define GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL \
"GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL"
diff --git a/reachable.c b/reachable.c
index ecf7ccf5041..dd33c7f07dd 100644
--- a/reachable.c
+++ b/reachable.c
@@ -337,7 +337,8 @@ static int mark_object_seen(const struct object_id *oid,
int exclude UNUSED,
uint32_t name_hash UNUSED,
struct packed_git *found_pack UNUSED,
- off_t found_offset UNUSED)
+ off_t found_offset UNUSED,
+ void *payload UNUSED)
{
struct object *obj = lookup_object_by_type(the_repository, oid, type);
if (!obj)
--
2.48.1.683.gf705b3209c.dirty
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH 7/9] pack-bitmap: introduce function to check whether a pack is bitmapped
2025-02-21 7:47 [PATCH 0/9] builtin/cat-file: allow filtering objects in batch mode Patrick Steinhardt
` (5 preceding siblings ...)
2025-02-21 7:47 ` [PATCH 6/9] pack-bitmap: expose function to iterate over bitmapped objects Patrick Steinhardt
@ 2025-02-21 7:47 ` Patrick Steinhardt
2025-02-27 23:33 ` Taylor Blau
2025-02-21 7:47 ` [PATCH 8/9] builtin/cat-file: deduplicate logic to iterate over all objects Patrick Steinhardt
` (3 subsequent siblings)
10 siblings, 1 reply; 72+ messages in thread
From: Patrick Steinhardt @ 2025-02-21 7:47 UTC (permalink / raw)
To: git
Introduce a function that allows us to verify whether a pack is
bitmapped or not. This functionality will be used in a subsequent
commit.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
pack-bitmap.c | 15 +++++++++++++++
pack-bitmap.h | 7 +++++++
2 files changed, 22 insertions(+)
diff --git a/pack-bitmap.c b/pack-bitmap.c
index fc92e0aae65..3cbe5bfe909 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -658,6 +658,21 @@ struct bitmap_index *prepare_midx_bitmap_git(struct multi_pack_index *midx)
return NULL;
}
+int bitmap_index_contains_pack(struct bitmap_index *bitmap, struct packed_git *pack)
+{
+ if (bitmap->pack)
+ return bitmap->pack == pack;
+
+ if (!bitmap->midx->chunk_bitmapped_packs)
+ return 0;
+
+ for (size_t i = 0; i < bitmap->midx->num_packs; i++)
+ if (bitmap->midx->packs[i] == pack)
+ return 1;
+
+ return 0;
+}
+
struct include_data {
struct bitmap_index *bitmap_git;
struct bitmap *base;
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 3368e79ed5a..45e96b213e2 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -67,6 +67,13 @@ struct bitmapped_pack {
struct bitmap_index *prepare_bitmap_git(struct repository *r);
struct bitmap_index *prepare_midx_bitmap_git(struct multi_pack_index *midx);
+
+/*
+ * Given a bitmap index, determine whether it contains the pack either directly
+ * or via the multi-pack-index.
+ */
+int bitmap_index_contains_pack(struct bitmap_index *bitmap, struct packed_git *pack);
+
void count_bitmap_commit_list(struct bitmap_index *, uint32_t *commits,
uint32_t *trees, uint32_t *blobs, uint32_t *tags);
void traverse_bitmap_commit_list(struct bitmap_index *,
--
2.48.1.683.gf705b3209c.dirty
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH 8/9] builtin/cat-file: deduplicate logic to iterate over all objects
2025-02-21 7:47 [PATCH 0/9] builtin/cat-file: allow filtering objects in batch mode Patrick Steinhardt
` (6 preceding siblings ...)
2025-02-21 7:47 ` [PATCH 7/9] pack-bitmap: introduce function to check whether a pack is bitmapped Patrick Steinhardt
@ 2025-02-21 7:47 ` Patrick Steinhardt
2025-02-21 7:47 ` [PATCH 9/9] builtin/cat-file: use bitmaps to efficiently filter by object type Patrick Steinhardt
` (2 subsequent siblings)
10 siblings, 0 replies; 72+ messages in thread
From: Patrick Steinhardt @ 2025-02-21 7:47 UTC (permalink / raw)
To: git
Pull out a common function that allows us to iterate over all objects in
a repository. Right now the logic is trivial and would only require two
function calls, making this refactoring a bit pointless. But in the next
commit we will iterate on this logic to make use of bitmaps, so this is
about to become a bit more complex.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
builtin/cat-file.c | 85 ++++++++++++++++++++++++++++++------------------------
1 file changed, 48 insertions(+), 37 deletions(-)
diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index b374c2bb104..25d5429e391 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -622,25 +622,18 @@ static int batch_object_cb(const struct object_id *oid, void *vdata)
return 0;
}
-static int collect_loose_object(const struct object_id *oid,
- const char *path UNUSED,
- void *data)
-{
- oid_array_append(data, oid);
- return 0;
-}
-
-static int collect_packed_object(const struct object_id *oid,
- struct packed_git *pack UNUSED,
- uint32_t pos UNUSED,
- void *data)
+static int collect_object(const struct object_id *oid,
+ struct packed_git *pack UNUSED,
+ off_t offset UNUSED,
+ void *data)
{
oid_array_append(data, oid);
return 0;
}
static int batch_unordered_object(const struct object_id *oid,
- struct packed_git *pack, off_t offset,
+ struct packed_git *pack,
+ off_t offset,
void *vdata)
{
struct object_cb_data *data = vdata;
@@ -654,23 +647,6 @@ static int batch_unordered_object(const struct object_id *oid,
return 0;
}
-static int batch_unordered_loose(const struct object_id *oid,
- const char *path UNUSED,
- void *data)
-{
- return batch_unordered_object(oid, NULL, 0, data);
-}
-
-static int batch_unordered_packed(const struct object_id *oid,
- struct packed_git *pack,
- uint32_t pos,
- void *data)
-{
- return batch_unordered_object(oid, pack,
- nth_packed_object_offset(pack, pos),
- data);
-}
-
typedef void (*parse_cmd_fn_t)(struct batch_options *, const char *,
struct strbuf *, struct expand_data *);
@@ -803,6 +779,45 @@ static void batch_objects_command(struct batch_options *opt,
#define DEFAULT_FORMAT "%(objectname) %(objecttype) %(objectsize)"
+typedef int (*for_each_object_fn)(const struct object_id *oid, struct packed_git *pack,
+ off_t offset, void *data);
+
+struct for_each_object_payload {
+ for_each_object_fn callback;
+ void *payload;
+};
+
+static int batch_one_object_loose(const struct object_id *oid,
+ const char *path UNUSED,
+ void *_payload)
+{
+ struct for_each_object_payload *payload = _payload;
+ return payload->callback(oid, NULL, 0, payload->payload);
+}
+
+static int batch_one_object_packed(const struct object_id *oid,
+ struct packed_git *pack,
+ uint32_t pos,
+ void *_payload)
+{
+ struct for_each_object_payload *payload = _payload;
+ return payload->callback(oid, pack, nth_packed_object_offset(pack, pos),
+ payload->payload);
+}
+
+static void batch_each_object(for_each_object_fn callback,
+ unsigned flags,
+ void *_payload)
+{
+ struct for_each_object_payload payload = {
+ .callback = callback,
+ .payload = _payload,
+ };
+ for_each_loose_object(batch_one_object_loose, &payload, 0);
+ for_each_packed_object(the_repository, batch_one_object_packed,
+ &payload, flags);
+}
+
static int batch_objects(struct batch_options *opt)
{
struct strbuf input = STRBUF_INIT;
@@ -857,18 +872,14 @@ static int batch_objects(struct batch_options *opt)
cb.seen = &seen;
- for_each_loose_object(batch_unordered_loose, &cb, 0);
- for_each_packed_object(the_repository, batch_unordered_packed,
- &cb, FOR_EACH_OBJECT_PACK_ORDER);
+ batch_each_object(batch_unordered_object,
+ FOR_EACH_OBJECT_PACK_ORDER, &cb);
oidset_clear(&seen);
} else {
struct oid_array sa = OID_ARRAY_INIT;
- for_each_loose_object(collect_loose_object, &sa, 0);
- for_each_packed_object(the_repository, collect_packed_object,
- &sa, 0);
-
+ batch_each_object(collect_object, 0, &sa);
oid_array_for_each_unique(&sa, batch_object_cb, &cb);
oid_array_clear(&sa);
--
2.48.1.683.gf705b3209c.dirty
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH 9/9] builtin/cat-file: use bitmaps to efficiently filter by object type
2025-02-21 7:47 [PATCH 0/9] builtin/cat-file: allow filtering objects in batch mode Patrick Steinhardt
` (7 preceding siblings ...)
2025-02-21 7:47 ` [PATCH 8/9] builtin/cat-file: deduplicate logic to iterate over all objects Patrick Steinhardt
@ 2025-02-21 7:47 ` Patrick Steinhardt
2025-02-27 11:38 ` Karthik Nayak
2025-02-27 23:48 ` Taylor Blau
2025-03-27 9:43 ` [PATCH v2 00/10] builtin/cat-file: allow filtering objects in batch mode Patrick Steinhardt
2025-04-02 11:13 ` [PATCH v3 00/11] builtin/cat-file: allow filtering objects in batch mode Patrick Steinhardt
10 siblings, 2 replies; 72+ messages in thread
From: Patrick Steinhardt @ 2025-02-21 7:47 UTC (permalink / raw)
To: git
While it is now possible to filter objects by type, this mechanism is
for now mostly a convenience. Most importantly, we still have to iterate
through the whole packfile to find all objects of a specific type. This
can be prohibitively expensive depending on the size of the packfiles.
It isn't really possible to do better than this when only considering a
packfile itself, as the order of objects is not fixed. But when we have
a packfile with a corresponding bitmap, either because the packfile
itself has one or because the multi-pack index has a bitmap for it, then
we can use these bitmaps to improve the runtime.
While bitmaps are typically used to compute reachability of objects,
they also contain one bitmap per object type encodes which object has
what type. So instead of reading through the whole packfile(s), we can
use the bitmaps and iterate through the type-specific bitmap. Typically,
only a subset of packfiles will have a bitmap. But this isn't really
much of a problem: we can use bitmaps when available, and then use the
non-bitmap walk for every packfile that isn't covered by one.
Overall, this leads to quite a significant speedup depending on how many
objects of a certain type exist. The following benchmarks have been
executed in the Chromium repository, which has a 50GB packfile with
almost 25 million objects:
Benchmark 1: git cat-file --batch-check --batch-all-objects --unordered --buffer --no-objects-filter
Time (mean ± σ): 82.806 s ± 6.363 s [User: 30.956 s, System: 8.264 s]
Range (min … max): 73.936 s … 89.690 s 10 runs
Benchmark 2: git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=tag
Time (mean ± σ): 20.8 ms ± 1.3 ms [User: 6.1 ms, System: 14.5 ms]
Range (min … max): 18.2 ms … 23.6 ms 127 runs
Benchmark 3: git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=commit
Time (mean ± σ): 1.551 s ± 0.008 s [User: 1.401 s, System: 0.147 s]
Range (min … max): 1.541 s … 1.566 s 10 runs
Benchmark 4: git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=tree
Time (mean ± σ): 11.169 s ± 0.046 s [User: 10.076 s, System: 1.063 s]
Range (min … max): 11.114 s … 11.245 s 10 runs
Benchmark 5: git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=blob
Time (mean ± σ): 67.342 s ± 3.368 s [User: 20.318 s, System: 7.787 s]
Range (min … max): 62.836 s … 73.618 s 10 runs
Benchmark 6: git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=blob:none
Time (mean ± σ): 13.032 s ± 0.072 s [User: 11.638 s, System: 1.368 s]
Range (min … max): 12.960 s … 13.199 s 10 runs
Summary
git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=tag
74.75 ± 4.61 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=commit
538.17 ± 33.17 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=tree
627.98 ± 38.77 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=blob:none
3244.93 ± 257.23 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=blob
3990.07 ± 392.72 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --no-objects-filter
The first benchmark is mostly equivalent in runtime compared to all the
others without the bitmap-optimization introduced in this commit. What
is noticeable in the benchmarks is that we're I/O-bound, not CPU-bound,
as can be seen from the user/system runtimes, which is often way lower
than the overall benchmarked runtime.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
builtin/cat-file.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 50 insertions(+), 5 deletions(-)
diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 25d5429e391..9021fd52f30 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -21,6 +21,7 @@
#include "streaming.h"
#include "oid-array.h"
#include "packfile.h"
+#include "pack-bitmap.h"
#include "object-file.h"
#include "object-name.h"
#include "object-store-ll.h"
@@ -805,7 +806,20 @@ static int batch_one_object_packed(const struct object_id *oid,
payload->payload);
}
-static void batch_each_object(for_each_object_fn callback,
+static int batch_one_object_bitmapped(const struct object_id *oid,
+ enum object_type type UNUSED,
+ int flags UNUSED,
+ uint32_t hash UNUSED,
+ struct packed_git *pack,
+ off_t offset,
+ void *_payload)
+{
+ struct for_each_object_payload *payload = _payload;
+ return payload->callback(oid, pack, offset, payload->payload);
+}
+
+static void batch_each_object(struct batch_options *opt,
+ for_each_object_fn callback,
unsigned flags,
void *_payload)
{
@@ -813,9 +827,40 @@ static void batch_each_object(for_each_object_fn callback,
.callback = callback,
.payload = _payload,
};
+ struct bitmap_index *bitmap = prepare_bitmap_git(the_repository);
+
for_each_loose_object(batch_one_object_loose, &payload, 0);
- for_each_packed_object(the_repository, batch_one_object_packed,
- &payload, flags);
+
+ if (bitmap &&
+ (opt->objects_filter.choice == LOFC_OBJECT_TYPE ||
+ opt->objects_filter.choice == LOFC_BLOB_NONE)) {
+ struct packed_git *pack;
+
+ if (opt->objects_filter.choice == LOFC_OBJECT_TYPE) {
+ for_each_bitmapped_object(bitmap, opt->objects_filter.object_type,
+ batch_one_object_bitmapped, &payload);
+ } else {
+ for_each_bitmapped_object(bitmap, OBJ_COMMIT,
+ batch_one_object_bitmapped, &payload);
+ for_each_bitmapped_object(bitmap, OBJ_TAG,
+ batch_one_object_bitmapped, &payload);
+ for_each_bitmapped_object(bitmap, OBJ_TREE,
+ batch_one_object_bitmapped, &payload);
+ }
+
+ for (pack = get_all_packs(the_repository); pack; pack = pack->next) {
+ if (bitmap_index_contains_pack(bitmap, pack) ||
+ open_pack_index(pack))
+ continue;
+ for_each_object_in_pack(pack, batch_one_object_packed,
+ &payload, flags);
+ }
+ } else {
+ for_each_packed_object(the_repository, batch_one_object_packed,
+ &payload, flags);
+ }
+
+ free_bitmap_index(bitmap);
}
static int batch_objects(struct batch_options *opt)
@@ -872,14 +917,14 @@ static int batch_objects(struct batch_options *opt)
cb.seen = &seen;
- batch_each_object(batch_unordered_object,
+ batch_each_object(opt, batch_unordered_object,
FOR_EACH_OBJECT_PACK_ORDER, &cb);
oidset_clear(&seen);
} else {
struct oid_array sa = OID_ARRAY_INIT;
- batch_each_object(collect_object, 0, &sa);
+ batch_each_object(opt, collect_object, 0, &sa);
oid_array_for_each_unique(&sa, batch_object_cb, &cb);
oid_array_clear(&sa);
--
2.48.1.683.gf705b3209c.dirty
^ permalink raw reply related [flat|nested] 72+ messages in thread
* Re: [PATCH 6/9] pack-bitmap: expose function to iterate over bitmapped objects
2025-02-21 7:47 ` [PATCH 6/9] pack-bitmap: expose function to iterate over bitmapped objects Patrick Steinhardt
@ 2025-02-24 18:05 ` Junio C Hamano
2025-02-25 6:59 ` Patrick Steinhardt
2025-02-27 23:23 ` Taylor Blau
0 siblings, 2 replies; 72+ messages in thread
From: Junio C Hamano @ 2025-02-24 18:05 UTC (permalink / raw)
To: Patrick Steinhardt, Taylor Blau; +Cc: git
Patrick Steinhardt <ps@pks.im> writes:
> Expose a function that allows the caller to iterate over all bitmapped
> objects of a specific type. This mechanism allows us to use the object
> type-specific bitmaps to enumerate all objects of that type without
> having to scan through a complete packfile.
>
> This functionality will be used in a subsequent commit.
>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
> builtin/pack-objects.c | 3 ++-
> builtin/rev-list.c | 3 ++-
> pack-bitmap.c | 65 +++++++++++++++++++++++++++++++-------------------
> pack-bitmap.h | 12 +++++++++-
> reachable.c | 3 ++-
> 5 files changed, 57 insertions(+), 29 deletions(-)
After 2189649b (pack-bitmap.c: keep track of each layer's type
bitmaps, 2024-11-19) added <type>_all bitmaps to the bitmap_index
struct, this step would need some adjustment, I am afraid.
Taylor Cc'ed.
Thanks.
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 58a9b161262..8f99e2b4fa8 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -1735,7 +1735,8 @@ static int add_object_entry(const struct object_id *oid, enum object_type type,
> static int add_object_entry_from_bitmap(const struct object_id *oid,
> enum object_type type,
> int flags UNUSED, uint32_t name_hash,
> - struct packed_git *pack, off_t offset)
> + struct packed_git *pack, off_t offset,
> + void *payload UNUSED)
> {
> display_progress(progress_state, ++nr_seen);
>
> diff --git a/builtin/rev-list.c b/builtin/rev-list.c
> index bb26bee0d45..1100dd2abe7 100644
> --- a/builtin/rev-list.c
> +++ b/builtin/rev-list.c
> @@ -429,7 +429,8 @@ static int show_object_fast(
> int exclude UNUSED,
> uint32_t name_hash UNUSED,
> struct packed_git *found_pack UNUSED,
> - off_t found_offset UNUSED)
> + off_t found_offset UNUSED,
> + void *payload UNUSED)
> {
> fprintf(stdout, "%s\n", oid_to_hex(oid));
> return 1;
> diff --git a/pack-bitmap.c b/pack-bitmap.c
> index 6406953d322..fc92e0aae65 100644
> --- a/pack-bitmap.c
> +++ b/pack-bitmap.c
> @@ -1509,50 +1509,45 @@ static void show_extended_objects(struct bitmap_index *bitmap_git,
> (obj->type == OBJ_TAG && !revs->tag_objects))
> continue;
>
> - show_reach(&obj->oid, obj->type, 0, eindex->hashes[i], NULL, 0);
> + show_reach(&obj->oid, obj->type, 0, eindex->hashes[i], NULL, 0, NULL);
> }
> }
>
> -static void init_type_iterator(struct ewah_iterator *it,
> - struct bitmap_index *bitmap_git,
> - enum object_type type)
> +static struct ewah_bitmap *ewah_for_type(struct bitmap_index *bitmap_git,
> + enum object_type type)
> {
> switch (type) {
> case OBJ_COMMIT:
> - ewah_iterator_init(it, bitmap_git->commits);
> - break;
> -
> + return bitmap_git->commits;
> case OBJ_TREE:
> - ewah_iterator_init(it, bitmap_git->trees);
> - break;
> -
> + return bitmap_git->trees;
> case OBJ_BLOB:
> - ewah_iterator_init(it, bitmap_git->blobs);
> - break;
> -
> + return bitmap_git->blobs;
> case OBJ_TAG:
> - ewah_iterator_init(it, bitmap_git->tags);
> - break;
> -
> + return bitmap_git->tags;
> default:
> BUG("object type %d not stored by bitmap type index", type);
> - break;
> }
> }
>
> -static void show_objects_for_type(
> - struct bitmap_index *bitmap_git,
> - enum object_type object_type,
> - show_reachable_fn show_reach)
> +static void init_type_iterator(struct ewah_iterator *it,
> + struct bitmap_index *bitmap_git,
> + enum object_type type)
> +{
> + ewah_iterator_init(it, ewah_for_type(bitmap_git, type));
> +}
> +
> +static void for_each_bitmapped_object_internal(struct bitmap_index *bitmap_git,
> + struct bitmap *objects,
> + enum object_type object_type,
> + show_reachable_fn show_reach,
> + void *payload)
> {
> size_t i = 0;
> uint32_t offset;
> -
> struct ewah_iterator it;
> eword_t filter;
>
> - struct bitmap *objects = bitmap_git->result;
> -
> init_type_iterator(&it, bitmap_git, object_type);
>
> for (i = 0; i < objects->word_alloc &&
> @@ -1595,11 +1590,31 @@ static void show_objects_for_type(
> if (bitmap_git->hashes)
> hash = get_be32(bitmap_git->hashes + index_pos);
>
> - show_reach(&oid, object_type, 0, hash, pack, ofs);
> + show_reach(&oid, object_type, 0, hash, pack, ofs, payload);
> }
> }
> }
>
> +static void show_objects_for_type(
> + struct bitmap_index *bitmap_git,
> + enum object_type object_type,
> + show_reachable_fn show_reach)
> +{
> + for_each_bitmapped_object_internal(bitmap_git, bitmap_git->result,
> + object_type, show_reach, NULL);
> +}
> +
> +void for_each_bitmapped_object(struct bitmap_index *bitmap_git,
> + enum object_type object_type,
> + show_reachable_fn show_reach,
> + void *payload)
> +{
> + struct bitmap *bitmap = ewah_to_bitmap(ewah_for_type(bitmap_git, object_type));
> + for_each_bitmapped_object_internal(bitmap_git, bitmap,
> + object_type, show_reach, payload);
> + bitmap_free(bitmap);
> +}
> +
> static int in_bitmapped_pack(struct bitmap_index *bitmap_git,
> struct object_list *roots)
> {
> diff --git a/pack-bitmap.h b/pack-bitmap.h
> index d7f4b8b8e95..3368e79ed5a 100644
> --- a/pack-bitmap.h
> +++ b/pack-bitmap.h
> @@ -50,7 +50,8 @@ typedef int (*show_reachable_fn)(
> int flags,
> uint32_t hash,
> struct packed_git *found_pack,
> - off_t found_offset);
> + off_t found_offset,
> + void *payload);
>
> struct bitmap_index;
>
> @@ -78,6 +79,15 @@ int test_bitmap_pseudo_merges(struct repository *r);
> int test_bitmap_pseudo_merge_commits(struct repository *r, uint32_t n);
> int test_bitmap_pseudo_merge_objects(struct repository *r, uint32_t n);
>
> +/*
> + * Iterate through all bitmapped objects of the given type and execute the
> + * `show_reach` for each of them.
> + */
> + void for_each_bitmapped_object(struct bitmap_index *bitmap_git,
> + enum object_type object_type,
> + show_reachable_fn show_reach,
> + void *payload);
> +
> #define GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL \
> "GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL"
>
> diff --git a/reachable.c b/reachable.c
> index ecf7ccf5041..dd33c7f07dd 100644
> --- a/reachable.c
> +++ b/reachable.c
> @@ -337,7 +337,8 @@ static int mark_object_seen(const struct object_id *oid,
> int exclude UNUSED,
> uint32_t name_hash UNUSED,
> struct packed_git *found_pack UNUSED,
> - off_t found_offset UNUSED)
> + off_t found_offset UNUSED,
> + void *payload UNUSED)
> {
> struct object *obj = lookup_object_by_type(the_repository, oid, type);
> if (!obj)
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 6/9] pack-bitmap: expose function to iterate over bitmapped objects
2025-02-24 18:05 ` Junio C Hamano
@ 2025-02-25 6:59 ` Patrick Steinhardt
2025-02-25 16:59 ` Junio C Hamano
2025-02-27 23:26 ` Taylor Blau
2025-02-27 23:23 ` Taylor Blau
1 sibling, 2 replies; 72+ messages in thread
From: Patrick Steinhardt @ 2025-02-25 6:59 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Taylor Blau, git
On Mon, Feb 24, 2025 at 10:05:27AM -0800, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
>
> > Expose a function that allows the caller to iterate over all bitmapped
> > objects of a specific type. This mechanism allows us to use the object
> > type-specific bitmaps to enumerate all objects of that type without
> > having to scan through a complete packfile.
> >
> > This functionality will be used in a subsequent commit.
> >
> > Signed-off-by: Patrick Steinhardt <ps@pks.im>
> > ---
> > builtin/pack-objects.c | 3 ++-
> > builtin/rev-list.c | 3 ++-
> > pack-bitmap.c | 65 +++++++++++++++++++++++++++++++-------------------
> > pack-bitmap.h | 12 +++++++++-
> > reachable.c | 3 ++-
> > 5 files changed, 57 insertions(+), 29 deletions(-)
>
> After 2189649b (pack-bitmap.c: keep track of each layer's type
> bitmaps, 2024-11-19) added <type>_all bitmaps to the bitmap_index
> struct, this step would need some adjustment, I am afraid.
Hm, does it? I understand that this commit only makes the bitmaps
accessible individually per bitmapped packfile, but the bitmap indices
part of `struct bitmap_index` would continue to be the union of all of
those bitmaps. Oh, but that changes in the subsequent commits indeed,
where we start to use an `ewah_or_iterator`.
I see that Taylor's series has been sitting in an unreviewed state for a
couple months already. I can review it with the hope of moving it
forward and can then pull it in as a dependency of this series. But I'll
wait for him to chime in first to see whether anything changed about its
current state.
Patrick
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 6/9] pack-bitmap: expose function to iterate over bitmapped objects
2025-02-25 6:59 ` Patrick Steinhardt
@ 2025-02-25 16:59 ` Junio C Hamano
2025-02-27 23:26 ` Taylor Blau
1 sibling, 0 replies; 72+ messages in thread
From: Junio C Hamano @ 2025-02-25 16:59 UTC (permalink / raw)
To: Patrick Steinhardt; +Cc: Taylor Blau, git
Patrick Steinhardt <ps@pks.im> writes:
> I see that Taylor's series has been sitting in an unreviewed state for a
> couple months already. I can review it with the hope of moving it
> forward and can then pull it in as a dependency of this series. But I'll
> wait for him to chime in first to see whether anything changed about its
> current state.
Thanks. Making sure that one hand knows what other hand's doing
would be a good idea.
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 2/9] builtin/cat-file: wire up an option to filter objects
2025-02-21 7:47 ` [PATCH 2/9] builtin/cat-file: wire up an option to filter objects Patrick Steinhardt
@ 2025-02-26 15:20 ` Toon Claes
2025-02-28 10:51 ` Patrick Steinhardt
2025-02-27 11:20 ` Karthik Nayak
1 sibling, 1 reply; 72+ messages in thread
From: Toon Claes @ 2025-02-26 15:20 UTC (permalink / raw)
To: Patrick Steinhardt, git
Patrick Steinhardt <ps@pks.im> writes:
> In batch mode, git-cat-file(1) enumerates all objects and prints them
> by iterating through both loose and packed objects. This works without
> considering their reachability at all, and consequently most options to
> filter objects as they exist in e.g. git-rev-list(1) are not applicable.
> In some situations it may still be useful though to filter objects based
> on properties that are inherent to them. This includes the object size
> as well as its type.
>
> Such a filter already exists in git-rev-list(1) with the `--filter=`
> command line option. While this option supports a couple of filters that
> are not applicable to our usecase, some of them are quite a neat fit.
>
> Wire up the filter as an option for git-cat-file(1). This allows us to
> reuse the same syntax as in git-rev-list(1) so that we don't have to
> reinvent the wheel. For now, we die when any of the filter options has
> been passed by the user, but they will be wired up in subsequent
> commits.
>
> Note that we don't use the same `--filter=` name fo the option as we use
> in git-rev-list(1). We already have `--filters`, and having both
> `--filter=` and `--filters` would be quite confusing. Instead, the new
> option is called `--objects-filter`.
I'm not sure I agree. I would rather have consistency in various
commands. Because `--filters` doesn't accept an argument, so I would say
having both `--filters` and `--filter=` is fine. I see in various places
we already use `OPT_PARSE_LIST_OBJECTS_FILTER` which defines the option
as `--filter=`, so it's pretty standard for several commands. I'd
prefer git-cat-file(1) to follow that as well. But that's my 2 cents.
--
Toon
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 3/9] builtin/cat-file: support "blob:none" objects filter
2025-02-21 7:47 ` [PATCH 3/9] builtin/cat-file: support "blob:none" objects filter Patrick Steinhardt
@ 2025-02-26 15:22 ` Toon Claes
2025-02-27 11:26 ` Karthik Nayak
1 sibling, 0 replies; 72+ messages in thread
From: Toon Claes @ 2025-02-26 15:22 UTC (permalink / raw)
To: Patrick Steinhardt, git
Patrick Steinhardt <ps@pks.im> writes:
> Implement support for the "blob:none" filter in git-cat-file(1), which
> causes us to omit all blobs.
>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
> Documentation/git-cat-file.adoc | 2 ++
> builtin/cat-file.c | 11 ++++++++++-
> t/t1006-cat-file.sh | 33 +++++++++++++++++++++++++++++++--
> 3 files changed, 43 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/git-cat-file.adoc b/Documentation/git-cat-file.adoc
> index 7c1c888079a..c11952d9eca 100644
> --- a/Documentation/git-cat-file.adoc
> +++ b/Documentation/git-cat-file.adoc
> @@ -86,6 +86,8 @@ OPTIONS
> Omit objects from the list of printed objects. This can only be used in
> combination with one of the batched modes. The '<filter-spec>' may be
> one of the following:
> ++
> +The form '--filter=blob:none' omits all blobs.
If we chose to use `--object-filter`, we need to use it here as well.
And same for the following commits.
--
Toon
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 5/9] builtin/cat-file: support "object:type=" objects filter
2025-02-21 7:47 ` [PATCH 5/9] builtin/cat-file: support "object:type=" " Patrick Steinhardt
@ 2025-02-26 15:23 ` Toon Claes
2025-02-28 10:51 ` Patrick Steinhardt
0 siblings, 1 reply; 72+ messages in thread
From: Toon Claes @ 2025-02-26 15:23 UTC (permalink / raw)
To: Patrick Steinhardt, git
Patrick Steinhardt <ps@pks.im> writes:
> Implement support for the "object:type=" filter in git-cat-file(1),
> which causes us to omit all objects that don't match the provided object
> type.
>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
> Documentation/git-cat-file.adoc | 3 +++
> builtin/cat-file.c | 8 +++++++-
> t/t1006-cat-file.sh | 6 +++++-
> 3 files changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/git-cat-file.adoc b/Documentation/git-cat-file.adoc
> index 8c474418b52..540d9dffdf9 100644
> --- a/Documentation/git-cat-file.adoc
> +++ b/Documentation/git-cat-file.adoc
> @@ -93,6 +93,9 @@ The form '--filter=blob:limit=<n>[kmg]' omits blobs of size at least n
> bytes or units. n may be zero. The suffixes k, m, and g can be used
> to name units in KiB, MiB, or GiB. For example, 'blob:limit=1k'
> is the same as 'blob:limit=1024'.
> ++
> +The form '--filter=object:type=(tag|commit|tree|blob)' omits all objects
> +which are not of the requested type.
>
> --path=<path>::
> For use with `--textconv` or `--filters`, to allow specifying an object
> diff --git a/builtin/cat-file.c b/builtin/cat-file.c
> index f57bf65cb03..b374c2bb104 100644
> --- a/builtin/cat-file.c
> +++ b/builtin/cat-file.c
> @@ -474,7 +474,8 @@ static void batch_object_write(const char *obj_name,
>
> if (use_mailmap ||
> opt->objects_filter.choice == LOFC_BLOB_NONE ||
> - opt->objects_filter.choice == LOFC_BLOB_LIMIT)
> + opt->objects_filter.choice == LOFC_BLOB_LIMIT ||
> + opt->objects_filter.choice == LOFC_OBJECT_TYPE)
> data->info.typep = &data->type;
> if (opt->objects_filter.choice == LOFC_BLOB_LIMIT)
> data->info.sizep = &data->size;
> @@ -505,6 +506,10 @@ static void batch_object_write(const char *obj_name,
> data->size >= opt->objects_filter.blob_limit_value)
> return;
> break;
> + case LOFC_OBJECT_TYPE:
> + if (data->type != opt->objects_filter.object_type)
> + return;
> + break;
> default:
> BUG("unsupported objects filter");
I see we don't support LOFC_COMBINE, so we won't be supporting repeating
the --filter= option, is this intentional? Should we support that too? I
feel it would make sense from the start, unless there are good reasons
not to?
--
Toon
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 2/9] builtin/cat-file: wire up an option to filter objects
2025-02-21 7:47 ` [PATCH 2/9] builtin/cat-file: wire up an option to filter objects Patrick Steinhardt
2025-02-26 15:20 ` Toon Claes
@ 2025-02-27 11:20 ` Karthik Nayak
1 sibling, 0 replies; 72+ messages in thread
From: Karthik Nayak @ 2025-02-27 11:20 UTC (permalink / raw)
To: Patrick Steinhardt, git
[-- Attachment #1: Type: text/plain, Size: 1739 bytes --]
Patrick Steinhardt <ps@pks.im> writes:
> In batch mode, git-cat-file(1) enumerates all objects and prints them
> by iterating through both loose and packed objects. This works without
> considering their reachability at all, and consequently most options to
> filter objects as they exist in e.g. git-rev-list(1) are not applicable.
> In some situations it may still be useful though to filter objects based
> on properties that are inherent to them. This includes the object size
> as well as its type.
>
> Such a filter already exists in git-rev-list(1) with the `--filter=`
> command line option. While this option supports a couple of filters that
> are not applicable to our usecase, some of them are quite a neat fit.
>
> Wire up the filter as an option for git-cat-file(1). This allows us to
> reuse the same syntax as in git-rev-list(1) so that we don't have to
> reinvent the wheel. For now, we die when any of the filter options has
> been passed by the user, but they will be wired up in subsequent
> commits.
>
> Note that we don't use the same `--filter=` name fo the option as we use
s/fo/for
> in git-rev-list(1). We already have `--filters`, and having both
> `--filter=` and `--filters` would be quite confusing. Instead, the new
> option is called `--objects-filter`.
>
> Further note that the filters that we are about to introduce don't
> significantly speed up the runtime of git-cat-file(1). While we can skip
> emitting a lot of objects in case they are uninteresting to us, the
> majority of time is spent reading the packfile, which is bottlenecked by
> I/O and not the processor. This will change though once we start to make
> use of bitmaps, which will allow us to skip reading the whole packfile.
>
[snip]
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 690 bytes --]
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 3/9] builtin/cat-file: support "blob:none" objects filter
2025-02-21 7:47 ` [PATCH 3/9] builtin/cat-file: support "blob:none" objects filter Patrick Steinhardt
2025-02-26 15:22 ` Toon Claes
@ 2025-02-27 11:26 ` Karthik Nayak
1 sibling, 0 replies; 72+ messages in thread
From: Karthik Nayak @ 2025-02-27 11:26 UTC (permalink / raw)
To: Patrick Steinhardt, git
[-- Attachment #1: Type: text/plain, Size: 4200 bytes --]
Patrick Steinhardt <ps@pks.im> writes:
> Implement support for the "blob:none" filter in git-cat-file(1), which
> causes us to omit all blobs.
>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
> Documentation/git-cat-file.adoc | 2 ++
> builtin/cat-file.c | 11 ++++++++++-
> t/t1006-cat-file.sh | 33 +++++++++++++++++++++++++++++++--
> 3 files changed, 43 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/git-cat-file.adoc b/Documentation/git-cat-file.adoc
> index 7c1c888079a..c11952d9eca 100644
> --- a/Documentation/git-cat-file.adoc
> +++ b/Documentation/git-cat-file.adoc
> @@ -86,6 +86,8 @@ OPTIONS
> Omit objects from the list of printed objects. This can only be used in
> combination with one of the batched modes. The '<filter-spec>' may be
> one of the following:
> ++
> +The form '--filter=blob:none' omits all blobs.
>
Shouldn't this be '--objects-filter' ?
> --path=<path>::
> For use with `--textconv` or `--filters`, to allow specifying an object
> diff --git a/builtin/cat-file.c b/builtin/cat-file.c
> index 723644fbba8..8e5572ba43e 100644
> --- a/builtin/cat-file.c
> +++ b/builtin/cat-file.c
> @@ -472,7 +472,8 @@ static void batch_object_write(const char *obj_name,
> if (!data->skip_object_info) {
> int ret;
>
> - if (use_mailmap)
> + if (use_mailmap ||
> + opt->objects_filter.choice == LOFC_BLOB_NONE)
So since we support selective filters, we'd have to add this type only
for those filters. In other words, there is no generic way to do this.
> data->info.typep = &data->type;
>
> if (pack)
> @@ -492,6 +493,10 @@ static void batch_object_write(const char *obj_name,
> switch (opt->objects_filter.choice) {
> case LOFC_DISABLED:
> break;
> + case LOFC_BLOB_NONE:
> + if (data->type == OBJ_BLOB)
> + return;
> + break;
> default:
> BUG("unsupported objects filter");
> }
> @@ -1032,6 +1037,10 @@ int cmd_cat_file(int argc,
> switch (batch.objects_filter.choice) {
> case LOFC_DISABLED:
> break;
> + case LOFC_BLOB_NONE:
> + if (!batch.enabled)
> + usage(_("objects filter only supported in batch mode"));
> + break;
> default:
> usagef(_("objects filter not supported: '%s'"),
> list_object_filter_config_name(batch.objects_filter.choice));
> diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
> index 48840a13561..97533225982 100755
> --- a/t/t1006-cat-file.sh
> +++ b/t/t1006-cat-file.sh
> @@ -1354,7 +1354,22 @@ test_expect_success PERL '--batch-command info is unbuffered by default' '
> '
>
> test_expect_success 'setup for objects filter' '
> - git init repo
> + git init repo &&
> + (
> + # Seed the repository with three different sets of objects:
> + #
> + # - The first set is fully packed and has a bitmap.
> + # - The second set is packed, but has no bitmap.
> + # - The third set is loose.
> + #
> + # This ensures that we cover all these types as expected.
> + cd repo &&
> + test_commit first &&
> + git repack -Adb &&
> + test_commit second &&
> + git repack -d &&
> + test_commit third
> + )
> '
>
> test_expect_success 'objects filter with unknown option' '
> @@ -1365,7 +1380,7 @@ test_expect_success 'objects filter with unknown option' '
> test_cmp expect err
> '
>
> -for option in blob:none blob:limit=1 object:type=tag sparse:oid=1234 tree:1 sparse:path=x
> +for option in blob:limit=1 object:type=tag sparse:oid=1234 tree:1 sparse:path=x
> do
> test_expect_success "objects filter with unsupported option $option" '
> case "$option" in
> @@ -1385,4 +1400,18 @@ do
> '
> done
>
> +test_objects_filter () {
> + filter="$1"
> +
> + test_expect_success "objects filter: $filter" '
> + git -C repo cat-file --batch-check="%(objectname)" --batch-all-objects --objects-filter="$filter" >actual &&
> + sort actual >actual.sorted &&
> + git -C repo rev-list --objects --no-object-names --all --filter="$filter" --filter-provided-objects >expect &&
> + sort expect >expect.sorted &&
> + test_cmp expect.sorted actual.sorted
> + '
> +}
> +
> +test_objects_filter "blob:none"
> +
Nice, this builds up for the upcoming commits too.
> test_done
>
> --
> 2.48.1.683.gf705b3209c.dirty
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 690 bytes --]
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 9/9] builtin/cat-file: use bitmaps to efficiently filter by object type
2025-02-21 7:47 ` [PATCH 9/9] builtin/cat-file: use bitmaps to efficiently filter by object type Patrick Steinhardt
@ 2025-02-27 11:38 ` Karthik Nayak
2025-02-27 23:48 ` Taylor Blau
1 sibling, 0 replies; 72+ messages in thread
From: Karthik Nayak @ 2025-02-27 11:38 UTC (permalink / raw)
To: Patrick Steinhardt, git
[-- Attachment #1: Type: text/plain, Size: 8027 bytes --]
Patrick Steinhardt <ps@pks.im> writes:
> While it is now possible to filter objects by type, this mechanism is
> for now mostly a convenience. Most importantly, we still have to iterate
> through the whole packfile to find all objects of a specific type. This
> can be prohibitively expensive depending on the size of the packfiles.
>
> It isn't really possible to do better than this when only considering a
> packfile itself, as the order of objects is not fixed. But when we have
> a packfile with a corresponding bitmap, either because the packfile
> itself has one or because the multi-pack index has a bitmap for it, then
> we can use these bitmaps to improve the runtime.
>
> While bitmaps are typically used to compute reachability of objects,
> they also contain one bitmap per object type encodes which object has
perhaps s/type encodes/type that encodes/
> what type. So instead of reading through the whole packfile(s), we can
> use the bitmaps and iterate through the type-specific bitmap. Typically,
> only a subset of packfiles will have a bitmap. But this isn't really
> much of a problem: we can use bitmaps when available, and then use the
> non-bitmap walk for every packfile that isn't covered by one.
>
> Overall, this leads to quite a significant speedup depending on how many
> objects of a certain type exist. The following benchmarks have been
> executed in the Chromium repository, which has a 50GB packfile with
> almost 25 million objects:
>
> Benchmark 1: git cat-file --batch-check --batch-all-objects --unordered --buffer --no-objects-filter
> Time (mean ± σ): 82.806 s ± 6.363 s [User: 30.956 s, System: 8.264 s]
> Range (min … max): 73.936 s … 89.690 s 10 runs
>
> Benchmark 2: git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=tag
> Time (mean ± σ): 20.8 ms ± 1.3 ms [User: 6.1 ms, System: 14.5 ms]
> Range (min … max): 18.2 ms … 23.6 ms 127 runs
>
> Benchmark 3: git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=commit
> Time (mean ± σ): 1.551 s ± 0.008 s [User: 1.401 s, System: 0.147 s]
> Range (min … max): 1.541 s … 1.566 s 10 runs
>
> Benchmark 4: git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=tree
> Time (mean ± σ): 11.169 s ± 0.046 s [User: 10.076 s, System: 1.063 s]
> Range (min … max): 11.114 s … 11.245 s 10 runs
>
> Benchmark 5: git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=blob
> Time (mean ± σ): 67.342 s ± 3.368 s [User: 20.318 s, System: 7.787 s]
> Range (min … max): 62.836 s … 73.618 s 10 runs
>
> Benchmark 6: git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=blob:none
> Time (mean ± σ): 13.032 s ± 0.072 s [User: 11.638 s, System: 1.368 s]
> Range (min … max): 12.960 s … 13.199 s 10 runs
>
> Summary
> git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=tag
> 74.75 ± 4.61 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=commit
> 538.17 ± 33.17 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=tree
> 627.98 ± 38.77 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=blob:none
> 3244.93 ± 257.23 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=blob
> 3990.07 ± 392.72 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --no-objects-filter
>
> The first benchmark is mostly equivalent in runtime compared to all the
> others without the bitmap-optimization introduced in this commit. What
> is noticeable in the benchmarks is that we're I/O-bound, not CPU-bound,
> as can be seen from the user/system runtimes, which is often way lower
> than the overall benchmarked runtime.
>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
> builtin/cat-file.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++-----
> 1 file changed, 50 insertions(+), 5 deletions(-)
>
> diff --git a/builtin/cat-file.c b/builtin/cat-file.c
> index 25d5429e391..9021fd52f30 100644
> --- a/builtin/cat-file.c
> +++ b/builtin/cat-file.c
> @@ -21,6 +21,7 @@
> #include "streaming.h"
> #include "oid-array.h"
> #include "packfile.h"
> +#include "pack-bitmap.h"
> #include "object-file.h"
> #include "object-name.h"
> #include "object-store-ll.h"
> @@ -805,7 +806,20 @@ static int batch_one_object_packed(const struct object_id *oid,
> payload->payload);
> }
>
> -static void batch_each_object(for_each_object_fn callback,
> +static int batch_one_object_bitmapped(const struct object_id *oid,
> + enum object_type type UNUSED,
> + int flags UNUSED,
> + uint32_t hash UNUSED,
> + struct packed_git *pack,
> + off_t offset,
> + void *_payload)
> +{
> + struct for_each_object_payload *payload = _payload;
> + return payload->callback(oid, pack, offset, payload->payload);
> +}
> +
> +static void batch_each_object(struct batch_options *opt,
> + for_each_object_fn callback,
> unsigned flags,
> void *_payload)
> {
> @@ -813,9 +827,40 @@ static void batch_each_object(for_each_object_fn callback,
> .callback = callback,
> .payload = _payload,
> };
> + struct bitmap_index *bitmap = prepare_bitmap_git(the_repository);
> +
> for_each_loose_object(batch_one_object_loose, &payload, 0);
> - for_each_packed_object(the_repository, batch_one_object_packed,
> - &payload, flags);
> +
> + if (bitmap &&
> + (opt->objects_filter.choice == LOFC_OBJECT_TYPE ||
> + opt->objects_filter.choice == LOFC_BLOB_NONE)) {
> + struct packed_git *pack;
> +
> + if (opt->objects_filter.choice == LOFC_OBJECT_TYPE) {
> + for_each_bitmapped_object(bitmap, opt->objects_filter.object_type,
> + batch_one_object_bitmapped, &payload);
> + } else {
Nit: while this can be derived from the if statement above, it would be
more readable if this was `if else (opt->objects_filter.choice ==
LOFC_BLOB_NONE)`
> + for_each_bitmapped_object(bitmap, OBJ_COMMIT,
> + batch_one_object_bitmapped, &payload);
> + for_each_bitmapped_object(bitmap, OBJ_TAG,
> + batch_one_object_bitmapped, &payload);
> + for_each_bitmapped_object(bitmap, OBJ_TREE,
> + batch_one_object_bitmapped, &payload);
> + }
> +
> + for (pack = get_all_packs(the_repository); pack; pack = pack->next) {
> + if (bitmap_index_contains_pack(bitmap, pack) ||
> + open_pack_index(pack))
> + continue;
> + for_each_object_in_pack(pack, batch_one_object_packed,
> + &payload, flags);
> + }
> + } else {
> + for_each_packed_object(the_repository, batch_one_object_packed,
> + &payload, flags);
> + }
> +
> + free_bitmap_index(bitmap);
> }
>
> static int batch_objects(struct batch_options *opt)
> @@ -872,14 +917,14 @@ static int batch_objects(struct batch_options *opt)
>
> cb.seen = &seen;
>
> - batch_each_object(batch_unordered_object,
> + batch_each_object(opt, batch_unordered_object,
> FOR_EACH_OBJECT_PACK_ORDER, &cb);
>
> oidset_clear(&seen);
> } else {
> struct oid_array sa = OID_ARRAY_INIT;
>
> - batch_each_object(collect_object, 0, &sa);
> + batch_each_object(opt, collect_object, 0, &sa);
> oid_array_for_each_unique(&sa, batch_object_cb, &cb);
>
> oid_array_clear(&sa);
>
> --
> 2.48.1.683.gf705b3209c.dirty
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 690 bytes --]
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 6/9] pack-bitmap: expose function to iterate over bitmapped objects
2025-02-24 18:05 ` Junio C Hamano
2025-02-25 6:59 ` Patrick Steinhardt
@ 2025-02-27 23:23 ` Taylor Blau
2025-02-27 23:32 ` Junio C Hamano
1 sibling, 1 reply; 72+ messages in thread
From: Taylor Blau @ 2025-02-27 23:23 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Patrick Steinhardt, git
On Mon, Feb 24, 2025 at 10:05:27AM -0800, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
>
> > Expose a function that allows the caller to iterate over all bitmapped
> > objects of a specific type. This mechanism allows us to use the object
> > type-specific bitmaps to enumerate all objects of that type without
> > having to scan through a complete packfile.
> >
> > This functionality will be used in a subsequent commit.
> >
> > Signed-off-by: Patrick Steinhardt <ps@pks.im>
> > ---
> > builtin/pack-objects.c | 3 ++-
> > builtin/rev-list.c | 3 ++-
> > pack-bitmap.c | 65 +++++++++++++++++++++++++++++++-------------------
> > pack-bitmap.h | 12 +++++++++-
> > reachable.c | 3 ++-
> > 5 files changed, 57 insertions(+), 29 deletions(-)
>
> After 2189649b (pack-bitmap.c: keep track of each layer's type
> bitmaps, 2024-11-19) added <type>_all bitmaps to the bitmap_index
> struct, this step would need some adjustment, I am afraid.
>
> Taylor Cc'ed.
Thanks, I was going to respond with the same thing.
I was going to suggest leaving that function as-is to prevent future
breakage and/or a messy integration into 'seen'. But stepping back I am
not sure I understand the purpose of this commit in the first place.
It looks like the aim here is to introduce a function which executes a
callback for each object of some type in a bitmap. That's a thin wrapper
over the ewah_iterator, but it's not clear why we need a wrapper around
that function since it is internal to pack-bitmap.c. Likewise, this is a
performance critical area, so I am not sure I'm in favor of adding a
function pointer to a hot path which executes once per object for some
object type.
Thanks,
Taylor
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 6/9] pack-bitmap: expose function to iterate over bitmapped objects
2025-02-25 6:59 ` Patrick Steinhardt
2025-02-25 16:59 ` Junio C Hamano
@ 2025-02-27 23:26 ` Taylor Blau
2025-02-28 10:54 ` Patrick Steinhardt
1 sibling, 1 reply; 72+ messages in thread
From: Taylor Blau @ 2025-02-27 23:26 UTC (permalink / raw)
To: Patrick Steinhardt; +Cc: Junio C Hamano, git, Jeff King
On Tue, Feb 25, 2025 at 07:59:14AM +0100, Patrick Steinhardt wrote:
> On Mon, Feb 24, 2025 at 10:05:27AM -0800, Junio C Hamano wrote:
> > Patrick Steinhardt <ps@pks.im> writes:
> >
> > > Expose a function that allows the caller to iterate over all bitmapped
> > > objects of a specific type. This mechanism allows us to use the object
> > > type-specific bitmaps to enumerate all objects of that type without
> > > having to scan through a complete packfile.
> > >
> > > This functionality will be used in a subsequent commit.
> > >
> > > Signed-off-by: Patrick Steinhardt <ps@pks.im>
> > > ---
> > > builtin/pack-objects.c | 3 ++-
> > > builtin/rev-list.c | 3 ++-
> > > pack-bitmap.c | 65 +++++++++++++++++++++++++++++++-------------------
> > > pack-bitmap.h | 12 +++++++++-
> > > reachable.c | 3 ++-
> > > 5 files changed, 57 insertions(+), 29 deletions(-)
> >
> > After 2189649b (pack-bitmap.c: keep track of each layer's type
> > bitmaps, 2024-11-19) added <type>_all bitmaps to the bitmap_index
> > struct, this step would need some adjustment, I am afraid.
>
> Hm, does it? I understand that this commit only makes the bitmaps
> accessible individually per bitmapped packfile, but the bitmap indices
> part of `struct bitmap_index` would continue to be the union of all of
> those bitmaps. Oh, but that changes in the subsequent commits indeed,
> where we start to use an `ewah_or_iterator`.
That's right; the ewah_or_iterator is the mechanism by which we can
combine multiple "layers" of the bitmaps into a single iterator.
(As an aside, that was not the first approach I pursued. Initially the
caller was supposed to chase the 'next' pointer of each bitmap and
enumerate through whatever type iterator they're interested in at each
layer. But that was too error-prone, since you have to remember and
update the offset into the pseudo-pack order across multiple layers.)
> I see that Taylor's series has been sitting in an unreviewed state for a
> couple months already. I can review it with the hope of moving it
> forward and can then pull it in as a dependency of this series. But I'll
> wait for him to chime in first to see whether anything changed about its
> current state.
It would be great to get some review from you on that series. I know
that it has been on Peff's (CC'd) radar for a while, but that he has
likewise had a few off-list things to deal with lately as well.
I am still not sold on introducing a callback here, though, and would
much rather see callers interact with the iterator directly.
Thanks,
Taylor
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 6/9] pack-bitmap: expose function to iterate over bitmapped objects
2025-02-27 23:23 ` Taylor Blau
@ 2025-02-27 23:32 ` Junio C Hamano
2025-02-27 23:39 ` Taylor Blau
0 siblings, 1 reply; 72+ messages in thread
From: Junio C Hamano @ 2025-02-27 23:32 UTC (permalink / raw)
To: Taylor Blau; +Cc: Patrick Steinhardt, git
Taylor Blau <me@ttaylorr.com> writes:
> It looks like the aim here is to introduce a function which executes a
> callback for each object of some type in a bitmap. That's a thin wrapper
> over the ewah_iterator, but it's not clear why we need a wrapper around
> that function since it is internal to pack-bitmap.c. Likewise, this is a
> performance critical area, so I am not sure I'm in favor of adding a
> function pointer to a hot path which executes once per object for some
> object type.
It internally introduced ewah_for_type(), giving the "struct
bitmap_index" object an abstraction that callers can ask for the
bitmap for any type the caller wants. Before the <type>_all bitmaps
were introduced, there were one ewah-bitmap per type, so it made
sense for a caller to ask "Now, for this bitmap_index, give me the
ewah-bitmap for commits", but with "commits_all" added to the
bitmap_index object, it is no longer clear to me what the answer to
that question should be.
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 7/9] pack-bitmap: introduce function to check whether a pack is bitmapped
2025-02-21 7:47 ` [PATCH 7/9] pack-bitmap: introduce function to check whether a pack is bitmapped Patrick Steinhardt
@ 2025-02-27 23:33 ` Taylor Blau
0 siblings, 0 replies; 72+ messages in thread
From: Taylor Blau @ 2025-02-27 23:33 UTC (permalink / raw)
To: Patrick Steinhardt; +Cc: git
On Fri, Feb 21, 2025 at 08:47:32AM +0100, Patrick Steinhardt wrote:
> Introduce a function that allows us to verify whether a pack is
> bitmapped or not. This functionality will be used in a subsequent
> commit.
>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
> pack-bitmap.c | 15 +++++++++++++++
> pack-bitmap.h | 7 +++++++
> 2 files changed, 22 insertions(+)
>
> diff --git a/pack-bitmap.c b/pack-bitmap.c
> index fc92e0aae65..3cbe5bfe909 100644
> --- a/pack-bitmap.c
> +++ b/pack-bitmap.c
> @@ -658,6 +658,21 @@ struct bitmap_index *prepare_midx_bitmap_git(struct multi_pack_index *midx)
> return NULL;
> }
>
> +int bitmap_index_contains_pack(struct bitmap_index *bitmap, struct packed_git *pack)
> +{
> + if (bitmap->pack)
> + return bitmap->pack == pack;
The bitmap_is_midx() function should be useful here. I don't think what
you wrote is wrong per-se, but that function is supposed to "hide" what
exactly constitutes a pack versus multi-pack bitmap.
> + if (!bitmap->midx->chunk_bitmapped_packs)
> + return 0;
What is the purpose of this check? The BTMP chunk was a relatively
recent addition, but it came long after multi-pack bitmaps were first
introduced. The BTMP chunk is necessary for multi-pack reuse, since it
indicates what sections of the bitmap's object order correspond to what
packs.
With or without a BTMP chunk in the MIDX, a multi-pack bitmap is assumed
to cover all of the packs in that MIDx. So I think the above check is at
best not helpful, and at worst will return incorrect results for
pre-BTMP MIDXs.
> + for (size_t i = 0; i < bitmap->midx->num_packs; i++)
> + if (bitmap->midx->packs[i] == pack)
> + return 1;
This part looks good to me. If you end up pulling in the incremental
MIDX bitmaps series in as a dependency of this one, this will have to be
rewritten something like:
for (; bitmap; bitmap = bitmap->base) {
if (bitmap_is_midx(bitmap)) {
for (size_t i = 0; i < bitmap->midx->num_packs; i++) {
if (bitmap->midx->packs[i] == pack)
return 1;
}
} else if (bitmap->pack == pack) {
return 1;
}
}
return 0;
Without pulling in that series as a dependency of this one, I think the
function would just contain the body of the above 'for' loop, but not
the loop itself.
Thanks,
Taylor
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 6/9] pack-bitmap: expose function to iterate over bitmapped objects
2025-02-27 23:32 ` Junio C Hamano
@ 2025-02-27 23:39 ` Taylor Blau
0 siblings, 0 replies; 72+ messages in thread
From: Taylor Blau @ 2025-02-27 23:39 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Patrick Steinhardt, git
On Thu, Feb 27, 2025 at 03:32:32PM -0800, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > It looks like the aim here is to introduce a function which executes a
> > callback for each object of some type in a bitmap. That's a thin wrapper
> > over the ewah_iterator, but it's not clear why we need a wrapper around
> > that function since it is internal to pack-bitmap.c. Likewise, this is a
> > performance critical area, so I am not sure I'm in favor of adding a
> > function pointer to a hot path which executes once per object for some
> > object type.
>
> It internally introduced ewah_for_type(), giving the "struct
> bitmap_index" object an abstraction that callers can ask for the
> bitmap for any type the caller wants. Before the <type>_all bitmaps
> were introduced, there were one ewah-bitmap per type, so it made
> sense for a caller to ask "Now, for this bitmap_index, give me the
> ewah-bitmap for commits", but with "commits_all" added to the
> bitmap_index object, it is no longer clear to me what the answer to
> that question should be.
I think these are orthogonal. (FWIW, I think the correct answer would be
"commits_all" in that world, but that is definitely out of scope for
Patrick's immediate concern). In any event, I see that later on in the
series it is important for callers to enumerate bitmapped objects of a
certain type. So having a callback to do that makes sense.
Thanks,
Taylor
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 9/9] builtin/cat-file: use bitmaps to efficiently filter by object type
2025-02-21 7:47 ` [PATCH 9/9] builtin/cat-file: use bitmaps to efficiently filter by object type Patrick Steinhardt
2025-02-27 11:38 ` Karthik Nayak
@ 2025-02-27 23:48 ` Taylor Blau
1 sibling, 0 replies; 72+ messages in thread
From: Taylor Blau @ 2025-02-27 23:48 UTC (permalink / raw)
To: Patrick Steinhardt; +Cc: git
On Fri, Feb 21, 2025 at 08:47:34AM +0100, Patrick Steinhardt wrote:
> @@ -813,9 +827,40 @@ static void batch_each_object(for_each_object_fn callback,
> .callback = callback,
> .payload = _payload,
> };
> + struct bitmap_index *bitmap = prepare_bitmap_git(the_repository);
> +
> for_each_loose_object(batch_one_object_loose, &payload, 0);
> - for_each_packed_object(the_repository, batch_one_object_packed,
> - &payload, flags);
> +
> + if (bitmap &&
> + (opt->objects_filter.choice == LOFC_OBJECT_TYPE ||
> + opt->objects_filter.choice == LOFC_BLOB_NONE)) {
Makes sense. I think there is one more case here that we could handle,
which is
opt->objects_filter.choice == LOFC_TREE_DEPTH && opt->objects_filter.depth == 0
where we'd just want to show commits.
I am scratching my head on if there is a convenient way to unify this
logic with pack-bitmap.c::filter_bitmap(). I think there is, but there
are a couple of wrinkles:
- filter_bitmap() is really designed to work with a whole 'struct
bitmap', and doesn't know how to deal with an ewah_iterator.
- traverse_bitmap_commit_list() is designed to provide a way for
callers to iterate over the set of objects reachable for some
rev-list query.
There we *do* have good facilities for iterating over an
ewah_iterator, which is what you'd want. But that function really
wants to have performed a bitmap walk first (see the
"assert(bitmap_git->result)" call at the beginning of that
function).
The new pieces of batch_each_object() introduced in this patch are
tantalizingly close to much of the existing logic in pack-bitmap.c. I
think there is a way to unify them by introducing a way to traverse over
the bitmap as a whole as if bitmap_git->result were the all-1s bitmap.
Thanks,
Taylor
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 2/9] builtin/cat-file: wire up an option to filter objects
2025-02-26 15:20 ` Toon Claes
@ 2025-02-28 10:51 ` Patrick Steinhardt
2025-02-28 17:44 ` Junio C Hamano
0 siblings, 1 reply; 72+ messages in thread
From: Patrick Steinhardt @ 2025-02-28 10:51 UTC (permalink / raw)
To: Toon Claes; +Cc: git
On Wed, Feb 26, 2025 at 04:20:55PM +0100, Toon Claes wrote:
> Patrick Steinhardt <ps@pks.im> writes:
>
> > In batch mode, git-cat-file(1) enumerates all objects and prints them
> > by iterating through both loose and packed objects. This works without
> > considering their reachability at all, and consequently most options to
> > filter objects as they exist in e.g. git-rev-list(1) are not applicable.
> > In some situations it may still be useful though to filter objects based
> > on properties that are inherent to them. This includes the object size
> > as well as its type.
> >
> > Such a filter already exists in git-rev-list(1) with the `--filter=`
> > command line option. While this option supports a couple of filters that
> > are not applicable to our usecase, some of them are quite a neat fit.
> >
> > Wire up the filter as an option for git-cat-file(1). This allows us to
> > reuse the same syntax as in git-rev-list(1) so that we don't have to
> > reinvent the wheel. For now, we die when any of the filter options has
> > been passed by the user, but they will be wired up in subsequent
> > commits.
> >
> > Note that we don't use the same `--filter=` name fo the option as we use
> > in git-rev-list(1). We already have `--filters`, and having both
> > `--filter=` and `--filters` would be quite confusing. Instead, the new
> > option is called `--objects-filter`.
>
> I'm not sure I agree. I would rather have consistency in various
> commands. Because `--filters` doesn't accept an argument, so I would say
> having both `--filters` and `--filter=` is fine. I see in various places
> we already use `OPT_PARSE_LIST_OBJECTS_FILTER` which defines the option
> as `--filter=`, so it's pretty standard for several commands. I'd
> prefer git-cat-file(1) to follow that as well. But that's my 2 cents.
I'll wait for a third party to chime in as a tie breaker here :) I'm not
feeling overly strong about it, but still think that it's just too easy
to get wrong when those options are so extremely similarly named.
Patrick
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 5/9] builtin/cat-file: support "object:type=" objects filter
2025-02-26 15:23 ` Toon Claes
@ 2025-02-28 10:51 ` Patrick Steinhardt
0 siblings, 0 replies; 72+ messages in thread
From: Patrick Steinhardt @ 2025-02-28 10:51 UTC (permalink / raw)
To: Toon Claes; +Cc: git
On Wed, Feb 26, 2025 at 04:23:12PM +0100, Toon Claes wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> > diff --git a/builtin/cat-file.c b/builtin/cat-file.c
> > index f57bf65cb03..b374c2bb104 100644
> > --- a/builtin/cat-file.c
> > +++ b/builtin/cat-file.c
> > @@ -474,7 +474,8 @@ static void batch_object_write(const char *obj_name,
> >
> > if (use_mailmap ||
> > opt->objects_filter.choice == LOFC_BLOB_NONE ||
> > - opt->objects_filter.choice == LOFC_BLOB_LIMIT)
> > + opt->objects_filter.choice == LOFC_BLOB_LIMIT ||
> > + opt->objects_filter.choice == LOFC_OBJECT_TYPE)
> > data->info.typep = &data->type;
> > if (opt->objects_filter.choice == LOFC_BLOB_LIMIT)
> > data->info.sizep = &data->size;
> > @@ -505,6 +506,10 @@ static void batch_object_write(const char *obj_name,
> > data->size >= opt->objects_filter.blob_limit_value)
> > return;
> > break;
> > + case LOFC_OBJECT_TYPE:
> > + if (data->type != opt->objects_filter.object_type)
> > + return;
> > + break;
> > default:
> > BUG("unsupported objects filter");
>
> I see we don't support LOFC_COMBINE, so we won't be supporting repeating
> the --filter= option, is this intentional? Should we support that too? I
> feel it would make sense from the start, unless there are good reasons
> not to?
I think the usefulness of LOFC_COMBINE is quite restricted in our case
because we only support a subset of filters in the first place. There is
only a single combination that does make sense: `blob:limit` plus
`object:type=blob`. All the other combinations are useless as they only
filter based on the object type, and thus they would yield the empty
set.
So given that this isn't that useful and given that it does add quite a
bit of complexity to support I decided to not support it for now, also
because I don't have any usecase for it.
Patrick
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 6/9] pack-bitmap: expose function to iterate over bitmapped objects
2025-02-27 23:26 ` Taylor Blau
@ 2025-02-28 10:54 ` Patrick Steinhardt
0 siblings, 0 replies; 72+ messages in thread
From: Patrick Steinhardt @ 2025-02-28 10:54 UTC (permalink / raw)
To: Taylor Blau; +Cc: Junio C Hamano, git, Jeff King
On Thu, Feb 27, 2025 at 06:26:31PM -0500, Taylor Blau wrote:
> On Tue, Feb 25, 2025 at 07:59:14AM +0100, Patrick Steinhardt wrote:
> > I see that Taylor's series has been sitting in an unreviewed state for a
> > couple months already. I can review it with the hope of moving it
> > forward and can then pull it in as a dependency of this series. But I'll
> > wait for him to chime in first to see whether anything changed about its
> > current state.
>
> It would be great to get some review from you on that series. I know
> that it has been on Peff's (CC'd) radar for a while, but that he has
> likewise had a few off-list things to deal with lately as well.
I've done a first review today. I'll delay my patch series a bit until
your series looks like it is close to landing.
Thanks!
Patrick
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 2/9] builtin/cat-file: wire up an option to filter objects
2025-02-28 10:51 ` Patrick Steinhardt
@ 2025-02-28 17:44 ` Junio C Hamano
2025-03-03 10:40 ` Patrick Steinhardt
0 siblings, 1 reply; 72+ messages in thread
From: Junio C Hamano @ 2025-02-28 17:44 UTC (permalink / raw)
To: Patrick Steinhardt; +Cc: Toon Claes, git
Patrick Steinhardt <ps@pks.im> writes:
>> > Note that we don't use the same `--filter=` name fo the option as we use
>> > in git-rev-list(1). We already have `--filters`, and having both
>> > `--filter=` and `--filters` would be quite confusing. Instead, the new
>> > option is called `--objects-filter`.
>>
>> I'm not sure I agree. I would rather have consistency in various
>> commands. Because `--filters` doesn't accept an argument, so I would say
>> having both `--filters` and `--filter=` is fine. I see in various places
>> we already use `OPT_PARSE_LIST_OBJECTS_FILTER` which defines the option
>> as `--filter=`, so it's pretty standard for several commands. I'd
>> prefer git-cat-file(1) to follow that as well. But that's my 2 cents.
>
> I'll wait for a third party to chime in as a tie breaker here :) I'm not
> feeling overly strong about it, but still think that it's just too easy
> to get wrong when those options are so extremely similarly named.
$ git grep '^[^a-z]*--filter' Documentation/
Documentation/config/gc.adoc: `--filter=<filter-spec>` option of linkgit:git-repack[1].
Documentation/git-cat-file.adoc:--filters::
Documentation/git-cat-file.adoc: `--filters`.
Documentation/git-clone.adoc: [--filter=<filter-spec>] [--also-filter-submodules]] [--] <repository>
Documentation/git-clone.adoc:`--filter=<filter-spec>`::
Documentation/git-clone.adoc: `--filter=blob:limit=<size>` will filter out all blobs of size
Documentation/git-pack-objects.adoc:--filter=<filter-spec>::
Documentation/git-repack.adoc:--filter=<filter-spec>::
Documentation/git-repack.adoc:--filter-to=<dir>::
Documentation/rev-list-options.adoc:--filter=<filter-spec>::
Documentation/rev-list-options.adoc:--filter-provided-objects::
Documentation/rev-list-options.adoc:--filter-print-omitted::
The above does makes it look that whoever called their invention
"--filters" when they added it to "cat-file" wasn't paying attention
to make things consistent, but that is not the case. The word
"filter" in the context of existing feature set of "cat-file" has
ALWAYS refered to the act of applying the "smudge" filter chain to
externalize an internal "clean" blob object contents for the working
tree representation. We should thank that somebody for not using
and squatting on a shorter and sweeter "--filter" ;-)
"cat-file" should call the feature "--filter=<filter-spec>" like
everybody else does, or the feature should not be added to
"cat-file" at all. Unless we are willing to rename "--filter="
options for _all_ existing commands to "--object-filter=", that is.
In retrospect, such a longer and more explicit name may have been
nicer. But given that all users of the "--filter=<filter-spec>" are
about object transfer, it is understandable that we didn't invoke
deliberate redundancy when naming the option. Historically,
"cat-file" has always been "give me the contents of the object I
name", and never about "I may ask about many objects but do not
answer requests for objects chosen by these criteria", so it also is
understandable that we didn't redundantly say "--contents-filter",
too.
Am I third-party enough?
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 2/9] builtin/cat-file: wire up an option to filter objects
2025-02-28 17:44 ` Junio C Hamano
@ 2025-03-03 10:40 ` Patrick Steinhardt
0 siblings, 0 replies; 72+ messages in thread
From: Patrick Steinhardt @ 2025-03-03 10:40 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Toon Claes, git
On Fri, Feb 28, 2025 at 09:44:24AM -0800, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
>
> >> > Note that we don't use the same `--filter=` name fo the option as we use
> >> > in git-rev-list(1). We already have `--filters`, and having both
> >> > `--filter=` and `--filters` would be quite confusing. Instead, the new
> >> > option is called `--objects-filter`.
> >>
> >> I'm not sure I agree. I would rather have consistency in various
> >> commands. Because `--filters` doesn't accept an argument, so I would say
> >> having both `--filters` and `--filter=` is fine. I see in various places
> >> we already use `OPT_PARSE_LIST_OBJECTS_FILTER` which defines the option
> >> as `--filter=`, so it's pretty standard for several commands. I'd
> >> prefer git-cat-file(1) to follow that as well. But that's my 2 cents.
> >
> > I'll wait for a third party to chime in as a tie breaker here :) I'm not
> > feeling overly strong about it, but still think that it's just too easy
> > to get wrong when those options are so extremely similarly named.
>
> $ git grep '^[^a-z]*--filter' Documentation/
> Documentation/config/gc.adoc: `--filter=<filter-spec>` option of linkgit:git-repack[1].
> Documentation/git-cat-file.adoc:--filters::
> Documentation/git-cat-file.adoc: `--filters`.
> Documentation/git-clone.adoc: [--filter=<filter-spec>] [--also-filter-submodules]] [--] <repository>
> Documentation/git-clone.adoc:`--filter=<filter-spec>`::
> Documentation/git-clone.adoc: `--filter=blob:limit=<size>` will filter out all blobs of size
> Documentation/git-pack-objects.adoc:--filter=<filter-spec>::
> Documentation/git-repack.adoc:--filter=<filter-spec>::
> Documentation/git-repack.adoc:--filter-to=<dir>::
> Documentation/rev-list-options.adoc:--filter=<filter-spec>::
> Documentation/rev-list-options.adoc:--filter-provided-objects::
> Documentation/rev-list-options.adoc:--filter-print-omitted::
>
> The above does makes it look that whoever called their invention
> "--filters" when they added it to "cat-file" wasn't paying attention
> to make things consistent, but that is not the case. The word
> "filter" in the context of existing feature set of "cat-file" has
> ALWAYS refered to the act of applying the "smudge" filter chain to
> externalize an internal "clean" blob object contents for the working
> tree representation. We should thank that somebody for not using
> and squatting on a shorter and sweeter "--filter" ;-)
>
> "cat-file" should call the feature "--filter=<filter-spec>" like
> everybody else does, or the feature should not be added to
> "cat-file" at all. Unless we are willing to rename "--filter="
> options for _all_ existing commands to "--object-filter=", that is.
>
> In retrospect, such a longer and more explicit name may have been
> nicer. But given that all users of the "--filter=<filter-spec>" are
> about object transfer, it is understandable that we didn't invoke
> deliberate redundancy when naming the option. Historically,
> "cat-file" has always been "give me the contents of the object I
> name", and never about "I may ask about many objects but do not
> answer requests for objects chosen by these criteria", so it also is
> understandable that we didn't redundantly say "--contents-filter",
> too.
>
> Am I third-party enough?
Yup :) I'll adapt then, thanks for chiming in!
Patrick
^ permalink raw reply [flat|nested] 72+ messages in thread
* [PATCH v2 00/10] builtin/cat-file: allow filtering objects in batch mode
2025-02-21 7:47 [PATCH 0/9] builtin/cat-file: allow filtering objects in batch mode Patrick Steinhardt
` (8 preceding siblings ...)
2025-02-21 7:47 ` [PATCH 9/9] builtin/cat-file: use bitmaps to efficiently filter by object type Patrick Steinhardt
@ 2025-03-27 9:43 ` Patrick Steinhardt
2025-03-27 9:43 ` [PATCH v2 01/10] builtin/cat-file: rename variable that tracks usage Patrick Steinhardt
` (9 more replies)
2025-04-02 11:13 ` [PATCH v3 00/11] builtin/cat-file: allow filtering objects in batch mode Patrick Steinhardt
10 siblings, 10 replies; 72+ messages in thread
From: Patrick Steinhardt @ 2025-03-27 9:43 UTC (permalink / raw)
To: git; +Cc: Toon Claes, Karthik Nayak, Taylor Blau, Junio C Hamano
Hi,
at GitLab, we sometimes have the need to list all objects regardless of
their reachability. We use git-cat-file(1) with `--batch-all-objects` to
do this, and typically this is quite a good fit. In some cases though,
we only want to list objects of a specific type, where we then basically
have the following pipeline:
git cat-file --batch-all-objects --batch-check='%(objecttype) %(objectname)' |
grep '^commit ' |
cut -d' ' -f2 |
git cat-file --batch
This works okayish in medium-sized repositories, but once you reach a
certain size this isn't really an option anymore. In the Chromium
repository for example [1] simply listing all objects in the first
invocation of git-cat-file(1) takes around 80 to 100 seconds. The
workload is completely I/O-bottlenecked: my machine reads at ~500MB/s,
and the packfile is 50GB in size, which matches the 100 seconds that I
observe.
This series addresses the issue by introducing object filters into
git-cat-file(1). These object filters use the exact same syntax as the
filters we have in git-rev-list(1), but only a subset of them is
supported because not all filters can be computed by git-cat-file(1).
Supported are "blob:none", "blob:limit=" as well as "object:type=".
The filters alone don't really help though: we still have to scan
through the whole packfile in order to compute the packfiles. While we
are able to shed a bit of CPU time because we can stop emitting some of
the objects, we're still I/O-bottlenecked.
The second part of the series thus expands the filters so that they can
make use of bitmap indices for some of the filters, if available. This
allows us to efficiently answer the question where to find all objects
of a specific type, and thus we can avoid scanning through the packfile
and instead directly look up relevant objects, leading to a significant
speedup:
Benchmark 1: cat-file with filter=object:type=commit (revision = HEAD~)
Time (mean ± σ): 86.444 s ± 4.081 s [User: 36.830 s, System: 11.312 s]
Range (min … max): 80.305 s … 93.104 s 10 runs
Benchmark 2: cat-file with filter=object:type=commit (revision = HEAD)
Time (mean ± σ): 2.089 s ± 0.015 s [User: 1.872 s, System: 0.207 s]
Range (min … max): 2.073 s … 2.119 s 10 runs
Summary
cat-file with filter=object:type=commit (revision = HEAD) ran
41.38 ± 1.98 times faster than cat-file with filter=object:type=commit (revision = HEAD~)
We now directly scale with the number of objects of a specific type
contained in the packfile instead of scaling with the overall number of
objects. It's quite fun to see how the math plays out: if you sum up the
times for each of the types you arrive at the time for the unfiltered
case.
Changes in v2:
- The series is now built on top of "master" at 683c54c999c (Git 2.49,
2025-03-14) with "tb/incremental-midx-part-2" at 27afc272c49 (midx:
implement writing incremental MIDX bitmaps, 2025-03-20) merged into
it.
- Rename the filter options to "--filter=" to match
git-pack-objects(1).
- The bitmap-filtering is now reusing existing mechanisms that we
already have in "pack-bitmap.c", as proposed by Taylor.
- Link to v1: https://lore.kernel.org/r/20250221-pks-cat-file-object-type-filter-v1-0-0852530888e2@pks.im
Thanks!
Patrick
[1]: https://github.com/chromium/chromium.git
---
Patrick Steinhardt (10):
builtin/cat-file: rename variable that tracks usage
builtin/cat-file: wire up an option to filter objects
builtin/cat-file: support "blob:none" objects filter
builtin/cat-file: support "blob:limit=" objects filter
builtin/cat-file: support "object:type=" objects filter
pack-bitmap: allow passing payloads to `show_reachable_fn()`
pack-bitmap: add function to iterate over filtered bitmapped objects
pack-bitmap: introduce function to check whether a pack is bitmapped
builtin/cat-file: deduplicate logic to iterate over all objects
builtin/cat-file: use bitmaps to efficiently filter by object type
Documentation/git-cat-file.adoc | 16 +++
builtin/cat-file.c | 212 +++++++++++++++++++++++++++++-----------
builtin/pack-objects.c | 3 +-
builtin/rev-list.c | 3 +-
pack-bitmap.c | 81 +++++++++++++--
pack-bitmap.h | 22 ++++-
reachable.c | 3 +-
t/t1006-cat-file.sh | 77 +++++++++++++++
8 files changed, 346 insertions(+), 71 deletions(-)
Range-diff versus v1:
1: 108b50d8a66 = 1: d16b84702fd builtin/cat-file: rename variable that tracks usage
2: 4a4a22ac465 ! 2: d3259ff034c builtin/cat-file: wire up an option to filter objects
@@ Commit message
been passed by the user, but they will be wired up in subsequent
commits.
- Note that we don't use the same `--filter=` name fo the option as we use
- in git-rev-list(1). We already have `--filters`, and having both
- `--filter=` and `--filters` would be quite confusing. Instead, the new
- option is called `--objects-filter`.
-
Further note that the filters that we are about to introduce don't
significantly speed up the runtime of git-cat-file(1). While we can skip
emitting a lot of objects in case they are uninteresting to us, the
@@ Documentation/git-cat-file.adoc: OPTIONS
end-of-line conversion, etc). In this case, `<object>` has to be of
the form `<tree-ish>:<path>`, or `:<path>`.
-+--objects-filter=<filter-spec>::
-+--no-objects-filter::
++--filter=<filter-spec>::
++--no-filter::
+ Omit objects from the list of printed objects. This can only be used in
+ combination with one of the batched modes. The '<filter-spec>' may be
+ one of the following:
@@ builtin/cat-file.c: int cmd_cat_file(int argc,
N_("run filters on object's content"), 'w'),
OPT_STRING(0, "path", &force_path, N_("blob|tree"),
N_("use a <path> for (--textconv | --filters); Not with 'batch'")),
-+ OPT_CALLBACK(0, "objects-filter", &batch.objects_filter, N_("args"),
++ OPT_CALLBACK(0, "filter", &batch.objects_filter, N_("args"),
+ N_("object filtering"), opt_parse_list_objects_filter),
OPT_END()
};
@@ t/t1006-cat-file.sh: test_expect_success PERL '--batch-command info is unbuffere
+ cat >expect <<-EOF &&
+ fatal: invalid filter-spec ${SQ}unknown${SQ}
+ EOF
-+ test_must_fail git -C repo cat-file --objects-filter=unknown 2>err &&
++ test_must_fail git -C repo cat-file --filter=unknown 2>err &&
+ test_cmp expect err
+'
+
@@ t/t1006-cat-file.sh: test_expect_success PERL '--batch-command info is unbuffere
+ printf "usage: objects filter not supported: ${SQ}%s${SQ}\n" "$option_name" >expect
+ ;;
+ esac &&
-+ test_must_fail git -C repo cat-file --objects-filter=$option 2>err &&
++ test_must_fail git -C repo cat-file --filter=$option 2>err &&
+ test_cmp expect err
+ '
+done
3: baddbca6de6 ! 3: 02c7fc38986 builtin/cat-file: support "blob:none" objects filter
@@ t/t1006-cat-file.sh: do
+ filter="$1"
+
+ test_expect_success "objects filter: $filter" '
-+ git -C repo cat-file --batch-check="%(objectname)" --batch-all-objects --objects-filter="$filter" >actual &&
++ git -C repo cat-file --batch-check="%(objectname)" --batch-all-objects --filter="$filter" >actual &&
+ sort actual >actual.sorted &&
+ git -C repo rev-list --objects --no-object-names --all --filter="$filter" --filter-provided-objects >expect &&
+ sort expect >expect.sorted &&
4: e55fa01810d ! 4: 33c5ea58fdc builtin/cat-file: support "blob:limit=" objects filter
@@ Documentation/git-cat-file.adoc: OPTIONS
The form '--filter=blob:none' omits all blobs.
++
+The form '--filter=blob:limit=<n>[kmg]' omits blobs of size at least n
-+bytes or units. n may be zero. The suffixes k, m, and g can be used
-+to name units in KiB, MiB, or GiB. For example, 'blob:limit=1k'
-+is the same as 'blob:limit=1024'.
++bytes or units. n may be zero. The suffixes k, m, and g can be used to name
++units in KiB, MiB, or GiB. For example, 'blob:limit=1k' is the same as
++'blob:limit=1024'.
--path=<path>::
For use with `--textconv` or `--filters`, to allow specifying an object
5: 1882ac07c9b ! 5: 40e2f7f2f82 builtin/cat-file: support "object:type=" objects filter
@@ Commit message
## Documentation/git-cat-file.adoc ##
@@ Documentation/git-cat-file.adoc: The form '--filter=blob:limit=<n>[kmg]' omits blobs of size at least n
- bytes or units. n may be zero. The suffixes k, m, and g can be used
- to name units in KiB, MiB, or GiB. For example, 'blob:limit=1k'
- is the same as 'blob:limit=1024'.
+ bytes or units. n may be zero. The suffixes k, m, and g can be used to name
+ units in KiB, MiB, or GiB. For example, 'blob:limit=1k' is the same as
+ 'blob:limit=1024'.
++
-+The form '--filter=object:type=(tag|commit|tree|blob)' omits all objects
-+which are not of the requested type.
++The form '--filter=object:type=(tag|commit|tree|blob)' omits all objects which
++are not of the requested type.
--path=<path>::
For use with `--textconv` or `--filters`, to allow specifying an object
6: c8ac9481b39 ! 6: 5affa992909 pack-bitmap: expose function to iterate over bitmapped objects
@@ Metadata
Author: Patrick Steinhardt <ps@pks.im>
## Commit message ##
- pack-bitmap: expose function to iterate over bitmapped objects
+ pack-bitmap: allow passing payloads to `show_reachable_fn()`
- Expose a function that allows the caller to iterate over all bitmapped
- objects of a specific type. This mechanism allows us to use the object
- type-specific bitmaps to enumerate all objects of that type without
- having to scan through a complete packfile.
+ The `show_reachable_fn` callback is used by a couple of functions to
+ present reachable objects to the caller. The function does not provide a
+ way for the caller to pass a payload though, which is functionality that
+ we'll require in a subsequent commit.
- This functionality will be used in a subsequent commit.
+ Change the callback type to accept a payload and adapt all callsites
+ accordingly.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
@@ pack-bitmap.c: static void show_extended_objects(struct bitmap_index *bitmap_git
}
}
--static void init_type_iterator(struct ewah_iterator *it,
-- struct bitmap_index *bitmap_git,
-- enum object_type type)
-+static struct ewah_bitmap *ewah_for_type(struct bitmap_index *bitmap_git,
-+ enum object_type type)
- {
- switch (type) {
- case OBJ_COMMIT:
-- ewah_iterator_init(it, bitmap_git->commits);
-- break;
--
-+ return bitmap_git->commits;
- case OBJ_TREE:
-- ewah_iterator_init(it, bitmap_git->trees);
-- break;
--
-+ return bitmap_git->trees;
- case OBJ_BLOB:
-- ewah_iterator_init(it, bitmap_git->blobs);
-- break;
--
-+ return bitmap_git->blobs;
- case OBJ_TAG:
-- ewah_iterator_init(it, bitmap_git->tags);
-- break;
--
-+ return bitmap_git->tags;
- default:
- BUG("object type %d not stored by bitmap type index", type);
-- break;
- }
- }
-
--static void show_objects_for_type(
-- struct bitmap_index *bitmap_git,
-- enum object_type object_type,
+@@ pack-bitmap.c: static void init_type_iterator(struct ewah_or_iterator *it,
+ static void show_objects_for_type(
+ struct bitmap_index *bitmap_git,
+ enum object_type object_type,
- show_reachable_fn show_reach)
-+static void init_type_iterator(struct ewah_iterator *it,
-+ struct bitmap_index *bitmap_git,
-+ enum object_type type)
-+{
-+ ewah_iterator_init(it, ewah_for_type(bitmap_git, type));
-+}
-+
-+static void for_each_bitmapped_object_internal(struct bitmap_index *bitmap_git,
-+ struct bitmap *objects,
-+ enum object_type object_type,
-+ show_reachable_fn show_reach,
-+ void *payload)
++ show_reachable_fn show_reach,
++ void *payload)
{
size_t i = 0;
uint32_t offset;
--
- struct ewah_iterator it;
- eword_t filter;
-
-- struct bitmap *objects = bitmap_git->result;
--
- init_type_iterator(&it, bitmap_git, object_type);
-
- for (i = 0; i < objects->word_alloc &&
@@ pack-bitmap.c: static void show_objects_for_type(
if (bitmap_git->hashes)
hash = get_be32(bitmap_git->hashes + index_pos);
@@ pack-bitmap.c: static void show_objects_for_type(
+ show_reach(&oid, object_type, 0, hash, pack, ofs, payload);
}
}
- }
-+static void show_objects_for_type(
-+ struct bitmap_index *bitmap_git,
-+ enum object_type object_type,
-+ show_reachable_fn show_reach)
-+{
-+ for_each_bitmapped_object_internal(bitmap_git, bitmap_git->result,
-+ object_type, show_reach, NULL);
-+}
-+
-+void for_each_bitmapped_object(struct bitmap_index *bitmap_git,
-+ enum object_type object_type,
-+ show_reachable_fn show_reach,
-+ void *payload)
-+{
-+ struct bitmap *bitmap = ewah_to_bitmap(ewah_for_type(bitmap_git, object_type));
-+ for_each_bitmapped_object_internal(bitmap_git, bitmap,
-+ object_type, show_reach, payload);
-+ bitmap_free(bitmap);
-+}
-+
- static int in_bitmapped_pack(struct bitmap_index *bitmap_git,
- struct object_list *roots)
+@@ pack-bitmap.c: void traverse_bitmap_commit_list(struct bitmap_index *bitmap_git,
{
+ assert(bitmap_git->result);
+
+- show_objects_for_type(bitmap_git, OBJ_COMMIT, show_reachable);
++ show_objects_for_type(bitmap_git, OBJ_COMMIT, show_reachable, NULL);
+ if (revs->tree_objects)
+- show_objects_for_type(bitmap_git, OBJ_TREE, show_reachable);
++ show_objects_for_type(bitmap_git, OBJ_TREE, show_reachable, NULL);
+ if (revs->blob_objects)
+- show_objects_for_type(bitmap_git, OBJ_BLOB, show_reachable);
++ show_objects_for_type(bitmap_git, OBJ_BLOB, show_reachable, NULL);
+ if (revs->tag_objects)
+- show_objects_for_type(bitmap_git, OBJ_TAG, show_reachable);
++ show_objects_for_type(bitmap_git, OBJ_TAG, show_reachable, NULL);
+
+ show_extended_objects(bitmap_git, revs, show_reachable);
+ }
## pack-bitmap.h ##
@@ pack-bitmap.h: typedef int (*show_reachable_fn)(
@@ pack-bitmap.h: typedef int (*show_reachable_fn)(
struct bitmap_index;
-@@ pack-bitmap.h: int test_bitmap_pseudo_merges(struct repository *r);
- int test_bitmap_pseudo_merge_commits(struct repository *r, uint32_t n);
- int test_bitmap_pseudo_merge_objects(struct repository *r, uint32_t n);
-
-+/*
-+ * Iterate through all bitmapped objects of the given type and execute the
-+ * `show_reach` for each of them.
-+ */
-+ void for_each_bitmapped_object(struct bitmap_index *bitmap_git,
-+ enum object_type object_type,
-+ show_reachable_fn show_reach,
-+ void *payload);
-+
- #define GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL \
- "GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL"
-
## reachable.c ##
@@ reachable.c: static int mark_object_seen(const struct object_id *oid,
-: ----------- > 7: 9de8dff849c pack-bitmap: add function to iterate over filtered bitmapped objects
7: 86a520477f5 ! 8: 67a3cdf5fb9 pack-bitmap: introduce function to check whether a pack is bitmapped
@@ Commit message
bitmapped or not. This functionality will be used in a subsequent
commit.
+ Helped-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
## pack-bitmap.c ##
@@ pack-bitmap.c: struct bitmap_index *prepare_midx_bitmap_git(struct multi_pack_in
+int bitmap_index_contains_pack(struct bitmap_index *bitmap, struct packed_git *pack)
+{
-+ if (bitmap->pack)
-+ return bitmap->pack == pack;
-+
-+ if (!bitmap->midx->chunk_bitmapped_packs)
-+ return 0;
-+
-+ for (size_t i = 0; i < bitmap->midx->num_packs; i++)
-+ if (bitmap->midx->packs[i] == pack)
++ for (; bitmap; bitmap = bitmap->base) {
++ if (bitmap_is_midx(bitmap)) {
++ for (size_t i = 0; i < bitmap->midx->num_packs; i++)
++ if (bitmap->midx->packs[i] == pack)
++ return 1;
++ } else if (bitmap->pack == pack) {
+ return 1;
++ }
++ }
+
+ return 0;
+}
8: 56168d29a7c = 9: c4dc2fe1de2 builtin/cat-file: deduplicate logic to iterate over all objects
9: 36d88811991 ! 10: 61904db7d35 builtin/cat-file: use bitmaps to efficiently filter by object type
@@ Commit message
we can use these bitmaps to improve the runtime.
While bitmaps are typically used to compute reachability of objects,
- they also contain one bitmap per object type encodes which object has
- what type. So instead of reading through the whole packfile(s), we can
- use the bitmaps and iterate through the type-specific bitmap. Typically,
- only a subset of packfiles will have a bitmap. But this isn't really
- much of a problem: we can use bitmaps when available, and then use the
- non-bitmap walk for every packfile that isn't covered by one.
+ they also contain one bitmap per object type that encodes which object
+ has what type. So instead of reading through the whole packfile(s), we
+ can use the bitmaps and iterate through the type-specific bitmap.
+ Typically, only a subset of packfiles will have a bitmap. But this isn't
+ really much of a problem: we can use bitmaps when available, and then
+ use the non-bitmap walk for every packfile that isn't covered by one.
Overall, this leads to quite a significant speedup depending on how many
objects of a certain type exist. The following benchmarks have been
executed in the Chromium repository, which has a 50GB packfile with
- almost 25 million objects:
+ almost 25 million objects. As expected, there isn't really much of a
+ change in performance without an object filter:
- Benchmark 1: git cat-file --batch-check --batch-all-objects --unordered --buffer --no-objects-filter
- Time (mean ± σ): 82.806 s ± 6.363 s [User: 30.956 s, System: 8.264 s]
- Range (min … max): 73.936 s … 89.690 s 10 runs
+ Benchmark 1: cat-file with no-filter (revision = HEAD~)
+ Time (mean ± σ): 89.675 s ± 4.527 s [User: 40.807 s, System: 10.782 s]
+ Range (min … max): 83.052 s … 96.084 s 10 runs
- Benchmark 2: git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=tag
- Time (mean ± σ): 20.8 ms ± 1.3 ms [User: 6.1 ms, System: 14.5 ms]
- Range (min … max): 18.2 ms … 23.6 ms 127 runs
+ Benchmark 2: cat-file with no-filter (revision = HEAD)
+ Time (mean ± σ): 88.991 s ± 2.488 s [User: 42.278 s, System: 10.305 s]
+ Range (min … max): 82.843 s … 91.271 s 10 runs
- Benchmark 3: git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=commit
- Time (mean ± σ): 1.551 s ± 0.008 s [User: 1.401 s, System: 0.147 s]
- Range (min … max): 1.541 s … 1.566 s 10 runs
+ Summary
+ cat-file with no-filter (revision = HEAD) ran
+ 1.01 ± 0.06 times faster than cat-file with no-filter (revision = HEAD~)
+
+ We still have to scan through all objects as we yield all of them, so
+ using the bitmap in this case doesn't really buy us anything. What is
+ noticeable in this benchmark is that we're I/O-bound, not CPU-bound, as
+ can be seen from the user/system runtimes, which combined are way lower
+ than the overall benchmarked runtime.
- Benchmark 4: git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=tree
- Time (mean ± σ): 11.169 s ± 0.046 s [User: 10.076 s, System: 1.063 s]
- Range (min … max): 11.114 s … 11.245 s 10 runs
+ But when we do use a filter we can see a significant improvement:
- Benchmark 5: git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=blob
- Time (mean ± σ): 67.342 s ± 3.368 s [User: 20.318 s, System: 7.787 s]
- Range (min … max): 62.836 s … 73.618 s 10 runs
+ Benchmark 1: cat-file with filter=object:type=commit (revision = HEAD~)
+ Time (mean ± σ): 86.444 s ± 4.081 s [User: 36.830 s, System: 11.312 s]
+ Range (min … max): 80.305 s … 93.104 s 10 runs
- Benchmark 6: git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=blob:none
- Time (mean ± σ): 13.032 s ± 0.072 s [User: 11.638 s, System: 1.368 s]
- Range (min … max): 12.960 s … 13.199 s 10 runs
+ Benchmark 2: cat-file with filter=object:type=commit (revision = HEAD)
+ Time (mean ± σ): 2.089 s ± 0.015 s [User: 1.872 s, System: 0.207 s]
+ Range (min … max): 2.073 s … 2.119 s 10 runs
Summary
- git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=tag
- 74.75 ± 4.61 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=commit
- 538.17 ± 33.17 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=tree
- 627.98 ± 38.77 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=blob:none
- 3244.93 ± 257.23 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=blob
- 3990.07 ± 392.72 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --no-objects-filter
-
- The first benchmark is mostly equivalent in runtime compared to all the
- others without the bitmap-optimization introduced in this commit. What
- is noticeable in the benchmarks is that we're I/O-bound, not CPU-bound,
- as can be seen from the user/system runtimes, which is often way lower
- than the overall benchmarked runtime.
+ cat-file with filter=object:type=commit (revision = HEAD) ran
+ 41.38 ± 1.98 times faster than cat-file with filter=object:type=commit (revision = HEAD~)
+
+ This is because we don't have to scan through all packfiles anymore, but
+ can instead directly look up relevant objects.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
@@ builtin/cat-file.c: static void batch_each_object(for_each_object_fn callback,
- for_each_packed_object(the_repository, batch_one_object_packed,
- &payload, flags);
+
-+ if (bitmap &&
-+ (opt->objects_filter.choice == LOFC_OBJECT_TYPE ||
-+ opt->objects_filter.choice == LOFC_BLOB_NONE)) {
++ if (bitmap && !for_each_bitmapped_object(bitmap, &opt->objects_filter,
++ batch_one_object_bitmapped, &payload)) {
+ struct packed_git *pack;
+
-+ if (opt->objects_filter.choice == LOFC_OBJECT_TYPE) {
-+ for_each_bitmapped_object(bitmap, opt->objects_filter.object_type,
-+ batch_one_object_bitmapped, &payload);
-+ } else {
-+ for_each_bitmapped_object(bitmap, OBJ_COMMIT,
-+ batch_one_object_bitmapped, &payload);
-+ for_each_bitmapped_object(bitmap, OBJ_TAG,
-+ batch_one_object_bitmapped, &payload);
-+ for_each_bitmapped_object(bitmap, OBJ_TREE,
-+ batch_one_object_bitmapped, &payload);
-+ }
-+
+ for (pack = get_all_packs(the_repository); pack; pack = pack->next) {
+ if (bitmap_index_contains_pack(bitmap, pack) ||
+ open_pack_index(pack))
---
base-commit: 003c5f45b8447877015b2a23ceab2297638fe1f1
change-id: 20250220-pks-cat-file-object-type-filter-9140c0ed5ee1
^ permalink raw reply [flat|nested] 72+ messages in thread
* [PATCH v2 01/10] builtin/cat-file: rename variable that tracks usage
2025-03-27 9:43 ` [PATCH v2 00/10] builtin/cat-file: allow filtering objects in batch mode Patrick Steinhardt
@ 2025-03-27 9:43 ` Patrick Steinhardt
2025-04-01 9:51 ` Karthik Nayak
2025-03-27 9:43 ` [PATCH v2 02/10] builtin/cat-file: wire up an option to filter objects Patrick Steinhardt
` (8 subsequent siblings)
9 siblings, 1 reply; 72+ messages in thread
From: Patrick Steinhardt @ 2025-03-27 9:43 UTC (permalink / raw)
To: git; +Cc: Toon Claes, Karthik Nayak, Taylor Blau, Junio C Hamano
The usage strings for git-cat-file(1) that we pass to `parse_options()`
and `usage_msg_optf()` are stored in a variable called `usage`. This
variable shadows the declaration of `usage()`, which we'll want to use
in a subsequent commit.
Rename the variable to `builtin_catfile_usage`, which is in line with
how the variable is typically called in other builtins.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
builtin/cat-file.c | 32 ++++++++++++++++----------------
1 file changed, 16 insertions(+), 16 deletions(-)
diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index b13561cf73b..8e40016dd24 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -941,7 +941,7 @@ int cmd_cat_file(int argc,
int input_nul_terminated = 0;
int nul_terminated = 0;
- const char * const usage[] = {
+ const char * const builtin_catfile_usage[] = {
N_("git cat-file <type> <object>"),
N_("git cat-file (-e | -p) <object>"),
N_("git cat-file (-t | -s) [--allow-unknown-type] <object>"),
@@ -1007,7 +1007,7 @@ int cmd_cat_file(int argc,
batch.buffer_output = -1;
- argc = parse_options(argc, argv, prefix, options, usage, 0);
+ argc = parse_options(argc, argv, prefix, options, builtin_catfile_usage, 0);
opt_cw = (opt == 'c' || opt == 'w');
opt_epts = (opt == 'e' || opt == 'p' || opt == 't' || opt == 's');
@@ -1021,7 +1021,7 @@ int cmd_cat_file(int argc,
/* Option compatibility */
if (force_path && !opt_cw)
usage_msg_optf(_("'%s=<%s>' needs '%s' or '%s'"),
- usage, options,
+ builtin_catfile_usage, options,
"--path", _("path|tree-ish"), "--filters",
"--textconv");
@@ -1029,19 +1029,19 @@ int cmd_cat_file(int argc,
if (batch.enabled)
;
else if (batch.follow_symlinks)
- usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
+ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options,
"--follow-symlinks");
else if (batch.buffer_output >= 0)
- usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
+ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options,
"--buffer");
else if (batch.all_objects)
- usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
+ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options,
"--batch-all-objects");
else if (input_nul_terminated)
- usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
+ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options,
"-z");
else if (nul_terminated)
- usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
+ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options,
"-Z");
batch.input_delim = batch.output_delim = '\n';
@@ -1063,9 +1063,9 @@ int cmd_cat_file(int argc,
batch.transform_mode = opt;
else if (opt && opt != 'b')
usage_msg_optf(_("'-%c' is incompatible with batch mode"),
- usage, options, opt);
+ builtin_catfile_usage, options, opt);
else if (argc)
- usage_msg_opt(_("batch modes take no arguments"), usage,
+ usage_msg_opt(_("batch modes take no arguments"), builtin_catfile_usage,
options);
return batch_objects(&batch);
@@ -1074,22 +1074,22 @@ int cmd_cat_file(int argc,
if (opt) {
if (!argc && opt == 'c')
usage_msg_optf(_("<rev> required with '%s'"),
- usage, options, "--textconv");
+ builtin_catfile_usage, options, "--textconv");
else if (!argc && opt == 'w')
usage_msg_optf(_("<rev> required with '%s'"),
- usage, options, "--filters");
+ builtin_catfile_usage, options, "--filters");
else if (!argc && opt_epts)
usage_msg_optf(_("<object> required with '-%c'"),
- usage, options, opt);
+ builtin_catfile_usage, options, opt);
else if (argc == 1)
obj_name = argv[0];
else
- usage_msg_opt(_("too many arguments"), usage, options);
+ usage_msg_opt(_("too many arguments"), builtin_catfile_usage, options);
} else if (!argc) {
- usage_with_options(usage, options);
+ usage_with_options(builtin_catfile_usage, options);
} else if (argc != 2) {
usage_msg_optf(_("only two arguments allowed in <type> <object> mode, not %d"),
- usage, options, argc);
+ builtin_catfile_usage, options, argc);
} else if (argc) {
exp_type = argv[0];
obj_name = argv[1];
--
2.49.0.472.ge94155a9ec.dirty
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 02/10] builtin/cat-file: wire up an option to filter objects
2025-03-27 9:43 ` [PATCH v2 00/10] builtin/cat-file: allow filtering objects in batch mode Patrick Steinhardt
2025-03-27 9:43 ` [PATCH v2 01/10] builtin/cat-file: rename variable that tracks usage Patrick Steinhardt
@ 2025-03-27 9:43 ` Patrick Steinhardt
2025-04-01 11:45 ` Toon Claes
2025-04-01 12:05 ` Karthik Nayak
2025-03-27 9:43 ` [PATCH v2 03/10] builtin/cat-file: support "blob:none" objects filter Patrick Steinhardt
` (7 subsequent siblings)
9 siblings, 2 replies; 72+ messages in thread
From: Patrick Steinhardt @ 2025-03-27 9:43 UTC (permalink / raw)
To: git; +Cc: Toon Claes, Karthik Nayak, Taylor Blau, Junio C Hamano
In batch mode, git-cat-file(1) enumerates all objects and prints them
by iterating through both loose and packed objects. This works without
considering their reachability at all, and consequently most options to
filter objects as they exist in e.g. git-rev-list(1) are not applicable.
In some situations it may still be useful though to filter objects based
on properties that are inherent to them. This includes the object size
as well as its type.
Such a filter already exists in git-rev-list(1) with the `--filter=`
command line option. While this option supports a couple of filters that
are not applicable to our usecase, some of them are quite a neat fit.
Wire up the filter as an option for git-cat-file(1). This allows us to
reuse the same syntax as in git-rev-list(1) so that we don't have to
reinvent the wheel. For now, we die when any of the filter options has
been passed by the user, but they will be wired up in subsequent
commits.
Further note that the filters that we are about to introduce don't
significantly speed up the runtime of git-cat-file(1). While we can skip
emitting a lot of objects in case they are uninteresting to us, the
majority of time is spent reading the packfile, which is bottlenecked by
I/O and not the processor. This will change though once we start to make
use of bitmaps, which will allow us to skip reading the whole packfile.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
Documentation/git-cat-file.adoc | 6 ++++++
builtin/cat-file.c | 37 +++++++++++++++++++++++++++++++++----
t/t1006-cat-file.sh | 32 ++++++++++++++++++++++++++++++++
3 files changed, 71 insertions(+), 4 deletions(-)
diff --git a/Documentation/git-cat-file.adoc b/Documentation/git-cat-file.adoc
index d5890ae3686..f7f57b7f538 100644
--- a/Documentation/git-cat-file.adoc
+++ b/Documentation/git-cat-file.adoc
@@ -81,6 +81,12 @@ OPTIONS
end-of-line conversion, etc). In this case, `<object>` has to be of
the form `<tree-ish>:<path>`, or `:<path>`.
+--filter=<filter-spec>::
+--no-filter::
+ Omit objects from the list of printed objects. This can only be used in
+ combination with one of the batched modes. The '<filter-spec>' may be
+ one of the following:
+
--path=<path>::
For use with `--textconv` or `--filters`, to allow specifying an object
name and a path separately, e.g. when it is difficult to figure out
diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 8e40016dd24..940900d92ad 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -15,6 +15,7 @@
#include "gettext.h"
#include "hex.h"
#include "ident.h"
+#include "list-objects-filter-options.h"
#include "parse-options.h"
#include "userdiff.h"
#include "streaming.h"
@@ -35,6 +36,7 @@ enum batch_mode {
};
struct batch_options {
+ struct list_objects_filter_options objects_filter;
int enabled;
int follow_symlinks;
enum batch_mode batch_mode;
@@ -487,6 +489,13 @@ static void batch_object_write(const char *obj_name,
return;
}
+ switch (opt->objects_filter.choice) {
+ case LOFC_DISABLED:
+ break;
+ default:
+ BUG("unsupported objects filter");
+ }
+
if (use_mailmap && (data->type == OBJ_COMMIT || data->type == OBJ_TAG)) {
size_t s = data->size;
char *buf = NULL;
@@ -812,7 +821,8 @@ static int batch_objects(struct batch_options *opt)
struct object_cb_data cb;
struct object_info empty = OBJECT_INFO_INIT;
- if (!memcmp(&data.info, &empty, sizeof(empty)))
+ if (!memcmp(&data.info, &empty, sizeof(empty)) &&
+ opt->objects_filter.choice == LOFC_DISABLED)
data.skip_object_info = 1;
if (repo_has_promisor_remote(the_repository))
@@ -936,10 +946,13 @@ int cmd_cat_file(int argc,
int opt_cw = 0;
int opt_epts = 0;
const char *exp_type = NULL, *obj_name = NULL;
- struct batch_options batch = {0};
+ struct batch_options batch = {
+ .objects_filter = LIST_OBJECTS_FILTER_INIT,
+ };
int unknown_type = 0;
int input_nul_terminated = 0;
int nul_terminated = 0;
+ int ret;
const char * const builtin_catfile_usage[] = {
N_("git cat-file <type> <object>"),
@@ -1000,6 +1013,8 @@ int cmd_cat_file(int argc,
N_("run filters on object's content"), 'w'),
OPT_STRING(0, "path", &force_path, N_("blob|tree"),
N_("use a <path> for (--textconv | --filters); Not with 'batch'")),
+ OPT_CALLBACK(0, "filter", &batch.objects_filter, N_("args"),
+ N_("object filtering"), opt_parse_list_objects_filter),
OPT_END()
};
@@ -1014,6 +1029,14 @@ int cmd_cat_file(int argc,
if (use_mailmap)
read_mailmap(&mailmap);
+ switch (batch.objects_filter.choice) {
+ case LOFC_DISABLED:
+ break;
+ default:
+ usagef(_("objects filter not supported: '%s'"),
+ list_object_filter_config_name(batch.objects_filter.choice));
+ }
+
/* --batch-all-objects? */
if (opt == 'b')
batch.all_objects = 1;
@@ -1068,7 +1091,8 @@ int cmd_cat_file(int argc,
usage_msg_opt(_("batch modes take no arguments"), builtin_catfile_usage,
options);
- return batch_objects(&batch);
+ ret = batch_objects(&batch);
+ goto out;
}
if (opt) {
@@ -1097,5 +1121,10 @@ int cmd_cat_file(int argc,
if (unknown_type && opt != 't' && opt != 's')
die("git cat-file --allow-unknown-type: use with -s or -t");
- return cat_one_file(opt, exp_type, obj_name, unknown_type);
+
+ ret = cat_one_file(opt, exp_type, obj_name, unknown_type);
+
+out:
+ list_objects_filter_release(&batch.objects_filter);
+ return ret;
}
diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
index 398865d6ebe..1246d3119f8 100755
--- a/t/t1006-cat-file.sh
+++ b/t/t1006-cat-file.sh
@@ -1353,4 +1353,36 @@ test_expect_success PERL '--batch-command info is unbuffered by default' '
perl -e "$script" -- --batch-command $hello_oid "$expect" "info "
'
+test_expect_success 'setup for objects filter' '
+ git init repo
+'
+
+test_expect_success 'objects filter with unknown option' '
+ cat >expect <<-EOF &&
+ fatal: invalid filter-spec ${SQ}unknown${SQ}
+ EOF
+ test_must_fail git -C repo cat-file --filter=unknown 2>err &&
+ test_cmp expect err
+'
+
+for option in blob:none blob:limit=1 object:type=tag sparse:oid=1234 tree:1 sparse:path=x
+do
+ test_expect_success "objects filter with unsupported option $option" '
+ case "$option" in
+ tree:1)
+ echo "usage: objects filter not supported: ${SQ}tree${SQ}" >expect
+ ;;
+ sparse:path=x)
+ echo "fatal: sparse:path filters support has been dropped" >expect
+ ;;
+ *)
+ option_name=$(echo "$option" | cut -d= -f1) &&
+ printf "usage: objects filter not supported: ${SQ}%s${SQ}\n" "$option_name" >expect
+ ;;
+ esac &&
+ test_must_fail git -C repo cat-file --filter=$option 2>err &&
+ test_cmp expect err
+ '
+done
+
test_done
--
2.49.0.472.ge94155a9ec.dirty
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 03/10] builtin/cat-file: support "blob:none" objects filter
2025-03-27 9:43 ` [PATCH v2 00/10] builtin/cat-file: allow filtering objects in batch mode Patrick Steinhardt
2025-03-27 9:43 ` [PATCH v2 01/10] builtin/cat-file: rename variable that tracks usage Patrick Steinhardt
2025-03-27 9:43 ` [PATCH v2 02/10] builtin/cat-file: wire up an option to filter objects Patrick Steinhardt
@ 2025-03-27 9:43 ` Patrick Steinhardt
2025-04-01 12:22 ` Karthik Nayak
2025-03-27 9:43 ` [PATCH v2 04/10] builtin/cat-file: support "blob:limit=" " Patrick Steinhardt
` (6 subsequent siblings)
9 siblings, 1 reply; 72+ messages in thread
From: Patrick Steinhardt @ 2025-03-27 9:43 UTC (permalink / raw)
To: git; +Cc: Toon Claes, Karthik Nayak, Taylor Blau, Junio C Hamano
Implement support for the "blob:none" filter in git-cat-file(1), which
causes us to omit all blobs.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
Documentation/git-cat-file.adoc | 2 ++
builtin/cat-file.c | 11 ++++++++++-
t/t1006-cat-file.sh | 33 +++++++++++++++++++++++++++++++--
3 files changed, 43 insertions(+), 3 deletions(-)
diff --git a/Documentation/git-cat-file.adoc b/Documentation/git-cat-file.adoc
index f7f57b7f538..bb32f715944 100644
--- a/Documentation/git-cat-file.adoc
+++ b/Documentation/git-cat-file.adoc
@@ -86,6 +86,8 @@ OPTIONS
Omit objects from the list of printed objects. This can only be used in
combination with one of the batched modes. The '<filter-spec>' may be
one of the following:
++
+The form '--filter=blob:none' omits all blobs.
--path=<path>::
For use with `--textconv` or `--filters`, to allow specifying an object
diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 940900d92ad..e783dbbad58 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -472,7 +472,8 @@ static void batch_object_write(const char *obj_name,
if (!data->skip_object_info) {
int ret;
- if (use_mailmap)
+ if (use_mailmap ||
+ opt->objects_filter.choice == LOFC_BLOB_NONE)
data->info.typep = &data->type;
if (pack)
@@ -492,6 +493,10 @@ static void batch_object_write(const char *obj_name,
switch (opt->objects_filter.choice) {
case LOFC_DISABLED:
break;
+ case LOFC_BLOB_NONE:
+ if (data->type == OBJ_BLOB)
+ return;
+ break;
default:
BUG("unsupported objects filter");
}
@@ -1032,6 +1037,10 @@ int cmd_cat_file(int argc,
switch (batch.objects_filter.choice) {
case LOFC_DISABLED:
break;
+ case LOFC_BLOB_NONE:
+ if (!batch.enabled)
+ usage(_("objects filter only supported in batch mode"));
+ break;
default:
usagef(_("objects filter not supported: '%s'"),
list_object_filter_config_name(batch.objects_filter.choice));
diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
index 1246d3119f8..d00073f8add 100755
--- a/t/t1006-cat-file.sh
+++ b/t/t1006-cat-file.sh
@@ -1354,7 +1354,22 @@ test_expect_success PERL '--batch-command info is unbuffered by default' '
'
test_expect_success 'setup for objects filter' '
- git init repo
+ git init repo &&
+ (
+ # Seed the repository with three different sets of objects:
+ #
+ # - The first set is fully packed and has a bitmap.
+ # - The second set is packed, but has no bitmap.
+ # - The third set is loose.
+ #
+ # This ensures that we cover all these types as expected.
+ cd repo &&
+ test_commit first &&
+ git repack -Adb &&
+ test_commit second &&
+ git repack -d &&
+ test_commit third
+ )
'
test_expect_success 'objects filter with unknown option' '
@@ -1365,7 +1380,7 @@ test_expect_success 'objects filter with unknown option' '
test_cmp expect err
'
-for option in blob:none blob:limit=1 object:type=tag sparse:oid=1234 tree:1 sparse:path=x
+for option in blob:limit=1 object:type=tag sparse:oid=1234 tree:1 sparse:path=x
do
test_expect_success "objects filter with unsupported option $option" '
case "$option" in
@@ -1385,4 +1400,18 @@ do
'
done
+test_objects_filter () {
+ filter="$1"
+
+ test_expect_success "objects filter: $filter" '
+ git -C repo cat-file --batch-check="%(objectname)" --batch-all-objects --filter="$filter" >actual &&
+ sort actual >actual.sorted &&
+ git -C repo rev-list --objects --no-object-names --all --filter="$filter" --filter-provided-objects >expect &&
+ sort expect >expect.sorted &&
+ test_cmp expect.sorted actual.sorted
+ '
+}
+
+test_objects_filter "blob:none"
+
test_done
--
2.49.0.472.ge94155a9ec.dirty
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 04/10] builtin/cat-file: support "blob:limit=" objects filter
2025-03-27 9:43 ` [PATCH v2 00/10] builtin/cat-file: allow filtering objects in batch mode Patrick Steinhardt
` (2 preceding siblings ...)
2025-03-27 9:43 ` [PATCH v2 03/10] builtin/cat-file: support "blob:none" objects filter Patrick Steinhardt
@ 2025-03-27 9:43 ` Patrick Steinhardt
2025-03-27 9:44 ` [PATCH v2 05/10] builtin/cat-file: support "object:type=" " Patrick Steinhardt
` (5 subsequent siblings)
9 siblings, 0 replies; 72+ messages in thread
From: Patrick Steinhardt @ 2025-03-27 9:43 UTC (permalink / raw)
To: git; +Cc: Toon Claes, Karthik Nayak, Taylor Blau, Junio C Hamano
Implement support for the "blob:limit=" filter in git-cat-file(1), which
causes us to omit all blobs that are bigger than a certain size.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
Documentation/git-cat-file.adoc | 5 +++++
builtin/cat-file.c | 11 ++++++++++-
t/t1006-cat-file.sh | 18 +++++++++++++++---
3 files changed, 30 insertions(+), 4 deletions(-)
diff --git a/Documentation/git-cat-file.adoc b/Documentation/git-cat-file.adoc
index bb32f715944..62bfb00f4b1 100644
--- a/Documentation/git-cat-file.adoc
+++ b/Documentation/git-cat-file.adoc
@@ -88,6 +88,11 @@ OPTIONS
one of the following:
+
The form '--filter=blob:none' omits all blobs.
++
+The form '--filter=blob:limit=<n>[kmg]' omits blobs of size at least n
+bytes or units. n may be zero. The suffixes k, m, and g can be used to name
+units in KiB, MiB, or GiB. For example, 'blob:limit=1k' is the same as
+'blob:limit=1024'.
--path=<path>::
For use with `--textconv` or `--filters`, to allow specifying an object
diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index e783dbbad58..55755a461bc 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -473,8 +473,11 @@ static void batch_object_write(const char *obj_name,
int ret;
if (use_mailmap ||
- opt->objects_filter.choice == LOFC_BLOB_NONE)
+ opt->objects_filter.choice == LOFC_BLOB_NONE ||
+ opt->objects_filter.choice == LOFC_BLOB_LIMIT)
data->info.typep = &data->type;
+ if (opt->objects_filter.choice == LOFC_BLOB_LIMIT)
+ data->info.sizep = &data->size;
if (pack)
ret = packed_object_info(the_repository, pack, offset,
@@ -497,6 +500,11 @@ static void batch_object_write(const char *obj_name,
if (data->type == OBJ_BLOB)
return;
break;
+ case LOFC_BLOB_LIMIT:
+ if (data->type == OBJ_BLOB &&
+ data->size >= opt->objects_filter.blob_limit_value)
+ return;
+ break;
default:
BUG("unsupported objects filter");
}
@@ -1038,6 +1046,7 @@ int cmd_cat_file(int argc,
case LOFC_DISABLED:
break;
case LOFC_BLOB_NONE:
+ case LOFC_BLOB_LIMIT:
if (!batch.enabled)
usage(_("objects filter only supported in batch mode"));
break;
diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
index d00073f8add..1a0931bd2ca 100755
--- a/t/t1006-cat-file.sh
+++ b/t/t1006-cat-file.sh
@@ -1356,11 +1356,12 @@ test_expect_success PERL '--batch-command info is unbuffered by default' '
test_expect_success 'setup for objects filter' '
git init repo &&
(
- # Seed the repository with three different sets of objects:
+ # Seed the repository with four different sets of objects:
#
# - The first set is fully packed and has a bitmap.
# - The second set is packed, but has no bitmap.
# - The third set is loose.
+ # - The fourth set is loose and contains big objects.
#
# This ensures that we cover all these types as expected.
cd repo &&
@@ -1368,7 +1369,14 @@ test_expect_success 'setup for objects filter' '
git repack -Adb &&
test_commit second &&
git repack -d &&
- test_commit third
+ test_commit third &&
+
+ for n in 1000 10000
+ do
+ printf "%"$n"s" X >large.$n || return 1
+ done &&
+ git add large.* &&
+ git commit -m fourth
)
'
@@ -1380,7 +1388,7 @@ test_expect_success 'objects filter with unknown option' '
test_cmp expect err
'
-for option in blob:limit=1 object:type=tag sparse:oid=1234 tree:1 sparse:path=x
+for option in object:type=tag sparse:oid=1234 tree:1 sparse:path=x
do
test_expect_success "objects filter with unsupported option $option" '
case "$option" in
@@ -1413,5 +1421,9 @@ test_objects_filter () {
}
test_objects_filter "blob:none"
+test_objects_filter "blob:limit=1"
+test_objects_filter "blob:limit=500"
+test_objects_filter "blob:limit=1000"
+test_objects_filter "blob:limit=1g"
test_done
--
2.49.0.472.ge94155a9ec.dirty
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 05/10] builtin/cat-file: support "object:type=" objects filter
2025-03-27 9:43 ` [PATCH v2 00/10] builtin/cat-file: allow filtering objects in batch mode Patrick Steinhardt
` (3 preceding siblings ...)
2025-03-27 9:43 ` [PATCH v2 04/10] builtin/cat-file: support "blob:limit=" " Patrick Steinhardt
@ 2025-03-27 9:44 ` Patrick Steinhardt
2025-03-27 9:44 ` [PATCH v2 06/10] pack-bitmap: allow passing payloads to `show_reachable_fn()` Patrick Steinhardt
` (4 subsequent siblings)
9 siblings, 0 replies; 72+ messages in thread
From: Patrick Steinhardt @ 2025-03-27 9:44 UTC (permalink / raw)
To: git; +Cc: Toon Claes, Karthik Nayak, Taylor Blau, Junio C Hamano
Implement support for the "object:type=" filter in git-cat-file(1),
which causes us to omit all objects that don't match the provided object
type.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
Documentation/git-cat-file.adoc | 3 +++
builtin/cat-file.c | 8 +++++++-
t/t1006-cat-file.sh | 6 +++++-
3 files changed, 15 insertions(+), 2 deletions(-)
diff --git a/Documentation/git-cat-file.adoc b/Documentation/git-cat-file.adoc
index 62bfb00f4b1..9931840567b 100644
--- a/Documentation/git-cat-file.adoc
+++ b/Documentation/git-cat-file.adoc
@@ -93,6 +93,9 @@ The form '--filter=blob:limit=<n>[kmg]' omits blobs of size at least n
bytes or units. n may be zero. The suffixes k, m, and g can be used to name
units in KiB, MiB, or GiB. For example, 'blob:limit=1k' is the same as
'blob:limit=1024'.
++
+The form '--filter=object:type=(tag|commit|tree|blob)' omits all objects which
+are not of the requested type.
--path=<path>::
For use with `--textconv` or `--filters`, to allow specifying an object
diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 55755a461bc..430320adfe9 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -474,7 +474,8 @@ static void batch_object_write(const char *obj_name,
if (use_mailmap ||
opt->objects_filter.choice == LOFC_BLOB_NONE ||
- opt->objects_filter.choice == LOFC_BLOB_LIMIT)
+ opt->objects_filter.choice == LOFC_BLOB_LIMIT ||
+ opt->objects_filter.choice == LOFC_OBJECT_TYPE)
data->info.typep = &data->type;
if (opt->objects_filter.choice == LOFC_BLOB_LIMIT)
data->info.sizep = &data->size;
@@ -505,6 +506,10 @@ static void batch_object_write(const char *obj_name,
data->size >= opt->objects_filter.blob_limit_value)
return;
break;
+ case LOFC_OBJECT_TYPE:
+ if (data->type != opt->objects_filter.object_type)
+ return;
+ break;
default:
BUG("unsupported objects filter");
}
@@ -1047,6 +1052,7 @@ int cmd_cat_file(int argc,
break;
case LOFC_BLOB_NONE:
case LOFC_BLOB_LIMIT:
+ case LOFC_OBJECT_TYPE:
if (!batch.enabled)
usage(_("objects filter only supported in batch mode"));
break;
diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
index 1a0931bd2ca..9edd3d0c048 100755
--- a/t/t1006-cat-file.sh
+++ b/t/t1006-cat-file.sh
@@ -1388,7 +1388,7 @@ test_expect_success 'objects filter with unknown option' '
test_cmp expect err
'
-for option in object:type=tag sparse:oid=1234 tree:1 sparse:path=x
+for option in sparse:oid=1234 tree:1 sparse:path=x
do
test_expect_success "objects filter with unsupported option $option" '
case "$option" in
@@ -1425,5 +1425,9 @@ test_objects_filter "blob:limit=1"
test_objects_filter "blob:limit=500"
test_objects_filter "blob:limit=1000"
test_objects_filter "blob:limit=1g"
+test_objects_filter "object:type=blob"
+test_objects_filter "object:type=commit"
+test_objects_filter "object:type=tag"
+test_objects_filter "object:type=tree"
test_done
--
2.49.0.472.ge94155a9ec.dirty
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 06/10] pack-bitmap: allow passing payloads to `show_reachable_fn()`
2025-03-27 9:43 ` [PATCH v2 00/10] builtin/cat-file: allow filtering objects in batch mode Patrick Steinhardt
` (4 preceding siblings ...)
2025-03-27 9:44 ` [PATCH v2 05/10] builtin/cat-file: support "object:type=" " Patrick Steinhardt
@ 2025-03-27 9:44 ` Patrick Steinhardt
2025-04-01 12:17 ` Toon Claes
2025-03-27 9:44 ` [PATCH v2 07/10] pack-bitmap: add function to iterate over filtered bitmapped objects Patrick Steinhardt
` (3 subsequent siblings)
9 siblings, 1 reply; 72+ messages in thread
From: Patrick Steinhardt @ 2025-03-27 9:44 UTC (permalink / raw)
To: git; +Cc: Toon Claes, Karthik Nayak, Taylor Blau, Junio C Hamano
The `show_reachable_fn` callback is used by a couple of functions to
present reachable objects to the caller. The function does not provide a
way for the caller to pass a payload though, which is functionality that
we'll require in a subsequent commit.
Change the callback type to accept a payload and adapt all callsites
accordingly.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
builtin/pack-objects.c | 3 ++-
builtin/rev-list.c | 3 ++-
pack-bitmap.c | 15 ++++++++-------
pack-bitmap.h | 3 ++-
reachable.c | 3 ++-
5 files changed, 16 insertions(+), 11 deletions(-)
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index a7e4bb79049..38784613fc0 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1736,7 +1736,8 @@ static int add_object_entry(const struct object_id *oid, enum object_type type,
static int add_object_entry_from_bitmap(const struct object_id *oid,
enum object_type type,
int flags UNUSED, uint32_t name_hash,
- struct packed_git *pack, off_t offset)
+ struct packed_git *pack, off_t offset,
+ void *payload UNUSED)
{
display_progress(progress_state, ++nr_seen);
diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index bb26bee0d45..1100dd2abe7 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -429,7 +429,8 @@ static int show_object_fast(
int exclude UNUSED,
uint32_t name_hash UNUSED,
struct packed_git *found_pack UNUSED,
- off_t found_offset UNUSED)
+ off_t found_offset UNUSED,
+ void *payload UNUSED)
{
fprintf(stdout, "%s\n", oid_to_hex(oid));
return 1;
diff --git a/pack-bitmap.c b/pack-bitmap.c
index 6f7fd94c36f..d192fb87da9 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1625,7 +1625,7 @@ static void show_extended_objects(struct bitmap_index *bitmap_git,
(obj->type == OBJ_TAG && !revs->tag_objects))
continue;
- show_reach(&obj->oid, obj->type, 0, eindex->hashes[i], NULL, 0);
+ show_reach(&obj->oid, obj->type, 0, eindex->hashes[i], NULL, 0, NULL);
}
}
@@ -1663,7 +1663,8 @@ static void init_type_iterator(struct ewah_or_iterator *it,
static void show_objects_for_type(
struct bitmap_index *bitmap_git,
enum object_type object_type,
- show_reachable_fn show_reach)
+ show_reachable_fn show_reach,
+ void *payload)
{
size_t i = 0;
uint32_t offset;
@@ -1715,7 +1716,7 @@ static void show_objects_for_type(
if (bitmap_git->hashes)
hash = get_be32(bitmap_git->hashes + index_pos);
- show_reach(&oid, object_type, 0, hash, pack, ofs);
+ show_reach(&oid, object_type, 0, hash, pack, ofs, payload);
}
}
@@ -2518,13 +2519,13 @@ void traverse_bitmap_commit_list(struct bitmap_index *bitmap_git,
{
assert(bitmap_git->result);
- show_objects_for_type(bitmap_git, OBJ_COMMIT, show_reachable);
+ show_objects_for_type(bitmap_git, OBJ_COMMIT, show_reachable, NULL);
if (revs->tree_objects)
- show_objects_for_type(bitmap_git, OBJ_TREE, show_reachable);
+ show_objects_for_type(bitmap_git, OBJ_TREE, show_reachable, NULL);
if (revs->blob_objects)
- show_objects_for_type(bitmap_git, OBJ_BLOB, show_reachable);
+ show_objects_for_type(bitmap_git, OBJ_BLOB, show_reachable, NULL);
if (revs->tag_objects)
- show_objects_for_type(bitmap_git, OBJ_TAG, show_reachable);
+ show_objects_for_type(bitmap_git, OBJ_TAG, show_reachable, NULL);
show_extended_objects(bitmap_git, revs, show_reachable);
}
diff --git a/pack-bitmap.h b/pack-bitmap.h
index dd0951088f6..de6bf534fef 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -50,7 +50,8 @@ typedef int (*show_reachable_fn)(
int flags,
uint32_t hash,
struct packed_git *found_pack,
- off_t found_offset);
+ off_t found_offset,
+ void *payload);
struct bitmap_index;
diff --git a/reachable.c b/reachable.c
index 9ee04c89ec6..421d354d3b5 100644
--- a/reachable.c
+++ b/reachable.c
@@ -341,7 +341,8 @@ static int mark_object_seen(const struct object_id *oid,
int exclude UNUSED,
uint32_t name_hash UNUSED,
struct packed_git *found_pack UNUSED,
- off_t found_offset UNUSED)
+ off_t found_offset UNUSED,
+ void *payload UNUSED)
{
struct object *obj = lookup_object_by_type(the_repository, oid, type);
if (!obj)
--
2.49.0.472.ge94155a9ec.dirty
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 07/10] pack-bitmap: add function to iterate over filtered bitmapped objects
2025-03-27 9:43 ` [PATCH v2 00/10] builtin/cat-file: allow filtering objects in batch mode Patrick Steinhardt
` (5 preceding siblings ...)
2025-03-27 9:44 ` [PATCH v2 06/10] pack-bitmap: allow passing payloads to `show_reachable_fn()` Patrick Steinhardt
@ 2025-03-27 9:44 ` Patrick Steinhardt
2025-03-27 9:44 ` [PATCH v2 08/10] pack-bitmap: introduce function to check whether a pack is bitmapped Patrick Steinhardt
` (2 subsequent siblings)
9 siblings, 0 replies; 72+ messages in thread
From: Patrick Steinhardt @ 2025-03-27 9:44 UTC (permalink / raw)
To: git; +Cc: Toon Claes, Karthik Nayak, Taylor Blau, Junio C Hamano
Introduce a function that allows the caller to iterate over all
bitmapped objects that match a given filter. This mechanism will be used
in a subsequent commit to optimize object filters in git-cat-file(1).
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
pack-bitmap.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++++++++------
pack-bitmap.h | 12 ++++++++++++
2 files changed, 65 insertions(+), 6 deletions(-)
diff --git a/pack-bitmap.c b/pack-bitmap.c
index d192fb87da9..6adb8aaa1c2 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1662,6 +1662,7 @@ static void init_type_iterator(struct ewah_or_iterator *it,
static void show_objects_for_type(
struct bitmap_index *bitmap_git,
+ struct bitmap *objects,
enum object_type object_type,
show_reachable_fn show_reach,
void *payload)
@@ -1672,8 +1673,6 @@ static void show_objects_for_type(
struct ewah_or_iterator it;
eword_t filter;
- struct bitmap *objects = bitmap_git->result;
-
init_type_iterator(&it, bitmap_git, object_type);
for (i = 0; i < objects->word_alloc &&
@@ -2025,6 +2024,50 @@ static void filter_packed_objects_from_bitmap(struct bitmap_index *bitmap_git,
}
}
+int for_each_bitmapped_object(struct bitmap_index *bitmap_git,
+ struct list_objects_filter_options *filter,
+ show_reachable_fn show_reach,
+ void *payload)
+{
+ struct bitmap *filtered_bitmap = NULL;
+ uint32_t objects_nr;
+ size_t full_word_count;
+ int ret;
+
+ if (!can_filter_bitmap(filter)) {
+ ret = -1;
+ goto out;
+ }
+
+ objects_nr = bitmap_num_objects(bitmap_git);
+ full_word_count = objects_nr / BITS_IN_EWORD;
+
+ /* We start from the all-1 bitmap and then filter down from there. */
+ filtered_bitmap = bitmap_word_alloc(full_word_count + !!(objects_nr % BITS_IN_EWORD));
+ memset(filtered_bitmap->words, 0xff, full_word_count * sizeof(*filtered_bitmap->words));
+ for (size_t i = full_word_count * BITS_IN_EWORD; i < objects_nr; i++)
+ bitmap_set(filtered_bitmap, i);
+
+ if (filter_bitmap(bitmap_git, NULL, filtered_bitmap, filter) < 0) {
+ ret = -1;
+ goto out;
+ }
+
+ show_objects_for_type(bitmap_git, filtered_bitmap,
+ OBJ_COMMIT, show_reach, payload);
+ show_objects_for_type(bitmap_git, filtered_bitmap,
+ OBJ_TREE, show_reach, payload);
+ show_objects_for_type(bitmap_git, filtered_bitmap,
+ OBJ_BLOB, show_reach, payload);
+ show_objects_for_type(bitmap_git, filtered_bitmap,
+ OBJ_TAG, show_reach, payload);
+
+ ret = 0;
+out:
+ bitmap_free(filtered_bitmap);
+ return ret;
+}
+
struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
int filter_provided_objects)
{
@@ -2519,13 +2562,17 @@ void traverse_bitmap_commit_list(struct bitmap_index *bitmap_git,
{
assert(bitmap_git->result);
- show_objects_for_type(bitmap_git, OBJ_COMMIT, show_reachable, NULL);
+ show_objects_for_type(bitmap_git, bitmap_git->result,
+ OBJ_COMMIT, show_reachable, NULL);
if (revs->tree_objects)
- show_objects_for_type(bitmap_git, OBJ_TREE, show_reachable, NULL);
+ show_objects_for_type(bitmap_git, bitmap_git->result,
+ OBJ_TREE, show_reachable, NULL);
if (revs->blob_objects)
- show_objects_for_type(bitmap_git, OBJ_BLOB, show_reachable, NULL);
+ show_objects_for_type(bitmap_git, bitmap_git->result,
+ OBJ_BLOB, show_reachable, NULL);
if (revs->tag_objects)
- show_objects_for_type(bitmap_git, OBJ_TAG, show_reachable, NULL);
+ show_objects_for_type(bitmap_git, bitmap_git->result,
+ OBJ_TAG, show_reachable, NULL);
show_extended_objects(bitmap_git, revs, show_reachable);
}
diff --git a/pack-bitmap.h b/pack-bitmap.h
index de6bf534fef..079bae32466 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -79,6 +79,18 @@ int test_bitmap_pseudo_merges(struct repository *r);
int test_bitmap_pseudo_merge_commits(struct repository *r, uint32_t n);
int test_bitmap_pseudo_merge_objects(struct repository *r, uint32_t n);
+struct list_objects_filter_options;
+
+/*
+ * Filter bitmapped objects and iterate through all resulting objects,
+ * executing `show_reach` for each of them. Returns `-1` in case the filter is
+ * not supported, `0` otherwise.
+ */
+int for_each_bitmapped_object(struct bitmap_index *bitmap_git,
+ struct list_objects_filter_options *filter,
+ show_reachable_fn show_reach,
+ void *payload);
+
#define GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL \
"GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL"
--
2.49.0.472.ge94155a9ec.dirty
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 08/10] pack-bitmap: introduce function to check whether a pack is bitmapped
2025-03-27 9:43 ` [PATCH v2 00/10] builtin/cat-file: allow filtering objects in batch mode Patrick Steinhardt
` (6 preceding siblings ...)
2025-03-27 9:44 ` [PATCH v2 07/10] pack-bitmap: add function to iterate over filtered bitmapped objects Patrick Steinhardt
@ 2025-03-27 9:44 ` Patrick Steinhardt
2025-04-01 11:46 ` Toon Claes
2025-03-27 9:44 ` [PATCH v2 09/10] builtin/cat-file: deduplicate logic to iterate over all objects Patrick Steinhardt
2025-03-27 9:44 ` [PATCH v2 10/10] builtin/cat-file: use bitmaps to efficiently filter by object type Patrick Steinhardt
9 siblings, 1 reply; 72+ messages in thread
From: Patrick Steinhardt @ 2025-03-27 9:44 UTC (permalink / raw)
To: git; +Cc: Toon Claes, Karthik Nayak, Taylor Blau, Junio C Hamano
Introduce a function that allows us to verify whether a pack is
bitmapped or not. This functionality will be used in a subsequent
commit.
Helped-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
pack-bitmap.c | 15 +++++++++++++++
pack-bitmap.h | 7 +++++++
2 files changed, 22 insertions(+)
diff --git a/pack-bitmap.c b/pack-bitmap.c
index 6adb8aaa1c2..edc8f42122d 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -745,6 +745,21 @@ struct bitmap_index *prepare_midx_bitmap_git(struct multi_pack_index *midx)
return NULL;
}
+int bitmap_index_contains_pack(struct bitmap_index *bitmap, struct packed_git *pack)
+{
+ for (; bitmap; bitmap = bitmap->base) {
+ if (bitmap_is_midx(bitmap)) {
+ for (size_t i = 0; i < bitmap->midx->num_packs; i++)
+ if (bitmap->midx->packs[i] == pack)
+ return 1;
+ } else if (bitmap->pack == pack) {
+ return 1;
+ }
+ }
+
+ return 0;
+}
+
struct include_data {
struct bitmap_index *bitmap_git;
struct bitmap *base;
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 079bae32466..55df1b3af5a 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -67,6 +67,13 @@ struct bitmapped_pack {
struct bitmap_index *prepare_bitmap_git(struct repository *r);
struct bitmap_index *prepare_midx_bitmap_git(struct multi_pack_index *midx);
+
+/*
+ * Given a bitmap index, determine whether it contains the pack either directly
+ * or via the multi-pack-index.
+ */
+int bitmap_index_contains_pack(struct bitmap_index *bitmap, struct packed_git *pack);
+
void count_bitmap_commit_list(struct bitmap_index *, uint32_t *commits,
uint32_t *trees, uint32_t *blobs, uint32_t *tags);
void traverse_bitmap_commit_list(struct bitmap_index *,
--
2.49.0.472.ge94155a9ec.dirty
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 09/10] builtin/cat-file: deduplicate logic to iterate over all objects
2025-03-27 9:43 ` [PATCH v2 00/10] builtin/cat-file: allow filtering objects in batch mode Patrick Steinhardt
` (7 preceding siblings ...)
2025-03-27 9:44 ` [PATCH v2 08/10] pack-bitmap: introduce function to check whether a pack is bitmapped Patrick Steinhardt
@ 2025-03-27 9:44 ` Patrick Steinhardt
2025-04-01 12:13 ` Toon Claes
2025-03-27 9:44 ` [PATCH v2 10/10] builtin/cat-file: use bitmaps to efficiently filter by object type Patrick Steinhardt
9 siblings, 1 reply; 72+ messages in thread
From: Patrick Steinhardt @ 2025-03-27 9:44 UTC (permalink / raw)
To: git; +Cc: Toon Claes, Karthik Nayak, Taylor Blau, Junio C Hamano
Pull out a common function that allows us to iterate over all objects in
a repository. Right now the logic is trivial and would only require two
function calls, making this refactoring a bit pointless. But in the next
commit we will iterate on this logic to make use of bitmaps, so this is
about to become a bit more complex.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
builtin/cat-file.c | 85 ++++++++++++++++++++++++++++++------------------------
1 file changed, 48 insertions(+), 37 deletions(-)
diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 430320adfe9..6f5dbc821a2 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -622,25 +622,18 @@ static int batch_object_cb(const struct object_id *oid, void *vdata)
return 0;
}
-static int collect_loose_object(const struct object_id *oid,
- const char *path UNUSED,
- void *data)
-{
- oid_array_append(data, oid);
- return 0;
-}
-
-static int collect_packed_object(const struct object_id *oid,
- struct packed_git *pack UNUSED,
- uint32_t pos UNUSED,
- void *data)
+static int collect_object(const struct object_id *oid,
+ struct packed_git *pack UNUSED,
+ off_t offset UNUSED,
+ void *data)
{
oid_array_append(data, oid);
return 0;
}
static int batch_unordered_object(const struct object_id *oid,
- struct packed_git *pack, off_t offset,
+ struct packed_git *pack,
+ off_t offset,
void *vdata)
{
struct object_cb_data *data = vdata;
@@ -654,23 +647,6 @@ static int batch_unordered_object(const struct object_id *oid,
return 0;
}
-static int batch_unordered_loose(const struct object_id *oid,
- const char *path UNUSED,
- void *data)
-{
- return batch_unordered_object(oid, NULL, 0, data);
-}
-
-static int batch_unordered_packed(const struct object_id *oid,
- struct packed_git *pack,
- uint32_t pos,
- void *data)
-{
- return batch_unordered_object(oid, pack,
- nth_packed_object_offset(pack, pos),
- data);
-}
-
typedef void (*parse_cmd_fn_t)(struct batch_options *, const char *,
struct strbuf *, struct expand_data *);
@@ -803,6 +779,45 @@ static void batch_objects_command(struct batch_options *opt,
#define DEFAULT_FORMAT "%(objectname) %(objecttype) %(objectsize)"
+typedef int (*for_each_object_fn)(const struct object_id *oid, struct packed_git *pack,
+ off_t offset, void *data);
+
+struct for_each_object_payload {
+ for_each_object_fn callback;
+ void *payload;
+};
+
+static int batch_one_object_loose(const struct object_id *oid,
+ const char *path UNUSED,
+ void *_payload)
+{
+ struct for_each_object_payload *payload = _payload;
+ return payload->callback(oid, NULL, 0, payload->payload);
+}
+
+static int batch_one_object_packed(const struct object_id *oid,
+ struct packed_git *pack,
+ uint32_t pos,
+ void *_payload)
+{
+ struct for_each_object_payload *payload = _payload;
+ return payload->callback(oid, pack, nth_packed_object_offset(pack, pos),
+ payload->payload);
+}
+
+static void batch_each_object(for_each_object_fn callback,
+ unsigned flags,
+ void *_payload)
+{
+ struct for_each_object_payload payload = {
+ .callback = callback,
+ .payload = _payload,
+ };
+ for_each_loose_object(batch_one_object_loose, &payload, 0);
+ for_each_packed_object(the_repository, batch_one_object_packed,
+ &payload, flags);
+}
+
static int batch_objects(struct batch_options *opt)
{
struct strbuf input = STRBUF_INIT;
@@ -857,18 +872,14 @@ static int batch_objects(struct batch_options *opt)
cb.seen = &seen;
- for_each_loose_object(batch_unordered_loose, &cb, 0);
- for_each_packed_object(the_repository, batch_unordered_packed,
- &cb, FOR_EACH_OBJECT_PACK_ORDER);
+ batch_each_object(batch_unordered_object,
+ FOR_EACH_OBJECT_PACK_ORDER, &cb);
oidset_clear(&seen);
} else {
struct oid_array sa = OID_ARRAY_INIT;
- for_each_loose_object(collect_loose_object, &sa, 0);
- for_each_packed_object(the_repository, collect_packed_object,
- &sa, 0);
-
+ batch_each_object(collect_object, 0, &sa);
oid_array_for_each_unique(&sa, batch_object_cb, &cb);
oid_array_clear(&sa);
--
2.49.0.472.ge94155a9ec.dirty
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 10/10] builtin/cat-file: use bitmaps to efficiently filter by object type
2025-03-27 9:43 ` [PATCH v2 00/10] builtin/cat-file: allow filtering objects in batch mode Patrick Steinhardt
` (8 preceding siblings ...)
2025-03-27 9:44 ` [PATCH v2 09/10] builtin/cat-file: deduplicate logic to iterate over all objects Patrick Steinhardt
@ 2025-03-27 9:44 ` Patrick Steinhardt
9 siblings, 0 replies; 72+ messages in thread
From: Patrick Steinhardt @ 2025-03-27 9:44 UTC (permalink / raw)
To: git; +Cc: Toon Claes, Karthik Nayak, Taylor Blau, Junio C Hamano
While it is now possible to filter objects by type, this mechanism is
for now mostly a convenience. Most importantly, we still have to iterate
through the whole packfile to find all objects of a specific type. This
can be prohibitively expensive depending on the size of the packfiles.
It isn't really possible to do better than this when only considering a
packfile itself, as the order of objects is not fixed. But when we have
a packfile with a corresponding bitmap, either because the packfile
itself has one or because the multi-pack index has a bitmap for it, then
we can use these bitmaps to improve the runtime.
While bitmaps are typically used to compute reachability of objects,
they also contain one bitmap per object type that encodes which object
has what type. So instead of reading through the whole packfile(s), we
can use the bitmaps and iterate through the type-specific bitmap.
Typically, only a subset of packfiles will have a bitmap. But this isn't
really much of a problem: we can use bitmaps when available, and then
use the non-bitmap walk for every packfile that isn't covered by one.
Overall, this leads to quite a significant speedup depending on how many
objects of a certain type exist. The following benchmarks have been
executed in the Chromium repository, which has a 50GB packfile with
almost 25 million objects. As expected, there isn't really much of a
change in performance without an object filter:
Benchmark 1: cat-file with no-filter (revision = HEAD~)
Time (mean ± σ): 89.675 s ± 4.527 s [User: 40.807 s, System: 10.782 s]
Range (min … max): 83.052 s … 96.084 s 10 runs
Benchmark 2: cat-file with no-filter (revision = HEAD)
Time (mean ± σ): 88.991 s ± 2.488 s [User: 42.278 s, System: 10.305 s]
Range (min … max): 82.843 s … 91.271 s 10 runs
Summary
cat-file with no-filter (revision = HEAD) ran
1.01 ± 0.06 times faster than cat-file with no-filter (revision = HEAD~)
We still have to scan through all objects as we yield all of them, so
using the bitmap in this case doesn't really buy us anything. What is
noticeable in this benchmark is that we're I/O-bound, not CPU-bound, as
can be seen from the user/system runtimes, which combined are way lower
than the overall benchmarked runtime.
But when we do use a filter we can see a significant improvement:
Benchmark 1: cat-file with filter=object:type=commit (revision = HEAD~)
Time (mean ± σ): 86.444 s ± 4.081 s [User: 36.830 s, System: 11.312 s]
Range (min … max): 80.305 s … 93.104 s 10 runs
Benchmark 2: cat-file with filter=object:type=commit (revision = HEAD)
Time (mean ± σ): 2.089 s ± 0.015 s [User: 1.872 s, System: 0.207 s]
Range (min … max): 2.073 s … 2.119 s 10 runs
Summary
cat-file with filter=object:type=commit (revision = HEAD) ran
41.38 ± 1.98 times faster than cat-file with filter=object:type=commit (revision = HEAD~)
This is because we don't have to scan through all packfiles anymore, but
can instead directly look up relevant objects.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
builtin/cat-file.c | 42 +++++++++++++++++++++++++++++++++++++-----
1 file changed, 37 insertions(+), 5 deletions(-)
diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 6f5dbc821a2..eb6f0536c9e 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -21,6 +21,7 @@
#include "streaming.h"
#include "oid-array.h"
#include "packfile.h"
+#include "pack-bitmap.h"
#include "object-file.h"
#include "object-name.h"
#include "object-store-ll.h"
@@ -805,7 +806,20 @@ static int batch_one_object_packed(const struct object_id *oid,
payload->payload);
}
-static void batch_each_object(for_each_object_fn callback,
+static int batch_one_object_bitmapped(const struct object_id *oid,
+ enum object_type type UNUSED,
+ int flags UNUSED,
+ uint32_t hash UNUSED,
+ struct packed_git *pack,
+ off_t offset,
+ void *_payload)
+{
+ struct for_each_object_payload *payload = _payload;
+ return payload->callback(oid, pack, offset, payload->payload);
+}
+
+static void batch_each_object(struct batch_options *opt,
+ for_each_object_fn callback,
unsigned flags,
void *_payload)
{
@@ -813,9 +827,27 @@ static void batch_each_object(for_each_object_fn callback,
.callback = callback,
.payload = _payload,
};
+ struct bitmap_index *bitmap = prepare_bitmap_git(the_repository);
+
for_each_loose_object(batch_one_object_loose, &payload, 0);
- for_each_packed_object(the_repository, batch_one_object_packed,
- &payload, flags);
+
+ if (bitmap && !for_each_bitmapped_object(bitmap, &opt->objects_filter,
+ batch_one_object_bitmapped, &payload)) {
+ struct packed_git *pack;
+
+ for (pack = get_all_packs(the_repository); pack; pack = pack->next) {
+ if (bitmap_index_contains_pack(bitmap, pack) ||
+ open_pack_index(pack))
+ continue;
+ for_each_object_in_pack(pack, batch_one_object_packed,
+ &payload, flags);
+ }
+ } else {
+ for_each_packed_object(the_repository, batch_one_object_packed,
+ &payload, flags);
+ }
+
+ free_bitmap_index(bitmap);
}
static int batch_objects(struct batch_options *opt)
@@ -872,14 +904,14 @@ static int batch_objects(struct batch_options *opt)
cb.seen = &seen;
- batch_each_object(batch_unordered_object,
+ batch_each_object(opt, batch_unordered_object,
FOR_EACH_OBJECT_PACK_ORDER, &cb);
oidset_clear(&seen);
} else {
struct oid_array sa = OID_ARRAY_INIT;
- batch_each_object(collect_object, 0, &sa);
+ batch_each_object(opt, collect_object, 0, &sa);
oid_array_for_each_unique(&sa, batch_object_cb, &cb);
oid_array_clear(&sa);
--
2.49.0.472.ge94155a9ec.dirty
^ permalink raw reply related [flat|nested] 72+ messages in thread
* Re: [PATCH v2 01/10] builtin/cat-file: rename variable that tracks usage
2025-03-27 9:43 ` [PATCH v2 01/10] builtin/cat-file: rename variable that tracks usage Patrick Steinhardt
@ 2025-04-01 9:51 ` Karthik Nayak
2025-04-02 11:13 ` Patrick Steinhardt
0 siblings, 1 reply; 72+ messages in thread
From: Karthik Nayak @ 2025-04-01 9:51 UTC (permalink / raw)
To: Patrick Steinhardt, git; +Cc: Toon Claes, Taylor Blau, Junio C Hamano
[-- Attachment #1: Type: text/plain, Size: 4994 bytes --]
Patrick Steinhardt <ps@pks.im> writes:
> The usage strings for git-cat-file(1) that we pass to `parse_options()`
> and `usage_msg_optf()` are stored in a variable called `usage`. This
> variable shadows the declaration of `usage()`, which we'll want to use
> in a subsequent commit.
>
> Rename the variable to `builtin_catfile_usage`, which is in line with
> how the variable is typically called in other builtins.
>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
> builtin/cat-file.c | 32 ++++++++++++++++----------------
> 1 file changed, 16 insertions(+), 16 deletions(-)
>
> diff --git a/builtin/cat-file.c b/builtin/cat-file.c
> index b13561cf73b..8e40016dd24 100644
> --- a/builtin/cat-file.c
> +++ b/builtin/cat-file.c
> @@ -941,7 +941,7 @@ int cmd_cat_file(int argc,
> int input_nul_terminated = 0;
> int nul_terminated = 0;
>
> - const char * const usage[] = {
> + const char * const builtin_catfile_usage[] = {
Nit: Style: we use a right pointer alignment, while it is not part of
your code change, would be nice to fix.
> N_("git cat-file <type> <object>"),
> N_("git cat-file (-e | -p) <object>"),
> N_("git cat-file (-t | -s) [--allow-unknown-type] <object>"),
> @@ -1007,7 +1007,7 @@ int cmd_cat_file(int argc,
>
> batch.buffer_output = -1;
>
> - argc = parse_options(argc, argv, prefix, options, usage, 0);
> + argc = parse_options(argc, argv, prefix, options, builtin_catfile_usage, 0);
> opt_cw = (opt == 'c' || opt == 'w');
> opt_epts = (opt == 'e' || opt == 'p' || opt == 't' || opt == 's');
>
> @@ -1021,7 +1021,7 @@ int cmd_cat_file(int argc,
> /* Option compatibility */
> if (force_path && !opt_cw)
> usage_msg_optf(_("'%s=<%s>' needs '%s' or '%s'"),
> - usage, options,
> + builtin_catfile_usage, options,
> "--path", _("path|tree-ish"), "--filters",
> "--textconv");
>
> @@ -1029,19 +1029,19 @@ int cmd_cat_file(int argc,
> if (batch.enabled)
> ;
> else if (batch.follow_symlinks)
> - usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
> + usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options,
> "--follow-symlinks");
> else if (batch.buffer_output >= 0)
> - usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
> + usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options,
> "--buffer");
> else if (batch.all_objects)
> - usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
> + usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options,
> "--batch-all-objects");
> else if (input_nul_terminated)
> - usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
> + usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options,
> "-z");
> else if (nul_terminated)
> - usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
> + usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options,
> "-Z");
>
> batch.input_delim = batch.output_delim = '\n';
> @@ -1063,9 +1063,9 @@ int cmd_cat_file(int argc,
> batch.transform_mode = opt;
> else if (opt && opt != 'b')
> usage_msg_optf(_("'-%c' is incompatible with batch mode"),
> - usage, options, opt);
> + builtin_catfile_usage, options, opt);
> else if (argc)
> - usage_msg_opt(_("batch modes take no arguments"), usage,
> + usage_msg_opt(_("batch modes take no arguments"), builtin_catfile_usage,
> options);
>
> return batch_objects(&batch);
> @@ -1074,22 +1074,22 @@ int cmd_cat_file(int argc,
> if (opt) {
> if (!argc && opt == 'c')
> usage_msg_optf(_("<rev> required with '%s'"),
> - usage, options, "--textconv");
> + builtin_catfile_usage, options, "--textconv");
> else if (!argc && opt == 'w')
> usage_msg_optf(_("<rev> required with '%s'"),
> - usage, options, "--filters");
> + builtin_catfile_usage, options, "--filters");
> else if (!argc && opt_epts)
> usage_msg_optf(_("<object> required with '-%c'"),
> - usage, options, opt);
> + builtin_catfile_usage, options, opt);
> else if (argc == 1)
> obj_name = argv[0];
> else
> - usage_msg_opt(_("too many arguments"), usage, options);
> + usage_msg_opt(_("too many arguments"), builtin_catfile_usage, options);
> } else if (!argc) {
> - usage_with_options(usage, options);
> + usage_with_options(builtin_catfile_usage, options);
> } else if (argc != 2) {
> usage_msg_optf(_("only two arguments allowed in <type> <object> mode, not %d"),
> - usage, options, argc);
> + builtin_catfile_usage, options, argc);
> } else if (argc) {
> exp_type = argv[0];
> obj_name = argv[1];
>
> --
> 2.49.0.472.ge94155a9ec.dirty
Nit: Some of these lines could potentially be wrapped. But I think are
wrapping rules are a bit too strict. So I'd let it be as is.
The changes look good.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 690 bytes --]
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v2 02/10] builtin/cat-file: wire up an option to filter objects
2025-03-27 9:43 ` [PATCH v2 02/10] builtin/cat-file: wire up an option to filter objects Patrick Steinhardt
@ 2025-04-01 11:45 ` Toon Claes
2025-04-02 11:13 ` Patrick Steinhardt
2025-04-01 12:05 ` Karthik Nayak
1 sibling, 1 reply; 72+ messages in thread
From: Toon Claes @ 2025-04-01 11:45 UTC (permalink / raw)
To: Patrick Steinhardt, git; +Cc: Karthik Nayak, Taylor Blau, Junio C Hamano
Patrick Steinhardt <ps@pks.im> writes:
> In batch mode, git-cat-file(1) enumerates all objects and prints them
> by iterating through both loose and packed objects. This works without
> considering their reachability at all, and consequently most options to
> filter objects as they exist in e.g. git-rev-list(1) are not applicable.
> In some situations it may still be useful though to filter objects based
> on properties that are inherent to them. This includes the object size
> as well as its type.
>
> Such a filter already exists in git-rev-list(1) with the `--filter=`
> command line option. While this option supports a couple of filters that
> are not applicable to our usecase, some of them are quite a neat fit.
>
> Wire up the filter as an option for git-cat-file(1). This allows us to
> reuse the same syntax as in git-rev-list(1) so that we don't have to
> reinvent the wheel. For now, we die when any of the filter options has
> been passed by the user, but they will be wired up in subsequent
> commits.
>
> Further note that the filters that we are about to introduce don't
> significantly speed up the runtime of git-cat-file(1). While we can skip
> emitting a lot of objects in case they are uninteresting to us, the
> majority of time is spent reading the packfile, which is bottlenecked by
> I/O and not the processor. This will change though once we start to make
> use of bitmaps, which will allow us to skip reading the whole packfile.
>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
> Documentation/git-cat-file.adoc | 6 ++++++
> builtin/cat-file.c | 37 +++++++++++++++++++++++++++++++++----
> t/t1006-cat-file.sh | 32 ++++++++++++++++++++++++++++++++
> 3 files changed, 71 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/git-cat-file.adoc b/Documentation/git-cat-file.adoc
> index d5890ae3686..f7f57b7f538 100644
> --- a/Documentation/git-cat-file.adoc
> +++ b/Documentation/git-cat-file.adoc
> @@ -81,6 +81,12 @@ OPTIONS
> end-of-line conversion, etc). In this case, `<object>` has to be of
> the form `<tree-ish>:<path>`, or `:<path>`.
>
> +--filter=<filter-spec>::
> +--no-filter::
> + Omit objects from the list of printed objects. This can only be used in
> + combination with one of the batched modes. The '<filter-spec>' may be
> + one of the following:
> +
> --path=<path>::
> For use with `--textconv` or `--filters`, to allow specifying an object
> name and a path separately, e.g. when it is difficult to figure out
> diff --git a/builtin/cat-file.c b/builtin/cat-file.c
> index 8e40016dd24..940900d92ad 100644
> --- a/builtin/cat-file.c
> +++ b/builtin/cat-file.c
> @@ -936,10 +946,13 @@ int cmd_cat_file(int argc,
> int opt_cw = 0;
> int opt_epts = 0;
> const char *exp_type = NULL, *obj_name = NULL;
> - struct batch_options batch = {0};
> + struct batch_options batch = {
> + .objects_filter = LIST_OBJECTS_FILTER_INIT,
> + };
> int unknown_type = 0;
> int input_nul_terminated = 0;
> int nul_terminated = 0;
> + int ret;
>
> const char * const builtin_catfile_usage[] = {
> N_("git cat-file <type> <object>"),
> @@ -1000,6 +1013,8 @@ int cmd_cat_file(int argc,
> N_("run filters on object's content"), 'w'),
> OPT_STRING(0, "path", &force_path, N_("blob|tree"),
> N_("use a <path> for (--textconv | --filters); Not with 'batch'")),
> + OPT_CALLBACK(0, "filter", &batch.objects_filter, N_("args"),
> + N_("object filtering"), opt_parse_list_objects_filter),
Because we've decided on `--filter` we can use
`OPT_PARSE_LIST_OBJECTS_FILTER` here now.
--
Toon
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v2 08/10] pack-bitmap: introduce function to check whether a pack is bitmapped
2025-03-27 9:44 ` [PATCH v2 08/10] pack-bitmap: introduce function to check whether a pack is bitmapped Patrick Steinhardt
@ 2025-04-01 11:46 ` Toon Claes
2025-04-02 11:13 ` Patrick Steinhardt
0 siblings, 1 reply; 72+ messages in thread
From: Toon Claes @ 2025-04-01 11:46 UTC (permalink / raw)
To: Patrick Steinhardt, git; +Cc: Karthik Nayak, Taylor Blau, Junio C Hamano
Patrick Steinhardt <ps@pks.im> writes:
> Introduce a function that allows us to verify whether a pack is
> bitmapped or not. This functionality will be used in a subsequent
> commit.
>
> Helped-by: Taylor Blau <me@ttaylorr.com>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
> pack-bitmap.c | 15 +++++++++++++++
> pack-bitmap.h | 7 +++++++
> 2 files changed, 22 insertions(+)
>
> diff --git a/pack-bitmap.c b/pack-bitmap.c
> index 6adb8aaa1c2..edc8f42122d 100644
> --- a/pack-bitmap.c
> +++ b/pack-bitmap.c
> @@ -745,6 +745,21 @@ struct bitmap_index *prepare_midx_bitmap_git(struct multi_pack_index *midx)
> return NULL;
> }
>
> +int bitmap_index_contains_pack(struct bitmap_index *bitmap, struct packed_git *pack)
> +{
> + for (; bitmap; bitmap = bitmap->base) {
> + if (bitmap_is_midx(bitmap)) {
> + for (size_t i = 0; i < bitmap->midx->num_packs; i++)
> + if (bitmap->midx->packs[i] == pack)
> + return 1;
> + } else if (bitmap->pack == pack) {
Here, and two lines above, we compare packs by their pointer address,
this doesn't seem to be common practice to me. Or is it in the Git
codebase? Do we expect any problems with this, for example when we stop
using `the_repository`?
--
Toon
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v2 02/10] builtin/cat-file: wire up an option to filter objects
2025-03-27 9:43 ` [PATCH v2 02/10] builtin/cat-file: wire up an option to filter objects Patrick Steinhardt
2025-04-01 11:45 ` Toon Claes
@ 2025-04-01 12:05 ` Karthik Nayak
2025-04-02 11:13 ` Patrick Steinhardt
1 sibling, 1 reply; 72+ messages in thread
From: Karthik Nayak @ 2025-04-01 12:05 UTC (permalink / raw)
To: Patrick Steinhardt, git; +Cc: Toon Claes, Taylor Blau, Junio C Hamano
[-- Attachment #1: Type: text/plain, Size: 7565 bytes --]
Patrick Steinhardt <ps@pks.im> writes:
> In batch mode, git-cat-file(1) enumerates all objects and prints them
> by iterating through both loose and packed objects. This works without
Nit: I assume you're referring to the `--batch-all-objects` mode. So
would be nice to specify here perhaps?
> considering their reachability at all, and consequently most options to
> filter objects as they exist in e.g. git-rev-list(1) are not applicable.
> In some situations it may still be useful though to filter objects based
> on properties that are inherent to them. This includes the object size
> as well as its type.
>
> Such a filter already exists in git-rev-list(1) with the `--filter=`
> command line option. While this option supports a couple of filters that
> are not applicable to our usecase, some of them are quite a neat fit.
>
> Wire up the filter as an option for git-cat-file(1). This allows us to
> reuse the same syntax as in git-rev-list(1) so that we don't have to
> reinvent the wheel. For now, we die when any of the filter options has
> been passed by the user, but they will be wired up in subsequent
> commits.
>
> Further note that the filters that we are about to introduce don't
> significantly speed up the runtime of git-cat-file(1). While we can skip
> emitting a lot of objects in case they are uninteresting to us, the
> majority of time is spent reading the packfile, which is bottlenecked by
> I/O and not the processor. This will change though once we start to make
> use of bitmaps, which will allow us to skip reading the whole packfile.
>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
> Documentation/git-cat-file.adoc | 6 ++++++
> builtin/cat-file.c | 37 +++++++++++++++++++++++++++++++++----
> t/t1006-cat-file.sh | 32 ++++++++++++++++++++++++++++++++
> 3 files changed, 71 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/git-cat-file.adoc b/Documentation/git-cat-file.adoc
> index d5890ae3686..f7f57b7f538 100644
> --- a/Documentation/git-cat-file.adoc
> +++ b/Documentation/git-cat-file.adoc
> @@ -81,6 +81,12 @@ OPTIONS
> end-of-line conversion, etc). In this case, `<object>` has to be of
> the form `<tree-ish>:<path>`, or `:<path>`.
>
> +--filter=<filter-spec>::
> +--no-filter::
> + Omit objects from the list of printed objects. This can only be used in
> + combination with one of the batched modes. The '<filter-spec>' may be
> + one of the following:
> +
Shouldn't we say this is specific to `--batch-all-objects`?
> --path=<path>::
> For use with `--textconv` or `--filters`, to allow specifying an object
> name and a path separately, e.g. when it is difficult to figure out
> diff --git a/builtin/cat-file.c b/builtin/cat-file.c
> index 8e40016dd24..940900d92ad 100644
> --- a/builtin/cat-file.c
> +++ b/builtin/cat-file.c
> @@ -15,6 +15,7 @@
> #include "gettext.h"
> #include "hex.h"
> #include "ident.h"
> +#include "list-objects-filter-options.h"
> #include "parse-options.h"
> #include "userdiff.h"
> #include "streaming.h"
> @@ -35,6 +36,7 @@ enum batch_mode {
> };
>
> struct batch_options {
> + struct list_objects_filter_options objects_filter;
> int enabled;
> int follow_symlinks;
> enum batch_mode batch_mode;
> @@ -487,6 +489,13 @@ static void batch_object_write(const char *obj_name,
> return;
> }
>
> + switch (opt->objects_filter.choice) {
> + case LOFC_DISABLED:
> + break;
> + default:
> + BUG("unsupported objects filter");
> + }
> +
Okay here it seems like it also applies to other batch modes. So it
would be nice to perhaps clarify how this works when not used with
`--batch-all-objects`?
> if (use_mailmap && (data->type == OBJ_COMMIT || data->type == OBJ_TAG)) {
> size_t s = data->size;
> char *buf = NULL;
> @@ -812,7 +821,8 @@ static int batch_objects(struct batch_options *opt)
> struct object_cb_data cb;
> struct object_info empty = OBJECT_INFO_INIT;
>
> - if (!memcmp(&data.info, &empty, sizeof(empty)))
> + if (!memcmp(&data.info, &empty, sizeof(empty)) &&
> + opt->objects_filter.choice == LOFC_DISABLED)
> data.skip_object_info = 1;
>
> if (repo_has_promisor_remote(the_repository))
> @@ -936,10 +946,13 @@ int cmd_cat_file(int argc,
> int opt_cw = 0;
> int opt_epts = 0;
> const char *exp_type = NULL, *obj_name = NULL;
> - struct batch_options batch = {0};
> + struct batch_options batch = {
> + .objects_filter = LIST_OBJECTS_FILTER_INIT,
> + };
> int unknown_type = 0;
> int input_nul_terminated = 0;
> int nul_terminated = 0;
> + int ret;
>
> const char * const builtin_catfile_usage[] = {
> N_("git cat-file <type> <object>"),
> @@ -1000,6 +1013,8 @@ int cmd_cat_file(int argc,
> N_("run filters on object's content"), 'w'),
> OPT_STRING(0, "path", &force_path, N_("blob|tree"),
> N_("use a <path> for (--textconv | --filters); Not with 'batch'")),
> + OPT_CALLBACK(0, "filter", &batch.objects_filter, N_("args"),
> + N_("object filtering"), opt_parse_list_objects_filter),
> OPT_END()
> };
>
> @@ -1014,6 +1029,14 @@ int cmd_cat_file(int argc,
> if (use_mailmap)
> read_mailmap(&mailmap);
>
> + switch (batch.objects_filter.choice) {
> + case LOFC_DISABLED:
> + break;
> + default:
> + usagef(_("objects filter not supported: '%s'"),
> + list_object_filter_config_name(batch.objects_filter.choice));
> + }
> +
> /* --batch-all-objects? */
> if (opt == 'b')
> batch.all_objects = 1;
> @@ -1068,7 +1091,8 @@ int cmd_cat_file(int argc,
> usage_msg_opt(_("batch modes take no arguments"), builtin_catfile_usage,
> options);
>
> - return batch_objects(&batch);
> + ret = batch_objects(&batch);
> + goto out;
> }
>
> if (opt) {
> @@ -1097,5 +1121,10 @@ int cmd_cat_file(int argc,
>
> if (unknown_type && opt != 't' && opt != 's')
> die("git cat-file --allow-unknown-type: use with -s or -t");
> - return cat_one_file(opt, exp_type, obj_name, unknown_type);
> +
> + ret = cat_one_file(opt, exp_type, obj_name, unknown_type);
> +
> +out:
> + list_objects_filter_release(&batch.objects_filter);
> + return ret;
> }
> diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
> index 398865d6ebe..1246d3119f8 100755
> --- a/t/t1006-cat-file.sh
> +++ b/t/t1006-cat-file.sh
> @@ -1353,4 +1353,36 @@ test_expect_success PERL '--batch-command info is unbuffered by default' '
> perl -e "$script" -- --batch-command $hello_oid "$expect" "info "
> '
>
> +test_expect_success 'setup for objects filter' '
> + git init repo
> +'
> +
> +test_expect_success 'objects filter with unknown option' '
> + cat >expect <<-EOF &&
> + fatal: invalid filter-spec ${SQ}unknown${SQ}
> + EOF
> + test_must_fail git -C repo cat-file --filter=unknown 2>err &&
> + test_cmp expect err
> +'
> +
Would it be also worthwhile to test the `--no-filter` option?
> +for option in blob:none blob:limit=1 object:type=tag sparse:oid=1234 tree:1 sparse:path=x
> +do
> + test_expect_success "objects filter with unsupported option $option" '
> + case "$option" in
> + tree:1)
> + echo "usage: objects filter not supported: ${SQ}tree${SQ}" >expect
> + ;;
> + sparse:path=x)
> + echo "fatal: sparse:path filters support has been dropped" >expect
> + ;;
> + *)
> + option_name=$(echo "$option" | cut -d= -f1) &&
> + printf "usage: objects filter not supported: ${SQ}%s${SQ}\n" "$option_name" >expect
> + ;;
> + esac &&
> + test_must_fail git -C repo cat-file --filter=$option 2>err &&
> + test_cmp expect err
> + '
> +done
> +
> test_done
>
> --
> 2.49.0.472.ge94155a9ec.dirty
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 690 bytes --]
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v2 09/10] builtin/cat-file: deduplicate logic to iterate over all objects
2025-03-27 9:44 ` [PATCH v2 09/10] builtin/cat-file: deduplicate logic to iterate over all objects Patrick Steinhardt
@ 2025-04-01 12:13 ` Toon Claes
2025-04-02 11:13 ` Patrick Steinhardt
0 siblings, 1 reply; 72+ messages in thread
From: Toon Claes @ 2025-04-01 12:13 UTC (permalink / raw)
To: Patrick Steinhardt, git; +Cc: Karthik Nayak, Taylor Blau, Junio C Hamano
Patrick Steinhardt <ps@pks.im> writes:
> Pull out a common function that allows us to iterate over all objects in
> a repository. Right now the logic is trivial and would only require two
> function calls, making this refactoring a bit pointless. But in the next
> commit we will iterate on this logic to make use of bitmaps, so this is
> about to become a bit more complex.
>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
> builtin/cat-file.c | 85 ++++++++++++++++++++++++++++++------------------------
> 1 file changed, 48 insertions(+), 37 deletions(-)
>
> diff --git a/builtin/cat-file.c b/builtin/cat-file.c
> index 430320adfe9..6f5dbc821a2 100644
> --- a/builtin/cat-file.c
> +++ b/builtin/cat-file.c
> @@ -622,25 +622,18 @@ static int batch_object_cb(const struct object_id *oid, void *vdata)
> return 0;
> }
>
> -static int collect_loose_object(const struct object_id *oid,
> - const char *path UNUSED,
> - void *data)
> -{
> - oid_array_append(data, oid);
> - return 0;
> -}
> -
> -static int collect_packed_object(const struct object_id *oid,
> - struct packed_git *pack UNUSED,
> - uint32_t pos UNUSED,
> - void *data)
> +static int collect_object(const struct object_id *oid,
> + struct packed_git *pack UNUSED,
> + off_t offset UNUSED,
> + void *data)
> {
> oid_array_append(data, oid);
> return 0;
> }
>
> static int batch_unordered_object(const struct object_id *oid,
> - struct packed_git *pack, off_t offset,
> + struct packed_git *pack,
> + off_t offset,
> void *vdata)
> {
> struct object_cb_data *data = vdata;
> @@ -654,23 +647,6 @@ static int batch_unordered_object(const struct object_id *oid,
> return 0;
> }
>
> -static int batch_unordered_loose(const struct object_id *oid,
> - const char *path UNUSED,
> - void *data)
> -{
> - return batch_unordered_object(oid, NULL, 0, data);
> -}
> -
> -static int batch_unordered_packed(const struct object_id *oid,
> - struct packed_git *pack,
> - uint32_t pos,
> - void *data)
> -{
> - return batch_unordered_object(oid, pack,
> - nth_packed_object_offset(pack, pos),
> - data);
> -}
> -
> typedef void (*parse_cmd_fn_t)(struct batch_options *, const char *,
> struct strbuf *, struct expand_data *);
>
> @@ -803,6 +779,45 @@ static void batch_objects_command(struct batch_options *opt,
>
> #define DEFAULT_FORMAT "%(objectname) %(objecttype) %(objectsize)"
>
> +typedef int (*for_each_object_fn)(const struct object_id *oid, struct packed_git *pack,
> + off_t offset, void *data);
> +
> +struct for_each_object_payload {
> + for_each_object_fn callback;
> + void *payload;
> +};
> +
> +static int batch_one_object_loose(const struct object_id *oid,
> + const char *path UNUSED,
> + void *_payload)
> +{
> + struct for_each_object_payload *payload = _payload;
> + return payload->callback(oid, NULL, 0, payload->payload);
> +}
> +
> +static int batch_one_object_packed(const struct object_id *oid,
> + struct packed_git *pack,
> + uint32_t pos,
> + void *_payload)
> +{
> + struct for_each_object_payload *payload = _payload;
> + return payload->callback(oid, pack, nth_packed_object_offset(pack, pos),
> + payload->payload);
> +}
> +
> +static void batch_each_object(for_each_object_fn callback,
> + unsigned flags,
> + void *_payload)
Why is this `_payload` typeless? I see it only getting passed in
`struct object_cb_data`, is there a reason to hide this type? With
payload being wrapped in payload I think it's beneficial to keep type
info where possible.
--
Toon
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v2 06/10] pack-bitmap: allow passing payloads to `show_reachable_fn()`
2025-03-27 9:44 ` [PATCH v2 06/10] pack-bitmap: allow passing payloads to `show_reachable_fn()` Patrick Steinhardt
@ 2025-04-01 12:17 ` Toon Claes
2025-04-02 11:13 ` Patrick Steinhardt
0 siblings, 1 reply; 72+ messages in thread
From: Toon Claes @ 2025-04-01 12:17 UTC (permalink / raw)
To: Patrick Steinhardt, git; +Cc: Karthik Nayak, Taylor Blau, Junio C Hamano
Patrick Steinhardt <ps@pks.im> writes:
> The `show_reachable_fn` callback is used by a couple of functions to
> present reachable objects to the caller. The function does not provide a
> way for the caller to pass a payload though, which is functionality that
> we'll require in a subsequent commit.
>
> Change the callback type to accept a payload and adapt all callsites
> accordingly.
>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
> builtin/pack-objects.c | 3 ++-
> builtin/rev-list.c | 3 ++-
> pack-bitmap.c | 15 ++++++++-------
> pack-bitmap.h | 3 ++-
> reachable.c | 3 ++-
> 5 files changed, 16 insertions(+), 11 deletions(-)
>
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index a7e4bb79049..38784613fc0 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -1736,7 +1736,8 @@ static int add_object_entry(const struct object_id *oid, enum object_type type,
> static int add_object_entry_from_bitmap(const struct object_id *oid,
> enum object_type type,
> int flags UNUSED, uint32_t name_hash,
> - struct packed_git *pack, off_t offset)
> + struct packed_git *pack, off_t offset,
> + void *payload UNUSED)
> {
> display_progress(progress_state, ++nr_seen);
>
> diff --git a/builtin/rev-list.c b/builtin/rev-list.c
> index bb26bee0d45..1100dd2abe7 100644
> --- a/builtin/rev-list.c
> +++ b/builtin/rev-list.c
> @@ -429,7 +429,8 @@ static int show_object_fast(
> int exclude UNUSED,
> uint32_t name_hash UNUSED,
> struct packed_git *found_pack UNUSED,
> - off_t found_offset UNUSED)
> + off_t found_offset UNUSED,
> + void *payload UNUSED)
> {
> fprintf(stdout, "%s\n", oid_to_hex(oid));
> return 1;
> diff --git a/pack-bitmap.c b/pack-bitmap.c
> index 6f7fd94c36f..d192fb87da9 100644
> --- a/pack-bitmap.c
> +++ b/pack-bitmap.c
> @@ -1625,7 +1625,7 @@ static void show_extended_objects(struct bitmap_index *bitmap_git,
> (obj->type == OBJ_TAG && !revs->tag_objects))
> continue;
>
> - show_reach(&obj->oid, obj->type, 0, eindex->hashes[i], NULL, 0);
> + show_reach(&obj->oid, obj->type, 0, eindex->hashes[i], NULL, 0, NULL);
> }
> }
>
> @@ -1663,7 +1663,8 @@ static void init_type_iterator(struct ewah_or_iterator *it,
> static void show_objects_for_type(
> struct bitmap_index *bitmap_git,
> enum object_type object_type,
> - show_reachable_fn show_reach)
> + show_reachable_fn show_reach,
What would you think about adding the `_fn` to `show_reach`? Because the
function is passed on to `show_objects_for_type()`, I think it improves
the readability if it's called `show_reach_fn` or somethin?g
--
Toon
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v2 03/10] builtin/cat-file: support "blob:none" objects filter
2025-03-27 9:43 ` [PATCH v2 03/10] builtin/cat-file: support "blob:none" objects filter Patrick Steinhardt
@ 2025-04-01 12:22 ` Karthik Nayak
2025-04-01 12:31 ` Karthik Nayak
0 siblings, 1 reply; 72+ messages in thread
From: Karthik Nayak @ 2025-04-01 12:22 UTC (permalink / raw)
To: Patrick Steinhardt, git; +Cc: Toon Claes, Taylor Blau, Junio C Hamano
[-- Attachment #1: Type: text/plain, Size: 1881 bytes --]
Patrick Steinhardt <ps@pks.im> writes:
> Implement support for the "blob:none" filter in git-cat-file(1), which
> causes us to omit all blobs.
>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
> Documentation/git-cat-file.adoc | 2 ++
> builtin/cat-file.c | 11 ++++++++++-
> t/t1006-cat-file.sh | 33 +++++++++++++++++++++++++++++++--
> 3 files changed, 43 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/git-cat-file.adoc b/Documentation/git-cat-file.adoc
> index f7f57b7f538..bb32f715944 100644
> --- a/Documentation/git-cat-file.adoc
> +++ b/Documentation/git-cat-file.adoc
> @@ -86,6 +86,8 @@ OPTIONS
> Omit objects from the list of printed objects. This can only be used in
> combination with one of the batched modes. The '<filter-spec>' may be
> one of the following:
> ++
> +The form '--filter=blob:none' omits all blobs.
>
> --path=<path>::
> For use with `--textconv` or `--filters`, to allow specifying an object
> diff --git a/builtin/cat-file.c b/builtin/cat-file.c
> index 940900d92ad..e783dbbad58 100644
> --- a/builtin/cat-file.c
> +++ b/builtin/cat-file.c
> @@ -472,7 +472,8 @@ static void batch_object_write(const char *obj_name,
> if (!data->skip_object_info) {
> int ret;
>
> - if (use_mailmap)
> + if (use_mailmap ||
> + opt->objects_filter.choice == LOFC_BLOB_NONE)
> data->info.typep = &data->type;
>
I didn't understand why we need to do this, below we only check for
`data->type`. The only other place we use `data->info.typep` going
forward seems to be `print_object_or_die()`, but that flow is only
followed for `opt->batch_mode == BATCH_MODE_CONTENTS`. We already have
if (opt->batch_mode == BATCH_MODE_CONTENTS)
data.info.typep = &data.type;
in `batch_objects()` before this, shouldn't that cover this scenario
too? Maybe we can add a comment with the reasoning
[snip]
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 690 bytes --]
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v2 03/10] builtin/cat-file: support "blob:none" objects filter
2025-04-01 12:22 ` Karthik Nayak
@ 2025-04-01 12:31 ` Karthik Nayak
2025-04-02 11:13 ` Patrick Steinhardt
0 siblings, 1 reply; 72+ messages in thread
From: Karthik Nayak @ 2025-04-01 12:31 UTC (permalink / raw)
To: Patrick Steinhardt, git; +Cc: Toon Claes, Taylor Blau, Junio C Hamano
[-- Attachment #1: Type: text/plain, Size: 2276 bytes --]
Karthik Nayak <karthik.188@gmail.com> writes:
> Patrick Steinhardt <ps@pks.im> writes:
>
>> Implement support for the "blob:none" filter in git-cat-file(1), which
>> causes us to omit all blobs.
>>
>> Signed-off-by: Patrick Steinhardt <ps@pks.im>
>> ---
>> Documentation/git-cat-file.adoc | 2 ++
>> builtin/cat-file.c | 11 ++++++++++-
>> t/t1006-cat-file.sh | 33 +++++++++++++++++++++++++++++++--
>> 3 files changed, 43 insertions(+), 3 deletions(-)
>>
>> diff --git a/Documentation/git-cat-file.adoc b/Documentation/git-cat-file.adoc
>> index f7f57b7f538..bb32f715944 100644
>> --- a/Documentation/git-cat-file.adoc
>> +++ b/Documentation/git-cat-file.adoc
>> @@ -86,6 +86,8 @@ OPTIONS
>> Omit objects from the list of printed objects. This can only be used in
>> combination with one of the batched modes. The '<filter-spec>' may be
>> one of the following:
>> ++
>> +The form '--filter=blob:none' omits all blobs.
>>
>> --path=<path>::
>> For use with `--textconv` or `--filters`, to allow specifying an object
>> diff --git a/builtin/cat-file.c b/builtin/cat-file.c
>> index 940900d92ad..e783dbbad58 100644
>> --- a/builtin/cat-file.c
>> +++ b/builtin/cat-file.c
>> @@ -472,7 +472,8 @@ static void batch_object_write(const char *obj_name,
>> if (!data->skip_object_info) {
>> int ret;
>>
>> - if (use_mailmap)
>> + if (use_mailmap ||
>> + opt->objects_filter.choice == LOFC_BLOB_NONE)
>> data->info.typep = &data->type;
>>
>
> I didn't understand why we need to do this, below we only check for
> `data->type`. The only other place we use `data->info.typep` going
> forward seems to be `print_object_or_die()`, but that flow is only
> followed for `opt->batch_mode == BATCH_MODE_CONTENTS`. We already have
>
> if (opt->batch_mode == BATCH_MODE_CONTENTS)
> data.info.typep = &data.type;
>
> in `batch_objects()` before this, shouldn't that cover this scenario
> too? Maybe we can add a comment with the reasoning
>
> [snip]
After playing around more, I understand now, we set the pointer
`data->info.typep` to point to `data->type`, so when the data is parsed
in `packed_object_info()` or `oid_object_info_extended()`, that
information would be set into `data->type`. So we can skip as needed.
All good here!
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 690 bytes --]
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v2 01/10] builtin/cat-file: rename variable that tracks usage
2025-04-01 9:51 ` Karthik Nayak
@ 2025-04-02 11:13 ` Patrick Steinhardt
2025-04-07 20:25 ` Junio C Hamano
0 siblings, 1 reply; 72+ messages in thread
From: Patrick Steinhardt @ 2025-04-02 11:13 UTC (permalink / raw)
To: Karthik Nayak; +Cc: git, Toon Claes, Taylor Blau, Junio C Hamano
On Tue, Apr 01, 2025 at 02:51:01AM -0700, Karthik Nayak wrote:
> Patrick Steinhardt <ps@pks.im> writes:
>
> > The usage strings for git-cat-file(1) that we pass to `parse_options()`
> > and `usage_msg_optf()` are stored in a variable called `usage`. This
> > variable shadows the declaration of `usage()`, which we'll want to use
> > in a subsequent commit.
> >
> > Rename the variable to `builtin_catfile_usage`, which is in line with
> > how the variable is typically called in other builtins.
> >
> > Signed-off-by: Patrick Steinhardt <ps@pks.im>
> > ---
> > builtin/cat-file.c | 32 ++++++++++++++++----------------
> > 1 file changed, 16 insertions(+), 16 deletions(-)
> >
> > diff --git a/builtin/cat-file.c b/builtin/cat-file.c
> > index b13561cf73b..8e40016dd24 100644
> > --- a/builtin/cat-file.c
> > +++ b/builtin/cat-file.c
> > @@ -941,7 +941,7 @@ int cmd_cat_file(int argc,
> > int input_nul_terminated = 0;
> > int nul_terminated = 0;
> >
> > - const char * const usage[] = {
> > + const char * const builtin_catfile_usage[] = {
>
> Nit: Style: we use a right pointer alignment, while it is not part of
> your code change, would be nice to fix.
Not in this case though:
$ git grep 'const char \*const' | wc -l
85
$ git grep 'const char \* const' | wc -l
180
It's mixed, but we do have more cases of the latter.
Patrick
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v2 06/10] pack-bitmap: allow passing payloads to `show_reachable_fn()`
2025-04-01 12:17 ` Toon Claes
@ 2025-04-02 11:13 ` Patrick Steinhardt
0 siblings, 0 replies; 72+ messages in thread
From: Patrick Steinhardt @ 2025-04-02 11:13 UTC (permalink / raw)
To: Toon Claes; +Cc: git, Karthik Nayak, Taylor Blau, Junio C Hamano
On Tue, Apr 01, 2025 at 02:17:03PM +0200, Toon Claes wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> > diff --git a/pack-bitmap.c b/pack-bitmap.c
> > index 6f7fd94c36f..d192fb87da9 100644
> > --- a/pack-bitmap.c
> > +++ b/pack-bitmap.c
> > @@ -1663,7 +1663,8 @@ static void init_type_iterator(struct ewah_or_iterator *it,
> > static void show_objects_for_type(
> > struct bitmap_index *bitmap_git,
> > enum object_type object_type,
> > - show_reachable_fn show_reach)
> > + show_reachable_fn show_reach,
>
> What would you think about adding the `_fn` to `show_reach`? Because the
> function is passed on to `show_objects_for_type()`, I think it improves
> the readability if it's called `show_reach_fn` or somethin?g
We don't have that suffix anywhere else where `show_reachable_fn` is
accepted. So for the sake of consistency I'd rather leave it this way.
Patrick
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v2 09/10] builtin/cat-file: deduplicate logic to iterate over all objects
2025-04-01 12:13 ` Toon Claes
@ 2025-04-02 11:13 ` Patrick Steinhardt
2025-04-03 18:24 ` Toon Claes
0 siblings, 1 reply; 72+ messages in thread
From: Patrick Steinhardt @ 2025-04-02 11:13 UTC (permalink / raw)
To: Toon Claes; +Cc: git, Karthik Nayak, Taylor Blau, Junio C Hamano
On Tue, Apr 01, 2025 at 02:13:57PM +0200, Toon Claes wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> > diff --git a/builtin/cat-file.c b/builtin/cat-file.c
> > index 430320adfe9..6f5dbc821a2 100644
> > --- a/builtin/cat-file.c
> > +++ b/builtin/cat-file.c
> > @@ -803,6 +779,45 @@ static void batch_objects_command(struct batch_options *opt,
> >
> > #define DEFAULT_FORMAT "%(objectname) %(objecttype) %(objectsize)"
> >
> > +typedef int (*for_each_object_fn)(const struct object_id *oid, struct packed_git *pack,
> > + off_t offset, void *data);
> > +
> > +struct for_each_object_payload {
> > + for_each_object_fn callback;
> > + void *payload;
> > +};
> > +
> > +static int batch_one_object_loose(const struct object_id *oid,
> > + const char *path UNUSED,
> > + void *_payload)
> > +{
> > + struct for_each_object_payload *payload = _payload;
> > + return payload->callback(oid, NULL, 0, payload->payload);
> > +}
> > +
> > +static int batch_one_object_packed(const struct object_id *oid,
> > + struct packed_git *pack,
> > + uint32_t pos,
> > + void *_payload)
> > +{
> > + struct for_each_object_payload *payload = _payload;
> > + return payload->callback(oid, pack, nth_packed_object_offset(pack, pos),
> > + payload->payload);
> > +}
> > +
> > +static void batch_each_object(for_each_object_fn callback,
> > + unsigned flags,
> > + void *_payload)
>
> Why is this `_payload` typeless? I see it only getting passed in
> `struct object_cb_data`, is there a reason to hide this type? With
> payload being wrapped in payload I think it's beneficial to keep type
> info where possible.
Because the payload gets forwarded to the callback, and that callback
accepts arbitrary types. You can already see this now: we call the
function once with a `struct object_cb_data` pointer, and once with a
`struct oid_array` pointer.
Patrick
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v2 08/10] pack-bitmap: introduce function to check whether a pack is bitmapped
2025-04-01 11:46 ` Toon Claes
@ 2025-04-02 11:13 ` Patrick Steinhardt
0 siblings, 0 replies; 72+ messages in thread
From: Patrick Steinhardt @ 2025-04-02 11:13 UTC (permalink / raw)
To: Toon Claes; +Cc: git, Karthik Nayak, Taylor Blau, Junio C Hamano
On Tue, Apr 01, 2025 at 01:46:09PM +0200, Toon Claes wrote:
> Patrick Steinhardt <ps@pks.im> writes:
>
> > Introduce a function that allows us to verify whether a pack is
> > bitmapped or not. This functionality will be used in a subsequent
> > commit.
> >
> > Helped-by: Taylor Blau <me@ttaylorr.com>
> > Signed-off-by: Patrick Steinhardt <ps@pks.im>
> > ---
> > pack-bitmap.c | 15 +++++++++++++++
> > pack-bitmap.h | 7 +++++++
> > 2 files changed, 22 insertions(+)
> >
> > diff --git a/pack-bitmap.c b/pack-bitmap.c
> > index 6adb8aaa1c2..edc8f42122d 100644
> > --- a/pack-bitmap.c
> > +++ b/pack-bitmap.c
> > @@ -745,6 +745,21 @@ struct bitmap_index *prepare_midx_bitmap_git(struct multi_pack_index *midx)
> > return NULL;
> > }
> >
> > +int bitmap_index_contains_pack(struct bitmap_index *bitmap, struct packed_git *pack)
> > +{
> > + for (; bitmap; bitmap = bitmap->base) {
> > + if (bitmap_is_midx(bitmap)) {
> > + for (size_t i = 0; i < bitmap->midx->num_packs; i++)
> > + if (bitmap->midx->packs[i] == pack)
> > + return 1;
> > + } else if (bitmap->pack == pack) {
>
> Here, and two lines above, we compare packs by their pointer address,
> this doesn't seem to be common practice to me. Or is it in the Git
> codebase? Do we expect any problems with this, for example when we stop
> using `the_repository`?
I don't expect any problems unless we have multiple `struct repository`
instances pointing to the same underlying repository. We never do that
to the best of my knowledge though, and it would feel somewhat broken if
we ever started to do that.
If we had structs pointing to different repositories though this will do
the right thing as a packfile from repository A shouldn't be indexed by
repository B.
Patrick
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v2 03/10] builtin/cat-file: support "blob:none" objects filter
2025-04-01 12:31 ` Karthik Nayak
@ 2025-04-02 11:13 ` Patrick Steinhardt
0 siblings, 0 replies; 72+ messages in thread
From: Patrick Steinhardt @ 2025-04-02 11:13 UTC (permalink / raw)
To: Karthik Nayak; +Cc: git, Toon Claes, Taylor Blau, Junio C Hamano
On Tue, Apr 01, 2025 at 05:31:24AM -0700, Karthik Nayak wrote:
> Karthik Nayak <karthik.188@gmail.com> writes:
> > Patrick Steinhardt <ps@pks.im> writes:
> >> diff --git a/builtin/cat-file.c b/builtin/cat-file.c
> >> index 940900d92ad..e783dbbad58 100644
> >> --- a/builtin/cat-file.c
> >> +++ b/builtin/cat-file.c
> >> @@ -472,7 +472,8 @@ static void batch_object_write(const char *obj_name,
> >> if (!data->skip_object_info) {
> >> int ret;
> >>
> >> - if (use_mailmap)
> >> + if (use_mailmap ||
> >> + opt->objects_filter.choice == LOFC_BLOB_NONE)
> >> data->info.typep = &data->type;
> >>
> >
> > I didn't understand why we need to do this, below we only check for
> > `data->type`. The only other place we use `data->info.typep` going
> > forward seems to be `print_object_or_die()`, but that flow is only
> > followed for `opt->batch_mode == BATCH_MODE_CONTENTS`. We already have
> >
> > if (opt->batch_mode == BATCH_MODE_CONTENTS)
> > data.info.typep = &data.type;
> >
> > in `batch_objects()` before this, shouldn't that cover this scenario
> > too? Maybe we can add a comment with the reasoning
> >
> > [snip]
>
> After playing around more, I understand now, we set the pointer
> `data->info.typep` to point to `data->type`, so when the data is parsed
> in `packed_object_info()` or `oid_object_info_extended()`, that
> information would be set into `data->type`. So we can skip as needed.
>
> All good here!
I've adapted the commit message to better explain this.
Patrick
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v2 02/10] builtin/cat-file: wire up an option to filter objects
2025-04-01 12:05 ` Karthik Nayak
@ 2025-04-02 11:13 ` Patrick Steinhardt
0 siblings, 0 replies; 72+ messages in thread
From: Patrick Steinhardt @ 2025-04-02 11:13 UTC (permalink / raw)
To: Karthik Nayak; +Cc: git, Toon Claes, Taylor Blau, Junio C Hamano
On Tue, Apr 01, 2025 at 05:05:17AM -0700, Karthik Nayak wrote:
> Patrick Steinhardt <ps@pks.im> writes:
>
> > In batch mode, git-cat-file(1) enumerates all objects and prints them
> > by iterating through both loose and packed objects. This works without
>
> Nit: I assume you're referring to the `--batch-all-objects` mode. So
> would be nice to specify here perhaps?
It's not though, the filter works with all batch modes.
`--batch-all-objects` of course is the most likely usecase as it may not
make much sense to use a filter in other contexts. But they still work.
> > diff --git a/builtin/cat-file.c b/builtin/cat-file.c
> > index 8e40016dd24..940900d92ad 100644
> > --- a/builtin/cat-file.c
> > +++ b/builtin/cat-file.c
> > @@ -15,6 +15,7 @@
> > #include "gettext.h"
> > #include "hex.h"
> > #include "ident.h"
> > +#include "list-objects-filter-options.h"
> > #include "parse-options.h"
> > #include "userdiff.h"
> > #include "streaming.h"
> > @@ -35,6 +36,7 @@ enum batch_mode {
> > };
> >
> > struct batch_options {
> > + struct list_objects_filter_options objects_filter;
> > int enabled;
> > int follow_symlinks;
> > enum batch_mode batch_mode;
> > @@ -487,6 +489,13 @@ static void batch_object_write(const char *obj_name,
> > return;
> > }
> >
> > + switch (opt->objects_filter.choice) {
> > + case LOFC_DISABLED:
> > + break;
> > + default:
> > + BUG("unsupported objects filter");
> > + }
> > +
>
> Okay here it seems like it also applies to other batch modes. So it
> would be nice to perhaps clarify how this works when not used with
> `--batch-all-objects`?
The filter works the same in all batch modes: if the filter says that an
object should be excluded, it's just not printed at all. So none of the
modes get treated specially, and the documentation already says that the
filters apply to all batched modes.
But this does raise an interesting question: when using `--batch` we
basically follow a request-response schema. So when the object gets
filtered, we'd skip the response altogether. Which raises the question
how the script would learn about this in the first place, because they
would continue to wait for the output.
Maybe we should do something similar to what we do for missing objects
and also print excluded objects.
I'll implement this and update the documentation accordingly.
> > diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
> > index 398865d6ebe..1246d3119f8 100755
> > --- a/t/t1006-cat-file.sh
> > +++ b/t/t1006-cat-file.sh
> > @@ -1353,4 +1353,36 @@ test_expect_success PERL '--batch-command info is unbuffered by default' '
> > perl -e "$script" -- --batch-command $hello_oid "$expect" "info "
> > '
> >
> > +test_expect_success 'setup for objects filter' '
> > + git init repo
> > +'
> > +
> > +test_expect_success 'objects filter with unknown option' '
> > + cat >expect <<-EOF &&
> > + fatal: invalid filter-spec ${SQ}unknown${SQ}
> > + EOF
> > + test_must_fail git -C repo cat-file --filter=unknown 2>err &&
> > + test_cmp expect err
> > +'
> > +
>
> Would it be also worthwhile to test the `--no-filter` option?
Yeah, let's.
Patrick
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v2 02/10] builtin/cat-file: wire up an option to filter objects
2025-04-01 11:45 ` Toon Claes
@ 2025-04-02 11:13 ` Patrick Steinhardt
0 siblings, 0 replies; 72+ messages in thread
From: Patrick Steinhardt @ 2025-04-02 11:13 UTC (permalink / raw)
To: Toon Claes; +Cc: git, Karthik Nayak, Taylor Blau, Junio C Hamano
On Tue, Apr 01, 2025 at 01:45:46PM +0200, Toon Claes wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> > diff --git a/builtin/cat-file.c b/builtin/cat-file.c
> > index 8e40016dd24..940900d92ad 100644
> > --- a/builtin/cat-file.c
> > +++ b/builtin/cat-file.c
> > @@ -1000,6 +1013,8 @@ int cmd_cat_file(int argc,
> > N_("run filters on object's content"), 'w'),
> > OPT_STRING(0, "path", &force_path, N_("blob|tree"),
> > N_("use a <path> for (--textconv | --filters); Not with 'batch'")),
> > + OPT_CALLBACK(0, "filter", &batch.objects_filter, N_("args"),
> > + N_("object filtering"), opt_parse_list_objects_filter),
>
> Because we've decided on `--filter` we can use
> `OPT_PARSE_LIST_OBJECTS_FILTER` here now.
Ah, indeed, well-spotted.
Patrick
^ permalink raw reply [flat|nested] 72+ messages in thread
* [PATCH v3 00/11] builtin/cat-file: allow filtering objects in batch mode
2025-02-21 7:47 [PATCH 0/9] builtin/cat-file: allow filtering objects in batch mode Patrick Steinhardt
` (9 preceding siblings ...)
2025-03-27 9:43 ` [PATCH v2 00/10] builtin/cat-file: allow filtering objects in batch mode Patrick Steinhardt
@ 2025-04-02 11:13 ` Patrick Steinhardt
2025-04-02 11:13 ` [PATCH v3 01/11] builtin/cat-file: rename variable that tracks usage Patrick Steinhardt
` (11 more replies)
10 siblings, 12 replies; 72+ messages in thread
From: Patrick Steinhardt @ 2025-04-02 11:13 UTC (permalink / raw)
To: git; +Cc: Toon Claes, Karthik Nayak, Taylor Blau, Junio C Hamano
Hi,
at GitLab, we sometimes have the need to list all objects regardless of
their reachability. We use git-cat-file(1) with `--batch-all-objects` to
do this, and typically this is quite a good fit. In some cases though,
we only want to list objects of a specific type, where we then basically
have the following pipeline:
git cat-file --batch-all-objects --batch-check='%(objecttype) %(objectname)' |
grep '^commit ' |
cut -d' ' -f2 |
git cat-file --batch
This works okayish in medium-sized repositories, but once you reach a
certain size this isn't really an option anymore. In the Chromium
repository for example [1] simply listing all objects in the first
invocation of git-cat-file(1) takes around 80 to 100 seconds. The
workload is completely I/O-bottlenecked: my machine reads at ~500MB/s,
and the packfile is 50GB in size, which matches the 100 seconds that I
observe.
This series addresses the issue by introducing object filters into
git-cat-file(1). These object filters use the exact same syntax as the
filters we have in git-rev-list(1), but only a subset of them is
supported because not all filters can be computed by git-cat-file(1).
Supported are "blob:none", "blob:limit=" as well as "object:type=".
The filters alone don't really help though: we still have to scan
through the whole packfile in order to compute the packfiles. While we
are able to shed a bit of CPU time because we can stop emitting some of
the objects, we're still I/O-bottlenecked.
The second part of the series thus expands the filters so that they can
make use of bitmap indices for some of the filters, if available. This
allows us to efficiently answer the question where to find all objects
of a specific type, and thus we can avoid scanning through the packfile
and instead directly look up relevant objects, leading to a significant
speedup:
Benchmark 1: cat-file with filter=object:type=commit (revision = HEAD~)
Time (mean ± σ): 86.444 s ± 4.081 s [User: 36.830 s, System: 11.312 s]
Range (min … max): 80.305 s … 93.104 s 10 runs
Benchmark 2: cat-file with filter=object:type=commit (revision = HEAD)
Time (mean ± σ): 2.089 s ± 0.015 s [User: 1.872 s, System: 0.207 s]
Range (min … max): 2.073 s … 2.119 s 10 runs
Summary
cat-file with filter=object:type=commit (revision = HEAD) ran
41.38 ± 1.98 times faster than cat-file with filter=object:type=commit (revision = HEAD~)
We now directly scale with the number of objects of a specific type
contained in the packfile instead of scaling with the overall number of
objects. It's quite fun to see how the math plays out: if you sum up the
times for each of the types you arrive at the time for the unfiltered
case.
Changes in v2:
- The series is now built on top of "master" at 683c54c999c (Git 2.49,
2025-03-14) with "tb/incremental-midx-part-2" at 27afc272c49 (midx:
implement writing incremental MIDX bitmaps, 2025-03-20) merged into
it.
- Rename the filter options to "--filter=" to match
git-pack-objects(1).
- The bitmap-filtering is now reusing existing mechanisms that we
already have in "pack-bitmap.c", as proposed by Taylor.
- Link to v1: https://lore.kernel.org/r/20250221-pks-cat-file-object-type-filter-v1-0-0852530888e2@pks.im
Changes in v3:
- Wrap some overly long lines.
- Better describe how filters interact with the different batch modes.
- Adapt the format with `--batch` and `--batch-check` so that we tell
the user that the object has been excluded.
- Add a test for "--no-filter".
- Use `OPT_PARSE_LIST_OBJECTS_FILTER()`.
- Link to v2: https://lore.kernel.org/r/20250327-pks-cat-file-object-type-filter-v2-0-4bbc7085d7c5@pks.im
Thanks!
Patrick
[1]: https://github.com/chromium/chromium.git
---
Patrick Steinhardt (11):
builtin/cat-file: rename variable that tracks usage
builtin/cat-file: introduce function to report object status
builtin/cat-file: wire up an option to filter objects
builtin/cat-file: support "blob:none" objects filter
builtin/cat-file: support "blob:limit=" objects filter
builtin/cat-file: support "object:type=" objects filter
pack-bitmap: allow passing payloads to `show_reachable_fn()`
pack-bitmap: add function to iterate over filtered bitmapped objects
pack-bitmap: introduce function to check whether a pack is bitmapped
builtin/cat-file: deduplicate logic to iterate over all objects
builtin/cat-file: use bitmaps to efficiently filter by object type
Documentation/git-cat-file.adoc | 26 ++++
builtin/cat-file.c | 256 +++++++++++++++++++++++++++++-----------
builtin/pack-objects.c | 3 +-
builtin/rev-list.c | 3 +-
pack-bitmap.c | 81 +++++++++++--
pack-bitmap.h | 22 +++-
reachable.c | 3 +-
t/t1006-cat-file.sh | 99 ++++++++++++++++
8 files changed, 411 insertions(+), 82 deletions(-)
Range-diff versus v2:
1: a75888e0bf4 ! 1: b0642b6c495 builtin/cat-file: rename variable that tracks usage
@@ builtin/cat-file.c: int cmd_cat_file(int argc,
;
else if (batch.follow_symlinks)
- usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
-+ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options,
- "--follow-symlinks");
+- "--follow-symlinks");
++ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage,
++ options, "--follow-symlinks");
else if (batch.buffer_output >= 0)
- usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
-+ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options,
- "--buffer");
+- "--buffer");
++ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage,
++ options, "--buffer");
else if (batch.all_objects)
- usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
-+ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options,
- "--batch-all-objects");
+- "--batch-all-objects");
++ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage,
++ options, "--batch-all-objects");
else if (input_nul_terminated)
- usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
-+ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options,
- "-z");
+- "-z");
++ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage,
++ options, "-z");
else if (nul_terminated)
- usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
-+ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options,
- "-Z");
+- "-Z");
++ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage,
++ options, "-Z");
batch.input_delim = batch.output_delim = '\n';
+ if (input_nul_terminated)
@@ builtin/cat-file.c: int cmd_cat_file(int argc,
batch.transform_mode = opt;
else if (opt && opt != 'b')
@@ builtin/cat-file.c: int cmd_cat_file(int argc,
+ builtin_catfile_usage, options, opt);
else if (argc)
- usage_msg_opt(_("batch modes take no arguments"), usage,
-+ usage_msg_opt(_("batch modes take no arguments"), builtin_catfile_usage,
- options);
+- options);
++ usage_msg_opt(_("batch modes take no arguments"),
++ builtin_catfile_usage, options);
return batch_objects(&batch);
+ }
@@ builtin/cat-file.c: int cmd_cat_file(int argc,
if (opt) {
if (!argc && opt == 'c')
usage_msg_optf(_("<rev> required with '%s'"),
- usage, options, "--textconv");
-+ builtin_catfile_usage, options, "--textconv");
++ builtin_catfile_usage, options,
++ "--textconv");
else if (!argc && opt == 'w')
usage_msg_optf(_("<rev> required with '%s'"),
- usage, options, "--filters");
-+ builtin_catfile_usage, options, "--filters");
++ builtin_catfile_usage, options,
++ "--filters");
else if (!argc && opt_epts)
usage_msg_optf(_("<object> required with '-%c'"),
- usage, options, opt);
@@ builtin/cat-file.c: int cmd_cat_file(int argc,
obj_name = argv[0];
else
- usage_msg_opt(_("too many arguments"), usage, options);
-+ usage_msg_opt(_("too many arguments"), builtin_catfile_usage, options);
++ usage_msg_opt(_("too many arguments"), builtin_catfile_usage,
++ options);
} else if (!argc) {
- usage_with_options(usage, options);
+ usage_with_options(builtin_catfile_usage, options);
-: ----------- > 2: 18353ba706d builtin/cat-file: introduce function to report object status
2: bee9407c1a9 ! 3: 1e46af5d07b builtin/cat-file: wire up an option to filter objects
@@ Documentation/git-cat-file.adoc: OPTIONS
+--filter=<filter-spec>::
+--no-filter::
+ Omit objects from the list of printed objects. This can only be used in
-+ combination with one of the batched modes. The '<filter-spec>' may be
-+ one of the following:
++ combination with one of the batched modes. Excluded objects that have
++ been explicitly requested via any of the batch modes that read objects
++ via standard input (`--batch`, `--batch-check`) will be reported as
++ "filtered". Excluded objects in `--batch-all-objects` mode will not be
++ printed at all. No filters are supported yet.
+
--path=<path>::
For use with `--textconv` or `--filters`, to allow specifying an object
name and a path separately, e.g. when it is difficult to figure out
+@@ Documentation/git-cat-file.adoc: the repository, then `cat-file` will ignore any custom format and print:
+ <object> SP missing LF
+ ------------
+
++If a name is specified on stdin that is filtered out via `--filter=`,
++then `cat-file` will ignore any custom format and print:
++
++------------
++<object> SP excluded LF
++------------
++
+ If a name is specified that might refer to more than one object (an ambiguous short sha), then `cat-file` will ignore any custom format and print:
+
+ ------------
## builtin/cat-file.c ##
@@
@@ builtin/cat-file.c: int cmd_cat_file(int argc,
N_("run filters on object's content"), 'w'),
OPT_STRING(0, "path", &force_path, N_("blob|tree"),
N_("use a <path> for (--textconv | --filters); Not with 'batch'")),
-+ OPT_CALLBACK(0, "filter", &batch.objects_filter, N_("args"),
-+ N_("object filtering"), opt_parse_list_objects_filter),
++ OPT_PARSE_LIST_OBJECTS_FILTER(&batch.objects_filter),
OPT_END()
};
@@ builtin/cat-file.c: int cmd_cat_file(int argc,
if (opt == 'b')
batch.all_objects = 1;
@@ builtin/cat-file.c: int cmd_cat_file(int argc,
- usage_msg_opt(_("batch modes take no arguments"), builtin_catfile_usage,
- options);
+ usage_msg_opt(_("batch modes take no arguments"),
+ builtin_catfile_usage, options);
- return batch_objects(&batch);
+ ret = batch_objects(&batch);
@@ t/t1006-cat-file.sh: test_expect_success PERL '--batch-command info is unbuffere
+ test_cmp expect err
+ '
+done
++
++test_expect_success 'objects filter: disabled' '
++ git -C repo cat-file --batch-check="%(objectname)" --batch-all-objects --no-filter >actual &&
++ sort actual >actual.sorted &&
++ git -C repo rev-list --objects --no-object-names --all >expect &&
++ sort expect >expect.sorted &&
++ test_cmp expect.sorted actual.sorted
++'
+
test_done
3: ec1d0c63de6 ! 4: 878ae8e2a76 builtin/cat-file: support "blob:none" objects filter
@@ Commit message
Implement support for the "blob:none" filter in git-cat-file(1), which
causes us to omit all blobs.
+ Note that this new filter requires us to read the object type via
+ `oid_object_info_extended()` in `batch_object_write()`. But as we try to
+ optimize away reading objects from the database the `data->info.typep`
+ pointer may not be set. We thus have to adapt the logic to conditionally
+ set the pointer in cases where the filter is given.
+
Signed-off-by: Patrick Steinhardt <ps@pks.im>
## Documentation/git-cat-file.adoc ##
@@ Documentation/git-cat-file.adoc: OPTIONS
- Omit objects from the list of printed objects. This can only be used in
- combination with one of the batched modes. The '<filter-spec>' may be
- one of the following:
+ been explicitly requested via any of the batch modes that read objects
+ via standard input (`--batch`, `--batch-check`) will be reported as
+ "filtered". Excluded objects in `--batch-all-objects` mode will not be
+- printed at all. No filters are supported yet.
++ printed at all. The '<filter-spec>' may be one of the following:
++
+The form '--filter=blob:none' omits all blobs.
@@ builtin/cat-file.c: static void batch_object_write(const char *obj_name,
case LOFC_DISABLED:
break;
+ case LOFC_BLOB_NONE:
-+ if (data->type == OBJ_BLOB)
++ if (data->type == OBJ_BLOB) {
++ if (!opt->all_objects)
++ report_object_status(opt, obj_name,
++ &data->oid, "excluded");
+ return;
++ }
+ break;
default:
BUG("unsupported objects filter");
@@ t/t1006-cat-file.sh: test_expect_success 'objects filter with unknown option' '
do
test_expect_success "objects filter with unsupported option $option" '
case "$option" in
-@@ t/t1006-cat-file.sh: do
- '
- done
+@@ t/t1006-cat-file.sh: test_expect_success 'objects filter: disabled' '
+ test_cmp expect.sorted actual.sorted
+ '
+test_objects_filter () {
+ filter="$1"
@@ t/t1006-cat-file.sh: do
+ sort expect >expect.sorted &&
+ test_cmp expect.sorted actual.sorted
+ '
++
++ test_expect_success "objects filter prints excluded objects: $filter" '
++ # Find all objects that would be excluded by the current filter.
++ git -C repo rev-list --objects --no-object-names --all >all &&
++ git -C repo rev-list --objects --no-object-names --all --filter="$filter" --filter-provided-objects >filtered &&
++ sort all >all.sorted &&
++ sort filtered >filtered.sorted &&
++ comm -23 all.sorted filtered.sorted >expected.excluded &&
++ test_line_count -gt 0 expected.excluded &&
++
++ git -C repo cat-file --batch-check="%(objectname)" --filter="$filter" <expected.excluded >actual &&
++ awk "/excluded/{ print \$1 }" actual | sort >actual.excluded &&
++ test_cmp expected.excluded actual.excluded
++ '
+}
+
+test_objects_filter "blob:none"
4: a3ed054994d ! 5: a88d5d4b60a builtin/cat-file: support "blob:limit=" objects filter
@@ Commit message
## Documentation/git-cat-file.adoc ##
@@ Documentation/git-cat-file.adoc: OPTIONS
- one of the following:
+ printed at all. The '<filter-spec>' may be one of the following:
+
The form '--filter=blob:none' omits all blobs.
++
@@ builtin/cat-file.c: static void batch_object_write(const char *obj_name,
if (pack)
ret = packed_object_info(the_repository, pack, offset,
@@ builtin/cat-file.c: static void batch_object_write(const char *obj_name,
- if (data->type == OBJ_BLOB)
return;
+ }
break;
+ case LOFC_BLOB_LIMIT:
+ if (data->type == OBJ_BLOB &&
-+ data->size >= opt->objects_filter.blob_limit_value)
++ data->size >= opt->objects_filter.blob_limit_value) {
++ if (!opt->all_objects)
++ report_object_status(opt, obj_name,
++ &data->oid, "excluded");
+ return;
++ }
+ break;
default:
BUG("unsupported objects filter");
@@ t/t1006-cat-file.sh: test_objects_filter () {
+test_objects_filter "blob:limit=1"
+test_objects_filter "blob:limit=500"
+test_objects_filter "blob:limit=1000"
-+test_objects_filter "blob:limit=1g"
++test_objects_filter "blob:limit=1k"
test_done
5: 8e39cd218c2 ! 6: 13be54300c9 builtin/cat-file: support "object:type=" objects filter
@@ builtin/cat-file.c: static void batch_object_write(const char *obj_name,
if (opt->objects_filter.choice == LOFC_BLOB_LIMIT)
data->info.sizep = &data->size;
@@ builtin/cat-file.c: static void batch_object_write(const char *obj_name,
- data->size >= opt->objects_filter.blob_limit_value)
return;
+ }
break;
+ case LOFC_OBJECT_TYPE:
-+ if (data->type != opt->objects_filter.object_type)
++ if (data->type != opt->objects_filter.object_type) {
++ if (!opt->all_objects)
++ report_object_status(opt, obj_name,
++ &data->oid, "excluded");
+ return;
++ }
+ break;
default:
BUG("unsupported objects filter");
@@ t/t1006-cat-file.sh: test_expect_success 'objects filter with unknown option' '
@@ t/t1006-cat-file.sh: test_objects_filter "blob:limit=1"
test_objects_filter "blob:limit=500"
test_objects_filter "blob:limit=1000"
- test_objects_filter "blob:limit=1g"
+ test_objects_filter "blob:limit=1k"
+test_objects_filter "object:type=blob"
+test_objects_filter "object:type=commit"
+test_objects_filter "object:type=tag"
6: a0655de3ace = 7: d525a5bc2ef pack-bitmap: allow passing payloads to `show_reachable_fn()`
7: e1e44303dac = 8: e3cc1ae3a87 pack-bitmap: add function to iterate over filtered bitmapped objects
8: 23bc040bb15 = 9: c0fc0e4ce0c pack-bitmap: introduce function to check whether a pack is bitmapped
9: 4eba2a70619 = 10: 28ef93dceec builtin/cat-file: deduplicate logic to iterate over all objects
10: d40f1924ef5 = 11: 842a6002c50 builtin/cat-file: use bitmaps to efficiently filter by object type
---
base-commit: 003c5f45b8447877015b2a23ceab2297638fe1f1
change-id: 20250220-pks-cat-file-object-type-filter-9140c0ed5ee1
^ permalink raw reply [flat|nested] 72+ messages in thread
* [PATCH v3 01/11] builtin/cat-file: rename variable that tracks usage
2025-04-02 11:13 ` [PATCH v3 00/11] builtin/cat-file: allow filtering objects in batch mode Patrick Steinhardt
@ 2025-04-02 11:13 ` Patrick Steinhardt
2025-04-02 11:13 ` [PATCH v3 02/11] builtin/cat-file: introduce function to report object status Patrick Steinhardt
` (10 subsequent siblings)
11 siblings, 0 replies; 72+ messages in thread
From: Patrick Steinhardt @ 2025-04-02 11:13 UTC (permalink / raw)
To: git; +Cc: Toon Claes, Karthik Nayak, Taylor Blau, Junio C Hamano
The usage strings for git-cat-file(1) that we pass to `parse_options()`
and `usage_msg_optf()` are stored in a variable called `usage`. This
variable shadows the declaration of `usage()`, which we'll want to use
in a subsequent commit.
Rename the variable to `builtin_catfile_usage`, which is in line with
how the variable is typically called in other builtins.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
builtin/cat-file.c | 47 +++++++++++++++++++++++++----------------------
1 file changed, 25 insertions(+), 22 deletions(-)
diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index b13561cf73b..b158b3acef9 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -941,7 +941,7 @@ int cmd_cat_file(int argc,
int input_nul_terminated = 0;
int nul_terminated = 0;
- const char * const usage[] = {
+ const char * const builtin_catfile_usage[] = {
N_("git cat-file <type> <object>"),
N_("git cat-file (-e | -p) <object>"),
N_("git cat-file (-t | -s) [--allow-unknown-type] <object>"),
@@ -1007,7 +1007,7 @@ int cmd_cat_file(int argc,
batch.buffer_output = -1;
- argc = parse_options(argc, argv, prefix, options, usage, 0);
+ argc = parse_options(argc, argv, prefix, options, builtin_catfile_usage, 0);
opt_cw = (opt == 'c' || opt == 'w');
opt_epts = (opt == 'e' || opt == 'p' || opt == 't' || opt == 's');
@@ -1021,7 +1021,7 @@ int cmd_cat_file(int argc,
/* Option compatibility */
if (force_path && !opt_cw)
usage_msg_optf(_("'%s=<%s>' needs '%s' or '%s'"),
- usage, options,
+ builtin_catfile_usage, options,
"--path", _("path|tree-ish"), "--filters",
"--textconv");
@@ -1029,20 +1029,20 @@ int cmd_cat_file(int argc,
if (batch.enabled)
;
else if (batch.follow_symlinks)
- usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
- "--follow-symlinks");
+ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage,
+ options, "--follow-symlinks");
else if (batch.buffer_output >= 0)
- usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
- "--buffer");
+ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage,
+ options, "--buffer");
else if (batch.all_objects)
- usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
- "--batch-all-objects");
+ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage,
+ options, "--batch-all-objects");
else if (input_nul_terminated)
- usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
- "-z");
+ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage,
+ options, "-z");
else if (nul_terminated)
- usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
- "-Z");
+ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage,
+ options, "-Z");
batch.input_delim = batch.output_delim = '\n';
if (input_nul_terminated)
@@ -1063,10 +1063,10 @@ int cmd_cat_file(int argc,
batch.transform_mode = opt;
else if (opt && opt != 'b')
usage_msg_optf(_("'-%c' is incompatible with batch mode"),
- usage, options, opt);
+ builtin_catfile_usage, options, opt);
else if (argc)
- usage_msg_opt(_("batch modes take no arguments"), usage,
- options);
+ usage_msg_opt(_("batch modes take no arguments"),
+ builtin_catfile_usage, options);
return batch_objects(&batch);
}
@@ -1074,22 +1074,25 @@ int cmd_cat_file(int argc,
if (opt) {
if (!argc && opt == 'c')
usage_msg_optf(_("<rev> required with '%s'"),
- usage, options, "--textconv");
+ builtin_catfile_usage, options,
+ "--textconv");
else if (!argc && opt == 'w')
usage_msg_optf(_("<rev> required with '%s'"),
- usage, options, "--filters");
+ builtin_catfile_usage, options,
+ "--filters");
else if (!argc && opt_epts)
usage_msg_optf(_("<object> required with '-%c'"),
- usage, options, opt);
+ builtin_catfile_usage, options, opt);
else if (argc == 1)
obj_name = argv[0];
else
- usage_msg_opt(_("too many arguments"), usage, options);
+ usage_msg_opt(_("too many arguments"), builtin_catfile_usage,
+ options);
} else if (!argc) {
- usage_with_options(usage, options);
+ usage_with_options(builtin_catfile_usage, options);
} else if (argc != 2) {
usage_msg_optf(_("only two arguments allowed in <type> <object> mode, not %d"),
- usage, options, argc);
+ builtin_catfile_usage, options, argc);
} else if (argc) {
exp_type = argv[0];
obj_name = argv[1];
--
2.49.0.604.gff1f9ca942.dirty
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v3 02/11] builtin/cat-file: introduce function to report object status
2025-04-02 11:13 ` [PATCH v3 00/11] builtin/cat-file: allow filtering objects in batch mode Patrick Steinhardt
2025-04-02 11:13 ` [PATCH v3 01/11] builtin/cat-file: rename variable that tracks usage Patrick Steinhardt
@ 2025-04-02 11:13 ` Patrick Steinhardt
2025-04-02 11:13 ` [PATCH v3 03/11] builtin/cat-file: wire up an option to filter objects Patrick Steinhardt
` (9 subsequent siblings)
11 siblings, 0 replies; 72+ messages in thread
From: Patrick Steinhardt @ 2025-04-02 11:13 UTC (permalink / raw)
To: git; +Cc: Toon Claes, Karthik Nayak, Taylor Blau, Junio C Hamano
We have multiple callsites that report the status of an object, for
example when the objec tis missing or its name is ambiguous. We're about
to add a couple more such callsites to report on "excluded" objects.
Prepare for this by introducing a new function `report_object_status()`
that encapsulates the functionality.
Note that this function also flushes stdout, which is a requirement so
that request-response style batched modes can learn about the status
before proceeding to the next object. We already flush correctly at all
existing callsites, even though the flush in `batch_one_object()` only
comes after the switch statement. That flush is now redundant, and we
could in theory deduplicate it by moving it into all branches that don't
use `report_object_status()`. But that doesn't quite feel sensible:
- The duplicate flush should ultimately just be a no-op for us and
thus shouldn't impact performance significantly.
- By keeping the flush in `report_object_status()` we ensure that all
future callers get semantics correct.
So let's just be pragmatic and live with the duplicated flush.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
builtin/cat-file.c | 18 +++++++++++++-----
1 file changed, 13 insertions(+), 5 deletions(-)
diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index b158b3acef9..1261a3ce352 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -455,6 +455,16 @@ static void print_default_format(struct strbuf *scratch, struct expand_data *dat
(uintmax_t)data->size, opt->output_delim);
}
+static void report_object_status(struct batch_options *opt,
+ const char *obj_name,
+ const struct object_id *oid,
+ const char *status)
+{
+ printf("%s %s%c", obj_name ? obj_name : oid_to_hex(oid),
+ status, opt->output_delim);
+ fflush(stdout);
+}
+
/*
* If "pack" is non-NULL, then "offset" is the byte offset within the pack from
* which the object may be accessed (though note that we may also rely on
@@ -481,9 +491,7 @@ static void batch_object_write(const char *obj_name,
&data->oid, &data->info,
OBJECT_INFO_LOOKUP_REPLACE);
if (ret < 0) {
- printf("%s missing%c",
- obj_name ? obj_name : oid_to_hex(&data->oid), opt->output_delim);
- fflush(stdout);
+ report_object_status(opt, obj_name, &data->oid, "missing");
return;
}
@@ -535,10 +543,10 @@ static void batch_one_object(const char *obj_name,
if (result != FOUND) {
switch (result) {
case MISSING_OBJECT:
- printf("%s missing%c", obj_name, opt->output_delim);
+ report_object_status(opt, obj_name, &data->oid, "missing");
break;
case SHORT_NAME_AMBIGUOUS:
- printf("%s ambiguous%c", obj_name, opt->output_delim);
+ report_object_status(opt, obj_name, &data->oid, "ambiguous");
break;
case DANGLING_SYMLINK:
printf("dangling %"PRIuMAX"%c%s%c",
--
2.49.0.604.gff1f9ca942.dirty
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v3 03/11] builtin/cat-file: wire up an option to filter objects
2025-04-02 11:13 ` [PATCH v3 00/11] builtin/cat-file: allow filtering objects in batch mode Patrick Steinhardt
2025-04-02 11:13 ` [PATCH v3 01/11] builtin/cat-file: rename variable that tracks usage Patrick Steinhardt
2025-04-02 11:13 ` [PATCH v3 02/11] builtin/cat-file: introduce function to report object status Patrick Steinhardt
@ 2025-04-02 11:13 ` Patrick Steinhardt
2025-04-02 11:13 ` [PATCH v3 04/11] builtin/cat-file: support "blob:none" objects filter Patrick Steinhardt
` (8 subsequent siblings)
11 siblings, 0 replies; 72+ messages in thread
From: Patrick Steinhardt @ 2025-04-02 11:13 UTC (permalink / raw)
To: git; +Cc: Toon Claes, Karthik Nayak, Taylor Blau, Junio C Hamano
In batch mode, git-cat-file(1) enumerates all objects and prints them
by iterating through both loose and packed objects. This works without
considering their reachability at all, and consequently most options to
filter objects as they exist in e.g. git-rev-list(1) are not applicable.
In some situations it may still be useful though to filter objects based
on properties that are inherent to them. This includes the object size
as well as its type.
Such a filter already exists in git-rev-list(1) with the `--filter=`
command line option. While this option supports a couple of filters that
are not applicable to our usecase, some of them are quite a neat fit.
Wire up the filter as an option for git-cat-file(1). This allows us to
reuse the same syntax as in git-rev-list(1) so that we don't have to
reinvent the wheel. For now, we die when any of the filter options has
been passed by the user, but they will be wired up in subsequent
commits.
Further note that the filters that we are about to introduce don't
significantly speed up the runtime of git-cat-file(1). While we can skip
emitting a lot of objects in case they are uninteresting to us, the
majority of time is spent reading the packfile, which is bottlenecked by
I/O and not the processor. This will change though once we start to make
use of bitmaps, which will allow us to skip reading the whole packfile.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
Documentation/git-cat-file.adoc | 16 ++++++++++++++++
builtin/cat-file.c | 36 ++++++++++++++++++++++++++++++++----
t/t1006-cat-file.sh | 40 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 88 insertions(+), 4 deletions(-)
diff --git a/Documentation/git-cat-file.adoc b/Documentation/git-cat-file.adoc
index d5890ae3686..da92eed1170 100644
--- a/Documentation/git-cat-file.adoc
+++ b/Documentation/git-cat-file.adoc
@@ -81,6 +81,15 @@ OPTIONS
end-of-line conversion, etc). In this case, `<object>` has to be of
the form `<tree-ish>:<path>`, or `:<path>`.
+--filter=<filter-spec>::
+--no-filter::
+ Omit objects from the list of printed objects. This can only be used in
+ combination with one of the batched modes. Excluded objects that have
+ been explicitly requested via any of the batch modes that read objects
+ via standard input (`--batch`, `--batch-check`) will be reported as
+ "filtered". Excluded objects in `--batch-all-objects` mode will not be
+ printed at all. No filters are supported yet.
+
--path=<path>::
For use with `--textconv` or `--filters`, to allow specifying an object
name and a path separately, e.g. when it is difficult to figure out
@@ -340,6 +349,13 @@ the repository, then `cat-file` will ignore any custom format and print:
<object> SP missing LF
------------
+If a name is specified on stdin that is filtered out via `--filter=`,
+then `cat-file` will ignore any custom format and print:
+
+------------
+<object> SP excluded LF
+------------
+
If a name is specified that might refer to more than one object (an ambiguous short sha), then `cat-file` will ignore any custom format and print:
------------
diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 1261a3ce352..0e2176c4491 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -15,6 +15,7 @@
#include "gettext.h"
#include "hex.h"
#include "ident.h"
+#include "list-objects-filter-options.h"
#include "parse-options.h"
#include "userdiff.h"
#include "streaming.h"
@@ -35,6 +36,7 @@ enum batch_mode {
};
struct batch_options {
+ struct list_objects_filter_options objects_filter;
int enabled;
int follow_symlinks;
enum batch_mode batch_mode;
@@ -495,6 +497,13 @@ static void batch_object_write(const char *obj_name,
return;
}
+ switch (opt->objects_filter.choice) {
+ case LOFC_DISABLED:
+ break;
+ default:
+ BUG("unsupported objects filter");
+ }
+
if (use_mailmap && (data->type == OBJ_COMMIT || data->type == OBJ_TAG)) {
size_t s = data->size;
char *buf = NULL;
@@ -820,7 +829,8 @@ static int batch_objects(struct batch_options *opt)
struct object_cb_data cb;
struct object_info empty = OBJECT_INFO_INIT;
- if (!memcmp(&data.info, &empty, sizeof(empty)))
+ if (!memcmp(&data.info, &empty, sizeof(empty)) &&
+ opt->objects_filter.choice == LOFC_DISABLED)
data.skip_object_info = 1;
if (repo_has_promisor_remote(the_repository))
@@ -944,10 +954,13 @@ int cmd_cat_file(int argc,
int opt_cw = 0;
int opt_epts = 0;
const char *exp_type = NULL, *obj_name = NULL;
- struct batch_options batch = {0};
+ struct batch_options batch = {
+ .objects_filter = LIST_OBJECTS_FILTER_INIT,
+ };
int unknown_type = 0;
int input_nul_terminated = 0;
int nul_terminated = 0;
+ int ret;
const char * const builtin_catfile_usage[] = {
N_("git cat-file <type> <object>"),
@@ -1008,6 +1021,7 @@ int cmd_cat_file(int argc,
N_("run filters on object's content"), 'w'),
OPT_STRING(0, "path", &force_path, N_("blob|tree"),
N_("use a <path> for (--textconv | --filters); Not with 'batch'")),
+ OPT_PARSE_LIST_OBJECTS_FILTER(&batch.objects_filter),
OPT_END()
};
@@ -1022,6 +1036,14 @@ int cmd_cat_file(int argc,
if (use_mailmap)
read_mailmap(&mailmap);
+ switch (batch.objects_filter.choice) {
+ case LOFC_DISABLED:
+ break;
+ default:
+ usagef(_("objects filter not supported: '%s'"),
+ list_object_filter_config_name(batch.objects_filter.choice));
+ }
+
/* --batch-all-objects? */
if (opt == 'b')
batch.all_objects = 1;
@@ -1076,7 +1098,8 @@ int cmd_cat_file(int argc,
usage_msg_opt(_("batch modes take no arguments"),
builtin_catfile_usage, options);
- return batch_objects(&batch);
+ ret = batch_objects(&batch);
+ goto out;
}
if (opt) {
@@ -1108,5 +1131,10 @@ int cmd_cat_file(int argc,
if (unknown_type && opt != 't' && opt != 's')
die("git cat-file --allow-unknown-type: use with -s or -t");
- return cat_one_file(opt, exp_type, obj_name, unknown_type);
+
+ ret = cat_one_file(opt, exp_type, obj_name, unknown_type);
+
+out:
+ list_objects_filter_release(&batch.objects_filter);
+ return ret;
}
diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
index 398865d6ebe..9ce4eda6e68 100755
--- a/t/t1006-cat-file.sh
+++ b/t/t1006-cat-file.sh
@@ -1353,4 +1353,44 @@ test_expect_success PERL '--batch-command info is unbuffered by default' '
perl -e "$script" -- --batch-command $hello_oid "$expect" "info "
'
+test_expect_success 'setup for objects filter' '
+ git init repo
+'
+
+test_expect_success 'objects filter with unknown option' '
+ cat >expect <<-EOF &&
+ fatal: invalid filter-spec ${SQ}unknown${SQ}
+ EOF
+ test_must_fail git -C repo cat-file --filter=unknown 2>err &&
+ test_cmp expect err
+'
+
+for option in blob:none blob:limit=1 object:type=tag sparse:oid=1234 tree:1 sparse:path=x
+do
+ test_expect_success "objects filter with unsupported option $option" '
+ case "$option" in
+ tree:1)
+ echo "usage: objects filter not supported: ${SQ}tree${SQ}" >expect
+ ;;
+ sparse:path=x)
+ echo "fatal: sparse:path filters support has been dropped" >expect
+ ;;
+ *)
+ option_name=$(echo "$option" | cut -d= -f1) &&
+ printf "usage: objects filter not supported: ${SQ}%s${SQ}\n" "$option_name" >expect
+ ;;
+ esac &&
+ test_must_fail git -C repo cat-file --filter=$option 2>err &&
+ test_cmp expect err
+ '
+done
+
+test_expect_success 'objects filter: disabled' '
+ git -C repo cat-file --batch-check="%(objectname)" --batch-all-objects --no-filter >actual &&
+ sort actual >actual.sorted &&
+ git -C repo rev-list --objects --no-object-names --all >expect &&
+ sort expect >expect.sorted &&
+ test_cmp expect.sorted actual.sorted
+'
+
test_done
--
2.49.0.604.gff1f9ca942.dirty
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v3 04/11] builtin/cat-file: support "blob:none" objects filter
2025-04-02 11:13 ` [PATCH v3 00/11] builtin/cat-file: allow filtering objects in batch mode Patrick Steinhardt
` (2 preceding siblings ...)
2025-04-02 11:13 ` [PATCH v3 03/11] builtin/cat-file: wire up an option to filter objects Patrick Steinhardt
@ 2025-04-02 11:13 ` Patrick Steinhardt
2025-04-02 11:13 ` [PATCH v3 05/11] builtin/cat-file: support "blob:limit=" " Patrick Steinhardt
` (7 subsequent siblings)
11 siblings, 0 replies; 72+ messages in thread
From: Patrick Steinhardt @ 2025-04-02 11:13 UTC (permalink / raw)
To: git; +Cc: Toon Claes, Karthik Nayak, Taylor Blau, Junio C Hamano
Implement support for the "blob:none" filter in git-cat-file(1), which
causes us to omit all blobs.
Note that this new filter requires us to read the object type via
`oid_object_info_extended()` in `batch_object_write()`. But as we try to
optimize away reading objects from the database the `data->info.typep`
pointer may not be set. We thus have to adapt the logic to conditionally
set the pointer in cases where the filter is given.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
Documentation/git-cat-file.adoc | 4 +++-
builtin/cat-file.c | 15 ++++++++++++-
t/t1006-cat-file.sh | 47 +++++++++++++++++++++++++++++++++++++++--
3 files changed, 62 insertions(+), 4 deletions(-)
diff --git a/Documentation/git-cat-file.adoc b/Documentation/git-cat-file.adoc
index da92eed1170..afcdb0a4738 100644
--- a/Documentation/git-cat-file.adoc
+++ b/Documentation/git-cat-file.adoc
@@ -88,7 +88,9 @@ OPTIONS
been explicitly requested via any of the batch modes that read objects
via standard input (`--batch`, `--batch-check`) will be reported as
"filtered". Excluded objects in `--batch-all-objects` mode will not be
- printed at all. No filters are supported yet.
+ printed at all. The '<filter-spec>' may be one of the following:
++
+The form '--filter=blob:none' omits all blobs.
--path=<path>::
For use with `--textconv` or `--filters`, to allow specifying an object
diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 0e2176c4491..bcceb646f85 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -482,7 +482,8 @@ static void batch_object_write(const char *obj_name,
if (!data->skip_object_info) {
int ret;
- if (use_mailmap)
+ if (use_mailmap ||
+ opt->objects_filter.choice == LOFC_BLOB_NONE)
data->info.typep = &data->type;
if (pack)
@@ -500,6 +501,14 @@ static void batch_object_write(const char *obj_name,
switch (opt->objects_filter.choice) {
case LOFC_DISABLED:
break;
+ case LOFC_BLOB_NONE:
+ if (data->type == OBJ_BLOB) {
+ if (!opt->all_objects)
+ report_object_status(opt, obj_name,
+ &data->oid, "excluded");
+ return;
+ }
+ break;
default:
BUG("unsupported objects filter");
}
@@ -1039,6 +1048,10 @@ int cmd_cat_file(int argc,
switch (batch.objects_filter.choice) {
case LOFC_DISABLED:
break;
+ case LOFC_BLOB_NONE:
+ if (!batch.enabled)
+ usage(_("objects filter only supported in batch mode"));
+ break;
default:
usagef(_("objects filter not supported: '%s'"),
list_object_filter_config_name(batch.objects_filter.choice));
diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
index 9ce4eda6e68..7404c135b1e 100755
--- a/t/t1006-cat-file.sh
+++ b/t/t1006-cat-file.sh
@@ -1354,7 +1354,22 @@ test_expect_success PERL '--batch-command info is unbuffered by default' '
'
test_expect_success 'setup for objects filter' '
- git init repo
+ git init repo &&
+ (
+ # Seed the repository with three different sets of objects:
+ #
+ # - The first set is fully packed and has a bitmap.
+ # - The second set is packed, but has no bitmap.
+ # - The third set is loose.
+ #
+ # This ensures that we cover all these types as expected.
+ cd repo &&
+ test_commit first &&
+ git repack -Adb &&
+ test_commit second &&
+ git repack -d &&
+ test_commit third
+ )
'
test_expect_success 'objects filter with unknown option' '
@@ -1365,7 +1380,7 @@ test_expect_success 'objects filter with unknown option' '
test_cmp expect err
'
-for option in blob:none blob:limit=1 object:type=tag sparse:oid=1234 tree:1 sparse:path=x
+for option in blob:limit=1 object:type=tag sparse:oid=1234 tree:1 sparse:path=x
do
test_expect_success "objects filter with unsupported option $option" '
case "$option" in
@@ -1393,4 +1408,32 @@ test_expect_success 'objects filter: disabled' '
test_cmp expect.sorted actual.sorted
'
+test_objects_filter () {
+ filter="$1"
+
+ test_expect_success "objects filter: $filter" '
+ git -C repo cat-file --batch-check="%(objectname)" --batch-all-objects --filter="$filter" >actual &&
+ sort actual >actual.sorted &&
+ git -C repo rev-list --objects --no-object-names --all --filter="$filter" --filter-provided-objects >expect &&
+ sort expect >expect.sorted &&
+ test_cmp expect.sorted actual.sorted
+ '
+
+ test_expect_success "objects filter prints excluded objects: $filter" '
+ # Find all objects that would be excluded by the current filter.
+ git -C repo rev-list --objects --no-object-names --all >all &&
+ git -C repo rev-list --objects --no-object-names --all --filter="$filter" --filter-provided-objects >filtered &&
+ sort all >all.sorted &&
+ sort filtered >filtered.sorted &&
+ comm -23 all.sorted filtered.sorted >expected.excluded &&
+ test_line_count -gt 0 expected.excluded &&
+
+ git -C repo cat-file --batch-check="%(objectname)" --filter="$filter" <expected.excluded >actual &&
+ awk "/excluded/{ print \$1 }" actual | sort >actual.excluded &&
+ test_cmp expected.excluded actual.excluded
+ '
+}
+
+test_objects_filter "blob:none"
+
test_done
--
2.49.0.604.gff1f9ca942.dirty
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v3 05/11] builtin/cat-file: support "blob:limit=" objects filter
2025-04-02 11:13 ` [PATCH v3 00/11] builtin/cat-file: allow filtering objects in batch mode Patrick Steinhardt
` (3 preceding siblings ...)
2025-04-02 11:13 ` [PATCH v3 04/11] builtin/cat-file: support "blob:none" objects filter Patrick Steinhardt
@ 2025-04-02 11:13 ` Patrick Steinhardt
2025-04-02 11:13 ` [PATCH v3 06/11] builtin/cat-file: support "object:type=" " Patrick Steinhardt
` (6 subsequent siblings)
11 siblings, 0 replies; 72+ messages in thread
From: Patrick Steinhardt @ 2025-04-02 11:13 UTC (permalink / raw)
To: git; +Cc: Toon Claes, Karthik Nayak, Taylor Blau, Junio C Hamano
Implement support for the "blob:limit=" filter in git-cat-file(1), which
causes us to omit all blobs that are bigger than a certain size.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
Documentation/git-cat-file.adoc | 5 +++++
builtin/cat-file.c | 15 ++++++++++++++-
t/t1006-cat-file.sh | 18 +++++++++++++++---
3 files changed, 34 insertions(+), 4 deletions(-)
diff --git a/Documentation/git-cat-file.adoc b/Documentation/git-cat-file.adoc
index afcdb0a4738..48e05e1af52 100644
--- a/Documentation/git-cat-file.adoc
+++ b/Documentation/git-cat-file.adoc
@@ -91,6 +91,11 @@ OPTIONS
printed at all. The '<filter-spec>' may be one of the following:
+
The form '--filter=blob:none' omits all blobs.
++
+The form '--filter=blob:limit=<n>[kmg]' omits blobs of size at least n
+bytes or units. n may be zero. The suffixes k, m, and g can be used to name
+units in KiB, MiB, or GiB. For example, 'blob:limit=1k' is the same as
+'blob:limit=1024'.
--path=<path>::
For use with `--textconv` or `--filters`, to allow specifying an object
diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index bcceb646f85..629c6cddcb2 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -483,8 +483,11 @@ static void batch_object_write(const char *obj_name,
int ret;
if (use_mailmap ||
- opt->objects_filter.choice == LOFC_BLOB_NONE)
+ opt->objects_filter.choice == LOFC_BLOB_NONE ||
+ opt->objects_filter.choice == LOFC_BLOB_LIMIT)
data->info.typep = &data->type;
+ if (opt->objects_filter.choice == LOFC_BLOB_LIMIT)
+ data->info.sizep = &data->size;
if (pack)
ret = packed_object_info(the_repository, pack, offset,
@@ -509,6 +512,15 @@ static void batch_object_write(const char *obj_name,
return;
}
break;
+ case LOFC_BLOB_LIMIT:
+ if (data->type == OBJ_BLOB &&
+ data->size >= opt->objects_filter.blob_limit_value) {
+ if (!opt->all_objects)
+ report_object_status(opt, obj_name,
+ &data->oid, "excluded");
+ return;
+ }
+ break;
default:
BUG("unsupported objects filter");
}
@@ -1049,6 +1061,7 @@ int cmd_cat_file(int argc,
case LOFC_DISABLED:
break;
case LOFC_BLOB_NONE:
+ case LOFC_BLOB_LIMIT:
if (!batch.enabled)
usage(_("objects filter only supported in batch mode"));
break;
diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
index 7404c135b1e..4f14840b71a 100755
--- a/t/t1006-cat-file.sh
+++ b/t/t1006-cat-file.sh
@@ -1356,11 +1356,12 @@ test_expect_success PERL '--batch-command info is unbuffered by default' '
test_expect_success 'setup for objects filter' '
git init repo &&
(
- # Seed the repository with three different sets of objects:
+ # Seed the repository with four different sets of objects:
#
# - The first set is fully packed and has a bitmap.
# - The second set is packed, but has no bitmap.
# - The third set is loose.
+ # - The fourth set is loose and contains big objects.
#
# This ensures that we cover all these types as expected.
cd repo &&
@@ -1368,7 +1369,14 @@ test_expect_success 'setup for objects filter' '
git repack -Adb &&
test_commit second &&
git repack -d &&
- test_commit third
+ test_commit third &&
+
+ for n in 1000 10000
+ do
+ printf "%"$n"s" X >large.$n || return 1
+ done &&
+ git add large.* &&
+ git commit -m fourth
)
'
@@ -1380,7 +1388,7 @@ test_expect_success 'objects filter with unknown option' '
test_cmp expect err
'
-for option in blob:limit=1 object:type=tag sparse:oid=1234 tree:1 sparse:path=x
+for option in object:type=tag sparse:oid=1234 tree:1 sparse:path=x
do
test_expect_success "objects filter with unsupported option $option" '
case "$option" in
@@ -1435,5 +1443,9 @@ test_objects_filter () {
}
test_objects_filter "blob:none"
+test_objects_filter "blob:limit=1"
+test_objects_filter "blob:limit=500"
+test_objects_filter "blob:limit=1000"
+test_objects_filter "blob:limit=1k"
test_done
--
2.49.0.604.gff1f9ca942.dirty
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v3 06/11] builtin/cat-file: support "object:type=" objects filter
2025-04-02 11:13 ` [PATCH v3 00/11] builtin/cat-file: allow filtering objects in batch mode Patrick Steinhardt
` (4 preceding siblings ...)
2025-04-02 11:13 ` [PATCH v3 05/11] builtin/cat-file: support "blob:limit=" " Patrick Steinhardt
@ 2025-04-02 11:13 ` Patrick Steinhardt
2025-04-02 11:13 ` [PATCH v3 07/11] pack-bitmap: allow passing payloads to `show_reachable_fn()` Patrick Steinhardt
` (5 subsequent siblings)
11 siblings, 0 replies; 72+ messages in thread
From: Patrick Steinhardt @ 2025-04-02 11:13 UTC (permalink / raw)
To: git; +Cc: Toon Claes, Karthik Nayak, Taylor Blau, Junio C Hamano
Implement support for the "object:type=" filter in git-cat-file(1),
which causes us to omit all objects that don't match the provided object
type.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
Documentation/git-cat-file.adoc | 3 +++
builtin/cat-file.c | 12 +++++++++++-
t/t1006-cat-file.sh | 6 +++++-
3 files changed, 19 insertions(+), 2 deletions(-)
diff --git a/Documentation/git-cat-file.adoc b/Documentation/git-cat-file.adoc
index 48e05e1af52..74d71c3282e 100644
--- a/Documentation/git-cat-file.adoc
+++ b/Documentation/git-cat-file.adoc
@@ -96,6 +96,9 @@ The form '--filter=blob:limit=<n>[kmg]' omits blobs of size at least n
bytes or units. n may be zero. The suffixes k, m, and g can be used to name
units in KiB, MiB, or GiB. For example, 'blob:limit=1k' is the same as
'blob:limit=1024'.
++
+The form '--filter=object:type=(tag|commit|tree|blob)' omits all objects which
+are not of the requested type.
--path=<path>::
For use with `--textconv` or `--filters`, to allow specifying an object
diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 629c6cddcb2..0f17175a549 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -484,7 +484,8 @@ static void batch_object_write(const char *obj_name,
if (use_mailmap ||
opt->objects_filter.choice == LOFC_BLOB_NONE ||
- opt->objects_filter.choice == LOFC_BLOB_LIMIT)
+ opt->objects_filter.choice == LOFC_BLOB_LIMIT ||
+ opt->objects_filter.choice == LOFC_OBJECT_TYPE)
data->info.typep = &data->type;
if (opt->objects_filter.choice == LOFC_BLOB_LIMIT)
data->info.sizep = &data->size;
@@ -521,6 +522,14 @@ static void batch_object_write(const char *obj_name,
return;
}
break;
+ case LOFC_OBJECT_TYPE:
+ if (data->type != opt->objects_filter.object_type) {
+ if (!opt->all_objects)
+ report_object_status(opt, obj_name,
+ &data->oid, "excluded");
+ return;
+ }
+ break;
default:
BUG("unsupported objects filter");
}
@@ -1062,6 +1071,7 @@ int cmd_cat_file(int argc,
break;
case LOFC_BLOB_NONE:
case LOFC_BLOB_LIMIT:
+ case LOFC_OBJECT_TYPE:
if (!batch.enabled)
usage(_("objects filter only supported in batch mode"));
break;
diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
index 4f14840b71a..98638fa2b9c 100755
--- a/t/t1006-cat-file.sh
+++ b/t/t1006-cat-file.sh
@@ -1388,7 +1388,7 @@ test_expect_success 'objects filter with unknown option' '
test_cmp expect err
'
-for option in object:type=tag sparse:oid=1234 tree:1 sparse:path=x
+for option in sparse:oid=1234 tree:1 sparse:path=x
do
test_expect_success "objects filter with unsupported option $option" '
case "$option" in
@@ -1447,5 +1447,9 @@ test_objects_filter "blob:limit=1"
test_objects_filter "blob:limit=500"
test_objects_filter "blob:limit=1000"
test_objects_filter "blob:limit=1k"
+test_objects_filter "object:type=blob"
+test_objects_filter "object:type=commit"
+test_objects_filter "object:type=tag"
+test_objects_filter "object:type=tree"
test_done
--
2.49.0.604.gff1f9ca942.dirty
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v3 07/11] pack-bitmap: allow passing payloads to `show_reachable_fn()`
2025-04-02 11:13 ` [PATCH v3 00/11] builtin/cat-file: allow filtering objects in batch mode Patrick Steinhardt
` (5 preceding siblings ...)
2025-04-02 11:13 ` [PATCH v3 06/11] builtin/cat-file: support "object:type=" " Patrick Steinhardt
@ 2025-04-02 11:13 ` Patrick Steinhardt
2025-04-02 11:13 ` [PATCH v3 08/11] pack-bitmap: add function to iterate over filtered bitmapped objects Patrick Steinhardt
` (4 subsequent siblings)
11 siblings, 0 replies; 72+ messages in thread
From: Patrick Steinhardt @ 2025-04-02 11:13 UTC (permalink / raw)
To: git; +Cc: Toon Claes, Karthik Nayak, Taylor Blau, Junio C Hamano
The `show_reachable_fn` callback is used by a couple of functions to
present reachable objects to the caller. The function does not provide a
way for the caller to pass a payload though, which is functionality that
we'll require in a subsequent commit.
Change the callback type to accept a payload and adapt all callsites
accordingly.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
builtin/pack-objects.c | 3 ++-
builtin/rev-list.c | 3 ++-
pack-bitmap.c | 15 ++++++++-------
pack-bitmap.h | 3 ++-
reachable.c | 3 ++-
5 files changed, 16 insertions(+), 11 deletions(-)
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index a7e4bb79049..38784613fc0 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1736,7 +1736,8 @@ static int add_object_entry(const struct object_id *oid, enum object_type type,
static int add_object_entry_from_bitmap(const struct object_id *oid,
enum object_type type,
int flags UNUSED, uint32_t name_hash,
- struct packed_git *pack, off_t offset)
+ struct packed_git *pack, off_t offset,
+ void *payload UNUSED)
{
display_progress(progress_state, ++nr_seen);
diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index bb26bee0d45..1100dd2abe7 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -429,7 +429,8 @@ static int show_object_fast(
int exclude UNUSED,
uint32_t name_hash UNUSED,
struct packed_git *found_pack UNUSED,
- off_t found_offset UNUSED)
+ off_t found_offset UNUSED,
+ void *payload UNUSED)
{
fprintf(stdout, "%s\n", oid_to_hex(oid));
return 1;
diff --git a/pack-bitmap.c b/pack-bitmap.c
index 6f7fd94c36f..d192fb87da9 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1625,7 +1625,7 @@ static void show_extended_objects(struct bitmap_index *bitmap_git,
(obj->type == OBJ_TAG && !revs->tag_objects))
continue;
- show_reach(&obj->oid, obj->type, 0, eindex->hashes[i], NULL, 0);
+ show_reach(&obj->oid, obj->type, 0, eindex->hashes[i], NULL, 0, NULL);
}
}
@@ -1663,7 +1663,8 @@ static void init_type_iterator(struct ewah_or_iterator *it,
static void show_objects_for_type(
struct bitmap_index *bitmap_git,
enum object_type object_type,
- show_reachable_fn show_reach)
+ show_reachable_fn show_reach,
+ void *payload)
{
size_t i = 0;
uint32_t offset;
@@ -1715,7 +1716,7 @@ static void show_objects_for_type(
if (bitmap_git->hashes)
hash = get_be32(bitmap_git->hashes + index_pos);
- show_reach(&oid, object_type, 0, hash, pack, ofs);
+ show_reach(&oid, object_type, 0, hash, pack, ofs, payload);
}
}
@@ -2518,13 +2519,13 @@ void traverse_bitmap_commit_list(struct bitmap_index *bitmap_git,
{
assert(bitmap_git->result);
- show_objects_for_type(bitmap_git, OBJ_COMMIT, show_reachable);
+ show_objects_for_type(bitmap_git, OBJ_COMMIT, show_reachable, NULL);
if (revs->tree_objects)
- show_objects_for_type(bitmap_git, OBJ_TREE, show_reachable);
+ show_objects_for_type(bitmap_git, OBJ_TREE, show_reachable, NULL);
if (revs->blob_objects)
- show_objects_for_type(bitmap_git, OBJ_BLOB, show_reachable);
+ show_objects_for_type(bitmap_git, OBJ_BLOB, show_reachable, NULL);
if (revs->tag_objects)
- show_objects_for_type(bitmap_git, OBJ_TAG, show_reachable);
+ show_objects_for_type(bitmap_git, OBJ_TAG, show_reachable, NULL);
show_extended_objects(bitmap_git, revs, show_reachable);
}
diff --git a/pack-bitmap.h b/pack-bitmap.h
index dd0951088f6..de6bf534fef 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -50,7 +50,8 @@ typedef int (*show_reachable_fn)(
int flags,
uint32_t hash,
struct packed_git *found_pack,
- off_t found_offset);
+ off_t found_offset,
+ void *payload);
struct bitmap_index;
diff --git a/reachable.c b/reachable.c
index 9ee04c89ec6..421d354d3b5 100644
--- a/reachable.c
+++ b/reachable.c
@@ -341,7 +341,8 @@ static int mark_object_seen(const struct object_id *oid,
int exclude UNUSED,
uint32_t name_hash UNUSED,
struct packed_git *found_pack UNUSED,
- off_t found_offset UNUSED)
+ off_t found_offset UNUSED,
+ void *payload UNUSED)
{
struct object *obj = lookup_object_by_type(the_repository, oid, type);
if (!obj)
--
2.49.0.604.gff1f9ca942.dirty
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v3 08/11] pack-bitmap: add function to iterate over filtered bitmapped objects
2025-04-02 11:13 ` [PATCH v3 00/11] builtin/cat-file: allow filtering objects in batch mode Patrick Steinhardt
` (6 preceding siblings ...)
2025-04-02 11:13 ` [PATCH v3 07/11] pack-bitmap: allow passing payloads to `show_reachable_fn()` Patrick Steinhardt
@ 2025-04-02 11:13 ` Patrick Steinhardt
2025-04-02 11:13 ` [PATCH v3 09/11] pack-bitmap: introduce function to check whether a pack is bitmapped Patrick Steinhardt
` (3 subsequent siblings)
11 siblings, 0 replies; 72+ messages in thread
From: Patrick Steinhardt @ 2025-04-02 11:13 UTC (permalink / raw)
To: git; +Cc: Toon Claes, Karthik Nayak, Taylor Blau, Junio C Hamano
Introduce a function that allows the caller to iterate over all
bitmapped objects that match a given filter. This mechanism will be used
in a subsequent commit to optimize object filters in git-cat-file(1).
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
pack-bitmap.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++++++++------
pack-bitmap.h | 12 ++++++++++++
2 files changed, 65 insertions(+), 6 deletions(-)
diff --git a/pack-bitmap.c b/pack-bitmap.c
index d192fb87da9..6adb8aaa1c2 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1662,6 +1662,7 @@ static void init_type_iterator(struct ewah_or_iterator *it,
static void show_objects_for_type(
struct bitmap_index *bitmap_git,
+ struct bitmap *objects,
enum object_type object_type,
show_reachable_fn show_reach,
void *payload)
@@ -1672,8 +1673,6 @@ static void show_objects_for_type(
struct ewah_or_iterator it;
eword_t filter;
- struct bitmap *objects = bitmap_git->result;
-
init_type_iterator(&it, bitmap_git, object_type);
for (i = 0; i < objects->word_alloc &&
@@ -2025,6 +2024,50 @@ static void filter_packed_objects_from_bitmap(struct bitmap_index *bitmap_git,
}
}
+int for_each_bitmapped_object(struct bitmap_index *bitmap_git,
+ struct list_objects_filter_options *filter,
+ show_reachable_fn show_reach,
+ void *payload)
+{
+ struct bitmap *filtered_bitmap = NULL;
+ uint32_t objects_nr;
+ size_t full_word_count;
+ int ret;
+
+ if (!can_filter_bitmap(filter)) {
+ ret = -1;
+ goto out;
+ }
+
+ objects_nr = bitmap_num_objects(bitmap_git);
+ full_word_count = objects_nr / BITS_IN_EWORD;
+
+ /* We start from the all-1 bitmap and then filter down from there. */
+ filtered_bitmap = bitmap_word_alloc(full_word_count + !!(objects_nr % BITS_IN_EWORD));
+ memset(filtered_bitmap->words, 0xff, full_word_count * sizeof(*filtered_bitmap->words));
+ for (size_t i = full_word_count * BITS_IN_EWORD; i < objects_nr; i++)
+ bitmap_set(filtered_bitmap, i);
+
+ if (filter_bitmap(bitmap_git, NULL, filtered_bitmap, filter) < 0) {
+ ret = -1;
+ goto out;
+ }
+
+ show_objects_for_type(bitmap_git, filtered_bitmap,
+ OBJ_COMMIT, show_reach, payload);
+ show_objects_for_type(bitmap_git, filtered_bitmap,
+ OBJ_TREE, show_reach, payload);
+ show_objects_for_type(bitmap_git, filtered_bitmap,
+ OBJ_BLOB, show_reach, payload);
+ show_objects_for_type(bitmap_git, filtered_bitmap,
+ OBJ_TAG, show_reach, payload);
+
+ ret = 0;
+out:
+ bitmap_free(filtered_bitmap);
+ return ret;
+}
+
struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
int filter_provided_objects)
{
@@ -2519,13 +2562,17 @@ void traverse_bitmap_commit_list(struct bitmap_index *bitmap_git,
{
assert(bitmap_git->result);
- show_objects_for_type(bitmap_git, OBJ_COMMIT, show_reachable, NULL);
+ show_objects_for_type(bitmap_git, bitmap_git->result,
+ OBJ_COMMIT, show_reachable, NULL);
if (revs->tree_objects)
- show_objects_for_type(bitmap_git, OBJ_TREE, show_reachable, NULL);
+ show_objects_for_type(bitmap_git, bitmap_git->result,
+ OBJ_TREE, show_reachable, NULL);
if (revs->blob_objects)
- show_objects_for_type(bitmap_git, OBJ_BLOB, show_reachable, NULL);
+ show_objects_for_type(bitmap_git, bitmap_git->result,
+ OBJ_BLOB, show_reachable, NULL);
if (revs->tag_objects)
- show_objects_for_type(bitmap_git, OBJ_TAG, show_reachable, NULL);
+ show_objects_for_type(bitmap_git, bitmap_git->result,
+ OBJ_TAG, show_reachable, NULL);
show_extended_objects(bitmap_git, revs, show_reachable);
}
diff --git a/pack-bitmap.h b/pack-bitmap.h
index de6bf534fef..079bae32466 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -79,6 +79,18 @@ int test_bitmap_pseudo_merges(struct repository *r);
int test_bitmap_pseudo_merge_commits(struct repository *r, uint32_t n);
int test_bitmap_pseudo_merge_objects(struct repository *r, uint32_t n);
+struct list_objects_filter_options;
+
+/*
+ * Filter bitmapped objects and iterate through all resulting objects,
+ * executing `show_reach` for each of them. Returns `-1` in case the filter is
+ * not supported, `0` otherwise.
+ */
+int for_each_bitmapped_object(struct bitmap_index *bitmap_git,
+ struct list_objects_filter_options *filter,
+ show_reachable_fn show_reach,
+ void *payload);
+
#define GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL \
"GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL"
--
2.49.0.604.gff1f9ca942.dirty
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v3 09/11] pack-bitmap: introduce function to check whether a pack is bitmapped
2025-04-02 11:13 ` [PATCH v3 00/11] builtin/cat-file: allow filtering objects in batch mode Patrick Steinhardt
` (7 preceding siblings ...)
2025-04-02 11:13 ` [PATCH v3 08/11] pack-bitmap: add function to iterate over filtered bitmapped objects Patrick Steinhardt
@ 2025-04-02 11:13 ` Patrick Steinhardt
2025-04-02 11:13 ` [PATCH v3 10/11] builtin/cat-file: deduplicate logic to iterate over all objects Patrick Steinhardt
` (2 subsequent siblings)
11 siblings, 0 replies; 72+ messages in thread
From: Patrick Steinhardt @ 2025-04-02 11:13 UTC (permalink / raw)
To: git; +Cc: Toon Claes, Karthik Nayak, Taylor Blau, Junio C Hamano
Introduce a function that allows us to verify whether a pack is
bitmapped or not. This functionality will be used in a subsequent
commit.
Helped-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
pack-bitmap.c | 15 +++++++++++++++
pack-bitmap.h | 7 +++++++
2 files changed, 22 insertions(+)
diff --git a/pack-bitmap.c b/pack-bitmap.c
index 6adb8aaa1c2..edc8f42122d 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -745,6 +745,21 @@ struct bitmap_index *prepare_midx_bitmap_git(struct multi_pack_index *midx)
return NULL;
}
+int bitmap_index_contains_pack(struct bitmap_index *bitmap, struct packed_git *pack)
+{
+ for (; bitmap; bitmap = bitmap->base) {
+ if (bitmap_is_midx(bitmap)) {
+ for (size_t i = 0; i < bitmap->midx->num_packs; i++)
+ if (bitmap->midx->packs[i] == pack)
+ return 1;
+ } else if (bitmap->pack == pack) {
+ return 1;
+ }
+ }
+
+ return 0;
+}
+
struct include_data {
struct bitmap_index *bitmap_git;
struct bitmap *base;
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 079bae32466..55df1b3af5a 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -67,6 +67,13 @@ struct bitmapped_pack {
struct bitmap_index *prepare_bitmap_git(struct repository *r);
struct bitmap_index *prepare_midx_bitmap_git(struct multi_pack_index *midx);
+
+/*
+ * Given a bitmap index, determine whether it contains the pack either directly
+ * or via the multi-pack-index.
+ */
+int bitmap_index_contains_pack(struct bitmap_index *bitmap, struct packed_git *pack);
+
void count_bitmap_commit_list(struct bitmap_index *, uint32_t *commits,
uint32_t *trees, uint32_t *blobs, uint32_t *tags);
void traverse_bitmap_commit_list(struct bitmap_index *,
--
2.49.0.604.gff1f9ca942.dirty
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v3 10/11] builtin/cat-file: deduplicate logic to iterate over all objects
2025-04-02 11:13 ` [PATCH v3 00/11] builtin/cat-file: allow filtering objects in batch mode Patrick Steinhardt
` (8 preceding siblings ...)
2025-04-02 11:13 ` [PATCH v3 09/11] pack-bitmap: introduce function to check whether a pack is bitmapped Patrick Steinhardt
@ 2025-04-02 11:13 ` Patrick Steinhardt
2025-04-02 11:13 ` [PATCH v3 11/11] builtin/cat-file: use bitmaps to efficiently filter by object type Patrick Steinhardt
2025-04-03 8:17 ` [PATCH v3 00/11] builtin/cat-file: allow filtering objects in batch mode Karthik Nayak
11 siblings, 0 replies; 72+ messages in thread
From: Patrick Steinhardt @ 2025-04-02 11:13 UTC (permalink / raw)
To: git; +Cc: Toon Claes, Karthik Nayak, Taylor Blau, Junio C Hamano
Pull out a common function that allows us to iterate over all objects in
a repository. Right now the logic is trivial and would only require two
function calls, making this refactoring a bit pointless. But in the next
commit we will iterate on this logic to make use of bitmaps, so this is
about to become a bit more complex.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
builtin/cat-file.c | 85 ++++++++++++++++++++++++++++++------------------------
1 file changed, 48 insertions(+), 37 deletions(-)
diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 0f17175a549..b0c758eca02 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -642,25 +642,18 @@ static int batch_object_cb(const struct object_id *oid, void *vdata)
return 0;
}
-static int collect_loose_object(const struct object_id *oid,
- const char *path UNUSED,
- void *data)
-{
- oid_array_append(data, oid);
- return 0;
-}
-
-static int collect_packed_object(const struct object_id *oid,
- struct packed_git *pack UNUSED,
- uint32_t pos UNUSED,
- void *data)
+static int collect_object(const struct object_id *oid,
+ struct packed_git *pack UNUSED,
+ off_t offset UNUSED,
+ void *data)
{
oid_array_append(data, oid);
return 0;
}
static int batch_unordered_object(const struct object_id *oid,
- struct packed_git *pack, off_t offset,
+ struct packed_git *pack,
+ off_t offset,
void *vdata)
{
struct object_cb_data *data = vdata;
@@ -674,23 +667,6 @@ static int batch_unordered_object(const struct object_id *oid,
return 0;
}
-static int batch_unordered_loose(const struct object_id *oid,
- const char *path UNUSED,
- void *data)
-{
- return batch_unordered_object(oid, NULL, 0, data);
-}
-
-static int batch_unordered_packed(const struct object_id *oid,
- struct packed_git *pack,
- uint32_t pos,
- void *data)
-{
- return batch_unordered_object(oid, pack,
- nth_packed_object_offset(pack, pos),
- data);
-}
-
typedef void (*parse_cmd_fn_t)(struct batch_options *, const char *,
struct strbuf *, struct expand_data *);
@@ -823,6 +799,45 @@ static void batch_objects_command(struct batch_options *opt,
#define DEFAULT_FORMAT "%(objectname) %(objecttype) %(objectsize)"
+typedef int (*for_each_object_fn)(const struct object_id *oid, struct packed_git *pack,
+ off_t offset, void *data);
+
+struct for_each_object_payload {
+ for_each_object_fn callback;
+ void *payload;
+};
+
+static int batch_one_object_loose(const struct object_id *oid,
+ const char *path UNUSED,
+ void *_payload)
+{
+ struct for_each_object_payload *payload = _payload;
+ return payload->callback(oid, NULL, 0, payload->payload);
+}
+
+static int batch_one_object_packed(const struct object_id *oid,
+ struct packed_git *pack,
+ uint32_t pos,
+ void *_payload)
+{
+ struct for_each_object_payload *payload = _payload;
+ return payload->callback(oid, pack, nth_packed_object_offset(pack, pos),
+ payload->payload);
+}
+
+static void batch_each_object(for_each_object_fn callback,
+ unsigned flags,
+ void *_payload)
+{
+ struct for_each_object_payload payload = {
+ .callback = callback,
+ .payload = _payload,
+ };
+ for_each_loose_object(batch_one_object_loose, &payload, 0);
+ for_each_packed_object(the_repository, batch_one_object_packed,
+ &payload, flags);
+}
+
static int batch_objects(struct batch_options *opt)
{
struct strbuf input = STRBUF_INIT;
@@ -877,18 +892,14 @@ static int batch_objects(struct batch_options *opt)
cb.seen = &seen;
- for_each_loose_object(batch_unordered_loose, &cb, 0);
- for_each_packed_object(the_repository, batch_unordered_packed,
- &cb, FOR_EACH_OBJECT_PACK_ORDER);
+ batch_each_object(batch_unordered_object,
+ FOR_EACH_OBJECT_PACK_ORDER, &cb);
oidset_clear(&seen);
} else {
struct oid_array sa = OID_ARRAY_INIT;
- for_each_loose_object(collect_loose_object, &sa, 0);
- for_each_packed_object(the_repository, collect_packed_object,
- &sa, 0);
-
+ batch_each_object(collect_object, 0, &sa);
oid_array_for_each_unique(&sa, batch_object_cb, &cb);
oid_array_clear(&sa);
--
2.49.0.604.gff1f9ca942.dirty
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v3 11/11] builtin/cat-file: use bitmaps to efficiently filter by object type
2025-04-02 11:13 ` [PATCH v3 00/11] builtin/cat-file: allow filtering objects in batch mode Patrick Steinhardt
` (9 preceding siblings ...)
2025-04-02 11:13 ` [PATCH v3 10/11] builtin/cat-file: deduplicate logic to iterate over all objects Patrick Steinhardt
@ 2025-04-02 11:13 ` Patrick Steinhardt
2025-04-03 8:17 ` [PATCH v3 00/11] builtin/cat-file: allow filtering objects in batch mode Karthik Nayak
11 siblings, 0 replies; 72+ messages in thread
From: Patrick Steinhardt @ 2025-04-02 11:13 UTC (permalink / raw)
To: git; +Cc: Toon Claes, Karthik Nayak, Taylor Blau, Junio C Hamano
While it is now possible to filter objects by type, this mechanism is
for now mostly a convenience. Most importantly, we still have to iterate
through the whole packfile to find all objects of a specific type. This
can be prohibitively expensive depending on the size of the packfiles.
It isn't really possible to do better than this when only considering a
packfile itself, as the order of objects is not fixed. But when we have
a packfile with a corresponding bitmap, either because the packfile
itself has one or because the multi-pack index has a bitmap for it, then
we can use these bitmaps to improve the runtime.
While bitmaps are typically used to compute reachability of objects,
they also contain one bitmap per object type that encodes which object
has what type. So instead of reading through the whole packfile(s), we
can use the bitmaps and iterate through the type-specific bitmap.
Typically, only a subset of packfiles will have a bitmap. But this isn't
really much of a problem: we can use bitmaps when available, and then
use the non-bitmap walk for every packfile that isn't covered by one.
Overall, this leads to quite a significant speedup depending on how many
objects of a certain type exist. The following benchmarks have been
executed in the Chromium repository, which has a 50GB packfile with
almost 25 million objects. As expected, there isn't really much of a
change in performance without an object filter:
Benchmark 1: cat-file with no-filter (revision = HEAD~)
Time (mean ± σ): 89.675 s ± 4.527 s [User: 40.807 s, System: 10.782 s]
Range (min … max): 83.052 s … 96.084 s 10 runs
Benchmark 2: cat-file with no-filter (revision = HEAD)
Time (mean ± σ): 88.991 s ± 2.488 s [User: 42.278 s, System: 10.305 s]
Range (min … max): 82.843 s … 91.271 s 10 runs
Summary
cat-file with no-filter (revision = HEAD) ran
1.01 ± 0.06 times faster than cat-file with no-filter (revision = HEAD~)
We still have to scan through all objects as we yield all of them, so
using the bitmap in this case doesn't really buy us anything. What is
noticeable in this benchmark is that we're I/O-bound, not CPU-bound, as
can be seen from the user/system runtimes, which combined are way lower
than the overall benchmarked runtime.
But when we do use a filter we can see a significant improvement:
Benchmark 1: cat-file with filter=object:type=commit (revision = HEAD~)
Time (mean ± σ): 86.444 s ± 4.081 s [User: 36.830 s, System: 11.312 s]
Range (min … max): 80.305 s … 93.104 s 10 runs
Benchmark 2: cat-file with filter=object:type=commit (revision = HEAD)
Time (mean ± σ): 2.089 s ± 0.015 s [User: 1.872 s, System: 0.207 s]
Range (min … max): 2.073 s … 2.119 s 10 runs
Summary
cat-file with filter=object:type=commit (revision = HEAD) ran
41.38 ± 1.98 times faster than cat-file with filter=object:type=commit (revision = HEAD~)
This is because we don't have to scan through all packfiles anymore, but
can instead directly look up relevant objects.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
builtin/cat-file.c | 42 +++++++++++++++++++++++++++++++++++++-----
1 file changed, 37 insertions(+), 5 deletions(-)
diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index b0c758eca02..ead7554a57a 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -21,6 +21,7 @@
#include "streaming.h"
#include "oid-array.h"
#include "packfile.h"
+#include "pack-bitmap.h"
#include "object-file.h"
#include "object-name.h"
#include "object-store-ll.h"
@@ -825,7 +826,20 @@ static int batch_one_object_packed(const struct object_id *oid,
payload->payload);
}
-static void batch_each_object(for_each_object_fn callback,
+static int batch_one_object_bitmapped(const struct object_id *oid,
+ enum object_type type UNUSED,
+ int flags UNUSED,
+ uint32_t hash UNUSED,
+ struct packed_git *pack,
+ off_t offset,
+ void *_payload)
+{
+ struct for_each_object_payload *payload = _payload;
+ return payload->callback(oid, pack, offset, payload->payload);
+}
+
+static void batch_each_object(struct batch_options *opt,
+ for_each_object_fn callback,
unsigned flags,
void *_payload)
{
@@ -833,9 +847,27 @@ static void batch_each_object(for_each_object_fn callback,
.callback = callback,
.payload = _payload,
};
+ struct bitmap_index *bitmap = prepare_bitmap_git(the_repository);
+
for_each_loose_object(batch_one_object_loose, &payload, 0);
- for_each_packed_object(the_repository, batch_one_object_packed,
- &payload, flags);
+
+ if (bitmap && !for_each_bitmapped_object(bitmap, &opt->objects_filter,
+ batch_one_object_bitmapped, &payload)) {
+ struct packed_git *pack;
+
+ for (pack = get_all_packs(the_repository); pack; pack = pack->next) {
+ if (bitmap_index_contains_pack(bitmap, pack) ||
+ open_pack_index(pack))
+ continue;
+ for_each_object_in_pack(pack, batch_one_object_packed,
+ &payload, flags);
+ }
+ } else {
+ for_each_packed_object(the_repository, batch_one_object_packed,
+ &payload, flags);
+ }
+
+ free_bitmap_index(bitmap);
}
static int batch_objects(struct batch_options *opt)
@@ -892,14 +924,14 @@ static int batch_objects(struct batch_options *opt)
cb.seen = &seen;
- batch_each_object(batch_unordered_object,
+ batch_each_object(opt, batch_unordered_object,
FOR_EACH_OBJECT_PACK_ORDER, &cb);
oidset_clear(&seen);
} else {
struct oid_array sa = OID_ARRAY_INIT;
- batch_each_object(collect_object, 0, &sa);
+ batch_each_object(opt, collect_object, 0, &sa);
oid_array_for_each_unique(&sa, batch_object_cb, &cb);
oid_array_clear(&sa);
--
2.49.0.604.gff1f9ca942.dirty
^ permalink raw reply related [flat|nested] 72+ messages in thread
* Re: [PATCH v3 00/11] builtin/cat-file: allow filtering objects in batch mode
2025-04-02 11:13 ` [PATCH v3 00/11] builtin/cat-file: allow filtering objects in batch mode Patrick Steinhardt
` (10 preceding siblings ...)
2025-04-02 11:13 ` [PATCH v3 11/11] builtin/cat-file: use bitmaps to efficiently filter by object type Patrick Steinhardt
@ 2025-04-03 8:17 ` Karthik Nayak
2025-04-08 0:32 ` Junio C Hamano
11 siblings, 1 reply; 72+ messages in thread
From: Karthik Nayak @ 2025-04-03 8:17 UTC (permalink / raw)
To: Patrick Steinhardt, git; +Cc: Toon Claes, Taylor Blau, Junio C Hamano
[-- Attachment #1: Type: text/plain, Size: 20066 bytes --]
Patrick Steinhardt <ps@pks.im> writes:
> Hi,
>
> at GitLab, we sometimes have the need to list all objects regardless of
> their reachability. We use git-cat-file(1) with `--batch-all-objects` to
> do this, and typically this is quite a good fit. In some cases though,
> we only want to list objects of a specific type, where we then basically
> have the following pipeline:
>
> git cat-file --batch-all-objects --batch-check='%(objecttype) %(objectname)' |
> grep '^commit ' |
> cut -d' ' -f2 |
> git cat-file --batch
>
> This works okayish in medium-sized repositories, but once you reach a
> certain size this isn't really an option anymore. In the Chromium
> repository for example [1] simply listing all objects in the first
> invocation of git-cat-file(1) takes around 80 to 100 seconds. The
> workload is completely I/O-bottlenecked: my machine reads at ~500MB/s,
> and the packfile is 50GB in size, which matches the 100 seconds that I
> observe.
>
> This series addresses the issue by introducing object filters into
> git-cat-file(1). These object filters use the exact same syntax as the
> filters we have in git-rev-list(1), but only a subset of them is
> supported because not all filters can be computed by git-cat-file(1).
> Supported are "blob:none", "blob:limit=" as well as "object:type=".
>
> The filters alone don't really help though: we still have to scan
> through the whole packfile in order to compute the packfiles. While we
> are able to shed a bit of CPU time because we can stop emitting some of
> the objects, we're still I/O-bottlenecked.
>
> The second part of the series thus expands the filters so that they can
> make use of bitmap indices for some of the filters, if available. This
> allows us to efficiently answer the question where to find all objects
> of a specific type, and thus we can avoid scanning through the packfile
> and instead directly look up relevant objects, leading to a significant
> speedup:
>
> Benchmark 1: cat-file with filter=object:type=commit (revision = HEAD~)
> Time (mean ± σ): 86.444 s ± 4.081 s [User: 36.830 s, System: 11.312 s]
> Range (min … max): 80.305 s … 93.104 s 10 runs
>
> Benchmark 2: cat-file with filter=object:type=commit (revision = HEAD)
> Time (mean ± σ): 2.089 s ± 0.015 s [User: 1.872 s, System: 0.207 s]
> Range (min … max): 2.073 s … 2.119 s 10 runs
>
> Summary
> cat-file with filter=object:type=commit (revision = HEAD) ran
> 41.38 ± 1.98 times faster than cat-file with filter=object:type=commit (revision = HEAD~)
>
> We now directly scale with the number of objects of a specific type
> contained in the packfile instead of scaling with the overall number of
> objects. It's quite fun to see how the math plays out: if you sum up the
> times for each of the types you arrive at the time for the unfiltered
> case.
>
> Changes in v2:
> - The series is now built on top of "master" at 683c54c999c (Git 2.49,
> 2025-03-14) with "tb/incremental-midx-part-2" at 27afc272c49 (midx:
> implement writing incremental MIDX bitmaps, 2025-03-20) merged into
> it.
> - Rename the filter options to "--filter=" to match
> git-pack-objects(1).
> - The bitmap-filtering is now reusing existing mechanisms that we
> already have in "pack-bitmap.c", as proposed by Taylor.
> - Link to v1: https://lore.kernel.org/r/20250221-pks-cat-file-object-type-filter-v1-0-0852530888e2@pks.im
>
> Changes in v3:
> - Wrap some overly long lines.
> - Better describe how filters interact with the different batch modes.
> - Adapt the format with `--batch` and `--batch-check` so that we tell
> the user that the object has been excluded.
> - Add a test for "--no-filter".
> - Use `OPT_PARSE_LIST_OBJECTS_FILTER()`.
> - Link to v2: https://lore.kernel.org/r/20250327-pks-cat-file-object-type-filter-v2-0-4bbc7085d7c5@pks.im
>
> Thanks!
>
> Patrick
>
> [1]: https://github.com/chromium/chromium.git
>
> ---
> Patrick Steinhardt (11):
> builtin/cat-file: rename variable that tracks usage
> builtin/cat-file: introduce function to report object status
> builtin/cat-file: wire up an option to filter objects
> builtin/cat-file: support "blob:none" objects filter
> builtin/cat-file: support "blob:limit=" objects filter
> builtin/cat-file: support "object:type=" objects filter
> pack-bitmap: allow passing payloads to `show_reachable_fn()`
> pack-bitmap: add function to iterate over filtered bitmapped objects
> pack-bitmap: introduce function to check whether a pack is bitmapped
> builtin/cat-file: deduplicate logic to iterate over all objects
> builtin/cat-file: use bitmaps to efficiently filter by object type
>
> Documentation/git-cat-file.adoc | 26 ++++
> builtin/cat-file.c | 256 +++++++++++++++++++++++++++++-----------
> builtin/pack-objects.c | 3 +-
> builtin/rev-list.c | 3 +-
> pack-bitmap.c | 81 +++++++++++--
> pack-bitmap.h | 22 +++-
> reachable.c | 3 +-
> t/t1006-cat-file.sh | 99 ++++++++++++++++
> 8 files changed, 411 insertions(+), 82 deletions(-)
>
> Range-diff versus v2:
>
> 1: a75888e0bf4 ! 1: b0642b6c495 builtin/cat-file: rename variable that tracks usage
> @@ builtin/cat-file.c: int cmd_cat_file(int argc,
> ;
> else if (batch.follow_symlinks)
> - usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
> -+ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options,
> - "--follow-symlinks");
> +- "--follow-symlinks");
> ++ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage,
> ++ options, "--follow-symlinks");
> else if (batch.buffer_output >= 0)
> - usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
> -+ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options,
> - "--buffer");
> +- "--buffer");
> ++ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage,
> ++ options, "--buffer");
> else if (batch.all_objects)
> - usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
> -+ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options,
> - "--batch-all-objects");
> +- "--batch-all-objects");
> ++ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage,
> ++ options, "--batch-all-objects");
> else if (input_nul_terminated)
> - usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
> -+ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options,
> - "-z");
> +- "-z");
> ++ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage,
> ++ options, "-z");
> else if (nul_terminated)
> - usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
> -+ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options,
> - "-Z");
> +- "-Z");
> ++ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage,
> ++ options, "-Z");
>
> batch.input_delim = batch.output_delim = '\n';
> + if (input_nul_terminated)
> @@ builtin/cat-file.c: int cmd_cat_file(int argc,
> batch.transform_mode = opt;
> else if (opt && opt != 'b')
> @@ builtin/cat-file.c: int cmd_cat_file(int argc,
> + builtin_catfile_usage, options, opt);
> else if (argc)
> - usage_msg_opt(_("batch modes take no arguments"), usage,
> -+ usage_msg_opt(_("batch modes take no arguments"), builtin_catfile_usage,
> - options);
> +- options);
> ++ usage_msg_opt(_("batch modes take no arguments"),
> ++ builtin_catfile_usage, options);
>
> return batch_objects(&batch);
> + }
> @@ builtin/cat-file.c: int cmd_cat_file(int argc,
> if (opt) {
> if (!argc && opt == 'c')
> usage_msg_optf(_("<rev> required with '%s'"),
> - usage, options, "--textconv");
> -+ builtin_catfile_usage, options, "--textconv");
> ++ builtin_catfile_usage, options,
> ++ "--textconv");
> else if (!argc && opt == 'w')
> usage_msg_optf(_("<rev> required with '%s'"),
> - usage, options, "--filters");
> -+ builtin_catfile_usage, options, "--filters");
> ++ builtin_catfile_usage, options,
> ++ "--filters");
> else if (!argc && opt_epts)
> usage_msg_optf(_("<object> required with '-%c'"),
> - usage, options, opt);
> @@ builtin/cat-file.c: int cmd_cat_file(int argc,
> obj_name = argv[0];
> else
> - usage_msg_opt(_("too many arguments"), usage, options);
> -+ usage_msg_opt(_("too many arguments"), builtin_catfile_usage, options);
> ++ usage_msg_opt(_("too many arguments"), builtin_catfile_usage,
> ++ options);
> } else if (!argc) {
> - usage_with_options(usage, options);
> + usage_with_options(builtin_catfile_usage, options);
> -: ----------- > 2: 18353ba706d builtin/cat-file: introduce function to report object status
> 2: bee9407c1a9 ! 3: 1e46af5d07b builtin/cat-file: wire up an option to filter objects
> @@ Documentation/git-cat-file.adoc: OPTIONS
> +--filter=<filter-spec>::
> +--no-filter::
> + Omit objects from the list of printed objects. This can only be used in
> -+ combination with one of the batched modes. The '<filter-spec>' may be
> -+ one of the following:
> ++ combination with one of the batched modes. Excluded objects that have
> ++ been explicitly requested via any of the batch modes that read objects
> ++ via standard input (`--batch`, `--batch-check`) will be reported as
> ++ "filtered". Excluded objects in `--batch-all-objects` mode will not be
> ++ printed at all. No filters are supported yet.
> +
> --path=<path>::
> For use with `--textconv` or `--filters`, to allow specifying an object
> name and a path separately, e.g. when it is difficult to figure out
> +@@ Documentation/git-cat-file.adoc: the repository, then `cat-file` will ignore any custom format and print:
> + <object> SP missing LF
> + ------------
> +
> ++If a name is specified on stdin that is filtered out via `--filter=`,
> ++then `cat-file` will ignore any custom format and print:
> ++
> ++------------
> ++<object> SP excluded LF
> ++------------
> ++
> + If a name is specified that might refer to more than one object (an ambiguous short sha), then `cat-file` will ignore any custom format and print:
> +
> + ------------
>
> ## builtin/cat-file.c ##
> @@
> @@ builtin/cat-file.c: int cmd_cat_file(int argc,
> N_("run filters on object's content"), 'w'),
> OPT_STRING(0, "path", &force_path, N_("blob|tree"),
> N_("use a <path> for (--textconv | --filters); Not with 'batch'")),
> -+ OPT_CALLBACK(0, "filter", &batch.objects_filter, N_("args"),
> -+ N_("object filtering"), opt_parse_list_objects_filter),
> ++ OPT_PARSE_LIST_OBJECTS_FILTER(&batch.objects_filter),
> OPT_END()
> };
>
> @@ builtin/cat-file.c: int cmd_cat_file(int argc,
> if (opt == 'b')
> batch.all_objects = 1;
> @@ builtin/cat-file.c: int cmd_cat_file(int argc,
> - usage_msg_opt(_("batch modes take no arguments"), builtin_catfile_usage,
> - options);
> + usage_msg_opt(_("batch modes take no arguments"),
> + builtin_catfile_usage, options);
>
> - return batch_objects(&batch);
> + ret = batch_objects(&batch);
> @@ t/t1006-cat-file.sh: test_expect_success PERL '--batch-command info is unbuffere
> + test_cmp expect err
> + '
> +done
> ++
> ++test_expect_success 'objects filter: disabled' '
> ++ git -C repo cat-file --batch-check="%(objectname)" --batch-all-objects --no-filter >actual &&
> ++ sort actual >actual.sorted &&
> ++ git -C repo rev-list --objects --no-object-names --all >expect &&
> ++ sort expect >expect.sorted &&
> ++ test_cmp expect.sorted actual.sorted
> ++'
> +
> test_done
> 3: ec1d0c63de6 ! 4: 878ae8e2a76 builtin/cat-file: support "blob:none" objects filter
> @@ Commit message
> Implement support for the "blob:none" filter in git-cat-file(1), which
> causes us to omit all blobs.
>
> + Note that this new filter requires us to read the object type via
> + `oid_object_info_extended()` in `batch_object_write()`. But as we try to
> + optimize away reading objects from the database the `data->info.typep`
> + pointer may not be set. We thus have to adapt the logic to conditionally
> + set the pointer in cases where the filter is given.
> +
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
>
> ## Documentation/git-cat-file.adoc ##
> @@ Documentation/git-cat-file.adoc: OPTIONS
> - Omit objects from the list of printed objects. This can only be used in
> - combination with one of the batched modes. The '<filter-spec>' may be
> - one of the following:
> + been explicitly requested via any of the batch modes that read objects
> + via standard input (`--batch`, `--batch-check`) will be reported as
> + "filtered". Excluded objects in `--batch-all-objects` mode will not be
> +- printed at all. No filters are supported yet.
> ++ printed at all. The '<filter-spec>' may be one of the following:
> ++
> +The form '--filter=blob:none' omits all blobs.
>
> @@ builtin/cat-file.c: static void batch_object_write(const char *obj_name,
> case LOFC_DISABLED:
> break;
> + case LOFC_BLOB_NONE:
> -+ if (data->type == OBJ_BLOB)
> ++ if (data->type == OBJ_BLOB) {
> ++ if (!opt->all_objects)
> ++ report_object_status(opt, obj_name,
> ++ &data->oid, "excluded");
> + return;
> ++ }
> + break;
> default:
> BUG("unsupported objects filter");
> @@ t/t1006-cat-file.sh: test_expect_success 'objects filter with unknown option' '
> do
> test_expect_success "objects filter with unsupported option $option" '
> case "$option" in
> -@@ t/t1006-cat-file.sh: do
> - '
> - done
> +@@ t/t1006-cat-file.sh: test_expect_success 'objects filter: disabled' '
> + test_cmp expect.sorted actual.sorted
> + '
>
> +test_objects_filter () {
> + filter="$1"
> @@ t/t1006-cat-file.sh: do
> + sort expect >expect.sorted &&
> + test_cmp expect.sorted actual.sorted
> + '
> ++
> ++ test_expect_success "objects filter prints excluded objects: $filter" '
> ++ # Find all objects that would be excluded by the current filter.
> ++ git -C repo rev-list --objects --no-object-names --all >all &&
> ++ git -C repo rev-list --objects --no-object-names --all --filter="$filter" --filter-provided-objects >filtered &&
> ++ sort all >all.sorted &&
> ++ sort filtered >filtered.sorted &&
> ++ comm -23 all.sorted filtered.sorted >expected.excluded &&
> ++ test_line_count -gt 0 expected.excluded &&
> ++
> ++ git -C repo cat-file --batch-check="%(objectname)" --filter="$filter" <expected.excluded >actual &&
> ++ awk "/excluded/{ print \$1 }" actual | sort >actual.excluded &&
> ++ test_cmp expected.excluded actual.excluded
> ++ '
> +}
> +
> +test_objects_filter "blob:none"
> 4: a3ed054994d ! 5: a88d5d4b60a builtin/cat-file: support "blob:limit=" objects filter
> @@ Commit message
>
> ## Documentation/git-cat-file.adoc ##
> @@ Documentation/git-cat-file.adoc: OPTIONS
> - one of the following:
> + printed at all. The '<filter-spec>' may be one of the following:
> +
> The form '--filter=blob:none' omits all blobs.
> ++
> @@ builtin/cat-file.c: static void batch_object_write(const char *obj_name,
> if (pack)
> ret = packed_object_info(the_repository, pack, offset,
> @@ builtin/cat-file.c: static void batch_object_write(const char *obj_name,
> - if (data->type == OBJ_BLOB)
> return;
> + }
> break;
> + case LOFC_BLOB_LIMIT:
> + if (data->type == OBJ_BLOB &&
> -+ data->size >= opt->objects_filter.blob_limit_value)
> ++ data->size >= opt->objects_filter.blob_limit_value) {
> ++ if (!opt->all_objects)
> ++ report_object_status(opt, obj_name,
> ++ &data->oid, "excluded");
> + return;
> ++ }
> + break;
> default:
> BUG("unsupported objects filter");
> @@ t/t1006-cat-file.sh: test_objects_filter () {
> +test_objects_filter "blob:limit=1"
> +test_objects_filter "blob:limit=500"
> +test_objects_filter "blob:limit=1000"
> -+test_objects_filter "blob:limit=1g"
> ++test_objects_filter "blob:limit=1k"
>
> test_done
> 5: 8e39cd218c2 ! 6: 13be54300c9 builtin/cat-file: support "object:type=" objects filter
> @@ builtin/cat-file.c: static void batch_object_write(const char *obj_name,
> if (opt->objects_filter.choice == LOFC_BLOB_LIMIT)
> data->info.sizep = &data->size;
> @@ builtin/cat-file.c: static void batch_object_write(const char *obj_name,
> - data->size >= opt->objects_filter.blob_limit_value)
> return;
> + }
> break;
> + case LOFC_OBJECT_TYPE:
> -+ if (data->type != opt->objects_filter.object_type)
> ++ if (data->type != opt->objects_filter.object_type) {
> ++ if (!opt->all_objects)
> ++ report_object_status(opt, obj_name,
> ++ &data->oid, "excluded");
> + return;
> ++ }
> + break;
> default:
> BUG("unsupported objects filter");
> @@ t/t1006-cat-file.sh: test_expect_success 'objects filter with unknown option' '
> @@ t/t1006-cat-file.sh: test_objects_filter "blob:limit=1"
> test_objects_filter "blob:limit=500"
> test_objects_filter "blob:limit=1000"
> - test_objects_filter "blob:limit=1g"
> + test_objects_filter "blob:limit=1k"
> +test_objects_filter "object:type=blob"
> +test_objects_filter "object:type=commit"
> +test_objects_filter "object:type=tag"
> 6: a0655de3ace = 7: d525a5bc2ef pack-bitmap: allow passing payloads to `show_reachable_fn()`
> 7: e1e44303dac = 8: e3cc1ae3a87 pack-bitmap: add function to iterate over filtered bitmapped objects
> 8: 23bc040bb15 = 9: c0fc0e4ce0c pack-bitmap: introduce function to check whether a pack is bitmapped
> 9: 4eba2a70619 = 10: 28ef93dceec builtin/cat-file: deduplicate logic to iterate over all objects
> 10: d40f1924ef5 = 11: 842a6002c50 builtin/cat-file: use bitmaps to efficiently filter by object type
>
Thanks for the new version, the range-diff looks good. Good that you
also added a test for "excluded" message too.
> ---
> base-commit: 003c5f45b8447877015b2a23ceab2297638fe1f1
> change-id: 20250220-pks-cat-file-object-type-filter-9140c0ed5ee1
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 690 bytes --]
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v2 09/10] builtin/cat-file: deduplicate logic to iterate over all objects
2025-04-02 11:13 ` Patrick Steinhardt
@ 2025-04-03 18:24 ` Toon Claes
0 siblings, 0 replies; 72+ messages in thread
From: Toon Claes @ 2025-04-03 18:24 UTC (permalink / raw)
To: Patrick Steinhardt; +Cc: git, Karthik Nayak, Taylor Blau, Junio C Hamano
Patrick Steinhardt <ps@pks.im> writes:
> Because the payload gets forwarded to the callback, and that callback
> accepts arbitrary types. You can already see this now: we call the
> function once with a `struct object_cb_data` pointer, and once with a
> `struct oid_array` pointer.
Thanks for your replies, it all makes sense now. I've got no further
comments.
I approve.
--
Toon
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v2 01/10] builtin/cat-file: rename variable that tracks usage
2025-04-02 11:13 ` Patrick Steinhardt
@ 2025-04-07 20:25 ` Junio C Hamano
0 siblings, 0 replies; 72+ messages in thread
From: Junio C Hamano @ 2025-04-07 20:25 UTC (permalink / raw)
To: Patrick Steinhardt; +Cc: Karthik Nayak, git, Toon Claes, Taylor Blau
Patrick Steinhardt <ps@pks.im> writes:
>> > - const char * const usage[] = {
>> > + const char * const builtin_catfile_usage[] = {
>>
>> Nit: Style: we use a right pointer alignment, while it is not part of
>> your code change, would be nice to fix.
Sorry, but I do not get the "right pointer alignment" here. There
is a rule to say that the asterisk sticks to variable (and member in
a struct/union) rather than to type, and since <type> comes before
the <variable> being declared in C, it would be <type> *<variable>,
but that is different from "write asterisk stuck to the right
identifier".
> Not in this case though:
>
> $ git grep 'const char \*const' | wc -l
> 85
> $ git grep 'const char \* const' | wc -l
> 180
>
> It's mixed, but we do have more cases of the latter.
I think what you wrote is fine. Thanks.
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v3 00/11] builtin/cat-file: allow filtering objects in batch mode
2025-04-03 8:17 ` [PATCH v3 00/11] builtin/cat-file: allow filtering objects in batch mode Karthik Nayak
@ 2025-04-08 0:32 ` Junio C Hamano
0 siblings, 0 replies; 72+ messages in thread
From: Junio C Hamano @ 2025-04-08 0:32 UTC (permalink / raw)
To: Karthik Nayak; +Cc: Patrick Steinhardt, git, Toon Claes, Taylor Blau
Karthik Nayak <karthik.188@gmail.com> writes:
> Thanks for the new version, the range-diff looks good. Good that you
> also added a test for "excluded" message too.
Thanks, both of you. Will replace and queue.
Let me mark the topic for 'next'.
^ permalink raw reply [flat|nested] 72+ messages in thread
end of thread, other threads:[~2025-04-08 0:32 UTC | newest]
Thread overview: 72+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-21 7:47 [PATCH 0/9] builtin/cat-file: allow filtering objects in batch mode Patrick Steinhardt
2025-02-21 7:47 ` [PATCH 1/9] builtin/cat-file: rename variable that tracks usage Patrick Steinhardt
2025-02-21 7:47 ` [PATCH 2/9] builtin/cat-file: wire up an option to filter objects Patrick Steinhardt
2025-02-26 15:20 ` Toon Claes
2025-02-28 10:51 ` Patrick Steinhardt
2025-02-28 17:44 ` Junio C Hamano
2025-03-03 10:40 ` Patrick Steinhardt
2025-02-27 11:20 ` Karthik Nayak
2025-02-21 7:47 ` [PATCH 3/9] builtin/cat-file: support "blob:none" objects filter Patrick Steinhardt
2025-02-26 15:22 ` Toon Claes
2025-02-27 11:26 ` Karthik Nayak
2025-02-21 7:47 ` [PATCH 4/9] builtin/cat-file: support "blob:limit=" " Patrick Steinhardt
2025-02-21 7:47 ` [PATCH 5/9] builtin/cat-file: support "object:type=" " Patrick Steinhardt
2025-02-26 15:23 ` Toon Claes
2025-02-28 10:51 ` Patrick Steinhardt
2025-02-21 7:47 ` [PATCH 6/9] pack-bitmap: expose function to iterate over bitmapped objects Patrick Steinhardt
2025-02-24 18:05 ` Junio C Hamano
2025-02-25 6:59 ` Patrick Steinhardt
2025-02-25 16:59 ` Junio C Hamano
2025-02-27 23:26 ` Taylor Blau
2025-02-28 10:54 ` Patrick Steinhardt
2025-02-27 23:23 ` Taylor Blau
2025-02-27 23:32 ` Junio C Hamano
2025-02-27 23:39 ` Taylor Blau
2025-02-21 7:47 ` [PATCH 7/9] pack-bitmap: introduce function to check whether a pack is bitmapped Patrick Steinhardt
2025-02-27 23:33 ` Taylor Blau
2025-02-21 7:47 ` [PATCH 8/9] builtin/cat-file: deduplicate logic to iterate over all objects Patrick Steinhardt
2025-02-21 7:47 ` [PATCH 9/9] builtin/cat-file: use bitmaps to efficiently filter by object type Patrick Steinhardt
2025-02-27 11:38 ` Karthik Nayak
2025-02-27 23:48 ` Taylor Blau
2025-03-27 9:43 ` [PATCH v2 00/10] builtin/cat-file: allow filtering objects in batch mode Patrick Steinhardt
2025-03-27 9:43 ` [PATCH v2 01/10] builtin/cat-file: rename variable that tracks usage Patrick Steinhardt
2025-04-01 9:51 ` Karthik Nayak
2025-04-02 11:13 ` Patrick Steinhardt
2025-04-07 20:25 ` Junio C Hamano
2025-03-27 9:43 ` [PATCH v2 02/10] builtin/cat-file: wire up an option to filter objects Patrick Steinhardt
2025-04-01 11:45 ` Toon Claes
2025-04-02 11:13 ` Patrick Steinhardt
2025-04-01 12:05 ` Karthik Nayak
2025-04-02 11:13 ` Patrick Steinhardt
2025-03-27 9:43 ` [PATCH v2 03/10] builtin/cat-file: support "blob:none" objects filter Patrick Steinhardt
2025-04-01 12:22 ` Karthik Nayak
2025-04-01 12:31 ` Karthik Nayak
2025-04-02 11:13 ` Patrick Steinhardt
2025-03-27 9:43 ` [PATCH v2 04/10] builtin/cat-file: support "blob:limit=" " Patrick Steinhardt
2025-03-27 9:44 ` [PATCH v2 05/10] builtin/cat-file: support "object:type=" " Patrick Steinhardt
2025-03-27 9:44 ` [PATCH v2 06/10] pack-bitmap: allow passing payloads to `show_reachable_fn()` Patrick Steinhardt
2025-04-01 12:17 ` Toon Claes
2025-04-02 11:13 ` Patrick Steinhardt
2025-03-27 9:44 ` [PATCH v2 07/10] pack-bitmap: add function to iterate over filtered bitmapped objects Patrick Steinhardt
2025-03-27 9:44 ` [PATCH v2 08/10] pack-bitmap: introduce function to check whether a pack is bitmapped Patrick Steinhardt
2025-04-01 11:46 ` Toon Claes
2025-04-02 11:13 ` Patrick Steinhardt
2025-03-27 9:44 ` [PATCH v2 09/10] builtin/cat-file: deduplicate logic to iterate over all objects Patrick Steinhardt
2025-04-01 12:13 ` Toon Claes
2025-04-02 11:13 ` Patrick Steinhardt
2025-04-03 18:24 ` Toon Claes
2025-03-27 9:44 ` [PATCH v2 10/10] builtin/cat-file: use bitmaps to efficiently filter by object type Patrick Steinhardt
2025-04-02 11:13 ` [PATCH v3 00/11] builtin/cat-file: allow filtering objects in batch mode Patrick Steinhardt
2025-04-02 11:13 ` [PATCH v3 01/11] builtin/cat-file: rename variable that tracks usage Patrick Steinhardt
2025-04-02 11:13 ` [PATCH v3 02/11] builtin/cat-file: introduce function to report object status Patrick Steinhardt
2025-04-02 11:13 ` [PATCH v3 03/11] builtin/cat-file: wire up an option to filter objects Patrick Steinhardt
2025-04-02 11:13 ` [PATCH v3 04/11] builtin/cat-file: support "blob:none" objects filter Patrick Steinhardt
2025-04-02 11:13 ` [PATCH v3 05/11] builtin/cat-file: support "blob:limit=" " Patrick Steinhardt
2025-04-02 11:13 ` [PATCH v3 06/11] builtin/cat-file: support "object:type=" " Patrick Steinhardt
2025-04-02 11:13 ` [PATCH v3 07/11] pack-bitmap: allow passing payloads to `show_reachable_fn()` Patrick Steinhardt
2025-04-02 11:13 ` [PATCH v3 08/11] pack-bitmap: add function to iterate over filtered bitmapped objects Patrick Steinhardt
2025-04-02 11:13 ` [PATCH v3 09/11] pack-bitmap: introduce function to check whether a pack is bitmapped Patrick Steinhardt
2025-04-02 11:13 ` [PATCH v3 10/11] builtin/cat-file: deduplicate logic to iterate over all objects Patrick Steinhardt
2025-04-02 11:13 ` [PATCH v3 11/11] builtin/cat-file: use bitmaps to efficiently filter by object type Patrick Steinhardt
2025-04-03 8:17 ` [PATCH v3 00/11] builtin/cat-file: allow filtering objects in batch mode Karthik Nayak
2025-04-08 0:32 ` Junio C Hamano
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).