[Question] Can git cat-file have a type filtering option?

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [Question] Can git cat-file have a type filtering option?
@ 2023-04-07 14:24 ZheNing Hu
  2023-04-07 16:30 ` Junio C Hamano
  2023-04-09  1:23 ` Taylor Blau
  0 siblings, 2 replies; 23+ messages in thread
From: ZheNing Hu @ 2023-04-07 14:24 UTC (permalink / raw)
  To: Git List; +Cc: Junio C Hamano, johncai86

Sometimes when we use `git cat-file --batch-all-objects`, we only want
data of type "blob". In order to filter them out, we may need to use
some additional processes (such as `git rev-list --objects
--filter=blob:none --filter-provided-objects`) to obtain the SHA of
all blobs, and then use `git cat-file --batch` to retrieve them. This
is not very elegant, or in other words, it might be better to have an
internal implementation of filtering within `git cat-file
--batch-all-objects`.

However, `git cat-file` already has a `--filters` option, which is
used to "show content as transformed by filters". I'm not sure if
there is a better word to implement the functionality of filtering by
type? For example, `--type-filter`?

Thanks,
--
ZheNing Hu

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question] Can git cat-file have a type filtering option?
  2023-04-07 14:24 [Question] Can git cat-file have a type filtering option? ZheNing Hu
@ 2023-04-07 16:30 ` Junio C Hamano
  2023-04-08  6:27   ` ZheNing Hu
  2023-04-09  1:26   ` Taylor Blau
  2023-04-09  1:23 ` Taylor Blau
  1 sibling, 2 replies; 23+ messages in thread
From: Junio C Hamano @ 2023-04-07 16:30 UTC (permalink / raw)
  To: ZheNing Hu; +Cc: Git List, johncai86

ZheNing Hu <adlternative@gmail.com> writes:

> all blobs, and then use `git cat-file --batch` to retrieve them. This
> is not very elegant, or in other words, it might be better to have an
> internal implementation of filtering within `git cat-file
> --batch-all-objects`.

It does sound prominently elegant to have each tool does one task
and does it well, and being able to flexibly combine them to achieve
a larger task.

Once that approach is working well, it may still make sense to give
a special case codepath that bundles a specific combination of these
primitive features, if use cases for the specific combination appear
often.  But I do not know if the particular one, "we do not want to
feed specific list of objects to check to 'cat-file --batch'",
qualifies as one.

> For example, `--type-filter`?

Is the object type the only thing that people often would want to
base their filtering decision on?  Will we then see somebody else
request a "--size-filter", and then somebody else realizes that the
filtering criteria based on size need to be different between blobs
(most likely counted in bytes) and trees (it may be more convenient
to count the tree entries, not byes)?  It sounds rather messy and
we may be better off having such an extensible logic in one place.

Like rev-list's object list filtering, that is.

Is the logic that implements rev-list's object list filtering
something that is easily called from the side, as if it were a
library routine?  Refactoring that and teaching cat-file an option
to activate that logic might be more palatable.

Thanks.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question] Can git cat-file have a type filtering option?
  2023-04-07 16:30 ` Junio C Hamano
@ 2023-04-08  6:27   ` ZheNing Hu
  2023-04-09  1:28     ` Taylor Blau
  2023-04-09  1:26   ` Taylor Blau
  1 sibling, 1 reply; 23+ messages in thread
From: ZheNing Hu @ 2023-04-08  6:27 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git List, johncai86

Junio C Hamano <gitster@pobox.com> 于2023年4月8日周六 00:30写道：
>
> ZheNing Hu <adlternative@gmail.com> writes:
>
> > all blobs, and then use `git cat-file --batch` to retrieve them. This
> > is not very elegant, or in other words, it might be better to have an
> > internal implementation of filtering within `git cat-file
> > --batch-all-objects`.
>
> It does sound prominently elegant to have each tool does one task
> and does it well, and being able to flexibly combine them to achieve
> a larger task.
>

Okay, you're right. It's not "ungraceful" to have each task do its own thing.
I should clarify that for a command like `git cat-file --batch-all-objects`,
which traverses all objects, it would be better to have a filter. It might be
more performant than using `git rev-list --filter | git cat-file --batch`?


> Once that approach is working well, it may still make sense to give
> a special case codepath that bundles a specific combination of these
> primitive features, if use cases for the specific combination appear
> often.  But I do not know if the particular one, "we do not want to
> feed specific list of objects to check to 'cat-file --batch'",
> qualifies as one.
>
> > For example, `--type-filter`?
>
> Is the object type the only thing that people often would want to
> base their filtering decision on?  Will we then see somebody else
> request a "--size-filter", and then somebody else realizes that the
> filtering criteria based on size need to be different between blobs
> (most likely counted in bytes) and trees (it may be more convenient
> to count the tree entries, not byes)?  It sounds rather messy and
> we may be better off having such an extensible logic in one place.
>

Yes, having a generic filter for `git cat-file` would be better.

> Like rev-list's object list filtering, that is.
>
> Is the logic that implements rev-list's object list filtering
> something that is easily called from the side, as if it were a
> library routine?  Refactoring that and teaching cat-file an option
> to activate that logic might be more palatable.
>

I don't think so. While `git rev-list` traverses objects and performs
filtering within a revision, `git cat-file --batch-all-objects` traverses
all loose and packed objects. It might be difficult to perfectly
extract the filtering from `git rev-list` and apply it to `git cat-file`.

> Thanks.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question] Can git cat-file have a type filtering option?
  2023-04-07 14:24 [Question] Can git cat-file have a type filtering option? ZheNing Hu
  2023-04-07 16:30 ` Junio C Hamano
@ 2023-04-09  1:23 ` Taylor Blau
  1 sibling, 0 replies; 23+ messages in thread
From: Taylor Blau @ 2023-04-09  1:23 UTC (permalink / raw)
  To: ZheNing Hu; +Cc: Git List, Junio C Hamano, johncai86

On Fri, Apr 07, 2023 at 10:24:22PM +0800, ZheNing Hu wrote:
> However, `git cat-file` already has a `--filters` option, which is
> used to "show content as transformed by filters". I'm not sure if
> there is a better word to implement the functionality of filtering by
> type? For example, `--type-filter`?

There is the `--filter='object:type=blob'` that should do what you're
looking for.

In other words, if you wanted to dump the contents of all blobs in your
repository, this should do the trick:

  $ git rev-list --all --objects --filter='object:type=blob' |
    git cat-file --batch[=<format>]

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question] Can git cat-file have a type filtering option?
  2023-04-07 16:30 ` Junio C Hamano
  2023-04-08  6:27   ` ZheNing Hu
@ 2023-04-09  1:26   ` Taylor Blau
  1 sibling, 0 replies; 23+ messages in thread
From: Taylor Blau @ 2023-04-09  1:26 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: ZheNing Hu, Git List, johncai86

On Fri, Apr 07, 2023 at 09:30:18AM -0700, Junio C Hamano wrote:
> ZheNing Hu <adlternative@gmail.com> writes:
>
> > all blobs, and then use `git cat-file --batch` to retrieve them. This
> > is not very elegant, or in other words, it might be better to have an
> > internal implementation of filtering within `git cat-file
> > --batch-all-objects`.
>
> It does sound prominently elegant to have each tool does one task
> and does it well, and being able to flexibly combine them to achieve
> a larger task.

Yeah, agreed. It may be *convenient* to have an easy-to-reach option in
cat-file like '--exclude-type=tree,commit,tag' or something. But the
argument falls on a pretty slippery slope, as I think you note below.

> Is the object type the only thing that people often would want to
> base their filtering decision on?  Will we then see somebody else
> request a "--size-filter", and then somebody else realizes that the
> filtering criteria based on size need to be different between blobs
> (most likely counted in bytes) and trees (it may be more convenient
> to count the tree entries, not byes)?  It sounds rather messy and
> we may be better off having such an extensible logic in one place.
>
> Like rev-list's object list filtering, that is.

Yes, exactly. This definitely feels like a "do one thing and do it
well". `rev-list` is the tool we have for listing revisions and objects,
and it can produce output that is compatible with the kind of input that
other tools (like `cat-file`) can interpret.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question] Can git cat-file have a type filtering option?
  2023-04-08  6:27   ` ZheNing Hu
@ 2023-04-09  1:28     ` Taylor Blau
  2023-04-09  2:19       ` Taylor Blau
  2023-04-09  6:47       ` ZheNing Hu
  0 siblings, 2 replies; 23+ messages in thread
From: Taylor Blau @ 2023-04-09  1:28 UTC (permalink / raw)
  To: ZheNing Hu; +Cc: Junio C Hamano, Git List, johncai86

On Sat, Apr 08, 2023 at 02:27:53PM +0800, ZheNing Hu wrote:
> Okay, you're right. It's not "ungraceful" to have each task do its own thing.
> I should clarify that for a command like `git cat-file --batch-all-objects`,
> which traverses all objects, it would be better to have a filter. It might be
> more performant than using `git rev-list --filter | git cat-file --batch`?

Perhaps slightly so, since there is naturally going to be some
duplicated effort spawning processes, loading any shared libraries,
initializing the repository and reading its configuration, etc.

But I'd wager that these are all a negligible cost when compared to the
time we'll have to spend reading, inflating, and printing out all of the
objects in your repository.

Hopefully any task(s) where that cost *wouldn't* be negligible relative
to the rest of the job would be small enough that they could fit into a
single process.

> I don't think so. While `git rev-list` traverses objects and performs
> filtering within a revision, `git cat-file --batch-all-objects` traverses
> all loose and packed objects. It might be difficult to perfectly
> extract the filtering from `git rev-list` and apply it to `git cat-file`.

`rev-list`'s `--all` option does exactly the former: it looks at all
loose and packed objects instead of doing a traditional object walk.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question] Can git cat-file have a type filtering option?
  2023-04-09  1:28     ` Taylor Blau
@ 2023-04-09  2:19       ` Taylor Blau
  2023-04-09  2:26         ` Taylor Blau
  2023-04-09  6:47       ` ZheNing Hu
  1 sibling, 1 reply; 23+ messages in thread
From: Taylor Blau @ 2023-04-09  2:19 UTC (permalink / raw)
  To: ZheNing Hu; +Cc: Junio C Hamano, Git List, johncai86

On Sat, Apr 08, 2023 at 09:28:28PM -0400, Taylor Blau wrote:
> > I don't think so. While `git rev-list` traverses objects and performs
> > filtering within a revision, `git cat-file --batch-all-objects` traverses
> > all loose and packed objects. It might be difficult to perfectly
> > extract the filtering from `git rev-list` and apply it to `git cat-file`.
>
> `rev-list`'s `--all` option does exactly the former: it looks at all
> loose and packed objects instead of doing a traditional object walk.

Sorry, this isn't right: --all pretends as if you passed all references
to it over argv, not to just look at the individual loose and packed
objects.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question] Can git cat-file have a type filtering option?
  2023-04-09  2:19       ` Taylor Blau
@ 2023-04-09  2:26         ` Taylor Blau
  2023-04-09  6:51           ` ZheNing Hu
  0 siblings, 1 reply; 23+ messages in thread
From: Taylor Blau @ 2023-04-09  2:26 UTC (permalink / raw)
  To: ZheNing Hu; +Cc: Junio C Hamano, Git List, johncai86

On Sat, Apr 08, 2023 at 10:19:52PM -0400, Taylor Blau wrote:
> On Sat, Apr 08, 2023 at 09:28:28PM -0400, Taylor Blau wrote:
> > > I don't think so. While `git rev-list` traverses objects and performs
> > > filtering within a revision, `git cat-file --batch-all-objects` traverses
> > > all loose and packed objects. It might be difficult to perfectly
> > > extract the filtering from `git rev-list` and apply it to `git cat-file`.
> >
> > `rev-list`'s `--all` option does exactly the former: it looks at all
> > loose and packed objects instead of doing a traditional object walk.
>
> Sorry, this isn't right: --all pretends as if you passed all references
> to it over argv, not to just look at the individual loose and packed
> objects.

The right thing to do here if you wanted to get a listing of all blobs
in your repository regardless of their reachability or whether they are
loose or packed is:

    git cat-file --batch-check='%(objectname)' --batch-all-objects |
    git rev-list --objects --stdin --no-walk --filter='object:type=blob'

Or, if your filter is as straightforward as "is this object a blob or
not", you could write something like:

    git cat-file --batch-check --batch-all-objects | awk '
      if ($2 == "blob") { print $0 }'

Or you could tighten up the AWK expression by doing something like:

    git cat-file --batch-check='%(objecttype) %(objectname)' \
      --batch-all-objects | awk '/^blob / { print $2 }'

Sorry for the brain fart.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question] Can git cat-file have a type filtering option?
  2023-04-09  1:28     ` Taylor Blau
  2023-04-09  2:19       ` Taylor Blau
@ 2023-04-09  6:47       ` ZheNing Hu
  2023-04-10 20:14         ` Jeff King
  1 sibling, 1 reply; 23+ messages in thread
From: ZheNing Hu @ 2023-04-09  6:47 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Junio C Hamano, Git List, johncai86

Taylor Blau <me@ttaylorr.com> 于2023年4月9日周日 09:28写道：
>
> On Sat, Apr 08, 2023 at 02:27:53PM +0800, ZheNing Hu wrote:
> > Okay, you're right. It's not "ungraceful" to have each task do its own thing.
> > I should clarify that for a command like `git cat-file --batch-all-objects`,
> > which traverses all objects, it would be better to have a filter. It might be
> > more performant than using `git rev-list --filter | git cat-file --batch`?
>
> Perhaps slightly so, since there is naturally going to be some
> duplicated effort spawning processes, loading any shared libraries,
> initializing the repository and reading its configuration, etc.
>
> But I'd wager that these are all a negligible cost when compared to the
> time we'll have to spend reading, inflating, and printing out all of the
> objects in your repository.
>

"What you said makes sense. I implemented the --type-filter option for
git cat-file and compared the performance of outputting all blobs in the
git repository with and without using the type-filter. I found that the
difference was not significant.

time git  cat-file --batch-all-objects --batch-check="%(objectname)
%(objecttype)" |
awk '{ if ($2 == "blob") print $1 }' | git cat-file --batch > /dev/null
17.10s user 0.27s system 102% cpu 16.987 total

time git cat-file --batch-all-objects --batch --type-filter=blob >/dev/null
16.74s user 0.19s system 95% cpu 17.655 total

At first, I thought the processes that provide all blob oids by using
git rev-list or git cat-file --batch-all-objects --batch-check might waste
cpu, io, memory resources because they need to read a large number
of objects, and then they are read again by git cat-file --batch.
However, it seems that this is not actually the bottleneck in performance.

> Hopefully any task(s) where that cost *wouldn't* be negligible relative
> to the rest of the job would be small enough that they could fit into a
> single process.
>
> > I don't think so. While `git rev-list` traverses objects and performs
> > filtering within a revision, `git cat-file --batch-all-objects` traverses
> > all loose and packed objects. It might be difficult to perfectly
> > extract the filtering from `git rev-list` and apply it to `git cat-file`.
>
> `rev-list`'s `--all` option does exactly the former: it looks at all
> loose and packed objects instead of doing a traditional object walk.
>
> Thanks,
> Taylor

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question] Can git cat-file have a type filtering option?
  2023-04-09  2:26         ` Taylor Blau
@ 2023-04-09  6:51           ` ZheNing Hu
  2023-04-10 20:01             ` Jeff King
  0 siblings, 1 reply; 23+ messages in thread
From: ZheNing Hu @ 2023-04-09  6:51 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Junio C Hamano, Git List, johncai86

Taylor Blau <me@ttaylorr.com> 于2023年4月9日周日 10:26写道：
>
> On Sat, Apr 08, 2023 at 10:19:52PM -0400, Taylor Blau wrote:
> > On Sat, Apr 08, 2023 at 09:28:28PM -0400, Taylor Blau wrote:
> > > > I don't think so. While `git rev-list` traverses objects and performs
> > > > filtering within a revision, `git cat-file --batch-all-objects` traverses
> > > > all loose and packed objects. It might be difficult to perfectly
> > > > extract the filtering from `git rev-list` and apply it to `git cat-file`.
> > >
> > > `rev-list`'s `--all` option does exactly the former: it looks at all
> > > loose and packed objects instead of doing a traditional object walk.
> >
> > Sorry, this isn't right: --all pretends as if you passed all references
> > to it over argv, not to just look at the individual loose and packed
> > objects.
>
> The right thing to do here if you wanted to get a listing of all blobs
> in your repository regardless of their reachability or whether they are
> loose or packed is:
>
>     git cat-file --batch-check='%(objectname)' --batch-all-objects |
>     git rev-list --objects --stdin --no-walk --filter='object:type=blob'
>

This looks like a mistake. Try passing a tree oid to git rev-list:

git rev-list --objects --stdin --no-walk --filter='object:type=blob'
<<< HEAD^{tree}
27f9fa75c6d8cdae7834f38006b631522c6a5ac3
4860bebd32f8d3f34c2382f097ac50c0b972d3a0 .cirrus.yml
c592dda681fecfaa6bf64fb3f539eafaf4123ed8 .clang-format
f9d819623d832113014dd5d5366e8ee44ac9666a .editorconfig
b0044cf272fec9b987e99c600d6a95bc357261c3 .gitattributes
...

> Or, if your filter is as straightforward as "is this object a blob or
> not", you could write something like:
>
>     git cat-file --batch-check --batch-all-objects | awk '
>       if ($2 == "blob") { print $0 }'
>
> Or you could tighten up the AWK expression by doing something like:
>
>     git cat-file --batch-check='%(objecttype) %(objectname)' \
>       --batch-all-objects | awk '/^blob / { print $2 }'
>
> Sorry for the brain fart.
>
> Thanks,
> Taylor

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question] Can git cat-file have a type filtering option?
  2023-04-09  6:51           ` ZheNing Hu
@ 2023-04-10 20:01             ` Jeff King
  2023-04-10 23:20               ` Taylor Blau
  0 siblings, 1 reply; 23+ messages in thread
From: Jeff King @ 2023-04-10 20:01 UTC (permalink / raw)
  To: ZheNing Hu; +Cc: Taylor Blau, Junio C Hamano, Git List, johncai86

On Sun, Apr 09, 2023 at 02:51:34PM +0800, ZheNing Hu wrote:

> > The right thing to do here if you wanted to get a listing of all blobs
> > in your repository regardless of their reachability or whether they are
> > loose or packed is:
> >
> >     git cat-file --batch-check='%(objectname)' --batch-all-objects |
> >     git rev-list --objects --stdin --no-walk --filter='object:type=blob'
> >
> 
> This looks like a mistake. Try passing a tree oid to git rev-list:
> 
> git rev-list --objects --stdin --no-walk --filter='object:type=blob'
> <<< HEAD^{tree}
> 27f9fa75c6d8cdae7834f38006b631522c6a5ac3
> 4860bebd32f8d3f34c2382f097ac50c0b972d3a0 .cirrus.yml
> c592dda681fecfaa6bf64fb3f539eafaf4123ed8 .clang-format
> f9d819623d832113014dd5d5366e8ee44ac9666a .editorconfig
> b0044cf272fec9b987e99c600d6a95bc357261c3 .gitattributes
> ...

This is the expected behavior. The filter options are meant to support
partial clones, and the behavior is really "filter things we'd traverse
to". It is intentional that objects the caller directly asks for will
always be included in the output.

I certainly found that convention confusing, but I imagine it solves
some problems with the lazy-fetch requests themselves. Regardless,
that's how it works and it's not going to change anytime soon. :)

For that reason, and just for general flexibility, I think you are
mostly better off piping cat-file through an external filter program
(and then back to cat-file to get more data on each object).

-Peff

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question] Can git cat-file have a type filtering option?
  2023-04-09  6:47       ` ZheNing Hu
@ 2023-04-10 20:14         ` Jeff King
  2023-04-11 14:09           ` ZheNing Hu
  0 siblings, 1 reply; 23+ messages in thread
From: Jeff King @ 2023-04-10 20:14 UTC (permalink / raw)
  To: ZheNing Hu; +Cc: Taylor Blau, Junio C Hamano, Git List, johncai86

On Sun, Apr 09, 2023 at 02:47:30PM +0800, ZheNing Hu wrote:

> > Perhaps slightly so, since there is naturally going to be some
> > duplicated effort spawning processes, loading any shared libraries,
> > initializing the repository and reading its configuration, etc.
> >
> > But I'd wager that these are all a negligible cost when compared to the
> > time we'll have to spend reading, inflating, and printing out all of the
> > objects in your repository.
> 
> "What you said makes sense. I implemented the --type-filter option for
> git cat-file and compared the performance of outputting all blobs in the
> git repository with and without using the type-filter. I found that the
> difference was not significant.
> 
> time git  cat-file --batch-all-objects --batch-check="%(objectname)
> %(objecttype)" |
> awk '{ if ($2 == "blob") print $1 }' | git cat-file --batch > /dev/null
> 17.10s user 0.27s system 102% cpu 16.987 total
> 
> time git cat-file --batch-all-objects --batch --type-filter=blob >/dev/null
> 16.74s user 0.19s system 95% cpu 17.655 total
> 
> At first, I thought the processes that provide all blob oids by using
> git rev-list or git cat-file --batch-all-objects --batch-check might waste
> cpu, io, memory resources because they need to read a large number
> of objects, and then they are read again by git cat-file --batch.
> However, it seems that this is not actually the bottleneck in performance.

Yeah, I think most of your time there is spent on the --batch command
itself, which is just putting through a lot of bytes. You might also try
with "--unordered". The default ordering for --batch-all-objects is in
sha1 order, which has pretty bad locality characteristics for delta
caching. Using --unordered goes in pack-order, which should be optimal.

E.g., in git.git, running:

  time \
    git cat-file --batch-all-objects --batch-check='%(objecttype) %(objectname)' |
    perl -lne 'print $1 if /^blob (.*)/' |
    git cat-file --batch >/dev/null

takes:

  real	0m29.961s
  user	0m29.128s
  sys	0m1.461s

Adding "--unordered" to the initial cat-file gives:

  real	0m1.970s
  user	0m2.170s
  sys	0m0.126s

So reducing the size of the actual --batch printing may make the
relative cost of using multiple processes much higher (I didn't apply
your --type-filter patches to test myself).

In general, I do think having a processing pipeline like this is OK, as
it's pretty flexible. But especially for smaller queries (even ones that
don't ask for the whole object contents), the per-object lookup costs
can start to dominate (especially in a repository that hasn't been
recently packed). Right now, even your "--batch --type-filter" example
is probably making at least two lookups per object, because we don't
have a way to open a "handle" to an object to check its type, and then
extract the contents conditionally. And of course with multiple
processes, we're naturally doing a separate lookup in each one.

So a nice thing about being able to do the filtering in one process is
that we could _eventually_ do it all with one object lookup. But I'd
probably wait on adding something like --type-filter until we have an
internal single-lookup API, and then we could time it to see how much
speedup we can get.

-Peff

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question] Can git cat-file have a type filtering option?
  2023-04-10 20:01             ` Jeff King
@ 2023-04-10 23:20               ` Taylor Blau
  0 siblings, 0 replies; 23+ messages in thread
From: Taylor Blau @ 2023-04-10 23:20 UTC (permalink / raw)
  To: Jeff King; +Cc: ZheNing Hu, Junio C Hamano, Git List, johncai86

On Mon, Apr 10, 2023 at 04:01:41PM -0400, Jeff King wrote:
> For that reason, and just for general flexibility, I think you are
> mostly better off piping cat-file through an external filter program
> (and then back to cat-file to get more data on each object).

Yeah, agreed. The convention of printing objects listed on the
command-line regardless of whether they would pass through the object
filter is confusing to me, too.

But using `rev-list --no-walk` to accomplish the same job for a filter
as trivial as the type-level one feels overkill anyway, so I agree that
just relying on `cat-file` to produce the list of objects, filtering it
yourself, and then handing it back to `cat-file` is the easiest thing to
do.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question] Can git cat-file have a type filtering option?
  2023-04-10 20:14         ` Jeff King
@ 2023-04-11 14:09           ` ZheNing Hu
  2023-04-12  7:43             ` Jeff King
  0 siblings, 1 reply; 23+ messages in thread
From: ZheNing Hu @ 2023-04-11 14:09 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, Junio C Hamano, Git List, johncai86

Jeff King <peff@peff.net> 于2023年4月11日周二 04:14写道：
>
> On Sun, Apr 09, 2023 at 02:47:30PM +0800, ZheNing Hu wrote:
>
> > > Perhaps slightly so, since there is naturally going to be some
> > > duplicated effort spawning processes, loading any shared libraries,
> > > initializing the repository and reading its configuration, etc.
> > >
> > > But I'd wager that these are all a negligible cost when compared to the
> > > time we'll have to spend reading, inflating, and printing out all of the
> > > objects in your repository.
> >
> > "What you said makes sense. I implemented the --type-filter option for
> > git cat-file and compared the performance of outputting all blobs in the
> > git repository with and without using the type-filter. I found that the
> > difference was not significant.
> >
> > time git  cat-file --batch-all-objects --batch-check="%(objectname)
> > %(objecttype)" |
> > awk '{ if ($2 == "blob") print $1 }' | git cat-file --batch > /dev/null
> > 17.10s user 0.27s system 102% cpu 16.987 total
> >
> > time git cat-file --batch-all-objects --batch --type-filter=blob >/dev/null
> > 16.74s user 0.19s system 95% cpu 17.655 total
> >
> > At first, I thought the processes that provide all blob oids by using
> > git rev-list or git cat-file --batch-all-objects --batch-check might waste
> > cpu, io, memory resources because they need to read a large number
> > of objects, and then they are read again by git cat-file --batch.
> > However, it seems that this is not actually the bottleneck in performance.
>
> Yeah, I think most of your time there is spent on the --batch command
> itself, which is just putting through a lot of bytes. You might also try
> with "--unordered". The default ordering for --batch-all-objects is in
> sha1 order, which has pretty bad locality characteristics for delta
> caching. Using --unordered goes in pack-order, which should be optimal.
>
> E.g., in git.git, running:
>
>   time \
>     git cat-file --batch-all-objects --batch-check='%(objecttype) %(objectname)' |
>     perl -lne 'print $1 if /^blob (.*)/' |
>     git cat-file --batch >/dev/null
>
> takes:
>
>   real  0m29.961s
>   user  0m29.128s
>   sys   0m1.461s
>
> Adding "--unordered" to the initial cat-file gives:
>
>   real  0m1.970s
>   user  0m2.170s
>   sys   0m0.126s
>
> So reducing the size of the actual --batch printing may make the
> relative cost of using multiple processes much higher (I didn't apply
> your --type-filter patches to test myself).
>

You are right. Adding the --unordered option can avoid the
time-consuming sorting process from affecting the test results.

time git cat-file --unordered --batch-all-objects \
--batch-check="%(objectname) %(objecttype)" | \
awk '{ if ($2 == "blob") print $1 }' | git cat-file --batch > /dev/null

4.17s user 0.23s system 109% cpu 4.025 total

time git cat-file --unordered --batch-all-objects --batch
--type-filter=blob >/dev/null

3.84s user 0.17s system 97% cpu 4.099 total

It looks like the difference is not significant either.

After all, the truly time-consuming process is reading
the entire data of the blob, whereas git cat-file --batch-check
only reads the first few bytes of the object in comparison.

> In general, I do think having a processing pipeline like this is OK, as
> it's pretty flexible. But especially for smaller queries (even ones that
> don't ask for the whole object contents), the per-object lookup costs
> can start to dominate (especially in a repository that hasn't been
> recently packed). Right now, even your "--batch --type-filter" example
> is probably making at least two lookups per object, because we don't
> have a way to open a "handle" to an object to check its type, and then
> extract the contents conditionally. And of course with multiple
> processes, we're naturally doing a separate lookup in each one.
>

Yes, the type of the object is encapsulated in the header of the loose
object file or the object entry header of the pack file. We have to read
it to get the object type. This may be a lingering question I have had:
why does git put the type/size in the file data instead of storing it as some
kind of metadata elsewhere?

> So a nice thing about being able to do the filtering in one process is
> that we could _eventually_ do it all with one object lookup. But I'd
> probably wait on adding something like --type-filter until we have an
> internal single-lookup API, and then we could time it to see how much
> speedup we can get.
>

I am highly skeptical of this "internal single-lookup API". Do we really
need an extra metadata table to record all objects?
Something like: metadata: {oid: type, size}?

> -Peff

ZheNing Hu

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question] Can git cat-file have a type filtering option?
  2023-04-11 14:09           ` ZheNing Hu
@ 2023-04-12  7:43             ` Jeff King
  2023-04-12  9:57               ` ZheNing Hu
  0 siblings, 1 reply; 23+ messages in thread
From: Jeff King @ 2023-04-12  7:43 UTC (permalink / raw)
  To: ZheNing Hu; +Cc: Taylor Blau, Junio C Hamano, Git List, johncai86

On Tue, Apr 11, 2023 at 10:09:33PM +0800, ZheNing Hu wrote:

> > So reducing the size of the actual --batch printing may make the
> > relative cost of using multiple processes much higher (I didn't apply
> > your --type-filter patches to test myself).
> >
> 
> You are right. Adding the --unordered option can avoid the
> time-consuming sorting process from affecting the test results.

Just to be clear: it's not the cost of sorting, but rather that
accessing the object contents in a sub-optimal order is much worse (and
that sub-optimal order happens to be "sorted by sha1", since that is
effectively random with respect to the contents).

> time git cat-file --unordered --batch-all-objects \
> --batch-check="%(objectname) %(objecttype)" | \
> awk '{ if ($2 == "blob") print $1 }' | git cat-file --batch > /dev/null
> 
> 4.17s user 0.23s system 109% cpu 4.025 total
> 
> time git cat-file --unordered --batch-all-objects --batch
> --type-filter=blob >/dev/null
> 
> 3.84s user 0.17s system 97% cpu 4.099 total
> 
> It looks like the difference is not significant either.

OK, good, that means we can probably not worry about it. :)

> > In general, I do think having a processing pipeline like this is OK, as
> > it's pretty flexible. But especially for smaller queries (even ones that
> > don't ask for the whole object contents), the per-object lookup costs
> > can start to dominate (especially in a repository that hasn't been
> > recently packed). Right now, even your "--batch --type-filter" example
> > is probably making at least two lookups per object, because we don't
> > have a way to open a "handle" to an object to check its type, and then
> > extract the contents conditionally. And of course with multiple
> > processes, we're naturally doing a separate lookup in each one.
> >
> 
> Yes, the type of the object is encapsulated in the header of the loose
> object file or the object entry header of the pack file. We have to read
> it to get the object type. This may be a lingering question I have had:
> why does git put the type/size in the file data instead of storing it as some
> kind of metadata elsewhere?

It's not just metadata; it's actually part of what we hash to get the
object id (though of course it doesn't _have_ to be stored in a linear
buffer, as the pack storage shows). But for loose objects, where would
such metadata be? And accessing it isn't too expensive; we only zlib
inflate the first few bytes (the main cost is in the syscalls to find
and open the file).

For packed object, it effectively is metadata, just stuck at the front
of the object contents, rather than in a separate table. That lets us
use the same .idx file for finding that metadata as we do for the
contents themselves (at the slight cost that if you're _just_ accessing
metadata, the results are sparser within the file, which has worse
behavior for cold-cache disks).

But when I say that lookup costs dominate, what I mean is that we'd
spend a lot of our time binary searching within the pack .idx file, or
falling back to syscalls to look for loose objects.

> > So a nice thing about being able to do the filtering in one process is
> > that we could _eventually_ do it all with one object lookup. But I'd
> > probably wait on adding something like --type-filter until we have an
> > internal single-lookup API, and then we could time it to see how much
> > speedup we can get.
> 
> I am highly skeptical of this "internal single-lookup API". Do we really
> need an extra metadata table to record all objects?
> Something like: metadata: {oid: type, size}?

No, I don't mean changing the storage at all. I mean that rather than
doing this:

  /* get type, size, etc, for --batch format */
  type = oid_object_info(&oid, &size);

  /* now get the contents for --batch to write them itself; but note
   * that this searches for the entry again within all packs, etc */
  contents = read_object_file(oid, &type, &size);

as the cat-file code now does (because the first call is in
batch_object_write(), and the latter in print_object_or_die()), they
could be a single call that does the lookup once.

We could actually do that today, since the object contents are
eventually fed from oid_object_info_extended(), and we know ahead of
time that we want both the metadata and the contents. But that wouldn't
work if we filtered by type, etc.

I'm not sure how much of a speedup it would yield in practice, though.
If you're printing the object contents, then the extra lookup is
probably not that expensive by comparison.

-Peff

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question] Can git cat-file have a type filtering option?
  2023-04-12  7:43             ` Jeff King
@ 2023-04-12  9:57               ` ZheNing Hu
  2023-04-14  7:30                 ` Jeff King
  0 siblings, 1 reply; 23+ messages in thread
From: ZheNing Hu @ 2023-04-12  9:57 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, Junio C Hamano, Git List, johncai86

Jeff King <peff@peff.net> 于2023年4月12日周三 15:43写道：
>
> On Tue, Apr 11, 2023 at 10:09:33PM +0800, ZheNing Hu wrote:
>
> > > So reducing the size of the actual --batch printing may make the
> > > relative cost of using multiple processes much higher (I didn't apply
> > > your --type-filter patches to test myself).
> > >
> >
> > You are right. Adding the --unordered option can avoid the
> > time-consuming sorting process from affecting the test results.
>
> Just to be clear: it's not the cost of sorting, but rather that
> accessing the object contents in a sub-optimal order is much worse (and
> that sub-optimal order happens to be "sorted by sha1", since that is
> effectively random with respect to the contents).
>

Okay, thanks for correcting me. Reading the packfile in SHA1 order is
actually a type of random read, and it should cause additional overhead.

> > time git cat-file --unordered --batch-all-objects \
> > --batch-check="%(objectname) %(objecttype)" | \
> > awk '{ if ($2 == "blob") print $1 }' | git cat-file --batch > /dev/null
> >
> > 4.17s user 0.23s system 109% cpu 4.025 total
> >
> > time git cat-file --unordered --batch-all-objects --batch
> > --type-filter=blob >/dev/null
> >
> > 3.84s user 0.17s system 97% cpu 4.099 total
> >
> > It looks like the difference is not significant either.
>
> OK, good, that means we can probably not worry about it. :)
>
> > > In general, I do think having a processing pipeline like this is OK, as
> > > it's pretty flexible. But especially for smaller queries (even ones that
> > > don't ask for the whole object contents), the per-object lookup costs
> > > can start to dominate (especially in a repository that hasn't been
> > > recently packed). Right now, even your "--batch --type-filter" example
> > > is probably making at least two lookups per object, because we don't
> > > have a way to open a "handle" to an object to check its type, and then
> > > extract the contents conditionally. And of course with multiple
> > > processes, we're naturally doing a separate lookup in each one.
> > >
> >
> > Yes, the type of the object is encapsulated in the header of the loose
> > object file or the object entry header of the pack file. We have to read
> > it to get the object type. This may be a lingering question I have had:
> > why does git put the type/size in the file data instead of storing it as some
> > kind of metadata elsewhere?
>
> It's not just metadata; it's actually part of what we hash to get the
> object id (though of course it doesn't _have_ to be stored in a linear
> buffer, as the pack storage shows).

I'm still puzzled why git calculated the object id based on {type, size, data}
 together instead of just {data}?

> But for loose objects, where would
> such metadata be? And accessing it isn't too expensive; we only zlib
> inflate the first few bytes (the main cost is in the syscalls to find
> and open the file).
>

I may not have a lot of experience with this here. It looks like I should
go ahead and do some performance testing to compare the cost of searching
and opening loose objects v.s reading and inflating loose objects.

> For packed object, it effectively is metadata, just stuck at the front
> of the object contents, rather than in a separate table. That lets us
> use the same .idx file for finding that metadata as we do for the
> contents themselves (at the slight cost that if you're _just_ accessing
> metadata, the results are sparser within the file, which has worse
> behavior for cold-cache disks).
>

Agree. But what if there is a metadata table in the .idx file?
We can even know the type and size of the object without accessing
the packfile.

> But when I say that lookup costs dominate, what I mean is that we'd
> spend a lot of our time binary searching within the pack .idx file, or
> falling back to syscalls to look for loose objects.
>

Alright, binary search in .idx may indeed be more time-consuming than
reading type and size from the packfile.

> > > So a nice thing about being able to do the filtering in one process is
> > > that we could _eventually_ do it all with one object lookup. But I'd
> > > probably wait on adding something like --type-filter until we have an
> > > internal single-lookup API, and then we could time it to see how much
> > > speedup we can get.
> >
> > I am highly skeptical of this "internal single-lookup API". Do we really
> > need an extra metadata table to record all objects?
> > Something like: metadata: {oid: type, size}?
>
> No, I don't mean changing the storage at all. I mean that rather than
> doing this:
>
>   /* get type, size, etc, for --batch format */
>   type = oid_object_info(&oid, &size);
>
>   /* now get the contents for --batch to write them itself; but note
>    * that this searches for the entry again within all packs, etc */
>   contents = read_object_file(oid, &type, &size);
>
> as the cat-file code now does (because the first call is in
> batch_object_write(), and the latter in print_object_or_die()), they
> could be a single call that does the lookup once.
>
> We could actually do that today, since the object contents are
> eventually fed from oid_object_info_extended(), and we know ahead of
> time that we want both the metadata and the contents. But that wouldn't
> work if we filtered by type, etc.
>

So what you mentioned earlier about single read refers to combining
the two read operations of getting type size and getting content into one,
when we know exactly that we need to retrieve the content.(in order to reduce
the overhead of the binary search once).

> I'm not sure how much of a speedup it would yield in practice, though.
> If you're printing the object contents, then the extra lookup is
> probably not that expensive by comparison.
>

I feel like this solution may not be feasible. After we get the type and size
for the first time, we go through different output processes for different types
of objects: use `stream_blob()` for blobs, and `read_object_file()` with
`batch_write()` for other objects. If we obtain the content of a blob in one
single read operation, then the performance optimization provided by
`stream_blob()` would be invalidated.

> -Peff

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question] Can git cat-file have a type filtering option?
  2023-04-12  9:57               ` ZheNing Hu
@ 2023-04-14  7:30                 ` Jeff King
  2023-04-14 12:17                   ` ZheNing Hu
  0 siblings, 1 reply; 23+ messages in thread
From: Jeff King @ 2023-04-14  7:30 UTC (permalink / raw)
  To: ZheNing Hu; +Cc: Taylor Blau, Junio C Hamano, Git List, johncai86

On Wed, Apr 12, 2023 at 05:57:02PM +0800, ZheNing Hu wrote:

> > It's not just metadata; it's actually part of what we hash to get the
> > object id (though of course it doesn't _have_ to be stored in a linear
> > buffer, as the pack storage shows).
> 
> I'm still puzzled why git calculated the object id based on {type, size, data}
>  together instead of just {data}?

You'd have to ask Linus for the original reasoning. ;)

But one nice thing about including these, especially the type, in the
hash, is that the object id gives the complete context for an object.
So if another object claims to point to a tree, say, and points to a blob
instead, we can detect that problem immediately.

Or worse, think about something like "git show 1234abcd". If the
metadata was not part of the object, then how would we know if you
wanted to show a commit, or a blob (that happens to look like a commit),
etc? That metadata could be carried outside the hash, but then it has to
be stored somewhere, and is subject to ending up mismatched to the
contents. Hashing all of it (including the size) makes consistency
checking much easier.

> > For packed object, it effectively is metadata, just stuck at the front
> > of the object contents, rather than in a separate table. That lets us
> > use the same .idx file for finding that metadata as we do for the
> > contents themselves (at the slight cost that if you're _just_ accessing
> > metadata, the results are sparser within the file, which has worse
> > behavior for cold-cache disks).
> 
> Agree. But what if there is a metadata table in the .idx file?
> We can even know the type and size of the object without accessing
> the packfile.

I'm not sure it would be any faster than accessing the packfile. If you
stick the metadata in the .idx file's oid lookup table, then lookups
perform a bit worse because you're wasting memory cache. If you make a
separate table in the .idx file that's OK, but I'm not sure it's
consistently better than finding the data in the packfile.

The oid lookup table gives you a way to index the table in
constant-time (if you store the table as fixed-size entries in sha1
order), but we can also access the packfile in constant-time (the idx
table gives us offsets). The idx metadata table would have better cache
behavior if you're only looking at metadata, and not contents. But
otherwise it's worse (since you have to hit the packfile, too). And I
cheated a bit to say "fixed-size" above; the packfile metadata is in a
variable-length encoding, so in some ways it's more efficient.

So I doubt it would make any operations appreciably faster, and even if
it did, you'd possibly be trading off versus other operations. I think
the more interesting metadata is not type/size, but properties such as
those stored by the commit graph. And there we do have separate tables
for fast access (and it's a _lot_ faster, because it's helping us avoid
inflating the object contents).

> > I'm not sure how much of a speedup it would yield in practice, though.
> > If you're printing the object contents, then the extra lookup is
> > probably not that expensive by comparison.
> >
> 
> I feel like this solution may not be feasible. After we get the type and size
> for the first time, we go through different output processes for different types
> of objects: use `stream_blob()` for blobs, and `read_object_file()` with
> `batch_write()` for other objects. If we obtain the content of a blob in one
> single read operation, then the performance optimization provided by
> `stream_blob()` would be invalidated.

Good point. So yeah, even to use it in today's code you'd need something
conditional. A few years ago I played with an option for object_info
that would let the caller say "please give me the object contents if
they are smaller than N bytes, otherwise don't".

And that would let many call-sites get type, size, and content together
most of the time (for small objects), and then stream only when
necessary. I still have the patches, and running them now it looks like
there's about a 10% speedup running:

  git cat-file --unordered --batch-all-objects --batch >/dev/null

Other code paths dealing with blobs would likewise get a small speedup,
I'd think. I don't remember why I didn't send it. I think there was some
ugly refactoring that I needed to double-check, and my attention just
got pulled elsewhere. The messy patches are at:

  https://github.com/peff/git jk/object-info-round-trip

if you're interested.

-Peff

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question] Can git cat-file have a type filtering option?
  2023-04-14  7:30                 ` Jeff King
@ 2023-04-14 12:17                   ` ZheNing Hu
  2023-04-14 15:58                     ` Junio C Hamano
  2023-04-14 17:04                     ` Linus Torvalds
  0 siblings, 2 replies; 23+ messages in thread
From: ZheNing Hu @ 2023-04-14 12:17 UTC (permalink / raw)
  To: Jeff King
  Cc: Taylor Blau, Junio C Hamano, Git List, johncai86, Linus Torvalds

Jeff King <peff@peff.net> 于2023年4月14日周五 15:30写道：
>
> On Wed, Apr 12, 2023 at 05:57:02PM +0800, ZheNing Hu wrote:
>
> > > It's not just metadata; it's actually part of what we hash to get the
> > > object id (though of course it doesn't _have_ to be stored in a linear
> > > buffer, as the pack storage shows).
> >
> > I'm still puzzled why git calculated the object id based on {type, size, data}
> >  together instead of just {data}?
>
> You'd have to ask Linus for the original reasoning. ;)
>
> But one nice thing about including these, especially the type, in the
> hash, is that the object id gives the complete context for an object.
> So if another object claims to point to a tree, say, and points to a blob
> instead, we can detect that problem immediately.
>
> Or worse, think about something like "git show 1234abcd". If the
> metadata was not part of the object, then how would we know if you
> wanted to show a commit, or a blob (that happens to look like a commit),
> etc? That metadata could be carried outside the hash, but then it has to
> be stored somewhere, and is subject to ending up mismatched to the
> contents. Hashing all of it (including the size) makes consistency
> checking much easier.
>

Oh, you are right, this could be to prevent conflicts between Git objects
with identical content but different types. However, I always associate
Git with the file system, where metadata such as file type and size is
stored in the inode, while the file data is stored in separate chunks.

> > > For packed object, it effectively is metadata, just stuck at the front
> > > of the object contents, rather than in a separate table. That lets us
> > > use the same .idx file for finding that metadata as we do for the
> > > contents themselves (at the slight cost that if you're _just_ accessing
> > > metadata, the results are sparser within the file, which has worse
> > > behavior for cold-cache disks).
> >
> > Agree. But what if there is a metadata table in the .idx file?
> > We can even know the type and size of the object without accessing
> > the packfile.
>
> I'm not sure it would be any faster than accessing the packfile. If you
> stick the metadata in the .idx file's oid lookup table, then lookups
> perform a bit worse because you're wasting memory cache. If you make a
> separate table in the .idx file that's OK, but I'm not sure it's
> consistently better than finding the data in the packfile.
>

Yes, but it maybe be very convenient if we need to filter by object
type or size.

> The oid lookup table gives you a way to index the table in
> constant-time (if you store the table as fixed-size entries in sha1
> order), but we can also access the packfile in constant-time (the idx
> table gives us offsets). The idx metadata table would have better cache
> behavior if you're only looking at metadata, and not contents. But
> otherwise it's worse (since you have to hit the packfile, too). And I
> cheated a bit to say "fixed-size" above; the packfile metadata is in a
> variable-length encoding, so in some ways it's more efficient.
>

Yes, if we only use git cat-file --batch-check, we may be able to improve
performance by avoiding access to the pack file. Additionally, I think
this metadata table is very suitable for filtering and aggregating operations.

> So I doubt it would make any operations appreciably faster, and even if
> it did, you'd possibly be trading off versus other operations. I think
> the more interesting metadata is not type/size, but properties such as
> those stored by the commit graph. And there we do have separate tables
> for fast access (and it's a _lot_ faster, because it's helping us avoid
> inflating the object contents).
>

Yeah, optimizing the retrieval of metadata such as type/size may not provide
as much benefit as recording the commit properties in the metadata table,
like the commit graph optimization does.

> > > I'm not sure how much of a speedup it would yield in practice, though.
> > > If you're printing the object contents, then the extra lookup is
> > > probably not that expensive by comparison.
> > >
> >
> > I feel like this solution may not be feasible. After we get the type and size
> > for the first time, we go through different output processes for different types
> > of objects: use `stream_blob()` for blobs, and `read_object_file()` with
> > `batch_write()` for other objects. If we obtain the content of a blob in one
> > single read operation, then the performance optimization provided by
> > `stream_blob()` would be invalidated.
>
> Good point. So yeah, even to use it in today's code you'd need something
> conditional. A few years ago I played with an option for object_info
> that would let the caller say "please give me the object contents if
> they are smaller than N bytes, otherwise don't".
>
> And that would let many call-sites get type, size, and content together
> most of the time (for small objects), and then stream only when
> necessary. I still have the patches, and running them now it looks like
> there's about a 10% speedup running:
>
>   git cat-file --unordered --batch-all-objects --batch >/dev/null
>
> Other code paths dealing with blobs would likewise get a small speedup,
> I'd think. I don't remember why I didn't send it. I think there was some
> ugly refactoring that I needed to double-check, and my attention just
> got pulled elsewhere. The messy patches are at:
>
>   https://github.com/peff/git jk/object-info-round-trip
>
> if you're interested.
>

Alright, this does feel a bit hackish, allowing most objects to fetch the
content when first read and allowing blobs larger than N to be
streamed via stream_blob().

I feel like this optimization for single-reads is a bit off-topic, I quote
your previous sentence:

> So a nice thing about being able to do the filtering in one process is
> that we could _eventually_ do it all with one object lookup. But I'd
> probably wait on adding something like --type-filter until we have an
> internal single-lookup API, and then we could time it to see how much
> speedup we can get.

This optimization for single-reads doesn't seem to provide much benefit
for implementing object filters, because we have already read the content
of the object in advance?

ZheNing Hu

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question] Can git cat-file have a type filtering option?
  2023-04-14 12:17                   ` ZheNing Hu
@ 2023-04-14 15:58                     ` Junio C Hamano
  2023-04-16 11:15                       ` ZheNing Hu
  2023-04-14 17:04                     ` Linus Torvalds
  1 sibling, 1 reply; 23+ messages in thread
From: Junio C Hamano @ 2023-04-14 15:58 UTC (permalink / raw)
  To: ZheNing Hu; +Cc: Jeff King, Taylor Blau, Git List, johncai86, Linus Torvalds

ZheNing Hu <adlternative@gmail.com> writes:

> Oh, you are right, this could be to prevent conflicts between Git objects
> with identical content but different types. However, I always associate
> Git with the file system, where metadata such as file type and size is
> stored in the inode, while the file data is stored in separate chunks.

I am afraid the presentation order Peff used caused a bit of
confusion.  The true reason is what Peff brought up as "Or worse".
We need to be able to tell, given only the name of an object,
everything that we need to know about the object, and for that, we
need the type information when we ask for an object by its name.
Having size embedded in the data that comes back to us when we
consult object database with an object name helps the implementation
to pre-allocate a buffer and then inflate into it--there is no
fundamental reason why it should be there.

It is a secondary problem created by the design choice that we store
type together with contents, that the object type recorded in a tree
entry may contradict the actual type of the object recorded in the
tree entry.  We could have declared that the object type found in a
tree entry is to be trusted, if we didn't record the type in the
object database together with the object contents.

I think your original question was not "why do we store type and
size together with the contents?", but was "why do we include in the
hash computation?", and all of the above discuss related tangent
without touching the original question.

The need to have type or size available when we ask the object
database for data associated with the object does not necessarily
mean they must be hashed together with the contents.  It was done
merely because "why not? that way, we do not have to worry about
catching corrupt values for type and size information we want to
store together with the contents".  IOW, we could have checksummed
these two pieces of information separately, but why bother?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question] Can git cat-file have a type filtering option?
  2023-04-14 12:17                   ` ZheNing Hu
  2023-04-14 15:58                     ` Junio C Hamano
@ 2023-04-14 17:04                     ` Linus Torvalds
  2023-04-16 12:06                       ` Felipe Contreras
  2023-04-16 12:43                       ` ZheNing Hu
  1 sibling, 2 replies; 23+ messages in thread
From: Linus Torvalds @ 2023-04-14 17:04 UTC (permalink / raw)
  To: ZheNing Hu; +Cc: Jeff King, Taylor Blau, Junio C Hamano, Git List, johncai86

On Fri, Apr 14, 2023 at 5:17 AM ZheNing Hu <adlternative@gmail.com> wrote:
>
> Jeff King <peff@peff.net> 于2023年4月14日周五 15:30写道：
> >
> > On Wed, Apr 12, 2023 at 05:57:02PM +0800, ZheNing Hu wrote:
> > >
> > > I'm still puzzled why git calculated the object id based on {type, size, data}
> > >  together instead of just {data}?
> >
> > You'd have to ask Linus for the original reasoning. ;)

I originally thought of the git object store as "tagged pointers".

That actually caused confusion initially when I tried to explain this
to SCM people, because "tag" means something very different in an SCM
environment than it means in computer architecture.

And the implication of a tagged pointer is that you have two parts of
it - the "tag" and the "address". Both are relevant at all points.

This isn't quite as obvious in everyday moden git usage, because a lot
of uses end up _only_ using the "address" (aka SHA1), but it's very
much part of the object store design. Internally, the object layout
never uses just the SHA1, it's all "type:SHA1", even if sometimes the
types are implied (ie the tree object doesn't spell out "blob", but
it's still explicit in the mode bits).

This is very very obvious in "git cat-file", which was one of the
original scripts in the first commit (but even there the tag/type has
changed meaning over time: the very first version didn't use it as
input at all, then it started verifying it, and then later it got the
more subtle context of "peel the tags until you find this type").

You can also see this in the original README (again, go look at that
first git commit): the README talks about the "tag of their type".

Of course, in practice git then walked away from having to specify the
type all the time. It started even in that original release, in that
the HEAD file never contained the type - because it was implicit (a
HEAD is always a commit).

So we ended up having a lot of situations like that where the actual
tag part was implicit from context, and these days people basically
never refer to the "full" object name with tag, but only the SHA1
address.

So now we have situations where the type really has to be looked up
dynamically, because it's not explicitly encoded anywhere. While HEAD
is supposed to always be a commit, other refs can be pretty much
anything, and can point to a tag object, a commit, a tree or a blob.
So then you actually have to look up the type based on the address.

End result: these days people don't even think of git objects as
"tagged pointers".  Even internally in git, lots of code just passes
the "object name" along without any tag/type, just the raw SHA1 / OID.

So that originally "everything is a tagged pointer" is much less true
than it used to be, and now, instead of having tagged pointers, you
mostly end up with just "bare pointers" and look up the type
dynamically from there.

And that "look up the type in the object" is possible because even
originally, I did *not* want any kind of "object type aliasing".

So even when looking up the object with the full "tag:pointer", the
encoding of the object itself then also contains that object type, so
that you can cross-check that you used the right tag.

That said, you *can* see some of the effects of this "tagged pointers"
in how the internals do things like

    struct commit *commit = lookup_commit(repo, &oid);

which conceptually very much is about tagged pointers. And the fact
that two objects cannot alias is actually somewhat encoded in that: a
"struct commit" contains a "struct object" as a member. But so does
"struct blob" - and the two "struct object" cases are never the same
"object".

So there's never any worry about "could blob.object be the same object
as commit.object"?

That is actually inherent in the code, in how "lookup_commit()"
actually does lookup_object() and then does object_as_type(OBJ_COMMIT)
on the result.

> Oh, you are right, this could be to prevent conflicts between Git objects
> with identical content but different types. However, I always associate
> Git with the file system, where metadata such as file type and size is
> stored in the inode, while the file data is stored in separate chunks.

See above: yes, git design was *also* influenced heavily by
filesystems, but that was mostly in the sense of "this is how to
encode these things without undue pain".

The object database being immutable was partly a security and safety
measure, but it was also very much partly a "rewriting files is going
to be a major pain from a filesystem consistency standpoint - don't do
it".

But even more than a filesystem design, it's an "computer
architecture" design. Think of the git object store as a very abstract
computer architecture that has tagged pointers, stable storage, and no
aliasing - and where the tag is actually verified at each lookup.

The "no aliasing" means that no two distinct pointers can point to the
same data. So a tagged pointer of type "commit" can not point to the
same object as a tagged pointer of type "blob". They are distinct
pointers, even if (maybe) the commit object encoding ends up then
being identical to a blob object.

And as mentioned, that "verified at each lookup" has mostly gone away,
and "each lookup" has become more of a "can be verified by fsck", but
it's probably still a good thing to think that way.

You still have "lookup_object_by_type()" internally in git that takes
the full tagged pointer, but almost nobody uses it any more. The
closest you get is those "lookup_commit()" things (which are fairly
common, still).

              Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question] Can git cat-file have a type filtering option?
  2023-04-14 15:58                     ` Junio C Hamano
@ 2023-04-16 11:15                       ` ZheNing Hu
  0 siblings, 0 replies; 23+ messages in thread
From: ZheNing Hu @ 2023-04-16 11:15 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Taylor Blau, Git List, johncai86, Linus Torvalds

Junio C Hamano <gitster@pobox.com> 于2023年4月14日周五 23:58写道：
>
> ZheNing Hu <adlternative@gmail.com> writes:
>
> > Oh, you are right, this could be to prevent conflicts between Git objects
> > with identical content but different types. However, I always associate
> > Git with the file system, where metadata such as file type and size is
> > stored in the inode, while the file data is stored in separate chunks.
>
> I am afraid the presentation order Peff used caused a bit of
> confusion.  The true reason is what Peff brought up as "Or worse".
> We need to be able to tell, given only the name of an object,
> everything that we need to know about the object, and for that, we
> need the type information when we ask for an object by its name.
> Having size embedded in the data that comes back to us when we
> consult object database with an object name helps the implementation
> to pre-allocate a buffer and then inflate into it--there is no
> fundamental reason why it should be there.
>

Yes, I think I understand the point now. Since Git addresses objects
based on their content, if type information is not included in the object,
we cannot easily understand what type of Git object corresponds to
a given object ID. Moreover, if we don't include type and size information
in Git objects, We would need to maintain a large number of external tables
to record this information, in order to inflate and identify the type.

> It is a secondary problem created by the design choice that we store
> type together with contents, that the object type recorded in a tree
> entry may contradict the actual type of the object recorded in the
> tree entry.  We could have declared that the object type found in a
> tree entry is to be trusted, if we didn't record the type in the
> object database together with the object contents.
>

Yes, that may not be crucial, but including type information
in Git objects can help validate the correctness of tree entries better.

> I think your original question was not "why do we store type and
> size together with the contents?", but was "why do we include in the
> hash computation?", and all of the above discuss related tangent
> without touching the original question.
>

Yes, but I think these two problems should be similar.

> The need to have type or size available when we ask the object
> database for data associated with the object does not necessarily
> mean they must be hashed together with the contents.  It was done
> merely because "why not? that way, we do not have to worry about
> catching corrupt values for type and size information we want to
> store together with the contents".  IOW, we could have checksummed
> these two pieces of information separately, but why bother?

 Thank you. I think I roughly understand.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question] Can git cat-file have a type filtering option?
  2023-04-14 17:04                     ` Linus Torvalds
@ 2023-04-16 12:06                       ` Felipe Contreras
  2023-04-16 12:43                       ` ZheNing Hu
  1 sibling, 0 replies; 23+ messages in thread
From: Felipe Contreras @ 2023-04-16 12:06 UTC (permalink / raw)
  To: Linus Torvalds, ZheNing Hu
  Cc: Jeff King, Taylor Blau, Junio C Hamano, Git List, johncai86

Linus Torvalds wrote:
> On Fri, Apr 14, 2023 at 5:17 AM ZheNing Hu <adlternative@gmail.com> wrote:
> >
> > Jeff King <peff@peff.net> 于2023年4月14日周五 15:30写道：
> > >
> > > On Wed, Apr 12, 2023 at 05:57:02PM +0800, ZheNing Hu wrote:
> > > >
> > > > I'm still puzzled why git calculated the object id based on {type, size, data}
> > > >  together instead of just {data}?
> > >
> > > You'd have to ask Linus for the original reasoning. ;)
> 
> I originally thought of the git object store as "tagged pointers".
> 
> That actually caused confusion initially when I tried to explain this
> to SCM people, because "tag" means something very different in an SCM
> environment than it means in computer architecture.
> 
> And the implication of a tagged pointer is that you have two parts of
> it - the "tag" and the "address". Both are relevant at all points.
> 
> This isn't quite as obvious in everyday moden git usage, because a lot
> of uses end up _only_ using the "address" (aka SHA1), but it's very
> much part of the object store design. Internally, the object layout
> never uses just the SHA1, it's all "type:SHA1", even if sometimes the
> types are implied (ie the tree object doesn't spell out "blob", but
> it's still explicit in the mode bits).
> 
> This is very very obvious in "git cat-file", which was one of the
> original scripts in the first commit (but even there the tag/type has
> changed meaning over time: the very first version didn't use it as
> input at all, then it started verifying it, and then later it got the
> more subtle context of "peel the tags until you find this type").
> 
> You can also see this in the original README (again, go look at that
> first git commit): the README talks about the "tag of their type".
> 
> Of course, in practice git then walked away from having to specify the
> type all the time. It started even in that original release, in that
> the HEAD file never contained the type - because it was implicit (a
> HEAD is always a commit).
> 
> So we ended up having a lot of situations like that where the actual
> tag part was implicit from context, and these days people basically
> never refer to the "full" object name with tag, but only the SHA1
> address.
> 
> So now we have situations where the type really has to be looked up
> dynamically, because it's not explicitly encoded anywhere. While HEAD
> is supposed to always be a commit, other refs can be pretty much
> anything, and can point to a tag object, a commit, a tree or a blob.
> So then you actually have to look up the type based on the address.
> 
> End result: these days people don't even think of git objects as
> "tagged pointers".  Even internally in git, lots of code just passes
> the "object name" along without any tag/type, just the raw SHA1 / OID.
> 
> So that originally "everything is a tagged pointer" is much less true
> than it used to be, and now, instead of having tagged pointers, you
> mostly end up with just "bare pointers" and look up the type
> dynamically from there.
> 
> And that "look up the type in the object" is possible because even
> originally, I did *not* want any kind of "object type aliasing".
> 
> So even when looking up the object with the full "tag:pointer", the
> encoding of the object itself then also contains that object type, so
> that you can cross-check that you used the right tag.
> 
> That said, you *can* see some of the effects of this "tagged pointers"
> in how the internals do things like
> 
>     struct commit *commit = lookup_commit(repo, &oid);
> 
> which conceptually very much is about tagged pointers. And the fact
> that two objects cannot alias is actually somewhat encoded in that: a
> "struct commit" contains a "struct object" as a member. But so does
> "struct blob" - and the two "struct object" cases are never the same
> "object".
> 
> So there's never any worry about "could blob.object be the same object
> as commit.object"?
> 
> That is actually inherent in the code, in how "lookup_commit()"
> actually does lookup_object() and then does object_as_type(OBJ_COMMIT)
> on the result.

This explains rather well why the object type is used in the calculation, and
it makes sense.

But I don't see anything about the object size. Isn't that unnecessary?

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question] Can git cat-file have a type filtering option?
  2023-04-14 17:04                     ` Linus Torvalds
  2023-04-16 12:06                       ` Felipe Contreras
@ 2023-04-16 12:43                       ` ZheNing Hu
  1 sibling, 0 replies; 23+ messages in thread
From: ZheNing Hu @ 2023-04-16 12:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff King, Taylor Blau, Junio C Hamano, Git List, johncai86

Linus Torvalds <torvalds@linux-foundation.org> 于2023年4月15日周六 01:05写道：
>
> On Fri, Apr 14, 2023 at 5:17 AM ZheNing Hu <adlternative@gmail.com> wrote:
> >
> > Jeff King <peff@peff.net> 于2023年4月14日周五 15:30写道：
> > >
> > > On Wed, Apr 12, 2023 at 05:57:02PM +0800, ZheNing Hu wrote:
> > > >
> > > > I'm still puzzled why git calculated the object id based on {type, size, data}
> > > >  together instead of just {data}?
> > >
> > > You'd have to ask Linus for the original reasoning. ;)
>
> I originally thought of the git object store as "tagged pointers".
>
> That actually caused confusion initially when I tried to explain this
> to SCM people, because "tag" means something very different in an SCM
> environment than it means in computer architecture.
>
> And the implication of a tagged pointer is that you have two parts of
> it - the "tag" and the "address". Both are relevant at all points.
>
> This isn't quite as obvious in everyday moden git usage, because a lot
> of uses end up _only_ using the "address" (aka SHA1), but it's very
> much part of the object store design. Internally, the object layout
> never uses just the SHA1, it's all "type:SHA1", even if sometimes the
> types are implied (ie the tree object doesn't spell out "blob", but
> it's still explicit in the mode bits).
>
> This is very very obvious in "git cat-file", which was one of the
> original scripts in the first commit (but even there the tag/type has
> changed meaning over time: the very first version didn't use it as
> input at all, then it started verifying it, and then later it got the
> more subtle context of "peel the tags until you find this type").
>

Yes, in the initial commit of Git, "git cat-file" only needs to pass the
object ID to obtain both the content and type of the object. However,
modern "git cat-file" requires specifying both the expected object type
and  object ID by default. e.g. "git cat-file commit v2.9.1". This should
be where you mentioned the simultaneous appearance of "tag" and
"address". This design model may not be very user-friendly for users
to use, so nowadays, people prefer to use "git cat-file -p", this may be
very similar to the initial version of git cat-file.

> You can also see this in the original README (again, go look at that
> first git commit): the README talks about the "tag of their type".
>
> Of course, in practice git then walked away from having to specify the
> type all the time. It started even in that original release, in that
> the HEAD file never contained the type - because it was implicit (a
> HEAD is always a commit).
>
> So we ended up having a lot of situations like that where the actual
> tag part was implicit from context, and these days people basically
> never refer to the "full" object name with tag, but only the SHA1
> address.
>
> So now we have situations where the type really has to be looked up
> dynamically, because it's not explicitly encoded anywhere. While HEAD
> is supposed to always be a commit, other refs can be pretty much
> anything, and can point to a tag object, a commit, a tree or a blob.
> So then you actually have to look up the type based on the address.
>
> End result: these days people don't even think of git objects as
> "tagged pointers".  Even internally in git, lots of code just passes
> the "object name" along without any tag/type, just the raw SHA1 / OID.
>
> So that originally "everything is a tagged pointer" is much less true
> than it used to be, and now, instead of having tagged pointers, you
> mostly end up with just "bare pointers" and look up the type
> dynamically from there.
>

I feel that if type was not included in the objects initially, people would
need to specify both "tag" and "address" at the same time to explain
the objects. Otherwise, this "tag" can only be used for checking the type,
and this is not necessary in most cases.

> And that "look up the type in the object" is possible because even
> originally, I did *not* want any kind of "object type aliasing".
>
> So even when looking up the object with the full "tag:pointer", the
> encoding of the object itself then also contains that object type, so
> that you can cross-check that you used the right tag.
>
> That said, you *can* see some of the effects of this "tagged pointers"
> in how the internals do things like
>
>     struct commit *commit = lookup_commit(repo, &oid);
>
> which conceptually very much is about tagged pointers. And the fact
> that two objects cannot alias is actually somewhat encoded in that: a
> "struct commit" contains a "struct object" as a member. But so does
> "struct blob" - and the two "struct object" cases are never the same
> "object".
>
> So there's never any worry about "could blob.object be the same object
> as commit.object"?
>

Yes, if an object can be interpreted as multiple types, it will certainly
make it very difficult for git higher-level logic to handle it.

> That is actually inherent in the code, in how "lookup_commit()"
> actually does lookup_object() and then does object_as_type(OBJ_COMMIT)
> on the result.
>
> > Oh, you are right, this could be to prevent conflicts between Git objects
> > with identical content but different types. However, I always associate
> > Git with the file system, where metadata such as file type and size is
> > stored in the inode, while the file data is stored in separate chunks.
>
> See above: yes, git design was *also* influenced heavily by
> filesystems, but that was mostly in the sense of "this is how to
> encode these things without undue pain".
>
> The object database being immutable was partly a security and safety
> measure, but it was also very much partly a "rewriting files is going
> to be a major pain from a filesystem consistency standpoint - don't do
> it".
>

You're right. Git objects are immutable, while data in a file system
is mutable, so Git's design doesn't need to follow the file system completely.

> But even more than a filesystem design, it's an "computer
> architecture" design. Think of the git object store as a very abstract
> computer architecture that has tagged pointers, stable storage, and no
> aliasing - and where the tag is actually verified at each lookup.
>
> The "no aliasing" means that no two distinct pointers can point to the
> same data. So a tagged pointer of type "commit" can not point to the
> same object as a tagged pointer of type "blob". They are distinct
> pointers, even if (maybe) the commit object encoding ends up then
> being identical to a blob object.
>
> And as mentioned, that "verified at each lookup" has mostly gone away,
> and "each lookup" has become more of a "can be verified by fsck", but
> it's probably still a good thing to think that way.
>
> You still have "lookup_object_by_type()" internally in git that takes
> the full tagged pointer, but almost nobody uses it any more. The
> closest you get is those "lookup_commit()" things (which are fairly
> common, still).
>

Well, now I understand: everything in Git's architecture is "tag:pointer".
Tags are used to verify object types (although it's not necessary now),
and pointers are used for addressing. This is also one of the reasons
why Git initially included the type in its objects.

>               Linus

Thank you for your wonderful answer regarding the design concept
of "tagged pointers"! This deepens my understanding of Git's design. :-)

ZheNing Hu

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2023-04-16 12:43 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-07 14:24 [Question] Can git cat-file have a type filtering option? ZheNing Hu
2023-04-07 16:30 ` Junio C Hamano
2023-04-08  6:27   ` ZheNing Hu
2023-04-09  1:28     ` Taylor Blau
2023-04-09  2:19       ` Taylor Blau
2023-04-09  2:26         ` Taylor Blau
2023-04-09  6:51           ` ZheNing Hu
2023-04-10 20:01             ` Jeff King
2023-04-10 23:20               ` Taylor Blau
2023-04-09  6:47       ` ZheNing Hu
2023-04-10 20:14         ` Jeff King
2023-04-11 14:09           ` ZheNing Hu
2023-04-12  7:43             ` Jeff King
2023-04-12  9:57               ` ZheNing Hu
2023-04-14  7:30                 ` Jeff King
2023-04-14 12:17                   ` ZheNing Hu
2023-04-14 15:58                     ` Junio C Hamano
2023-04-16 11:15                       ` ZheNing Hu
2023-04-14 17:04                     ` Linus Torvalds
2023-04-16 12:06                       ` Felipe Contreras
2023-04-16 12:43                       ` ZheNing Hu
2023-04-09  1:26   ` Taylor Blau
2023-04-09  1:23 ` Taylor Blau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).