* [Question] Can git cat-file have a type filtering option? @ 2023-04-07 14:24 ZheNing Hu 2023-04-07 16:30 ` Junio C Hamano 2023-04-09 1:23 ` Taylor Blau 0 siblings, 2 replies; 23+ messages in thread From: ZheNing Hu @ 2023-04-07 14:24 UTC (permalink / raw) To: Git List; +Cc: Junio C Hamano, johncai86 Sometimes when we use `git cat-file --batch-all-objects`, we only want data of type "blob". In order to filter them out, we may need to use some additional processes (such as `git rev-list --objects --filter=blob:none --filter-provided-objects`) to obtain the SHA of all blobs, and then use `git cat-file --batch` to retrieve them. This is not very elegant, or in other words, it might be better to have an internal implementation of filtering within `git cat-file --batch-all-objects`. However, `git cat-file` already has a `--filters` option, which is used to "show content as transformed by filters". I'm not sure if there is a better word to implement the functionality of filtering by type? For example, `--type-filter`? Thanks, -- ZheNing Hu ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Question] Can git cat-file have a type filtering option? 2023-04-07 14:24 [Question] Can git cat-file have a type filtering option? ZheNing Hu @ 2023-04-07 16:30 ` Junio C Hamano 2023-04-08 6:27 ` ZheNing Hu 2023-04-09 1:26 ` Taylor Blau 2023-04-09 1:23 ` Taylor Blau 1 sibling, 2 replies; 23+ messages in thread From: Junio C Hamano @ 2023-04-07 16:30 UTC (permalink / raw) To: ZheNing Hu; +Cc: Git List, johncai86 ZheNing Hu <adlternative@gmail.com> writes: > all blobs, and then use `git cat-file --batch` to retrieve them. This > is not very elegant, or in other words, it might be better to have an > internal implementation of filtering within `git cat-file > --batch-all-objects`. It does sound prominently elegant to have each tool does one task and does it well, and being able to flexibly combine them to achieve a larger task. Once that approach is working well, it may still make sense to give a special case codepath that bundles a specific combination of these primitive features, if use cases for the specific combination appear often. But I do not know if the particular one, "we do not want to feed specific list of objects to check to 'cat-file --batch'", qualifies as one. > For example, `--type-filter`? Is the object type the only thing that people often would want to base their filtering decision on? Will we then see somebody else request a "--size-filter", and then somebody else realizes that the filtering criteria based on size need to be different between blobs (most likely counted in bytes) and trees (it may be more convenient to count the tree entries, not byes)? It sounds rather messy and we may be better off having such an extensible logic in one place. Like rev-list's object list filtering, that is. Is the logic that implements rev-list's object list filtering something that is easily called from the side, as if it were a library routine? Refactoring that and teaching cat-file an option to activate that logic might be more palatable. Thanks. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Question] Can git cat-file have a type filtering option? 2023-04-07 16:30 ` Junio C Hamano @ 2023-04-08 6:27 ` ZheNing Hu 2023-04-09 1:28 ` Taylor Blau 2023-04-09 1:26 ` Taylor Blau 1 sibling, 1 reply; 23+ messages in thread From: ZheNing Hu @ 2023-04-08 6:27 UTC (permalink / raw) To: Junio C Hamano; +Cc: Git List, johncai86 Junio C Hamano <gitster@pobox.com> 于2023年4月8日周六 00:30写道: > > ZheNing Hu <adlternative@gmail.com> writes: > > > all blobs, and then use `git cat-file --batch` to retrieve them. This > > is not very elegant, or in other words, it might be better to have an > > internal implementation of filtering within `git cat-file > > --batch-all-objects`. > > It does sound prominently elegant to have each tool does one task > and does it well, and being able to flexibly combine them to achieve > a larger task. > Okay, you're right. It's not "ungraceful" to have each task do its own thing. I should clarify that for a command like `git cat-file --batch-all-objects`, which traverses all objects, it would be better to have a filter. It might be more performant than using `git rev-list --filter | git cat-file --batch`? > Once that approach is working well, it may still make sense to give > a special case codepath that bundles a specific combination of these > primitive features, if use cases for the specific combination appear > often. But I do not know if the particular one, "we do not want to > feed specific list of objects to check to 'cat-file --batch'", > qualifies as one. > > > For example, `--type-filter`? > > Is the object type the only thing that people often would want to > base their filtering decision on? Will we then see somebody else > request a "--size-filter", and then somebody else realizes that the > filtering criteria based on size need to be different between blobs > (most likely counted in bytes) and trees (it may be more convenient > to count the tree entries, not byes)? It sounds rather messy and > we may be better off having such an extensible logic in one place. > Yes, having a generic filter for `git cat-file` would be better. > Like rev-list's object list filtering, that is. > > Is the logic that implements rev-list's object list filtering > something that is easily called from the side, as if it were a > library routine? Refactoring that and teaching cat-file an option > to activate that logic might be more palatable. > I don't think so. While `git rev-list` traverses objects and performs filtering within a revision, `git cat-file --batch-all-objects` traverses all loose and packed objects. It might be difficult to perfectly extract the filtering from `git rev-list` and apply it to `git cat-file`. > Thanks. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Question] Can git cat-file have a type filtering option? 2023-04-08 6:27 ` ZheNing Hu @ 2023-04-09 1:28 ` Taylor Blau 2023-04-09 2:19 ` Taylor Blau 2023-04-09 6:47 ` ZheNing Hu 0 siblings, 2 replies; 23+ messages in thread From: Taylor Blau @ 2023-04-09 1:28 UTC (permalink / raw) To: ZheNing Hu; +Cc: Junio C Hamano, Git List, johncai86 On Sat, Apr 08, 2023 at 02:27:53PM +0800, ZheNing Hu wrote: > Okay, you're right. It's not "ungraceful" to have each task do its own thing. > I should clarify that for a command like `git cat-file --batch-all-objects`, > which traverses all objects, it would be better to have a filter. It might be > more performant than using `git rev-list --filter | git cat-file --batch`? Perhaps slightly so, since there is naturally going to be some duplicated effort spawning processes, loading any shared libraries, initializing the repository and reading its configuration, etc. But I'd wager that these are all a negligible cost when compared to the time we'll have to spend reading, inflating, and printing out all of the objects in your repository. Hopefully any task(s) where that cost *wouldn't* be negligible relative to the rest of the job would be small enough that they could fit into a single process. > I don't think so. While `git rev-list` traverses objects and performs > filtering within a revision, `git cat-file --batch-all-objects` traverses > all loose and packed objects. It might be difficult to perfectly > extract the filtering from `git rev-list` and apply it to `git cat-file`. `rev-list`'s `--all` option does exactly the former: it looks at all loose and packed objects instead of doing a traditional object walk. Thanks, Taylor ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Question] Can git cat-file have a type filtering option? 2023-04-09 1:28 ` Taylor Blau @ 2023-04-09 2:19 ` Taylor Blau 2023-04-09 2:26 ` Taylor Blau 2023-04-09 6:47 ` ZheNing Hu 1 sibling, 1 reply; 23+ messages in thread From: Taylor Blau @ 2023-04-09 2:19 UTC (permalink / raw) To: ZheNing Hu; +Cc: Junio C Hamano, Git List, johncai86 On Sat, Apr 08, 2023 at 09:28:28PM -0400, Taylor Blau wrote: > > I don't think so. While `git rev-list` traverses objects and performs > > filtering within a revision, `git cat-file --batch-all-objects` traverses > > all loose and packed objects. It might be difficult to perfectly > > extract the filtering from `git rev-list` and apply it to `git cat-file`. > > `rev-list`'s `--all` option does exactly the former: it looks at all > loose and packed objects instead of doing a traditional object walk. Sorry, this isn't right: --all pretends as if you passed all references to it over argv, not to just look at the individual loose and packed objects. Thanks, Taylor ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Question] Can git cat-file have a type filtering option? 2023-04-09 2:19 ` Taylor Blau @ 2023-04-09 2:26 ` Taylor Blau 2023-04-09 6:51 ` ZheNing Hu 0 siblings, 1 reply; 23+ messages in thread From: Taylor Blau @ 2023-04-09 2:26 UTC (permalink / raw) To: ZheNing Hu; +Cc: Junio C Hamano, Git List, johncai86 On Sat, Apr 08, 2023 at 10:19:52PM -0400, Taylor Blau wrote: > On Sat, Apr 08, 2023 at 09:28:28PM -0400, Taylor Blau wrote: > > > I don't think so. While `git rev-list` traverses objects and performs > > > filtering within a revision, `git cat-file --batch-all-objects` traverses > > > all loose and packed objects. It might be difficult to perfectly > > > extract the filtering from `git rev-list` and apply it to `git cat-file`. > > > > `rev-list`'s `--all` option does exactly the former: it looks at all > > loose and packed objects instead of doing a traditional object walk. > > Sorry, this isn't right: --all pretends as if you passed all references > to it over argv, not to just look at the individual loose and packed > objects. The right thing to do here if you wanted to get a listing of all blobs in your repository regardless of their reachability or whether they are loose or packed is: git cat-file --batch-check='%(objectname)' --batch-all-objects | git rev-list --objects --stdin --no-walk --filter='object:type=blob' Or, if your filter is as straightforward as "is this object a blob or not", you could write something like: git cat-file --batch-check --batch-all-objects | awk ' if ($2 == "blob") { print $0 }' Or you could tighten up the AWK expression by doing something like: git cat-file --batch-check='%(objecttype) %(objectname)' \ --batch-all-objects | awk '/^blob / { print $2 }' Sorry for the brain fart. Thanks, Taylor ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Question] Can git cat-file have a type filtering option? 2023-04-09 2:26 ` Taylor Blau @ 2023-04-09 6:51 ` ZheNing Hu 2023-04-10 20:01 ` Jeff King 0 siblings, 1 reply; 23+ messages in thread From: ZheNing Hu @ 2023-04-09 6:51 UTC (permalink / raw) To: Taylor Blau; +Cc: Junio C Hamano, Git List, johncai86 Taylor Blau <me@ttaylorr.com> 于2023年4月9日周日 10:26写道: > > On Sat, Apr 08, 2023 at 10:19:52PM -0400, Taylor Blau wrote: > > On Sat, Apr 08, 2023 at 09:28:28PM -0400, Taylor Blau wrote: > > > > I don't think so. While `git rev-list` traverses objects and performs > > > > filtering within a revision, `git cat-file --batch-all-objects` traverses > > > > all loose and packed objects. It might be difficult to perfectly > > > > extract the filtering from `git rev-list` and apply it to `git cat-file`. > > > > > > `rev-list`'s `--all` option does exactly the former: it looks at all > > > loose and packed objects instead of doing a traditional object walk. > > > > Sorry, this isn't right: --all pretends as if you passed all references > > to it over argv, not to just look at the individual loose and packed > > objects. > > The right thing to do here if you wanted to get a listing of all blobs > in your repository regardless of their reachability or whether they are > loose or packed is: > > git cat-file --batch-check='%(objectname)' --batch-all-objects | > git rev-list --objects --stdin --no-walk --filter='object:type=blob' > This looks like a mistake. Try passing a tree oid to git rev-list: git rev-list --objects --stdin --no-walk --filter='object:type=blob' <<< HEAD^{tree} 27f9fa75c6d8cdae7834f38006b631522c6a5ac3 4860bebd32f8d3f34c2382f097ac50c0b972d3a0 .cirrus.yml c592dda681fecfaa6bf64fb3f539eafaf4123ed8 .clang-format f9d819623d832113014dd5d5366e8ee44ac9666a .editorconfig b0044cf272fec9b987e99c600d6a95bc357261c3 .gitattributes ... > Or, if your filter is as straightforward as "is this object a blob or > not", you could write something like: > > git cat-file --batch-check --batch-all-objects | awk ' > if ($2 == "blob") { print $0 }' > > Or you could tighten up the AWK expression by doing something like: > > git cat-file --batch-check='%(objecttype) %(objectname)' \ > --batch-all-objects | awk '/^blob / { print $2 }' > > Sorry for the brain fart. > > Thanks, > Taylor ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Question] Can git cat-file have a type filtering option? 2023-04-09 6:51 ` ZheNing Hu @ 2023-04-10 20:01 ` Jeff King 2023-04-10 23:20 ` Taylor Blau 0 siblings, 1 reply; 23+ messages in thread From: Jeff King @ 2023-04-10 20:01 UTC (permalink / raw) To: ZheNing Hu; +Cc: Taylor Blau, Junio C Hamano, Git List, johncai86 On Sun, Apr 09, 2023 at 02:51:34PM +0800, ZheNing Hu wrote: > > The right thing to do here if you wanted to get a listing of all blobs > > in your repository regardless of their reachability or whether they are > > loose or packed is: > > > > git cat-file --batch-check='%(objectname)' --batch-all-objects | > > git rev-list --objects --stdin --no-walk --filter='object:type=blob' > > > > This looks like a mistake. Try passing a tree oid to git rev-list: > > git rev-list --objects --stdin --no-walk --filter='object:type=blob' > <<< HEAD^{tree} > 27f9fa75c6d8cdae7834f38006b631522c6a5ac3 > 4860bebd32f8d3f34c2382f097ac50c0b972d3a0 .cirrus.yml > c592dda681fecfaa6bf64fb3f539eafaf4123ed8 .clang-format > f9d819623d832113014dd5d5366e8ee44ac9666a .editorconfig > b0044cf272fec9b987e99c600d6a95bc357261c3 .gitattributes > ... This is the expected behavior. The filter options are meant to support partial clones, and the behavior is really "filter things we'd traverse to". It is intentional that objects the caller directly asks for will always be included in the output. I certainly found that convention confusing, but I imagine it solves some problems with the lazy-fetch requests themselves. Regardless, that's how it works and it's not going to change anytime soon. :) For that reason, and just for general flexibility, I think you are mostly better off piping cat-file through an external filter program (and then back to cat-file to get more data on each object). -Peff ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Question] Can git cat-file have a type filtering option? 2023-04-10 20:01 ` Jeff King @ 2023-04-10 23:20 ` Taylor Blau 0 siblings, 0 replies; 23+ messages in thread From: Taylor Blau @ 2023-04-10 23:20 UTC (permalink / raw) To: Jeff King; +Cc: ZheNing Hu, Junio C Hamano, Git List, johncai86 On Mon, Apr 10, 2023 at 04:01:41PM -0400, Jeff King wrote: > For that reason, and just for general flexibility, I think you are > mostly better off piping cat-file through an external filter program > (and then back to cat-file to get more data on each object). Yeah, agreed. The convention of printing objects listed on the command-line regardless of whether they would pass through the object filter is confusing to me, too. But using `rev-list --no-walk` to accomplish the same job for a filter as trivial as the type-level one feels overkill anyway, so I agree that just relying on `cat-file` to produce the list of objects, filtering it yourself, and then handing it back to `cat-file` is the easiest thing to do. Thanks, Taylor ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Question] Can git cat-file have a type filtering option? 2023-04-09 1:28 ` Taylor Blau 2023-04-09 2:19 ` Taylor Blau @ 2023-04-09 6:47 ` ZheNing Hu 2023-04-10 20:14 ` Jeff King 1 sibling, 1 reply; 23+ messages in thread From: ZheNing Hu @ 2023-04-09 6:47 UTC (permalink / raw) To: Taylor Blau; +Cc: Junio C Hamano, Git List, johncai86 Taylor Blau <me@ttaylorr.com> 于2023年4月9日周日 09:28写道: > > On Sat, Apr 08, 2023 at 02:27:53PM +0800, ZheNing Hu wrote: > > Okay, you're right. It's not "ungraceful" to have each task do its own thing. > > I should clarify that for a command like `git cat-file --batch-all-objects`, > > which traverses all objects, it would be better to have a filter. It might be > > more performant than using `git rev-list --filter | git cat-file --batch`? > > Perhaps slightly so, since there is naturally going to be some > duplicated effort spawning processes, loading any shared libraries, > initializing the repository and reading its configuration, etc. > > But I'd wager that these are all a negligible cost when compared to the > time we'll have to spend reading, inflating, and printing out all of the > objects in your repository. > "What you said makes sense. I implemented the --type-filter option for git cat-file and compared the performance of outputting all blobs in the git repository with and without using the type-filter. I found that the difference was not significant. time git cat-file --batch-all-objects --batch-check="%(objectname) %(objecttype)" | awk '{ if ($2 == "blob") print $1 }' | git cat-file --batch > /dev/null 17.10s user 0.27s system 102% cpu 16.987 total time git cat-file --batch-all-objects --batch --type-filter=blob >/dev/null 16.74s user 0.19s system 95% cpu 17.655 total At first, I thought the processes that provide all blob oids by using git rev-list or git cat-file --batch-all-objects --batch-check might waste cpu, io, memory resources because they need to read a large number of objects, and then they are read again by git cat-file --batch. However, it seems that this is not actually the bottleneck in performance. > Hopefully any task(s) where that cost *wouldn't* be negligible relative > to the rest of the job would be small enough that they could fit into a > single process. > > > I don't think so. While `git rev-list` traverses objects and performs > > filtering within a revision, `git cat-file --batch-all-objects` traverses > > all loose and packed objects. It might be difficult to perfectly > > extract the filtering from `git rev-list` and apply it to `git cat-file`. > > `rev-list`'s `--all` option does exactly the former: it looks at all > loose and packed objects instead of doing a traditional object walk. > > Thanks, > Taylor ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Question] Can git cat-file have a type filtering option? 2023-04-09 6:47 ` ZheNing Hu @ 2023-04-10 20:14 ` Jeff King 2023-04-11 14:09 ` ZheNing Hu 0 siblings, 1 reply; 23+ messages in thread From: Jeff King @ 2023-04-10 20:14 UTC (permalink / raw) To: ZheNing Hu; +Cc: Taylor Blau, Junio C Hamano, Git List, johncai86 On Sun, Apr 09, 2023 at 02:47:30PM +0800, ZheNing Hu wrote: > > Perhaps slightly so, since there is naturally going to be some > > duplicated effort spawning processes, loading any shared libraries, > > initializing the repository and reading its configuration, etc. > > > > But I'd wager that these are all a negligible cost when compared to the > > time we'll have to spend reading, inflating, and printing out all of the > > objects in your repository. > > "What you said makes sense. I implemented the --type-filter option for > git cat-file and compared the performance of outputting all blobs in the > git repository with and without using the type-filter. I found that the > difference was not significant. > > time git cat-file --batch-all-objects --batch-check="%(objectname) > %(objecttype)" | > awk '{ if ($2 == "blob") print $1 }' | git cat-file --batch > /dev/null > 17.10s user 0.27s system 102% cpu 16.987 total > > time git cat-file --batch-all-objects --batch --type-filter=blob >/dev/null > 16.74s user 0.19s system 95% cpu 17.655 total > > At first, I thought the processes that provide all blob oids by using > git rev-list or git cat-file --batch-all-objects --batch-check might waste > cpu, io, memory resources because they need to read a large number > of objects, and then they are read again by git cat-file --batch. > However, it seems that this is not actually the bottleneck in performance. Yeah, I think most of your time there is spent on the --batch command itself, which is just putting through a lot of bytes. You might also try with "--unordered". The default ordering for --batch-all-objects is in sha1 order, which has pretty bad locality characteristics for delta caching. Using --unordered goes in pack-order, which should be optimal. E.g., in git.git, running: time \ git cat-file --batch-all-objects --batch-check='%(objecttype) %(objectname)' | perl -lne 'print $1 if /^blob (.*)/' | git cat-file --batch >/dev/null takes: real 0m29.961s user 0m29.128s sys 0m1.461s Adding "--unordered" to the initial cat-file gives: real 0m1.970s user 0m2.170s sys 0m0.126s So reducing the size of the actual --batch printing may make the relative cost of using multiple processes much higher (I didn't apply your --type-filter patches to test myself). In general, I do think having a processing pipeline like this is OK, as it's pretty flexible. But especially for smaller queries (even ones that don't ask for the whole object contents), the per-object lookup costs can start to dominate (especially in a repository that hasn't been recently packed). Right now, even your "--batch --type-filter" example is probably making at least two lookups per object, because we don't have a way to open a "handle" to an object to check its type, and then extract the contents conditionally. And of course with multiple processes, we're naturally doing a separate lookup in each one. So a nice thing about being able to do the filtering in one process is that we could _eventually_ do it all with one object lookup. But I'd probably wait on adding something like --type-filter until we have an internal single-lookup API, and then we could time it to see how much speedup we can get. -Peff ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Question] Can git cat-file have a type filtering option? 2023-04-10 20:14 ` Jeff King @ 2023-04-11 14:09 ` ZheNing Hu 2023-04-12 7:43 ` Jeff King 0 siblings, 1 reply; 23+ messages in thread From: ZheNing Hu @ 2023-04-11 14:09 UTC (permalink / raw) To: Jeff King; +Cc: Taylor Blau, Junio C Hamano, Git List, johncai86 Jeff King <peff@peff.net> 于2023年4月11日周二 04:14写道: > > On Sun, Apr 09, 2023 at 02:47:30PM +0800, ZheNing Hu wrote: > > > > Perhaps slightly so, since there is naturally going to be some > > > duplicated effort spawning processes, loading any shared libraries, > > > initializing the repository and reading its configuration, etc. > > > > > > But I'd wager that these are all a negligible cost when compared to the > > > time we'll have to spend reading, inflating, and printing out all of the > > > objects in your repository. > > > > "What you said makes sense. I implemented the --type-filter option for > > git cat-file and compared the performance of outputting all blobs in the > > git repository with and without using the type-filter. I found that the > > difference was not significant. > > > > time git cat-file --batch-all-objects --batch-check="%(objectname) > > %(objecttype)" | > > awk '{ if ($2 == "blob") print $1 }' | git cat-file --batch > /dev/null > > 17.10s user 0.27s system 102% cpu 16.987 total > > > > time git cat-file --batch-all-objects --batch --type-filter=blob >/dev/null > > 16.74s user 0.19s system 95% cpu 17.655 total > > > > At first, I thought the processes that provide all blob oids by using > > git rev-list or git cat-file --batch-all-objects --batch-check might waste > > cpu, io, memory resources because they need to read a large number > > of objects, and then they are read again by git cat-file --batch. > > However, it seems that this is not actually the bottleneck in performance. > > Yeah, I think most of your time there is spent on the --batch command > itself, which is just putting through a lot of bytes. You might also try > with "--unordered". The default ordering for --batch-all-objects is in > sha1 order, which has pretty bad locality characteristics for delta > caching. Using --unordered goes in pack-order, which should be optimal. > > E.g., in git.git, running: > > time \ > git cat-file --batch-all-objects --batch-check='%(objecttype) %(objectname)' | > perl -lne 'print $1 if /^blob (.*)/' | > git cat-file --batch >/dev/null > > takes: > > real 0m29.961s > user 0m29.128s > sys 0m1.461s > > Adding "--unordered" to the initial cat-file gives: > > real 0m1.970s > user 0m2.170s > sys 0m0.126s > > So reducing the size of the actual --batch printing may make the > relative cost of using multiple processes much higher (I didn't apply > your --type-filter patches to test myself). > You are right. Adding the --unordered option can avoid the time-consuming sorting process from affecting the test results. time git cat-file --unordered --batch-all-objects \ --batch-check="%(objectname) %(objecttype)" | \ awk '{ if ($2 == "blob") print $1 }' | git cat-file --batch > /dev/null 4.17s user 0.23s system 109% cpu 4.025 total time git cat-file --unordered --batch-all-objects --batch --type-filter=blob >/dev/null 3.84s user 0.17s system 97% cpu 4.099 total It looks like the difference is not significant either. After all, the truly time-consuming process is reading the entire data of the blob, whereas git cat-file --batch-check only reads the first few bytes of the object in comparison. > In general, I do think having a processing pipeline like this is OK, as > it's pretty flexible. But especially for smaller queries (even ones that > don't ask for the whole object contents), the per-object lookup costs > can start to dominate (especially in a repository that hasn't been > recently packed). Right now, even your "--batch --type-filter" example > is probably making at least two lookups per object, because we don't > have a way to open a "handle" to an object to check its type, and then > extract the contents conditionally. And of course with multiple > processes, we're naturally doing a separate lookup in each one. > Yes, the type of the object is encapsulated in the header of the loose object file or the object entry header of the pack file. We have to read it to get the object type. This may be a lingering question I have had: why does git put the type/size in the file data instead of storing it as some kind of metadata elsewhere? > So a nice thing about being able to do the filtering in one process is > that we could _eventually_ do it all with one object lookup. But I'd > probably wait on adding something like --type-filter until we have an > internal single-lookup API, and then we could time it to see how much > speedup we can get. > I am highly skeptical of this "internal single-lookup API". Do we really need an extra metadata table to record all objects? Something like: metadata: {oid: type, size}? > -Peff ZheNing Hu ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Question] Can git cat-file have a type filtering option? 2023-04-11 14:09 ` ZheNing Hu @ 2023-04-12 7:43 ` Jeff King 2023-04-12 9:57 ` ZheNing Hu 0 siblings, 1 reply; 23+ messages in thread From: Jeff King @ 2023-04-12 7:43 UTC (permalink / raw) To: ZheNing Hu; +Cc: Taylor Blau, Junio C Hamano, Git List, johncai86 On Tue, Apr 11, 2023 at 10:09:33PM +0800, ZheNing Hu wrote: > > So reducing the size of the actual --batch printing may make the > > relative cost of using multiple processes much higher (I didn't apply > > your --type-filter patches to test myself). > > > > You are right. Adding the --unordered option can avoid the > time-consuming sorting process from affecting the test results. Just to be clear: it's not the cost of sorting, but rather that accessing the object contents in a sub-optimal order is much worse (and that sub-optimal order happens to be "sorted by sha1", since that is effectively random with respect to the contents). > time git cat-file --unordered --batch-all-objects \ > --batch-check="%(objectname) %(objecttype)" | \ > awk '{ if ($2 == "blob") print $1 }' | git cat-file --batch > /dev/null > > 4.17s user 0.23s system 109% cpu 4.025 total > > time git cat-file --unordered --batch-all-objects --batch > --type-filter=blob >/dev/null > > 3.84s user 0.17s system 97% cpu 4.099 total > > It looks like the difference is not significant either. OK, good, that means we can probably not worry about it. :) > > In general, I do think having a processing pipeline like this is OK, as > > it's pretty flexible. But especially for smaller queries (even ones that > > don't ask for the whole object contents), the per-object lookup costs > > can start to dominate (especially in a repository that hasn't been > > recently packed). Right now, even your "--batch --type-filter" example > > is probably making at least two lookups per object, because we don't > > have a way to open a "handle" to an object to check its type, and then > > extract the contents conditionally. And of course with multiple > > processes, we're naturally doing a separate lookup in each one. > > > > Yes, the type of the object is encapsulated in the header of the loose > object file or the object entry header of the pack file. We have to read > it to get the object type. This may be a lingering question I have had: > why does git put the type/size in the file data instead of storing it as some > kind of metadata elsewhere? It's not just metadata; it's actually part of what we hash to get the object id (though of course it doesn't _have_ to be stored in a linear buffer, as the pack storage shows). But for loose objects, where would such metadata be? And accessing it isn't too expensive; we only zlib inflate the first few bytes (the main cost is in the syscalls to find and open the file). For packed object, it effectively is metadata, just stuck at the front of the object contents, rather than in a separate table. That lets us use the same .idx file for finding that metadata as we do for the contents themselves (at the slight cost that if you're _just_ accessing metadata, the results are sparser within the file, which has worse behavior for cold-cache disks). But when I say that lookup costs dominate, what I mean is that we'd spend a lot of our time binary searching within the pack .idx file, or falling back to syscalls to look for loose objects. > > So a nice thing about being able to do the filtering in one process is > > that we could _eventually_ do it all with one object lookup. But I'd > > probably wait on adding something like --type-filter until we have an > > internal single-lookup API, and then we could time it to see how much > > speedup we can get. > > I am highly skeptical of this "internal single-lookup API". Do we really > need an extra metadata table to record all objects? > Something like: metadata: {oid: type, size}? No, I don't mean changing the storage at all. I mean that rather than doing this: /* get type, size, etc, for --batch format */ type = oid_object_info(&oid, &size); /* now get the contents for --batch to write them itself; but note * that this searches for the entry again within all packs, etc */ contents = read_object_file(oid, &type, &size); as the cat-file code now does (because the first call is in batch_object_write(), and the latter in print_object_or_die()), they could be a single call that does the lookup once. We could actually do that today, since the object contents are eventually fed from oid_object_info_extended(), and we know ahead of time that we want both the metadata and the contents. But that wouldn't work if we filtered by type, etc. I'm not sure how much of a speedup it would yield in practice, though. If you're printing the object contents, then the extra lookup is probably not that expensive by comparison. -Peff ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Question] Can git cat-file have a type filtering option? 2023-04-12 7:43 ` Jeff King @ 2023-04-12 9:57 ` ZheNing Hu 2023-04-14 7:30 ` Jeff King 0 siblings, 1 reply; 23+ messages in thread From: ZheNing Hu @ 2023-04-12 9:57 UTC (permalink / raw) To: Jeff King; +Cc: Taylor Blau, Junio C Hamano, Git List, johncai86 Jeff King <peff@peff.net> 于2023年4月12日周三 15:43写道: > > On Tue, Apr 11, 2023 at 10:09:33PM +0800, ZheNing Hu wrote: > > > > So reducing the size of the actual --batch printing may make the > > > relative cost of using multiple processes much higher (I didn't apply > > > your --type-filter patches to test myself). > > > > > > > You are right. Adding the --unordered option can avoid the > > time-consuming sorting process from affecting the test results. > > Just to be clear: it's not the cost of sorting, but rather that > accessing the object contents in a sub-optimal order is much worse (and > that sub-optimal order happens to be "sorted by sha1", since that is > effectively random with respect to the contents). > Okay, thanks for correcting me. Reading the packfile in SHA1 order is actually a type of random read, and it should cause additional overhead. > > time git cat-file --unordered --batch-all-objects \ > > --batch-check="%(objectname) %(objecttype)" | \ > > awk '{ if ($2 == "blob") print $1 }' | git cat-file --batch > /dev/null > > > > 4.17s user 0.23s system 109% cpu 4.025 total > > > > time git cat-file --unordered --batch-all-objects --batch > > --type-filter=blob >/dev/null > > > > 3.84s user 0.17s system 97% cpu 4.099 total > > > > It looks like the difference is not significant either. > > OK, good, that means we can probably not worry about it. :) > > > > In general, I do think having a processing pipeline like this is OK, as > > > it's pretty flexible. But especially for smaller queries (even ones that > > > don't ask for the whole object contents), the per-object lookup costs > > > can start to dominate (especially in a repository that hasn't been > > > recently packed). Right now, even your "--batch --type-filter" example > > > is probably making at least two lookups per object, because we don't > > > have a way to open a "handle" to an object to check its type, and then > > > extract the contents conditionally. And of course with multiple > > > processes, we're naturally doing a separate lookup in each one. > > > > > > > Yes, the type of the object is encapsulated in the header of the loose > > object file or the object entry header of the pack file. We have to read > > it to get the object type. This may be a lingering question I have had: > > why does git put the type/size in the file data instead of storing it as some > > kind of metadata elsewhere? > > It's not just metadata; it's actually part of what we hash to get the > object id (though of course it doesn't _have_ to be stored in a linear > buffer, as the pack storage shows). I'm still puzzled why git calculated the object id based on {type, size, data} together instead of just {data}? > But for loose objects, where would > such metadata be? And accessing it isn't too expensive; we only zlib > inflate the first few bytes (the main cost is in the syscalls to find > and open the file). > I may not have a lot of experience with this here. It looks like I should go ahead and do some performance testing to compare the cost of searching and opening loose objects v.s reading and inflating loose objects. > For packed object, it effectively is metadata, just stuck at the front > of the object contents, rather than in a separate table. That lets us > use the same .idx file for finding that metadata as we do for the > contents themselves (at the slight cost that if you're _just_ accessing > metadata, the results are sparser within the file, which has worse > behavior for cold-cache disks). > Agree. But what if there is a metadata table in the .idx file? We can even know the type and size of the object without accessing the packfile. > But when I say that lookup costs dominate, what I mean is that we'd > spend a lot of our time binary searching within the pack .idx file, or > falling back to syscalls to look for loose objects. > Alright, binary search in .idx may indeed be more time-consuming than reading type and size from the packfile. > > > So a nice thing about being able to do the filtering in one process is > > > that we could _eventually_ do it all with one object lookup. But I'd > > > probably wait on adding something like --type-filter until we have an > > > internal single-lookup API, and then we could time it to see how much > > > speedup we can get. > > > > I am highly skeptical of this "internal single-lookup API". Do we really > > need an extra metadata table to record all objects? > > Something like: metadata: {oid: type, size}? > > No, I don't mean changing the storage at all. I mean that rather than > doing this: > > /* get type, size, etc, for --batch format */ > type = oid_object_info(&oid, &size); > > /* now get the contents for --batch to write them itself; but note > * that this searches for the entry again within all packs, etc */ > contents = read_object_file(oid, &type, &size); > > as the cat-file code now does (because the first call is in > batch_object_write(), and the latter in print_object_or_die()), they > could be a single call that does the lookup once. > > We could actually do that today, since the object contents are > eventually fed from oid_object_info_extended(), and we know ahead of > time that we want both the metadata and the contents. But that wouldn't > work if we filtered by type, etc. > So what you mentioned earlier about single read refers to combining the two read operations of getting type size and getting content into one, when we know exactly that we need to retrieve the content.(in order to reduce the overhead of the binary search once). > I'm not sure how much of a speedup it would yield in practice, though. > If you're printing the object contents, then the extra lookup is > probably not that expensive by comparison. > I feel like this solution may not be feasible. After we get the type and size for the first time, we go through different output processes for different types of objects: use `stream_blob()` for blobs, and `read_object_file()` with `batch_write()` for other objects. If we obtain the content of a blob in one single read operation, then the performance optimization provided by `stream_blob()` would be invalidated. > -Peff ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Question] Can git cat-file have a type filtering option? 2023-04-12 9:57 ` ZheNing Hu @ 2023-04-14 7:30 ` Jeff King 2023-04-14 12:17 ` ZheNing Hu 0 siblings, 1 reply; 23+ messages in thread From: Jeff King @ 2023-04-14 7:30 UTC (permalink / raw) To: ZheNing Hu; +Cc: Taylor Blau, Junio C Hamano, Git List, johncai86 On Wed, Apr 12, 2023 at 05:57:02PM +0800, ZheNing Hu wrote: > > It's not just metadata; it's actually part of what we hash to get the > > object id (though of course it doesn't _have_ to be stored in a linear > > buffer, as the pack storage shows). > > I'm still puzzled why git calculated the object id based on {type, size, data} > together instead of just {data}? You'd have to ask Linus for the original reasoning. ;) But one nice thing about including these, especially the type, in the hash, is that the object id gives the complete context for an object. So if another object claims to point to a tree, say, and points to a blob instead, we can detect that problem immediately. Or worse, think about something like "git show 1234abcd". If the metadata was not part of the object, then how would we know if you wanted to show a commit, or a blob (that happens to look like a commit), etc? That metadata could be carried outside the hash, but then it has to be stored somewhere, and is subject to ending up mismatched to the contents. Hashing all of it (including the size) makes consistency checking much easier. > > For packed object, it effectively is metadata, just stuck at the front > > of the object contents, rather than in a separate table. That lets us > > use the same .idx file for finding that metadata as we do for the > > contents themselves (at the slight cost that if you're _just_ accessing > > metadata, the results are sparser within the file, which has worse > > behavior for cold-cache disks). > > Agree. But what if there is a metadata table in the .idx file? > We can even know the type and size of the object without accessing > the packfile. I'm not sure it would be any faster than accessing the packfile. If you stick the metadata in the .idx file's oid lookup table, then lookups perform a bit worse because you're wasting memory cache. If you make a separate table in the .idx file that's OK, but I'm not sure it's consistently better than finding the data in the packfile. The oid lookup table gives you a way to index the table in constant-time (if you store the table as fixed-size entries in sha1 order), but we can also access the packfile in constant-time (the idx table gives us offsets). The idx metadata table would have better cache behavior if you're only looking at metadata, and not contents. But otherwise it's worse (since you have to hit the packfile, too). And I cheated a bit to say "fixed-size" above; the packfile metadata is in a variable-length encoding, so in some ways it's more efficient. So I doubt it would make any operations appreciably faster, and even if it did, you'd possibly be trading off versus other operations. I think the more interesting metadata is not type/size, but properties such as those stored by the commit graph. And there we do have separate tables for fast access (and it's a _lot_ faster, because it's helping us avoid inflating the object contents). > > I'm not sure how much of a speedup it would yield in practice, though. > > If you're printing the object contents, then the extra lookup is > > probably not that expensive by comparison. > > > > I feel like this solution may not be feasible. After we get the type and size > for the first time, we go through different output processes for different types > of objects: use `stream_blob()` for blobs, and `read_object_file()` with > `batch_write()` for other objects. If we obtain the content of a blob in one > single read operation, then the performance optimization provided by > `stream_blob()` would be invalidated. Good point. So yeah, even to use it in today's code you'd need something conditional. A few years ago I played with an option for object_info that would let the caller say "please give me the object contents if they are smaller than N bytes, otherwise don't". And that would let many call-sites get type, size, and content together most of the time (for small objects), and then stream only when necessary. I still have the patches, and running them now it looks like there's about a 10% speedup running: git cat-file --unordered --batch-all-objects --batch >/dev/null Other code paths dealing with blobs would likewise get a small speedup, I'd think. I don't remember why I didn't send it. I think there was some ugly refactoring that I needed to double-check, and my attention just got pulled elsewhere. The messy patches are at: https://github.com/peff/git jk/object-info-round-trip if you're interested. -Peff ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Question] Can git cat-file have a type filtering option? 2023-04-14 7:30 ` Jeff King @ 2023-04-14 12:17 ` ZheNing Hu 2023-04-14 15:58 ` Junio C Hamano 2023-04-14 17:04 ` Linus Torvalds 0 siblings, 2 replies; 23+ messages in thread From: ZheNing Hu @ 2023-04-14 12:17 UTC (permalink / raw) To: Jeff King Cc: Taylor Blau, Junio C Hamano, Git List, johncai86, Linus Torvalds Jeff King <peff@peff.net> 于2023年4月14日周五 15:30写道: > > On Wed, Apr 12, 2023 at 05:57:02PM +0800, ZheNing Hu wrote: > > > > It's not just metadata; it's actually part of what we hash to get the > > > object id (though of course it doesn't _have_ to be stored in a linear > > > buffer, as the pack storage shows). > > > > I'm still puzzled why git calculated the object id based on {type, size, data} > > together instead of just {data}? > > You'd have to ask Linus for the original reasoning. ;) > > But one nice thing about including these, especially the type, in the > hash, is that the object id gives the complete context for an object. > So if another object claims to point to a tree, say, and points to a blob > instead, we can detect that problem immediately. > > Or worse, think about something like "git show 1234abcd". If the > metadata was not part of the object, then how would we know if you > wanted to show a commit, or a blob (that happens to look like a commit), > etc? That metadata could be carried outside the hash, but then it has to > be stored somewhere, and is subject to ending up mismatched to the > contents. Hashing all of it (including the size) makes consistency > checking much easier. > Oh, you are right, this could be to prevent conflicts between Git objects with identical content but different types. However, I always associate Git with the file system, where metadata such as file type and size is stored in the inode, while the file data is stored in separate chunks. > > > For packed object, it effectively is metadata, just stuck at the front > > > of the object contents, rather than in a separate table. That lets us > > > use the same .idx file for finding that metadata as we do for the > > > contents themselves (at the slight cost that if you're _just_ accessing > > > metadata, the results are sparser within the file, which has worse > > > behavior for cold-cache disks). > > > > Agree. But what if there is a metadata table in the .idx file? > > We can even know the type and size of the object without accessing > > the packfile. > > I'm not sure it would be any faster than accessing the packfile. If you > stick the metadata in the .idx file's oid lookup table, then lookups > perform a bit worse because you're wasting memory cache. If you make a > separate table in the .idx file that's OK, but I'm not sure it's > consistently better than finding the data in the packfile. > Yes, but it maybe be very convenient if we need to filter by object type or size. > The oid lookup table gives you a way to index the table in > constant-time (if you store the table as fixed-size entries in sha1 > order), but we can also access the packfile in constant-time (the idx > table gives us offsets). The idx metadata table would have better cache > behavior if you're only looking at metadata, and not contents. But > otherwise it's worse (since you have to hit the packfile, too). And I > cheated a bit to say "fixed-size" above; the packfile metadata is in a > variable-length encoding, so in some ways it's more efficient. > Yes, if we only use git cat-file --batch-check, we may be able to improve performance by avoiding access to the pack file. Additionally, I think this metadata table is very suitable for filtering and aggregating operations. > So I doubt it would make any operations appreciably faster, and even if > it did, you'd possibly be trading off versus other operations. I think > the more interesting metadata is not type/size, but properties such as > those stored by the commit graph. And there we do have separate tables > for fast access (and it's a _lot_ faster, because it's helping us avoid > inflating the object contents). > Yeah, optimizing the retrieval of metadata such as type/size may not provide as much benefit as recording the commit properties in the metadata table, like the commit graph optimization does. > > > I'm not sure how much of a speedup it would yield in practice, though. > > > If you're printing the object contents, then the extra lookup is > > > probably not that expensive by comparison. > > > > > > > I feel like this solution may not be feasible. After we get the type and size > > for the first time, we go through different output processes for different types > > of objects: use `stream_blob()` for blobs, and `read_object_file()` with > > `batch_write()` for other objects. If we obtain the content of a blob in one > > single read operation, then the performance optimization provided by > > `stream_blob()` would be invalidated. > > Good point. So yeah, even to use it in today's code you'd need something > conditional. A few years ago I played with an option for object_info > that would let the caller say "please give me the object contents if > they are smaller than N bytes, otherwise don't". > > And that would let many call-sites get type, size, and content together > most of the time (for small objects), and then stream only when > necessary. I still have the patches, and running them now it looks like > there's about a 10% speedup running: > > git cat-file --unordered --batch-all-objects --batch >/dev/null > > Other code paths dealing with blobs would likewise get a small speedup, > I'd think. I don't remember why I didn't send it. I think there was some > ugly refactoring that I needed to double-check, and my attention just > got pulled elsewhere. The messy patches are at: > > https://github.com/peff/git jk/object-info-round-trip > > if you're interested. > Alright, this does feel a bit hackish, allowing most objects to fetch the content when first read and allowing blobs larger than N to be streamed via stream_blob(). I feel like this optimization for single-reads is a bit off-topic, I quote your previous sentence: > So a nice thing about being able to do the filtering in one process is > that we could _eventually_ do it all with one object lookup. But I'd > probably wait on adding something like --type-filter until we have an > internal single-lookup API, and then we could time it to see how much > speedup we can get. This optimization for single-reads doesn't seem to provide much benefit for implementing object filters, because we have already read the content of the object in advance? ZheNing Hu ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Question] Can git cat-file have a type filtering option? 2023-04-14 12:17 ` ZheNing Hu @ 2023-04-14 15:58 ` Junio C Hamano 2023-04-16 11:15 ` ZheNing Hu 2023-04-14 17:04 ` Linus Torvalds 1 sibling, 1 reply; 23+ messages in thread From: Junio C Hamano @ 2023-04-14 15:58 UTC (permalink / raw) To: ZheNing Hu; +Cc: Jeff King, Taylor Blau, Git List, johncai86, Linus Torvalds ZheNing Hu <adlternative@gmail.com> writes: > Oh, you are right, this could be to prevent conflicts between Git objects > with identical content but different types. However, I always associate > Git with the file system, where metadata such as file type and size is > stored in the inode, while the file data is stored in separate chunks. I am afraid the presentation order Peff used caused a bit of confusion. The true reason is what Peff brought up as "Or worse". We need to be able to tell, given only the name of an object, everything that we need to know about the object, and for that, we need the type information when we ask for an object by its name. Having size embedded in the data that comes back to us when we consult object database with an object name helps the implementation to pre-allocate a buffer and then inflate into it--there is no fundamental reason why it should be there. It is a secondary problem created by the design choice that we store type together with contents, that the object type recorded in a tree entry may contradict the actual type of the object recorded in the tree entry. We could have declared that the object type found in a tree entry is to be trusted, if we didn't record the type in the object database together with the object contents. I think your original question was not "why do we store type and size together with the contents?", but was "why do we include in the hash computation?", and all of the above discuss related tangent without touching the original question. The need to have type or size available when we ask the object database for data associated with the object does not necessarily mean they must be hashed together with the contents. It was done merely because "why not? that way, we do not have to worry about catching corrupt values for type and size information we want to store together with the contents". IOW, we could have checksummed these two pieces of information separately, but why bother? ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Question] Can git cat-file have a type filtering option? 2023-04-14 15:58 ` Junio C Hamano @ 2023-04-16 11:15 ` ZheNing Hu 0 siblings, 0 replies; 23+ messages in thread From: ZheNing Hu @ 2023-04-16 11:15 UTC (permalink / raw) To: Junio C Hamano Cc: Jeff King, Taylor Blau, Git List, johncai86, Linus Torvalds Junio C Hamano <gitster@pobox.com> 于2023年4月14日周五 23:58写道: > > ZheNing Hu <adlternative@gmail.com> writes: > > > Oh, you are right, this could be to prevent conflicts between Git objects > > with identical content but different types. However, I always associate > > Git with the file system, where metadata such as file type and size is > > stored in the inode, while the file data is stored in separate chunks. > > I am afraid the presentation order Peff used caused a bit of > confusion. The true reason is what Peff brought up as "Or worse". > We need to be able to tell, given only the name of an object, > everything that we need to know about the object, and for that, we > need the type information when we ask for an object by its name. > Having size embedded in the data that comes back to us when we > consult object database with an object name helps the implementation > to pre-allocate a buffer and then inflate into it--there is no > fundamental reason why it should be there. > Yes, I think I understand the point now. Since Git addresses objects based on their content, if type information is not included in the object, we cannot easily understand what type of Git object corresponds to a given object ID. Moreover, if we don't include type and size information in Git objects, We would need to maintain a large number of external tables to record this information, in order to inflate and identify the type. > It is a secondary problem created by the design choice that we store > type together with contents, that the object type recorded in a tree > entry may contradict the actual type of the object recorded in the > tree entry. We could have declared that the object type found in a > tree entry is to be trusted, if we didn't record the type in the > object database together with the object contents. > Yes, that may not be crucial, but including type information in Git objects can help validate the correctness of tree entries better. > I think your original question was not "why do we store type and > size together with the contents?", but was "why do we include in the > hash computation?", and all of the above discuss related tangent > without touching the original question. > Yes, but I think these two problems should be similar. > The need to have type or size available when we ask the object > database for data associated with the object does not necessarily > mean they must be hashed together with the contents. It was done > merely because "why not? that way, we do not have to worry about > catching corrupt values for type and size information we want to > store together with the contents". IOW, we could have checksummed > these two pieces of information separately, but why bother? Thank you. I think I roughly understand. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Question] Can git cat-file have a type filtering option? 2023-04-14 12:17 ` ZheNing Hu 2023-04-14 15:58 ` Junio C Hamano @ 2023-04-14 17:04 ` Linus Torvalds 2023-04-16 12:06 ` Felipe Contreras 2023-04-16 12:43 ` ZheNing Hu 1 sibling, 2 replies; 23+ messages in thread From: Linus Torvalds @ 2023-04-14 17:04 UTC (permalink / raw) To: ZheNing Hu; +Cc: Jeff King, Taylor Blau, Junio C Hamano, Git List, johncai86 On Fri, Apr 14, 2023 at 5:17 AM ZheNing Hu <adlternative@gmail.com> wrote: > > Jeff King <peff@peff.net> 于2023年4月14日周五 15:30写道: > > > > On Wed, Apr 12, 2023 at 05:57:02PM +0800, ZheNing Hu wrote: > > > > > > I'm still puzzled why git calculated the object id based on {type, size, data} > > > together instead of just {data}? > > > > You'd have to ask Linus for the original reasoning. ;) I originally thought of the git object store as "tagged pointers". That actually caused confusion initially when I tried to explain this to SCM people, because "tag" means something very different in an SCM environment than it means in computer architecture. And the implication of a tagged pointer is that you have two parts of it - the "tag" and the "address". Both are relevant at all points. This isn't quite as obvious in everyday moden git usage, because a lot of uses end up _only_ using the "address" (aka SHA1), but it's very much part of the object store design. Internally, the object layout never uses just the SHA1, it's all "type:SHA1", even if sometimes the types are implied (ie the tree object doesn't spell out "blob", but it's still explicit in the mode bits). This is very very obvious in "git cat-file", which was one of the original scripts in the first commit (but even there the tag/type has changed meaning over time: the very first version didn't use it as input at all, then it started verifying it, and then later it got the more subtle context of "peel the tags until you find this type"). You can also see this in the original README (again, go look at that first git commit): the README talks about the "tag of their type". Of course, in practice git then walked away from having to specify the type all the time. It started even in that original release, in that the HEAD file never contained the type - because it was implicit (a HEAD is always a commit). So we ended up having a lot of situations like that where the actual tag part was implicit from context, and these days people basically never refer to the "full" object name with tag, but only the SHA1 address. So now we have situations where the type really has to be looked up dynamically, because it's not explicitly encoded anywhere. While HEAD is supposed to always be a commit, other refs can be pretty much anything, and can point to a tag object, a commit, a tree or a blob. So then you actually have to look up the type based on the address. End result: these days people don't even think of git objects as "tagged pointers". Even internally in git, lots of code just passes the "object name" along without any tag/type, just the raw SHA1 / OID. So that originally "everything is a tagged pointer" is much less true than it used to be, and now, instead of having tagged pointers, you mostly end up with just "bare pointers" and look up the type dynamically from there. And that "look up the type in the object" is possible because even originally, I did *not* want any kind of "object type aliasing". So even when looking up the object with the full "tag:pointer", the encoding of the object itself then also contains that object type, so that you can cross-check that you used the right tag. That said, you *can* see some of the effects of this "tagged pointers" in how the internals do things like struct commit *commit = lookup_commit(repo, &oid); which conceptually very much is about tagged pointers. And the fact that two objects cannot alias is actually somewhat encoded in that: a "struct commit" contains a "struct object" as a member. But so does "struct blob" - and the two "struct object" cases are never the same "object". So there's never any worry about "could blob.object be the same object as commit.object"? That is actually inherent in the code, in how "lookup_commit()" actually does lookup_object() and then does object_as_type(OBJ_COMMIT) on the result. > Oh, you are right, this could be to prevent conflicts between Git objects > with identical content but different types. However, I always associate > Git with the file system, where metadata such as file type and size is > stored in the inode, while the file data is stored in separate chunks. See above: yes, git design was *also* influenced heavily by filesystems, but that was mostly in the sense of "this is how to encode these things without undue pain". The object database being immutable was partly a security and safety measure, but it was also very much partly a "rewriting files is going to be a major pain from a filesystem consistency standpoint - don't do it". But even more than a filesystem design, it's an "computer architecture" design. Think of the git object store as a very abstract computer architecture that has tagged pointers, stable storage, and no aliasing - and where the tag is actually verified at each lookup. The "no aliasing" means that no two distinct pointers can point to the same data. So a tagged pointer of type "commit" can not point to the same object as a tagged pointer of type "blob". They are distinct pointers, even if (maybe) the commit object encoding ends up then being identical to a blob object. And as mentioned, that "verified at each lookup" has mostly gone away, and "each lookup" has become more of a "can be verified by fsck", but it's probably still a good thing to think that way. You still have "lookup_object_by_type()" internally in git that takes the full tagged pointer, but almost nobody uses it any more. The closest you get is those "lookup_commit()" things (which are fairly common, still). Linus ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Question] Can git cat-file have a type filtering option? 2023-04-14 17:04 ` Linus Torvalds @ 2023-04-16 12:06 ` Felipe Contreras 2023-04-16 12:43 ` ZheNing Hu 1 sibling, 0 replies; 23+ messages in thread From: Felipe Contreras @ 2023-04-16 12:06 UTC (permalink / raw) To: Linus Torvalds, ZheNing Hu Cc: Jeff King, Taylor Blau, Junio C Hamano, Git List, johncai86 Linus Torvalds wrote: > On Fri, Apr 14, 2023 at 5:17 AM ZheNing Hu <adlternative@gmail.com> wrote: > > > > Jeff King <peff@peff.net> 于2023年4月14日周五 15:30写道: > > > > > > On Wed, Apr 12, 2023 at 05:57:02PM +0800, ZheNing Hu wrote: > > > > > > > > I'm still puzzled why git calculated the object id based on {type, size, data} > > > > together instead of just {data}? > > > > > > You'd have to ask Linus for the original reasoning. ;) > > I originally thought of the git object store as "tagged pointers". > > That actually caused confusion initially when I tried to explain this > to SCM people, because "tag" means something very different in an SCM > environment than it means in computer architecture. > > And the implication of a tagged pointer is that you have two parts of > it - the "tag" and the "address". Both are relevant at all points. > > This isn't quite as obvious in everyday moden git usage, because a lot > of uses end up _only_ using the "address" (aka SHA1), but it's very > much part of the object store design. Internally, the object layout > never uses just the SHA1, it's all "type:SHA1", even if sometimes the > types are implied (ie the tree object doesn't spell out "blob", but > it's still explicit in the mode bits). > > This is very very obvious in "git cat-file", which was one of the > original scripts in the first commit (but even there the tag/type has > changed meaning over time: the very first version didn't use it as > input at all, then it started verifying it, and then later it got the > more subtle context of "peel the tags until you find this type"). > > You can also see this in the original README (again, go look at that > first git commit): the README talks about the "tag of their type". > > Of course, in practice git then walked away from having to specify the > type all the time. It started even in that original release, in that > the HEAD file never contained the type - because it was implicit (a > HEAD is always a commit). > > So we ended up having a lot of situations like that where the actual > tag part was implicit from context, and these days people basically > never refer to the "full" object name with tag, but only the SHA1 > address. > > So now we have situations where the type really has to be looked up > dynamically, because it's not explicitly encoded anywhere. While HEAD > is supposed to always be a commit, other refs can be pretty much > anything, and can point to a tag object, a commit, a tree or a blob. > So then you actually have to look up the type based on the address. > > End result: these days people don't even think of git objects as > "tagged pointers". Even internally in git, lots of code just passes > the "object name" along without any tag/type, just the raw SHA1 / OID. > > So that originally "everything is a tagged pointer" is much less true > than it used to be, and now, instead of having tagged pointers, you > mostly end up with just "bare pointers" and look up the type > dynamically from there. > > And that "look up the type in the object" is possible because even > originally, I did *not* want any kind of "object type aliasing". > > So even when looking up the object with the full "tag:pointer", the > encoding of the object itself then also contains that object type, so > that you can cross-check that you used the right tag. > > That said, you *can* see some of the effects of this "tagged pointers" > in how the internals do things like > > struct commit *commit = lookup_commit(repo, &oid); > > which conceptually very much is about tagged pointers. And the fact > that two objects cannot alias is actually somewhat encoded in that: a > "struct commit" contains a "struct object" as a member. But so does > "struct blob" - and the two "struct object" cases are never the same > "object". > > So there's never any worry about "could blob.object be the same object > as commit.object"? > > That is actually inherent in the code, in how "lookup_commit()" > actually does lookup_object() and then does object_as_type(OBJ_COMMIT) > on the result. This explains rather well why the object type is used in the calculation, and it makes sense. But I don't see anything about the object size. Isn't that unnecessary? -- Felipe Contreras ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Question] Can git cat-file have a type filtering option? 2023-04-14 17:04 ` Linus Torvalds 2023-04-16 12:06 ` Felipe Contreras @ 2023-04-16 12:43 ` ZheNing Hu 1 sibling, 0 replies; 23+ messages in thread From: ZheNing Hu @ 2023-04-16 12:43 UTC (permalink / raw) To: Linus Torvalds Cc: Jeff King, Taylor Blau, Junio C Hamano, Git List, johncai86 Linus Torvalds <torvalds@linux-foundation.org> 于2023年4月15日周六 01:05写道: > > On Fri, Apr 14, 2023 at 5:17 AM ZheNing Hu <adlternative@gmail.com> wrote: > > > > Jeff King <peff@peff.net> 于2023年4月14日周五 15:30写道: > > > > > > On Wed, Apr 12, 2023 at 05:57:02PM +0800, ZheNing Hu wrote: > > > > > > > > I'm still puzzled why git calculated the object id based on {type, size, data} > > > > together instead of just {data}? > > > > > > You'd have to ask Linus for the original reasoning. ;) > > I originally thought of the git object store as "tagged pointers". > > That actually caused confusion initially when I tried to explain this > to SCM people, because "tag" means something very different in an SCM > environment than it means in computer architecture. > > And the implication of a tagged pointer is that you have two parts of > it - the "tag" and the "address". Both are relevant at all points. > > This isn't quite as obvious in everyday moden git usage, because a lot > of uses end up _only_ using the "address" (aka SHA1), but it's very > much part of the object store design. Internally, the object layout > never uses just the SHA1, it's all "type:SHA1", even if sometimes the > types are implied (ie the tree object doesn't spell out "blob", but > it's still explicit in the mode bits). > > This is very very obvious in "git cat-file", which was one of the > original scripts in the first commit (but even there the tag/type has > changed meaning over time: the very first version didn't use it as > input at all, then it started verifying it, and then later it got the > more subtle context of "peel the tags until you find this type"). > Yes, in the initial commit of Git, "git cat-file" only needs to pass the object ID to obtain both the content and type of the object. However, modern "git cat-file" requires specifying both the expected object type and object ID by default. e.g. "git cat-file commit v2.9.1". This should be where you mentioned the simultaneous appearance of "tag" and "address". This design model may not be very user-friendly for users to use, so nowadays, people prefer to use "git cat-file -p", this may be very similar to the initial version of git cat-file. > You can also see this in the original README (again, go look at that > first git commit): the README talks about the "tag of their type". > > Of course, in practice git then walked away from having to specify the > type all the time. It started even in that original release, in that > the HEAD file never contained the type - because it was implicit (a > HEAD is always a commit). > > So we ended up having a lot of situations like that where the actual > tag part was implicit from context, and these days people basically > never refer to the "full" object name with tag, but only the SHA1 > address. > > So now we have situations where the type really has to be looked up > dynamically, because it's not explicitly encoded anywhere. While HEAD > is supposed to always be a commit, other refs can be pretty much > anything, and can point to a tag object, a commit, a tree or a blob. > So then you actually have to look up the type based on the address. > > End result: these days people don't even think of git objects as > "tagged pointers". Even internally in git, lots of code just passes > the "object name" along without any tag/type, just the raw SHA1 / OID. > > So that originally "everything is a tagged pointer" is much less true > than it used to be, and now, instead of having tagged pointers, you > mostly end up with just "bare pointers" and look up the type > dynamically from there. > I feel that if type was not included in the objects initially, people would need to specify both "tag" and "address" at the same time to explain the objects. Otherwise, this "tag" can only be used for checking the type, and this is not necessary in most cases. > And that "look up the type in the object" is possible because even > originally, I did *not* want any kind of "object type aliasing". > > So even when looking up the object with the full "tag:pointer", the > encoding of the object itself then also contains that object type, so > that you can cross-check that you used the right tag. > > That said, you *can* see some of the effects of this "tagged pointers" > in how the internals do things like > > struct commit *commit = lookup_commit(repo, &oid); > > which conceptually very much is about tagged pointers. And the fact > that two objects cannot alias is actually somewhat encoded in that: a > "struct commit" contains a "struct object" as a member. But so does > "struct blob" - and the two "struct object" cases are never the same > "object". > > So there's never any worry about "could blob.object be the same object > as commit.object"? > Yes, if an object can be interpreted as multiple types, it will certainly make it very difficult for git higher-level logic to handle it. > That is actually inherent in the code, in how "lookup_commit()" > actually does lookup_object() and then does object_as_type(OBJ_COMMIT) > on the result. > > > Oh, you are right, this could be to prevent conflicts between Git objects > > with identical content but different types. However, I always associate > > Git with the file system, where metadata such as file type and size is > > stored in the inode, while the file data is stored in separate chunks. > > See above: yes, git design was *also* influenced heavily by > filesystems, but that was mostly in the sense of "this is how to > encode these things without undue pain". > > The object database being immutable was partly a security and safety > measure, but it was also very much partly a "rewriting files is going > to be a major pain from a filesystem consistency standpoint - don't do > it". > You're right. Git objects are immutable, while data in a file system is mutable, so Git's design doesn't need to follow the file system completely. > But even more than a filesystem design, it's an "computer > architecture" design. Think of the git object store as a very abstract > computer architecture that has tagged pointers, stable storage, and no > aliasing - and where the tag is actually verified at each lookup. > > The "no aliasing" means that no two distinct pointers can point to the > same data. So a tagged pointer of type "commit" can not point to the > same object as a tagged pointer of type "blob". They are distinct > pointers, even if (maybe) the commit object encoding ends up then > being identical to a blob object. > > And as mentioned, that "verified at each lookup" has mostly gone away, > and "each lookup" has become more of a "can be verified by fsck", but > it's probably still a good thing to think that way. > > You still have "lookup_object_by_type()" internally in git that takes > the full tagged pointer, but almost nobody uses it any more. The > closest you get is those "lookup_commit()" things (which are fairly > common, still). > Well, now I understand: everything in Git's architecture is "tag:pointer". Tags are used to verify object types (although it's not necessary now), and pointers are used for addressing. This is also one of the reasons why Git initially included the type in its objects. > Linus Thank you for your wonderful answer regarding the design concept of "tagged pointers"! This deepens my understanding of Git's design. :-) ZheNing Hu ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Question] Can git cat-file have a type filtering option? 2023-04-07 16:30 ` Junio C Hamano 2023-04-08 6:27 ` ZheNing Hu @ 2023-04-09 1:26 ` Taylor Blau 1 sibling, 0 replies; 23+ messages in thread From: Taylor Blau @ 2023-04-09 1:26 UTC (permalink / raw) To: Junio C Hamano; +Cc: ZheNing Hu, Git List, johncai86 On Fri, Apr 07, 2023 at 09:30:18AM -0700, Junio C Hamano wrote: > ZheNing Hu <adlternative@gmail.com> writes: > > > all blobs, and then use `git cat-file --batch` to retrieve them. This > > is not very elegant, or in other words, it might be better to have an > > internal implementation of filtering within `git cat-file > > --batch-all-objects`. > > It does sound prominently elegant to have each tool does one task > and does it well, and being able to flexibly combine them to achieve > a larger task. Yeah, agreed. It may be *convenient* to have an easy-to-reach option in cat-file like '--exclude-type=tree,commit,tag' or something. But the argument falls on a pretty slippery slope, as I think you note below. > Is the object type the only thing that people often would want to > base their filtering decision on? Will we then see somebody else > request a "--size-filter", and then somebody else realizes that the > filtering criteria based on size need to be different between blobs > (most likely counted in bytes) and trees (it may be more convenient > to count the tree entries, not byes)? It sounds rather messy and > we may be better off having such an extensible logic in one place. > > Like rev-list's object list filtering, that is. Yes, exactly. This definitely feels like a "do one thing and do it well". `rev-list` is the tool we have for listing revisions and objects, and it can produce output that is compatible with the kind of input that other tools (like `cat-file`) can interpret. Thanks, Taylor ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Question] Can git cat-file have a type filtering option? 2023-04-07 14:24 [Question] Can git cat-file have a type filtering option? ZheNing Hu 2023-04-07 16:30 ` Junio C Hamano @ 2023-04-09 1:23 ` Taylor Blau 1 sibling, 0 replies; 23+ messages in thread From: Taylor Blau @ 2023-04-09 1:23 UTC (permalink / raw) To: ZheNing Hu; +Cc: Git List, Junio C Hamano, johncai86 On Fri, Apr 07, 2023 at 10:24:22PM +0800, ZheNing Hu wrote: > However, `git cat-file` already has a `--filters` option, which is > used to "show content as transformed by filters". I'm not sure if > there is a better word to implement the functionality of filtering by > type? For example, `--type-filter`? There is the `--filter='object:type=blob'` that should do what you're looking for. In other words, if you wanted to dump the contents of all blobs in your repository, this should do the trick: $ git rev-list --all --objects --filter='object:type=blob' | git cat-file --batch[=<format>] Thanks, Taylor ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2023-04-16 12:43 UTC | newest] Thread overview: 23+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-04-07 14:24 [Question] Can git cat-file have a type filtering option? ZheNing Hu 2023-04-07 16:30 ` Junio C Hamano 2023-04-08 6:27 ` ZheNing Hu 2023-04-09 1:28 ` Taylor Blau 2023-04-09 2:19 ` Taylor Blau 2023-04-09 2:26 ` Taylor Blau 2023-04-09 6:51 ` ZheNing Hu 2023-04-10 20:01 ` Jeff King 2023-04-10 23:20 ` Taylor Blau 2023-04-09 6:47 ` ZheNing Hu 2023-04-10 20:14 ` Jeff King 2023-04-11 14:09 ` ZheNing Hu 2023-04-12 7:43 ` Jeff King 2023-04-12 9:57 ` ZheNing Hu 2023-04-14 7:30 ` Jeff King 2023-04-14 12:17 ` ZheNing Hu 2023-04-14 15:58 ` Junio C Hamano 2023-04-16 11:15 ` ZheNing Hu 2023-04-14 17:04 ` Linus Torvalds 2023-04-16 12:06 ` Felipe Contreras 2023-04-16 12:43 ` ZheNing Hu 2023-04-09 1:26 ` Taylor Blau 2023-04-09 1:23 ` Taylor Blau
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).