From: Jeff Hostetler <git@jeffhostetler.com>
To: Jonathan Tan <jonathantanmy@google.com>
Cc: git@vger.kernel.org, gitster@pobox.com, peff@peff.net,
jeffhost@microsoft.com
Subject: Re: [PATCH 03/13] list-objects: filter objects in traverse_commit_list
Date: Wed, 27 Sep 2017 13:04:42 -0400 [thread overview]
Message-ID: <d5de0eda-18aa-a796-e7d0-d536d2e59605@jeffhostetler.com> (raw)
In-Reply-To: <20170926153141.6a8d7e5024eeed5bbda838c8@google.com>
On 9/26/2017 6:31 PM, Jonathan Tan wrote:
> On Fri, 22 Sep 2017 20:26:22 +0000
> Jeff Hostetler <git@jeffhostetler.com> wrote:
>
>> From: Jeff Hostetler <jeffhost@microsoft.com>
>>
>> Create traverse_commit_list_filtered() and add filtering
>
> You mention _filtered() here, but this patch contains _worker().
thanks.
>> list-objects.h | 30 ++++++++++++++++++++++++++
>> 2 files changed, 80 insertions(+), 16 deletions(-)
>>
>> diff --git a/list-objects.c b/list-objects.c
>> index b3931fa..3e86008 100644
>> --- a/list-objects.c
>> +++ b/list-objects.c
>> @@ -13,10 +13,13 @@ static void process_blob(struct rev_info *revs,
>> show_object_fn show,
>> struct strbuf *path,
>> const char *name,
>> - void *cb_data)
>> + void *cb_data,
>> + filter_object_fn filter,
>> + void *filter_data)
>
> I had some thoughts about modifying "show" (instead of adding "filter",
> as you have done in this patch) to indicate to the caller that the
> corresponding object should not be marked "seen". That does have the
> advantage that we don't have so many callbacks flying around, but it
> also makes things more complicated, so I don't know which is better.
I wanted the filtering function to be independent of the showing function.
As you can see in pack-objects.c and rev-list.c: All I needed to change
there was to call the new version of the traverse code and not alter the
show function. The filtering is completely isolated inside the traverse
code.
It also means that the filtering code can do higher-level filtering.
Currently, the show-object callback gets called with each terminal node
(blob) and it has already lost any chance for tree-level optimizations.
The filtering code participates in the treewalk on the commit and can do
subtree elimination.
>
>> + if (filter) {
>> + r = filter(LOFT_END_TREE, obj, base->buf, &base->buf[baselen], filter_data);
>> + if (r & LOFR_MARK_SEEN)
>> + obj->flags |= SEEN;
>> + if (r & LOFR_SHOW)
>> + show(obj, base->buf, cb_data);
>> + }
>
> This LOFT_END_TREE was added to support the code in
> list-objects-filter-sparse - would it be OK to completely remove the
> optimization involving the "provisional" omission of blobs? (I don't
> think the exact same tree object will occur multiple times often.) If
> yes, then I think this block can be removed too and will simplify the
> code.
The sparse filter is looking at pathnames and using the same rules
as sparse-checkout to decide which to *include* in the result. This
is essentially backwards from the other filters which are looking for
reasons to *exclude* a blob. If I see a {pathname, sha} pair and the
pathname is not wanted (by the sparse-checkout rules), I still don't
know anything about the object -- since the same SHA may appear later
in the treewalk but with a different pathname that *does* match the
patterns and *is* wanted.
The net-net is that I have to mark the blob as "provisionally omitted"
until I've seen all the pathnames associated with it. Only then can I
say it should be omitted.
Likewise, there are things about the tree object that we cannot
decide until we've seen all possible directory paths that reference it.
For example, if you rename a tree/directory between 2 commits, but make no
other changes within the directory, it will/should have the same SHA in the
second commit. If one of those paths is included in the sparse-checkout
and one is not, then you need include those blobs (and the tree object)
in the result. If the treewalk visits the excluded case first, you don't
want to discard the tree object (and shortcut future treewalks) because
the filter won't get a chance to see the included directory path case.
Also, the current code does not attempt to omit tree objects, but a
future version may. And having the _BEGIN_ and _END_ events means the
filter can keep track of the nesting without the expense of having to
discover it by parsing the pathname looking for slashes as we do elsewhere.
>
>> +/* See object.h and revision.h */
>> +#define FILTER_REVISIT (1<<25)
>
> If you do add this, also update object.h. But I don't think this is the
> right place to do it - it's probably better in
> list-objects-filter-sparse.
>
>> +typedef enum list_objects_filter_result list_objects_filter_result;
>> +typedef enum list_objects_filter_type list_objects_filter_type;
>
> I don't think Git style is to typedef enums.
>
>> +void traverse_commit_list_worker(
>> + struct rev_info *,
>> + show_commit_fn, show_object_fn, void *show_data,
>> + filter_object_fn filter, void *filter_data);
>
> Here (and throughout), as far as I can tell, the style is to indent to
> the character after the opening parenthesis.
>
right. thanks.
next prev parent reply other threads:[~2017-09-27 17:04 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-09-22 20:26 [PATCH 00/13] RFC object filtering for parital clone Jeff Hostetler
2017-09-22 20:26 ` [PATCH 01/13] dir: refactor add_excludes() Jeff Hostetler
2017-09-22 20:26 ` [PATCH 02/13] oidset2: create oidset subclass with object length and pathname Jeff Hostetler
2017-09-22 20:42 ` Brandon Williams
2017-09-26 22:20 ` Jonathan Tan
2017-09-27 14:47 ` Jeff Hostetler
2017-09-22 20:26 ` [PATCH 03/13] list-objects: filter objects in traverse_commit_list Jeff Hostetler
2017-09-26 22:31 ` Jonathan Tan
2017-09-27 17:04 ` Jeff Hostetler [this message]
2017-09-27 18:00 ` Jonathan Tan
2017-09-27 19:09 ` Jeff Hostetler
2017-09-27 20:49 ` Jonathan Tan
2017-09-22 20:26 ` [PATCH 04/13] list-objects-filter-all: add filter to omit all blobs Jeff Hostetler
2017-09-23 0:39 ` [PATCH 00/13] RFC object filtering for parital clone Jonathan Tan
2017-09-26 14:55 ` Jeff Hostetler
2017-09-26 19:23 ` Jeff Hostetler
-- strict thread matches above, loose matches on Subject: below --
2017-10-24 18:53 [PATCH 00/13] WIP Partial clone part 1: object filtering Jeff Hostetler
2017-10-24 18:53 ` [PATCH 03/13] list-objects: filter objects in traverse_commit_list Jeff Hostetler
2017-10-25 4:05 ` Jonathan Tan
2017-10-25 19:25 ` Jeff Hostetler
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d5de0eda-18aa-a796-e7d0-d536d2e59605@jeffhostetler.com \
--to=git@jeffhostetler.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=jeffhost@microsoft.com \
--cc=jonathantanmy@google.com \
--cc=peff@peff.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).