git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Is --filter-print-omitted correct/used/needed?
@ 2019-06-06 19:02 Emily Shaffer
  2019-06-07  6:38 ` Christian Couder
  0 siblings, 1 reply; 4+ messages in thread
From: Emily Shaffer @ 2019-06-06 19:02 UTC (permalink / raw)
  To: Git List

Yesterday while down a rabbit hole, I discovered an interesting thing
about the 'omitted' object in many filters in
list-objects-filter-options.h.  It appears that when we call
list-objects.h:traverse_commit_list_filtered() with a non-NULL
'omitted' argument, we still perform a walk of all objects - that is,
the filtered walk is no more efficient than the unfiltered walk from
the same place via 'traverse_commit_list()'. I verified by calling
each and counting the objects:

161         if (0) {
162                 /* Unfiltered: */
163                 printf(_("Unfiltered object walk.\n"));
164                 traverse_commit_list(rev, walken_show_commit,
165                                 walken_show_object, NULL);
166         } else {
167                 printf(_("Filtered object walk with filterspec
'tree:1'.\n"));
168                 /*
169                  * We can parse a tree depth of 1 to demonstrate
the kind of
170                  * filtering that could occur eg during shallow
cloning.
171                  */
172                 parse_list_objects_filter(&filter_options,
"tree:1");
173
174                 traverse_commit_list_filtered(&filter_options,
rev,
175                         walken_show_commit, walken_show_object,
NULL, &omitted);
176         }
177
178         /* Count the omitted objects. */
179         oidset_iter_init(&omitted, &oit);
180
181         while ((oid = oidset_iter_next(&oit)))
182                 omitted_count++;
183
184         printf(_("Object walk completed. Found %d commits, %d
blobs, %d tags, "
185                "and %d trees; %d omitted objects.\n"),
commit_count,
186                blob_count, tag_count, tree_count, omitted_count);

I found that omitted_count was always equal to the difference between
sum(blob_count, tag_count, tree_count, commit_count) in the unfiltered
and filtered walks. I also found that the length of time required to
perform the unfiltered walk and the filtered-with-non-NULL-omitted
walk was the same, while the time required to perform the filtered
walk with NULL omitted was significantly shorter. (The walk in
question was over the latest release of Git master, plus the ten or so
commits in my feature branch.)

I was surprised! I figured that with filter "tree:1" that "omitted"
would contain only the objects on the "border" of the filter - that
is, I assumed it would contain the blobs and trees in the root git
dir, but none of those trees' blobs and trees. After talking with
jrnieder at length, it sounds like neither of us were clear on why
this "omitted" list would be needed beyond the initial development
stage of a new filter... Jonathan's impression was also that if we do
need the "omitted" list, it may be a bug that we're still traversing
objects which are only reachable from objects already omitted.

I grepped the Git source and found that we only provide a non-NULL
"omitted" when someone calls "git rev-list --filter-print-omitted",
which we verify with a simple test case for "blobs:none", in which
case the "border" objects which were omitted must be the same as all
objects which were omitted (since blobs aren't pointing to anything
else). I think if we had written a similar test case with some trees
we expect to omit we might have noticed sooner.

Since I was already in the rabbit hole, out of curiosity I did a
search on Github and only found one project referencing
--filter-print-omitted which wasn't a mirror of Git:
https://github.com/search?l=Python&q=%22--filter-print-omitted%22&type=Code

So, what do we use --filter-print-omitted for? Is anybody needing it?
Or do we just use it to verify this one test case? Should we fix it,
or get rid of it, or neither?

 - Emily

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Is --filter-print-omitted correct/used/needed?
  2019-06-06 19:02 Is --filter-print-omitted correct/used/needed? Emily Shaffer
@ 2019-06-07  6:38 ` Christian Couder
  2019-06-07 18:57   ` Jeff Hostetler
  0 siblings, 1 reply; 4+ messages in thread
From: Christian Couder @ 2019-06-07  6:38 UTC (permalink / raw)
  To: Emily Shaffer; +Cc: Git List, Jeff Hostetler

On Thu, Jun 6, 2019 at 10:18 PM Emily Shaffer <emilyshaffer@google.com> wrote:



> I grepped the Git source and found that we only provide a non-NULL
> "omitted" when someone calls "git rev-list --filter-print-omitted",
> which we verify with a simple test case for "blobs:none", in which
> case the "border" objects which were omitted must be the same as all
> objects which were omitted (since blobs aren't pointing to anything
> else). I think if we had written a similar test case with some trees
> we expect to omit we might have noticed sooner.

It seems that --filter-print-omitted was introduced in caf3827e2f
(rev-list: add list-objects filtering support, 2017-11-21) so I cc'ed
Jeff.

[...]

> So, what do we use --filter-print-omitted for? Is anybody needing it?
> Or do we just use it to verify this one test case? Should we fix it,
> or get rid of it, or neither?

In caf3827e2f there is:

    This patch introduces handling of missing objects to help
    debugging and development of the "partial clone" mechanism,
    and once the mechanism is implemented, for a power user to
    perform operations that are missing-object aware without
    incurring the cost of checking if a missing link is expected.

So I would say that if you think that --filter-print-omitted doesn't
help in debugging or development, and can even be confusing, and that
it also doesn't help performance for power users or anyone else, then
it would make sense to remove it, unless you find a way to make it
fulfill its original goals, or maybe other worthwhile goals.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Is --filter-print-omitted correct/used/needed?
  2019-06-07  6:38 ` Christian Couder
@ 2019-06-07 18:57   ` Jeff Hostetler
  2019-06-07 21:10     ` Emily Shaffer
  0 siblings, 1 reply; 4+ messages in thread
From: Jeff Hostetler @ 2019-06-07 18:57 UTC (permalink / raw)
  To: Christian Couder, Emily Shaffer; +Cc: Git List



On 6/7/2019 2:38 AM, Christian Couder wrote:
> On Thu, Jun 6, 2019 at 10:18 PM Emily Shaffer <emilyshaffer@google.com> wrote:
> 
> 
> 
>> I grepped the Git source and found that we only provide a non-NULL
>> "omitted" when someone calls "git rev-list --filter-print-omitted",
>> which we verify with a simple test case for "blobs:none", in which
>> case the "border" objects which were omitted must be the same as all
>> objects which were omitted (since blobs aren't pointing to anything
>> else). I think if we had written a similar test case with some trees
>> we expect to omit we might have noticed sooner.
> 
> It seems that --filter-print-omitted was introduced in caf3827e2f
> (rev-list: add list-objects filtering support, 2017-11-21) so I cc'ed
> Jeff.
> 
> [...]

The --filter-print-omitted was intended to print the complete list
of omitted objects.  For example, a packfile built from a filtered
command and a packfile build from the unfiltered command would differ
by exactly that set of objects.

So the discrepancy reported by the tree:1 example is incorrect.
The omitted set is the full set, not the frontier.  So when
--filter-print-omitted is used, we still have to do the full tree walk.
When not specified, we do get the perf boost because we can terminate
the tree walk early.


>> So, what do we use --filter-print-omitted for? Is anybody needing it?
>> Or do we just use it to verify this one test case? Should we fix it,
>> or get rid of it, or neither?
>
> In caf3827e2f there is:
> 
>      This patch introduces handling of missing objects to help
>      debugging and development of the "partial clone" mechanism,
>      and once the mechanism is implemented, for a power user to
>      perform operations that are missing-object aware without
>      incurring the cost of checking if a missing link is expected.
> 
> So I would say that if you think that --filter-print-omitted doesn't
> help in debugging or development, and can even be confusing, and that
> it also doesn't help performance for power users or anyone else, then
> it would make sense to remove it, unless you find a way to make it
> fulfill its original goals, or maybe other worthwhile goals.

I don't currently have a use for that (other than the existing test
cases), but we could use that in the future as a guide for the server
to put the omitted objects on a CDN, for example.

So I'd say let's leave it as is for now.


Jeff




^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Is --filter-print-omitted correct/used/needed?
  2019-06-07 18:57   ` Jeff Hostetler
@ 2019-06-07 21:10     ` Emily Shaffer
  0 siblings, 0 replies; 4+ messages in thread
From: Emily Shaffer @ 2019-06-07 21:10 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: Christian Couder, Git List

On Fri, Jun 07, 2019 at 02:57:01PM -0400, Jeff Hostetler wrote:
> 
> 
> On 6/7/2019 2:38 AM, Christian Couder wrote:
> > On Thu, Jun 6, 2019 at 10:18 PM Emily Shaffer <emilyshaffer@google.com> wrote:
> > 
> > 
> > 
> > > I grepped the Git source and found that we only provide a non-NULL
> > > "omitted" when someone calls "git rev-list --filter-print-omitted",
> > > which we verify with a simple test case for "blobs:none", in which
> > > case the "border" objects which were omitted must be the same as all
> > > objects which were omitted (since blobs aren't pointing to anything
> > > else). I think if we had written a similar test case with some trees
> > > we expect to omit we might have noticed sooner.
> > 
> > It seems that --filter-print-omitted was introduced in caf3827e2f
> > (rev-list: add list-objects filtering support, 2017-11-21) so I cc'ed
> > Jeff.
> > 
> > [...]
> 
> The --filter-print-omitted was intended to print the complete list
> of omitted objects.  For example, a packfile built from a filtered
> command and a packfile build from the unfiltered command would differ
> by exactly that set of objects.
> 
> So the discrepancy reported by the tree:1 example is incorrect.
> The omitted set is the full set, not the frontier.  So when
> --filter-print-omitted is used, we still have to do the full tree walk.
> When not specified, we do get the perf boost because we can terminate
> the tree walk early.
> 
> 
> > > So, what do we use --filter-print-omitted for? Is anybody needing it?
> > > Or do we just use it to verify this one test case? Should we fix it,
> > > or get rid of it, or neither?
> > 
> > In caf3827e2f there is:
> > 
> >      This patch introduces handling of missing objects to help
> >      debugging and development of the "partial clone" mechanism,
> >      and once the mechanism is implemented, for a power user to
> >      perform operations that are missing-object aware without
> >      incurring the cost of checking if a missing link is expected.
> > 
> > So I would say that if you think that --filter-print-omitted doesn't
> > help in debugging or development, and can even be confusing, and that
> > it also doesn't help performance for power users or anyone else, then
> > it would make sense to remove it, unless you find a way to make it
> > fulfill its original goals, or maybe other worthwhile goals.
> 
> I don't currently have a use for that (other than the existing test
> cases), but we could use that in the future as a guide for the server
> to put the omitted objects on a CDN, for example.
> 
> So I'd say let's leave it as is for now.

Thanks for the input, Jeff. I wasn't sure from looking at it whether it
was intended behavior with a plan for use; looks like it is. I'll leave
it alone.

> 
> 
> Jeff
> 
> 
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-06-07 21:10 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-06-06 19:02 Is --filter-print-omitted correct/used/needed? Emily Shaffer
2019-06-07  6:38 ` Christian Couder
2019-06-07 18:57   ` Jeff Hostetler
2019-06-07 21:10     ` Emily Shaffer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).