* Corner case involving null sha1, alternates, cache misses, and submodule config API
@ 2016-12-25 2:29 Mike Hommey
2016-12-26 20:32 ` Stefan Beller
0 siblings, 1 reply; 2+ messages in thread
From: Mike Hommey @ 2016-12-25 2:29 UTC (permalink / raw)
To: hvoigt, sbeller, gitster; +Cc: git
Hi,
As you might be aware, I'm working on a mercurial remote helper for git.
The way it stores metadata for mercurial manifests abuses "commit"
references in trees, which are normally used for submodules.
Some operations in the helper use git diff-tree on those trees to find
files faster than just using ls-tree on every commit would.
Anyways, long story short, it turns out that a combination of
everything mentioned in the subject of this email causes running git
diff-tree -r --stdin with a list of 300k+ pairs of commits to take 10
minutes, when (after investigation) adding --ignore-submodules=dirty
made it take 1 minute instead, for the exact same 3GB output.
It turns out, this all starts in is_submodule_ignored(), which contains:
if (!DIFF_OPT_TST(options, OVERRIDE_SUBMODULE_CONFIG))
set_diffopt_flags_from_submodule_config(options, path);
And set_diffopt_flags_from_submodule_config calls:
submodule_from_path(null_sha1, path);
And because there is no actual submodule involved, at some point that
null_sha1 ends up in the call to read_sha1_file from
submodule-config.c's config_from, which then proceeds to try to open the
null sha1 as a loose object in every alternate, doing multiple system
calls in each directory for something that is bound to fail. And to add
pain to injury, it repeats that for each and every line of input to git
diff-tree because the object cache doesn't care about storing negatives
(which makes perfect sense for most cases).
Even worse, when read_object returns NULL because the object doesn't
exist, read_sha1_file_extended calls has_loose_object which does
another set of system calls.
Now, while I realize my use case is very atypical, and that I should
just use --ignore-submodule=dirty, the fact that using the null sha1 can
trigger such behavior strikes me as a footgun that would be better
avoided. Especially when you factor the fact that
read_sha1_file_extended calls lookup_replace_object_extended, which
suggests one might interfere by creating a replace object for the null
sha1. (BTW, it's not entirely clear to me, in the context of actual
submodules, what the various --ignore-submodule options are supposed to
mean for trees that are not the current HEAD ; also, the manual page say
"all" is the default, but that doesn't appear to be true)
From a cursory look at the output of `git grep \\bnull_sha1` it doesn't
look like the null sha1 is used anywhere else in a similar fashion where
it can be attempted to be read as an object. So, one could consider this
is something the submodule config code should handle on its own by
treating the null_sha1 argument to submodule_from_path (really
config_from) specially. After all, gitmodule_sha1_from_commit already
avoids a get_sha1() call when it's given the null sha1.
OTOH, it seems submodule_from_path and submodule_from_name, the only two
public functions that end up in config_from(), are *always* called with
either the null sha1 or a literal null pointer. The *only* calls to
these functions that doesn't involve a null sha1 or a null pointer is
from test code. So all in all, I'm not entirely sure what this sha1
argument is all about in the first place.
However, an argument could be made that null_sha1 should be treated
specially at a lower level (read_sha1_file, I guess).
What would be sensible to do here?
Mike
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: Corner case involving null sha1, alternates, cache misses, and submodule config API
2016-12-25 2:29 Corner case involving null sha1, alternates, cache misses, and submodule config API Mike Hommey
@ 2016-12-26 20:32 ` Stefan Beller
0 siblings, 0 replies; 2+ messages in thread
From: Stefan Beller @ 2016-12-26 20:32 UTC (permalink / raw)
To: Mike Hommey, Josh Triplett, Brandon Williams
Cc: Heiko Voigt, Junio C Hamano, git@vger.kernel.org
On Sat, Dec 24, 2016 at 6:29 PM, Mike Hommey <mh@glandium.org> wrote:
> Hi,
>
> As you might be aware, I'm working on a mercurial remote helper for git.
> The way it stores metadata for mercurial manifests abuses "commit"
> references in trees, which are normally used for submodules.
>
> Some operations in the helper use git diff-tree on those trees to find
> files faster than just using ls-tree on every commit would.
>
> Anyways, long story short, it turns out that a combination of
> everything mentioned in the subject of this email causes running git
> diff-tree -r --stdin with a list of 300k+ pairs of commits to take 10
> minutes, when (after investigation) adding --ignore-submodules=dirty
> made it take 1 minute instead, for the exact same 3GB output.
>
> It turns out, this all starts in is_submodule_ignored(), which contains:
>
> if (!DIFF_OPT_TST(options, OVERRIDE_SUBMODULE_CONFIG))
> set_diffopt_flags_from_submodule_config(options, path);
>
> And set_diffopt_flags_from_submodule_config calls:
>
> submodule_from_path(null_sha1, path);
>
> And because there is no actual submodule involved, at some point that
> null_sha1 ends up in the call to read_sha1_file from
> submodule-config.c's config_from, which then proceeds to try to open the
> null sha1 as a loose object in every alternate, doing multiple system
> calls in each directory for something that is bound to fail. And to add
> pain to injury, it repeats that for each and every line of input to git
> diff-tree because the object cache doesn't care about storing negatives
> (which makes perfect sense for most cases).
>
> Even worse, when read_object returns NULL because the object doesn't
> exist, read_sha1_file_extended calls has_loose_object which does
> another set of system calls.
>
> Now, while I realize my use case is very atypical, and that I should
> just use --ignore-submodule=dirty, the fact that using the null sha1 can
> trigger such behavior strikes me as a footgun that would be better
> avoided. Especially when you factor the fact that
> read_sha1_file_extended calls lookup_replace_object_extended, which
> suggests one might interfere by creating a replace object for the null
> sha1. (BTW, it's not entirely clear to me, in the context of actual
> submodules, what the various --ignore-submodule options are supposed to
> mean for trees that are not the current HEAD ; also, the manual page say
> "all" is the default, but that doesn't appear to be true)
I think at this high level we should start optimizing submodules, i.e.
not HEAD / working tree -> select less aggressive submodule config.
>
> From a cursory look at the output of `git grep \\bnull_sha1` it doesn't
> look like the null sha1 is used anywhere else in a similar fashion where
> it can be attempted to be read as an object. So, one could consider this
> is something the submodule config code should handle on its own by
> treating the null_sha1 argument to submodule_from_path (really
> config_from) specially. After all, gitmodule_sha1_from_commit already
> avoids a get_sha1() call when it's given the null sha1.
>
> OTOH, it seems submodule_from_path and submodule_from_name, the only two
> public functions that end up in config_from(), are *always* called with
> either the null sha1 or a literal null pointer.
When Heiko introduced the submodule config, the consensus
was to have a flexible API to lookup submodule configuration.
The sha1 argument defines where to lookup the submodule
config (e.g. you could have moved a submodule such that the
submodule_from_path returns a different submodule for HEAD^
than HEAD.
The null_sha1 is used for current working tree configuration, i.e.
load .gitmodules and then overload with .git/config.
origin/bw/grep-submodules changes the behavior as it preloads the
null_sha1 to be an actual commit-ish (HEAD^').
I may make use of it in the checkout --recurse-submodules series.
> The *only* calls to
> these functions that doesn't involve a null sha1 or a null pointer is
> from test code. So all in all, I'm not entirely sure what this sha1
> argument is all about in the first place.
See Documentation/technical/api-submodule-config.txt (IIRC)
in next(?).
>
> However, an argument could be made that null_sha1 should be treated
> specially at a lower level (read_sha1_file, I guess).
>
> What would be sensible to do here?
As said above, the nul_sha1 is (ab-)used to mean
"use the preloaded config" which as of now is .git/config on top
of .gitmodules in the working tree.
I would assume we want another earlier and cheaper switch
for all these functions. This may also come in handy for the
git-series, that uses gitlinks as non-submodules.
Something like
int treat_gitlinks_as_submodule()
which can be configured to be "auto"[1] and then
"on", "off" that can be set for tools such as mercurial
remote helper or git-series.
[1] the default that
requires some slightly more expensive guess work, e.g.
the existence of the .gitmodules file. c.f.
https://github.com/git/git/commit/1863e05af5be73d9f7b9a1d22a33ca9849726623
static void preset_submodule_default(void)
would be a starting point to cheaply estimate
if gitlinks are submodules.
>
> Mike
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2016-12-26 20:32 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-12-25 2:29 Corner case involving null sha1, alternates, cache misses, and submodule config API Mike Hommey
2016-12-26 20:32 ` Stefan Beller
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).