From: Kjetil Barvik <barvik@broadpark.no>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org, Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH/RFC 1/4] Optimised, faster, more effective symlink/directory detection
Date: Tue, 06 Jan 2009 10:36:59 +0100 [thread overview]
Message-ID: <86fxjwlqro.fsf@broadpark.no> (raw)
In-Reply-To: <7vprj0j181.fsf@gitster.siamese.dyndns.org>
Junio C Hamano <gitster@pobox.com> writes:
> Kjetil Barvik <barvik@broadpark.no> writes:
>
>> +static inline void
>> +update_path_cache(unsigned int ret_flags, unsigned int track_flags,
>> + int last_slash)
>> +{
>> + /* Max 3 different path types can be cached for the moment! */
>> + unsigned int save_flags =
>> + ret_flags & track_flags & (LSTAT_DIR|
>> + LSTAT_NOTDIR|
>> + LSTAT_SYMLINK);
>> + if (save_flags && last_slash > 0 && last_slash < PATH_MAX) {
>> + cache_flags = save_flags;
>> + cache_len = last_slash;
>> + } else {
>> + cache_flags = 0;
>> + cache_len = 0;
>> + }
>> +}
>
> I personally found this inline function with a single call site
> distracting in following the logic. It does not make the indentation
> level shallower, either. Also, the else part should probably call
> clear_lstat_cache() to protect it from possible future enhancements to add
> more state variables.
Ok, I will remove the function and update the else-part.
>> +
>> +/*
>> + * Check if name 'name' of length 'len' has a symlink leading
>> + * component, or if the directory exists and is real, or not.
>> + *
>> + * To speed up the check, some information is allowed to be cached.
>> + * This is indicated by the 'track_flags' argument.
>> + */
>> +unsigned int
>> +check_lstat_cache(int len, const char *name, unsigned int track_flags)
>> +{
>> + int match_len, last_slash, max_len;
>> + unsigned int match_flags, ret_flags;
>> + struct stat st;
>> +
>> + /* Check if match from the cache for 2 "excluding" path types.
>> + */
>> + match_len = last_slash =
>> + greatest_common_path_cache_prefix(len, name);
>> + match_flags =
>> + cache_flags & track_flags & (LSTAT_NOTDIR|
>> + LSTAT_SYMLINK);
>> + if (match_flags && match_len == cache_len)
>> + return match_flags;
>
> Let me see if I understand the logic behind this caching. When you have
> checked A/B/C earlier and you already know B is a symlink, you remember
> that A/B was a symlink.. You can fill a request to check A/B/$whatever
> (assuming A/B does not change --- otherwise the caller should clear the
> cache) from the cached data, because no matter what $whatever is, it will
> result in the same "has-leading-symlink".
>
> Similarly, if you know A/B is not a directory from an earlier test, you
> know that a request to check A/B/$whatever will result in the same ENOTDIR
> no matter what $whatever is, so you can return early.
>
> The above "return match_flags" will not trigger if the cached path does
> not have any leading symlink. So we know the matched part are all good
> directories when we start lstat() loop.
>
> Am I following you so far?
Yes you do!
>> + /* Okay, no match from the cache so far, so now we have to
>> + * check the rest of the path components and update the cache.
>> + */
>> + ret_flags = LSTAT_DIR;
>> + max_len = len < PATH_MAX ? len : PATH_MAX;
>> + while (match_len < max_len) {
>> + do {
>> + cache_path[match_len] = name[match_len];
>> + match_len++;
>> + } while (match_len < max_len && name[match_len] != '/');
>
> You take one component from the input, and append it to the part that is
> already known to be true directory (i.e. cached part and the part earlier
> iteration of the loop checked so far), to be tested by lstat()...
>
>> + if (match_len >= max_len)
>> + break;
>
> ... but you are not interested in the full input.
If the lengt of name is larger than PATH_MAX all lstat() calls would
fail with an ENAMETOOLONG error, at least on my Linux box, so I
thought that it was not nessarry to test further. But maybe we should
emdiatly return an error if name is too long?
> We are only checking the leading path (e.g. check for "A/B/C" may
> lstat() "A", "A/B" but not "A/B/C").
That is correct, and it is the same logic as in the
has_symlink_leading_path() function.
>> + last_slash = match_len;
>> + cache_path[last_slash] = '\0';
>> +
>> + if (lstat(cache_path, &st)) {
>> + ret_flags = LSTAT_LSTATERR;
>> + if (errno == ENOENT || errno == ENOTDIR)
>> + ret_flags |= LSTAT_NOTDIR;
>
> If you tested "A/B" here and got ENOENT back, you know "A/B" does not
> exist; you cache this knowledge as "A/B is not a directory"
Correct.
> (I also think you could use it as a cached knowledge that "A exists
> and is a directory". I am not sure if you are taking advantage of
> that).
It does take adavantage of this fact. It will also do simmilar things
with a symlink cached path.
> What I do not understand about this code is the ENOTDIR case. If you
> tested "A/B" and got ENOTDIR back, what you know is that "A" is not a
> directory (if the path tested this round were deeper like "X/Y/A/B", you
> know "X/Y/A" is not a directory, and you know "X" and "X/Y" are true
> directories; otherwise the loop would have exited before this round when
> you tested "X" or "X/Y" in the earlier rounds).
Since the cache is supoosed to start from a known existing directory,
and is testing each path component when the lstat("A/B") calls returns
ENOTDIR, we should know the fact that the directory "A" exists, and
that "A/B/" does not exists.
> So as far as I can think of, ENOENT case and ENOTDIR case would give you
> different information (ENOENT would say "A is a dir, A/B is not"; ENOTDIR
> would say "A is not a dir"). I am confused how you can cache the same
> path and same flag between these two cases here.
I admit that I used copy-and-paste from other similar test's from the
sourcecode:
kjetil@localhost ~/git/git $ grep -Hn ENOENT.*ENOTDIR *.c
builtin-apply.c:2364: else if ((errno != ENOENT) && (errno != ENOTDIR))
builtin-update-index.c:74: * - missing file (ENOENT or ENOTDIR). That's ok if we're
builtin-update-index.c:81: if (err == ENOENT || err == ENOTDIR)
diff-lib.c:30: if (errno != ENOENT && errno != ENOTDIR)
lstat_cache.c:81: if (errno == ENOENT || errno == ENOTDIR)
setup.c:191: if (errno != ENOENT && errno != ENOTDIR)
From the 'man lstat' page I read the following for this 2 error codes:
ENOENT A component of the path _path_ does not exist, or the path is an empty string.
ENOTDIR A component of the path is not a directory.
I would have guessed that for what we is looking for, it would be
correct to threat both these error codes as the same. Is this wrong?
-- kjetil
next prev parent reply other threads:[~2009-01-06 9:38 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-01-05 13:09 [PATCH/RFC 0/4] git checkout: optimise away lots of lstat() calls Kjetil Barvik
2009-01-05 13:09 ` [PATCH/RFC 1/4] Optimised, faster, more effective symlink/directory detection Kjetil Barvik
2009-01-06 8:19 ` Junio C Hamano
2009-01-06 8:56 ` Junio C Hamano
2009-01-06 9:56 ` Kjetil Barvik
2009-01-06 13:45 ` Kjetil Barvik
2009-01-06 9:36 ` Kjetil Barvik [this message]
2009-01-05 13:09 ` [PATCH/RFC 2/4] Use 'lstat_cache()' instead of 'has_symlink_leading_path()' Kjetil Barvik
2009-01-06 8:19 ` Junio C Hamano
2009-01-06 12:50 ` Kjetil Barvik
2009-01-05 13:10 ` [PATCH/RFC 3/4] create_directories() inside entry.c: only check each directory once! Kjetil Barvik
2009-01-05 13:10 ` [PATCH/RFC 4/4] remove the old 'has_symlink_leading_path()' function Kjetil Barvik
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=86fxjwlqro.fsf@broadpark.no \
--to=barvik@broadpark.no \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).