From: Kjetil Barvik <barvik@broadpark.no>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org, Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH/RFC 1/4] Optimised, faster, more effective symlink/directory detection
Date: Tue, 06 Jan 2009 10:36:59 +0100 [thread overview]
Message-ID: <86fxjwlqro.fsf@broadpark.no> (raw)
In-Reply-To: <7vprj0j181.fsf@gitster.siamese.dyndns.org>
Junio C Hamano <gitster@pobox.com> writes:
> Kjetil Barvik <barvik@broadpark.no> writes:
>
>> +static inline void
>> +update_path_cache(unsigned int ret_flags, unsigned int track_flags,
>> + int last_slash)
>> +{
>> + /* Max 3 different path types can be cached for the moment! */
>> + unsigned int save_flags =
>> + ret_flags & track_flags & (LSTAT_DIR|
>> + LSTAT_NOTDIR|
>> + LSTAT_SYMLINK);
>> + if (save_flags && last_slash > 0 && last_slash < PATH_MAX) {
>> + cache_flags = save_flags;
>> + cache_len = last_slash;
>> + } else {
>> + cache_flags = 0;
>> + cache_len = 0;
>> + }
>> +}
>
> I personally found this inline function with a single call site
> distracting in following the logic. It does not make the indentation
> level shallower, either. Also, the else part should probably call
> clear_lstat_cache() to protect it from possible future enhancements to add
> more state variables.
Ok, I will remove the function and update the else-part.
>> +
>> +/*
>> + * Check if name 'name' of length 'len' has a symlink leading
>> + * component, or if the directory exists and is real, or not.
>> + *
>> + * To speed up the check, some information is allowed to be cached.
>> + * This is indicated by the 'track_flags' argument.
>> + */
>> +unsigned int
>> +check_lstat_cache(int len, const char *name, unsigned int track_flags)
>> +{
>> + int match_len, last_slash, max_len;
>> + unsigned int match_flags, ret_flags;
>> + struct stat st;
>> +
>> + /* Check if match from the cache for 2 "excluding" path types.
>> + */
>> + match_len = last_slash =
>> + greatest_common_path_cache_prefix(len, name);
>> + match_flags =
>> + cache_flags & track_flags & (LSTAT_NOTDIR|
>> + LSTAT_SYMLINK);
>> + if (match_flags && match_len == cache_len)
>> + return match_flags;
>
> Let me see if I understand the logic behind this caching. When you have
> checked A/B/C earlier and you already know B is a symlink, you remember
> that A/B was a symlink.. You can fill a request to check A/B/$whatever
> (assuming A/B does not change --- otherwise the caller should clear the
> cache) from the cached data, because no matter what $whatever is, it will
> result in the same "has-leading-symlink".
>
> Similarly, if you know A/B is not a directory from an earlier test, you
> know that a request to check A/B/$whatever will result in the same ENOTDIR
> no matter what $whatever is, so you can return early.
>
> The above "return match_flags" will not trigger if the cached path does
> not have any leading symlink. So we know the matched part are all good
> directories when we start lstat() loop.
>
> Am I following you so far?
Yes you do!
>> + /* Okay, no match from the cache so far, so now we have to
>> + * check the rest of the path components and update the cache.
>> + */
>> + ret_flags = LSTAT_DIR;
>> + max_len = len < PATH_MAX ? len : PATH_MAX;
>> + while (match_len < max_len) {
>> + do {
>> + cache_path[match_len] = name[match_len];
>> + match_len++;
>> + } while (match_len < max_len && name[match_len] != '/');
>
> You take one component from the input, and append it to the part that is
> already known to be true directory (i.e. cached part and the part earlier
> iteration of the loop checked so far), to be tested by lstat()...
>
>> + if (match_len >= max_len)
>> + break;
>
> ... but you are not interested in the full input.
If the lengt of name is larger than PATH_MAX all lstat() calls would
fail with an ENAMETOOLONG error, at least on my Linux box, so I
thought that it was not nessarry to test further. But maybe we should
emdiatly return an error if name is too long?
> We are only checking the leading path (e.g. check for "A/B/C" may
> lstat() "A", "A/B" but not "A/B/C").
That is correct, and it is the same logic as in the
has_symlink_leading_path() function.
>> + last_slash = match_len;
>> + cache_path[last_slash] = '\0';
>> +
>> + if (lstat(cache_path, &st)) {
>> + ret_flags = LSTAT_LSTATERR;
>> + if (errno == ENOENT || errno == ENOTDIR)
>> + ret_flags |= LSTAT_NOTDIR;
>
> If you tested "A/B" here and got ENOENT back, you know "A/B" does not
> exist; you cache this knowledge as "A/B is not a directory"
Correct.
> (I also think you could use it as a cached knowledge that "A exists
> and is a directory". I am not sure if you are taking advantage of
> that).
It does take adavantage of this fact. It will also do simmilar things
with a symlink cached path.
> What I do not understand about this code is the ENOTDIR case. If you
> tested "A/B" and got ENOTDIR back, what you know is that "A" is not a
> directory (if the path tested this round were deeper like "X/Y/A/B", you
> know "X/Y/A" is not a directory, and you know "X" and "X/Y" are true
> directories; otherwise the loop would have exited before this round when
> you tested "X" or "X/Y" in the earlier rounds).
Since the cache is supoosed to start from a known existing directory,
and is testing each path component when the lstat("A/B") calls returns
ENOTDIR, we should know the fact that the directory "A" exists, and
that "A/B/" does not exists.
> So as far as I can think of, ENOENT case and ENOTDIR case would give you
> different information (ENOENT would say "A is a dir, A/B is not"; ENOTDIR
> would say "A is not a dir"). I am confused how you can cache the same
> path and same flag between these two cases here.
I admit that I used copy-and-paste from other similar test's from the
sourcecode:
kjetil@localhost ~/git/git $ grep -Hn ENOENT.*ENOTDIR *.c
builtin-apply.c:2364: else if ((errno != ENOENT) && (errno != ENOTDIR))
builtin-update-index.c:74: * - missing file (ENOENT or ENOTDIR). That's ok if we're
builtin-update-index.c:81: if (err == ENOENT || err == ENOTDIR)
diff-lib.c:30: if (errno != ENOENT && errno != ENOTDIR)
lstat_cache.c:81: if (errno == ENOENT || errno == ENOTDIR)
setup.c:191: if (errno != ENOENT && errno != ENOTDIR)
From the 'man lstat' page I read the following for this 2 error codes:
ENOENT A component of the path _path_ does not exist, or the path is an empty string.
ENOTDIR A component of the path is not a directory.
I would have guessed that for what we is looking for, it would be
correct to threat both these error codes as the same. Is this wrong?
-- kjetil
next prev parent reply other threads:[~2009-01-06 9:38 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-01-05 13:09 [PATCH/RFC 0/4] git checkout: optimise away lots of lstat() calls Kjetil Barvik
2009-01-05 13:09 ` [PATCH/RFC 1/4] Optimised, faster, more effective symlink/directory detection Kjetil Barvik
2009-01-06 8:19 ` Junio C Hamano
2009-01-06 8:56 ` Junio C Hamano
2009-01-06 9:56 ` Kjetil Barvik
2009-01-06 13:45 ` Kjetil Barvik
2009-01-06 9:36 ` Kjetil Barvik [this message]
2009-01-05 13:09 ` [PATCH/RFC 2/4] Use 'lstat_cache()' instead of 'has_symlink_leading_path()' Kjetil Barvik
2009-01-06 8:19 ` Junio C Hamano
2009-01-06 12:50 ` Kjetil Barvik
2009-01-05 13:10 ` [PATCH/RFC 3/4] create_directories() inside entry.c: only check each directory once! Kjetil Barvik
2009-01-05 13:10 ` [PATCH/RFC 4/4] remove the old 'has_symlink_leading_path()' function Kjetil Barvik
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=86fxjwlqro.fsf@broadpark.no \
--to=barvik@broadpark.no \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.