git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Kjetil Barvik <barvik@broadpark.no>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org, Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH/RFC 1/4] Optimised, faster, more effective symlink/directory detection
Date: Tue, 06 Jan 2009 10:36:59 +0100	[thread overview]
Message-ID: <86fxjwlqro.fsf@broadpark.no> (raw)
In-Reply-To: <7vprj0j181.fsf@gitster.siamese.dyndns.org>

Junio C Hamano <gitster@pobox.com> writes:

> Kjetil Barvik <barvik@broadpark.no> writes:
>
>> +static inline void
>> +update_path_cache(unsigned int ret_flags, unsigned int track_flags,
>> +		  int last_slash)
>> +{
>> +	/* Max 3 different path types can be cached for the moment! */
>> +	unsigned int save_flags =
>> +		ret_flags & track_flags & (LSTAT_DIR|
>> +					   LSTAT_NOTDIR|
>> +					   LSTAT_SYMLINK);
>> +	if (save_flags && last_slash > 0 && last_slash < PATH_MAX) {
>> +		cache_flags = save_flags;
>> +		cache_len   = last_slash;
>> +	} else {
>> +		cache_flags = 0;
>> +		cache_len   = 0;
>> +	}
>> +}
>
> I personally found this inline function with a single call site
> distracting in following the logic.  It does not make the indentation
> level shallower, either.  Also, the else part should probably call
> clear_lstat_cache() to protect it from possible future enhancements to add
> more state variables.

  Ok, I will remove the function and update the else-part.

>> +
>> +/*
>> + * Check if name 'name' of length 'len' has a symlink leading
>> + * component, or if the directory exists and is real, or not.
>> + *
>> + * To speed up the check, some information is allowed to be cached.
>> + * This is indicated by the 'track_flags' argument.
>> + */
>> +unsigned int
>> +check_lstat_cache(int len, const char *name, unsigned int track_flags)
>> +{
>> +	int match_len, last_slash, max_len;
>> +	unsigned int match_flags, ret_flags;
>> +	struct stat st;
>> +
>> +	/* Check if match from the cache for 2 "excluding" path types.
>> +	 */
>> +	match_len = last_slash =
>> +		greatest_common_path_cache_prefix(len, name);
>> +	match_flags =
>> +		cache_flags & track_flags & (LSTAT_NOTDIR|
>> +					     LSTAT_SYMLINK);
>> +	if (match_flags && match_len == cache_len)
>> +		return match_flags;
>
> Let me see if I understand the logic behind this caching.  When you have
> checked A/B/C earlier and you already know B is a symlink, you remember
> that A/B was a symlink..  You can fill a request to check A/B/$whatever
> (assuming A/B does not change --- otherwise the caller should clear the
> cache) from the cached data, because no matter what $whatever is, it will
> result in the same "has-leading-symlink".
>
> Similarly, if you know A/B is not a directory from an earlier test, you
> know that a request to check A/B/$whatever will result in the same ENOTDIR
> no matter what $whatever is, so you can return early.
>
> The above "return match_flags" will not trigger if the cached path does
> not have any leading symlink.  So we know the matched part are all good
> directories when we start lstat() loop.
>
> Am I following you so far?

  Yes you do!

>> +	/* Okay, no match from the cache so far, so now we have to
>> +	 * check the rest of the path components and update the cache.
>> +	 */
>> +	ret_flags = LSTAT_DIR;
>> +	max_len = len < PATH_MAX ? len : PATH_MAX;
>> +	while (match_len < max_len) {
>> +		do {
>> +			cache_path[match_len] = name[match_len];
>> +			match_len++;
>> +		} while (match_len < max_len && name[match_len] != '/');
>
> You take one component from the input, and append it to the part that is
> already known to be true directory (i.e. cached part and the part earlier
> iteration of the loop checked so far), to be tested by lstat()...
>
>> +		if (match_len >= max_len)
>> +			break;
>
> ... but you are not interested in the full input.  

  If the lengt of name is larger than PATH_MAX all lstat() calls would
  fail with an ENAMETOOLONG error, at least on my Linux box, so I
  thought that it was not nessarry to test further.  But maybe we should
  emdiatly return an error if name is too long?

> We are only checking the leading path (e.g. check for "A/B/C" may
> lstat() "A", "A/B" but not "A/B/C").

  That is correct, and it is the same logic as in the
  has_symlink_leading_path() function.

>> +		last_slash = match_len;
>> +		cache_path[last_slash] = '\0';
>> +
>> +		if (lstat(cache_path, &st)) {
>> +			ret_flags = LSTAT_LSTATERR;
>> +			if (errno == ENOENT || errno == ENOTDIR)
>> +				ret_flags |= LSTAT_NOTDIR;
>
> If you tested "A/B" here and got ENOENT back, you know "A/B" does not
> exist; you cache this knowledge as "A/B is not a directory" 

  Correct.

> (I also think you could use it as a cached knowledge that "A exists
> and is a directory".  I am not sure if you are taking advantage of
> that).

  It does take adavantage of this fact.  It will also do simmilar things
  with a symlink cached path.

> What I do not understand about this code is the ENOTDIR case.  If you
> tested "A/B" and got ENOTDIR back, what you know is that "A" is not a
> directory (if the path tested this round were deeper like "X/Y/A/B", you
> know "X/Y/A" is not a directory, and you know "X" and "X/Y" are true
> directories; otherwise the loop would have exited before this round when
> you tested "X" or "X/Y" in the earlier rounds).

  Since the cache is supoosed to start from a known existing directory,
  and is testing each path component when the lstat("A/B") calls returns
  ENOTDIR, we should know the fact that the directory "A" exists, and
  that "A/B/" does not exists.
  
> So as far as I can think of, ENOENT case and ENOTDIR case would give you
> different information (ENOENT would say "A is a dir, A/B is not"; ENOTDIR
> would say "A is not a dir").  I am confused how you can cache the same
> path and same flag between these two cases here.

  I admit that I used copy-and-paste from other similar test's from the
  sourcecode:

kjetil@localhost ~/git/git $ grep -Hn ENOENT.*ENOTDIR *.c
builtin-apply.c:2364:	else if ((errno != ENOENT) && (errno != ENOTDIR))
builtin-update-index.c:74: *  - missing file (ENOENT or ENOTDIR). That's ok if we're
builtin-update-index.c:81:	if (err == ENOENT || err == ENOTDIR)
diff-lib.c:30:		if (errno != ENOENT && errno != ENOTDIR)
lstat_cache.c:81:			if (errno == ENOENT || errno == ENOTDIR)
setup.c:191:	if (errno != ENOENT && errno != ENOTDIR)

  From the 'man lstat' page I read the following for this 2 error codes:

    ENOENT    A component of the path _path_ does not exist, or the path is an empty string.
    ENOTDIR   A component of the path is not a directory.

  I would have guessed that for what we is looking for, it would be
  correct to threat both these error codes as the same.  Is this wrong?

  -- kjetil

  parent reply	other threads:[~2009-01-06  9:38 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-01-05 13:09 [PATCH/RFC 0/4] git checkout: optimise away lots of lstat() calls Kjetil Barvik
2009-01-05 13:09 ` [PATCH/RFC 1/4] Optimised, faster, more effective symlink/directory detection Kjetil Barvik
2009-01-06  8:19   ` Junio C Hamano
2009-01-06  8:56     ` Junio C Hamano
2009-01-06  9:56       ` Kjetil Barvik
2009-01-06 13:45       ` Kjetil Barvik
2009-01-06  9:36     ` Kjetil Barvik [this message]
2009-01-05 13:09 ` [PATCH/RFC 2/4] Use 'lstat_cache()' instead of 'has_symlink_leading_path()' Kjetil Barvik
2009-01-06  8:19   ` Junio C Hamano
2009-01-06 12:50     ` Kjetil Barvik
2009-01-05 13:10 ` [PATCH/RFC 3/4] create_directories() inside entry.c: only check each directory once! Kjetil Barvik
2009-01-05 13:10 ` [PATCH/RFC 4/4] remove the old 'has_symlink_leading_path()' function Kjetil Barvik

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=86fxjwlqro.fsf@broadpark.no \
    --to=barvik@broadpark.no \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).