From: Karsten Blees <karsten.blees@gmail.com>
To: Duy Nguyen <pclouds@gmail.com>
Cc: Git Mailing List <git@vger.kernel.org>
Subject: Re: [RFC] Speed up "git status" by caching untracked file info
Date: Tue, 22 Apr 2014 20:56:27 +0200 [thread overview]
Message-ID: <5356BB5B.90903@gmail.com> (raw)
In-Reply-To: <CACsJy8AFDquHEajJQa_LWhtAMO-vHQbMCFuPV6UkGNZXTbANZw@mail.gmail.com>
Am 22.04.2014 12:35, schrieb Duy Nguyen:
> On Tue, Apr 22, 2014 at 5:13 PM, Duy Nguyen <pclouds@gmail.com> wrote:
>>> IIRC name_hash.c::lazy_init_name_hash took ~100ms on my system, so hopefully you did a dummy 'cache_name_exists("anything")' before starting the measurement of the first run?
>>
>> No I didn't. Thanks for pointing it out. I'll see if I can reduce its time.
>
> Well name-hash is only used when core.ignorecase is set. So it's
> optional.
This is only true for the case-insensitive directory hash. The file hash ('cache_file_exists') is always used to skip expensive excluded checks for tracked files.
'cache_file_exists' basically treats faster lookups for higher setup costs, which makes perfect sense when scanning the entire work tree. However, if most of the directory info is cached and just a few directories need refresh (and core.ignorecase=false), binary search ('cache_name_pos') may be better. The difficulty is to decide when to choose one over the other :-)
> Maybe we could save it in a separate index extension, but we
> need to verify that the reader uses the same hash function as the
> writer.
>
>>> Similarly, the '--directory' option controls early returns from the directory scan (via read_directory_recursive's check_only argument), so you won't be able to get a full untracked files listing if the cache was recorded with '--directory'. Additionally, '--directory' aggregates the state at the topmost untracked directory, so that directory's cached state depends on all sub-directories as well...
>>
>> I missed this. We could ignore check_only if caching is enabled, but
>> that does not sound really good. Let me think about it more..
>
> We could save "check_only" to the cache as well. This way we don't
> have to disable the check_only trick completely.
>
> So we process a directory with check_only set, find one untracked
> entry and stop short. We store check_only value and the status ("found
> something") in addition to dir mtime. Next time we check the dir's
> mtime. If it matches and is called with check_only set, we know there
> is at least one untracked entry, that's enough to stop r_d_r and
> return early. If dir mtime does not match, or r_d_r is called without
> check_only, we ignore the cached data and fall back to opendir.
>
> Sounds good?
>
What about untracked files in sub-directories? E.g. you have untracked dirs a/b with untracked file a/b/c, so normal 'git status' would list 'a/' as untracked.
Now, 'rm a/b/c' would update mtime of b, but not of a, so you'd still list 'a/' as untracked. Same thing for 'echo "c" >a/b/.gitignore'.
Your solution could work if you additionally cache the directories that had to be scanned to find that first untracked file (but you probably had that in mind anyway).
If the cache is only used for certain dir_struct.flags combinations, you can probably get around saving the check_only flag (which can only ever be true if both DIR_SHOW_OTHER_DIRECTORIES and DIR_HIDE_EMPTY_DIRECTORIES are set (which is the default for 'git status')).
next prev parent reply other threads:[~2014-04-22 18:56 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-04-17 5:51 [RFC] Speed up "git status" by caching untracked file info Nguyễn Thái Ngọc Duy
2014-04-17 19:40 ` Junio C Hamano
2014-04-17 23:27 ` Duy Nguyen
2014-04-22 9:56 ` Karsten Blees
2014-04-22 10:13 ` Duy Nguyen
2014-04-22 10:35 ` Duy Nguyen
2014-04-22 18:56 ` Karsten Blees [this message]
2014-04-23 0:52 ` Duy Nguyen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5356BB5B.90903@gmail.com \
--to=karsten.blees@gmail.com \
--cc=git@vger.kernel.org \
--cc=pclouds@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).