From: Karsten Blees <karsten.blees@gmail.com>
To: "Nguyễn Thái Ngọc Duy" <pclouds@gmail.com>, git@vger.kernel.org
Subject: Re: [RFC] Speed up "git status" by caching untracked file info
Date: Tue, 22 Apr 2014 11:56:02 +0200 [thread overview]
Message-ID: <53563CB2.5030603@gmail.com> (raw)
In-Reply-To: <1397713918-22829-1-git-send-email-pclouds@gmail.com>
Am 17.04.2014 07:51, schrieb Nguyễn Thái Ngọc Duy:
> This patch serves as a heads up about a feature I'm working on. I hope
> that by posting it early, people could double check if I have made
> some fundamental mistakes that completely ruin the idea. It's about
> speeding up "git status" by caching untracked file info in the index
> _if_ your file system supports it (more below).
>
> The whole WIP series is at
>
> https://github.com/pclouds/git/commits/untracked-cache
>
> I only post the real meat here. I'm aware of a few incomplete details
> in this patch, but nothing fundamentally wrong. So far the numbers are
> promising. ls-files is updated to run fill_directory() twice in a
> row and "ls-files -o --directory --no-empty-directory --exclude-standard"
> (with gcc -O0) gives me:
>
> first run second (cached) run
> gentoo-x86 500 ms 71.6 ms
> wine 140 ms 9.72 ms
> webkit 125 ms 6.88 ms
IIRC name_hash.c::lazy_init_name_hash took ~100ms on my system, so hopefully you did a dummy 'cache_name_exists("anything")' before starting the measurement of the first run?
> linux-2.6 106 ms 16.2 ms
>
> Basically untracked time is cut to one tenth in the best case
> scenario. The final numbers would be a bit higher because I haven't
> stored or read the cache from index yet. Real commit message follows..
>
>
> read_directory() plays a bit part in the slowness of "git status"
> because it has to read every directory and check for excluded entries,
> which is really expensive. This patch adds an option to cache the
> results so that after the first slow read_directory(), the following
> calls should be cheap and fast.
>
> The following inputs are sufficient to determine what files in a
> directory are excluded:
>
> - The list of files and directories of the direction in question
> - The $GIT_DIR/index
> - The content of $GIT_DIR/info/exclude
> - The content of core.excludesfile
> - The content (or the lack) of .gitignore of all parent directories
> from $GIT_WORK_TREE
>
The dir_struct.flags also play a big role in evaluation of read_directory.
E.g. it seems untracked files are not properly recorded if the cache is filled with '--ignored' option:
> @@ -1360,15 +1603,18 @@ static enum path_treatment read_directory_recursive(struct dir_struct *dir,
> break;
>
> case path_untracked:
> - if (!(dir->flags & DIR_SHOW_IGNORED))
> - dir_add_name(dir, path.buf, path.len);
> + if (dir->flags & DIR_SHOW_IGNORED)
> + break;
> + dir_add_name(dir, path.buf, path.len);
> + if (cdir.fdir)
> + add_untracked(untracked, path.buf + baselen);
> break;
Similarly, the '--directory' option controls early returns from the directory scan (via read_directory_recursive's check_only argument), so you won't be able to get a full untracked files listing if the cache was recorded with '--directory'. Additionally, '--directory' aggregates the state at the topmost untracked directory, so that directory's cached state depends on all sub-directories as well...
I wonder if it makes sense to separate cache recording logic from read_directory_recursive and friends, which are mainly concerned with flags processing.
> If we can cheaply validate all those inputs for a certain directory,
> we are sure that the current code will always produce the same
> results, so we can cache and reuse those results.
>
> This is not a silver bullet approach. When you compile a C file, for
> example, the old .o file is removed and a new one with the same name
> created, effectively invalidating the containing directory's
> cache. But at least with a large enough work tree, there could be many
> directories you never touch. The cache could help there.
>
> The first input can be checked using directory mtime. In many
> filesystems, directory mtime is updated when direct files/dirs are
> added or removed (*). If you do not use such a file system, this
> feature is not for you.
>
> The second one can be hooked from read-cache.c. Whenever a file (or a
> submodule) is added or removed from a directory, we invalidate that
> directory. This will be done in a later patch.
>
> The remaining inputs are easy, their SHA-1 could be used to verify
> their contents. We do need to read .gitignore files and digest
> them. But they are usually few and small, so the overhead should not
> be much.
>
> At the implementation level, the whole directory structure is saved,
> each directory corresponds to one struct untracked_dir.
With the usual options (e.g. standard 'git status'), untracked directories are mostly skipped, so the cache would mostly store tracked directories. Naming it 'struct untracked_dir' is a bit confusing, IMO.
> Each directory
> holds SHA-1 of the .gitignore underneath (or null if it does not
> exist) and the list of untracked "files" and subdirs that need to
> recurse into if all is well. Untracked subdirectories are saved in the
> file queue and are the reason of quoting "files" in the previous
> sentence.
>
> On the first run, no untracked_dir is valid, the default code path is
> run. prep_exclude() is updated to record SHA-1 of .gitignore along the
> way. read_directory_recursive() is updated to record untracked files.
>
> On subsequent runs, read_directory_recursive() reads stat info of the
> directory in question and verifies if files/dirs have been added or
> removed. With the help of prep_exclude() to verify .gitignore chain,
> it may decide "all is well" and enable the fast path in
> treat_path(). read_directory_recursive() is still called for
> subdirectories even in fast path, because a directory mtime does not
> cover all subdirs recursively.
>
> So if all is really well, read_directory() becomes a series of
> open(".gitignore"), read(".gitignore"), close(), hash_sha1_file() and
> stat(<dir>) _without_ heavyweight exclude filtering. There should be
> no overhead if this feature is disabled.
>
Wouldn't mtime of .gitignore files suffice here (so you don't need to open and parse them every time)?
next prev parent reply other threads:[~2014-04-22 9:56 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-04-17 5:51 [RFC] Speed up "git status" by caching untracked file info Nguyễn Thái Ngọc Duy
2014-04-17 19:40 ` Junio C Hamano
2014-04-17 23:27 ` Duy Nguyen
2014-04-22 9:56 ` Karsten Blees [this message]
2014-04-22 10:13 ` Duy Nguyen
2014-04-22 10:35 ` Duy Nguyen
2014-04-22 18:56 ` Karsten Blees
2014-04-23 0:52 ` Duy Nguyen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=53563CB2.5030603@gmail.com \
--to=karsten.blees@gmail.com \
--cc=git@vger.kernel.org \
--cc=pclouds@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).