Re: [RFC] Speed up "git status" by caching untracked file info

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Karsten Blees <karsten.blees@gmail.com>
To: "Nguyễn Thái Ngọc Duy" <pclouds@gmail.com>, git@vger.kernel.org
Subject: Re: [RFC] Speed up "git status" by caching untracked file info
Date: Tue, 22 Apr 2014 11:56:02 +0200	[thread overview]
Message-ID: <53563CB2.5030603@gmail.com> (raw)
In-Reply-To: <1397713918-22829-1-git-send-email-pclouds@gmail.com>

Am 17.04.2014 07:51, schrieb Nguyễn Thái Ngọc Duy:
> This patch serves as a heads up about a feature I'm working on. I hope
> that by posting it early, people could double check if I have made
> some fundamental mistakes that completely ruin the idea. It's about
> speeding up "git status" by caching untracked file info in the index
> _if_ your file system supports it (more below).
> 
> The whole WIP series is at
> 
> https://github.com/pclouds/git/commits/untracked-cache
> 
> I only post the real meat here. I'm aware of a few incomplete details
> in this patch, but nothing fundamentally wrong. So far the numbers are
> promising.  ls-files is updated to run fill_directory() twice in a
> row and "ls-files -o --directory --no-empty-directory --exclude-standard"
> (with gcc -O0) gives me:
> 
>            first run  second (cached) run
> gentoo-x86    500 ms             71.6  ms
> wine          140 ms              9.72 ms
> webkit        125 ms              6.88 ms

IIRC name_hash.c::lazy_init_name_hash took ~100ms on my system, so hopefully you did a dummy 'cache_name_exists("anything")' before starting the measurement of the first run?

> linux-2.6     106 ms             16.2  ms
> 
> Basically untracked time is cut to one tenth in the best case
> scenario. The final numbers would be a bit higher because I haven't
> stored or read the cache from index yet. Real commit message follows..
> 
> 
> read_directory() plays a bit part in the slowness of "git status"
> because it has to read every directory and check for excluded entries,
> which is really expensive. This patch adds an option to cache the
> results so that after the first slow read_directory(), the following
> calls should be cheap and fast.
> 
> The following inputs are sufficient to determine what files in a
> directory are excluded:
> 
>  - The list of files and directories of the direction in question
>  - The $GIT_DIR/index
>  - The content of $GIT_DIR/info/exclude
>  - The content of core.excludesfile
>  - The content (or the lack) of .gitignore of all parent directories
>    from $GIT_WORK_TREE
> 

The dir_struct.flags also play a big role in evaluation of read_directory.

E.g. it seems untracked files are not properly recorded if the cache is filled with '--ignored' option:

> @@ -1360,15 +1603,18 @@ static enum path_treatment read_directory_recursive(struct dir_struct *dir,
>  			break;
>  
>  		case path_untracked:
> -			if (!(dir->flags & DIR_SHOW_IGNORED))
> -				dir_add_name(dir, path.buf, path.len);
> +			if (dir->flags & DIR_SHOW_IGNORED)
> +				break;
> +			dir_add_name(dir, path.buf, path.len);
> +			if (cdir.fdir)
> +				add_untracked(untracked, path.buf + baselen);
>  			break;

Similarly, the '--directory' option controls early returns from the directory scan (via read_directory_recursive's check_only argument), so you won't be able to get a full untracked files listing if the cache was recorded with '--directory'. Additionally, '--directory' aggregates the state at the topmost untracked directory, so that directory's cached state depends on all sub-directories as well...

I wonder if it makes sense to separate cache recording logic from read_directory_recursive and friends, which are mainly concerned with flags processing.

> If we can cheaply validate all those inputs for a certain directory,
> we are sure that the current code will always produce the same
> results, so we can cache and reuse those results.
> 
> This is not a silver bullet approach. When you compile a C file, for
> example, the old .o file is removed and a new one with the same name
> created, effectively invalidating the containing directory's
> cache. But at least with a large enough work tree, there could be many
> directories you never touch. The cache could help there.
> 
> The first input can be checked using directory mtime. In many
> filesystems, directory mtime is updated when direct files/dirs are
> added or removed (*). If you do not use such a file system, this
> feature is not for you.
> 
> The second one can be hooked from read-cache.c. Whenever a file (or a
> submodule) is added or removed from a directory, we invalidate that
> directory. This will be done in a later patch.
> 
> The remaining inputs are easy, their SHA-1 could be used to verify
> their contents. We do need to read .gitignore files and digest
> them. But they are usually few and small, so the overhead should not
> be much.
> 
> At the implementation level, the whole directory structure is saved,
> each directory corresponds to one struct untracked_dir.

With the usual options (e.g. standard 'git status'), untracked directories are mostly skipped, so the cache would mostly store tracked directories. Naming it 'struct untracked_dir' is a bit confusing, IMO.

> Each directory
> holds SHA-1 of the .gitignore underneath (or null if it does not
> exist) and the list of untracked "files" and subdirs that need to
> recurse into if all is well. Untracked subdirectories are saved in the
> file queue and are the reason of quoting "files" in the previous
> sentence.
> 
> On the first run, no untracked_dir is valid, the default code path is
> run. prep_exclude() is updated to record SHA-1 of .gitignore along the
> way. read_directory_recursive() is updated to record untracked files.
> 
> On subsequent runs, read_directory_recursive() reads stat info of the
> directory in question and verifies if files/dirs have been added or
> removed. With the help of prep_exclude() to verify .gitignore chain,
> it may decide "all is well" and enable the fast path in
> treat_path(). read_directory_recursive() is still called for
> subdirectories even in fast path, because a directory mtime does not
> cover all subdirs recursively.
> 
> So if all is really well, read_directory() becomes a series of
> open(".gitignore"), read(".gitignore"), close(), hash_sha1_file() and
> stat(<dir>) _without_ heavyweight exclude filtering. There should be
> no overhead if this feature is disabled.
> 

Wouldn't mtime of .gitignore files suffice here (so you don't need to open and parse them every time)?

next prev parent reply	other threads:[~2014-04-22  9:56 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-04-17  5:51 [RFC] Speed up "git status" by caching untracked file info Nguyễn Thái Ngọc Duy
2014-04-17 19:40 ` Junio C Hamano
2014-04-17 23:27   ` Duy Nguyen
2014-04-22  9:56 ` Karsten Blees [this message]
2014-04-22 10:13   ` Duy Nguyen
2014-04-22 10:35     ` Duy Nguyen
2014-04-22 18:56       ` Karsten Blees
2014-04-23  0:52         ` Duy Nguyen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=53563CB2.5030603@gmail.com \
    --to=karsten.blees@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=pclouds@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).