Re: [RFC] Speed up "git status" by caching untracked file info

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Karsten Blees <karsten.blees@gmail.com>
To: "Nguyễn Thái Ngọc Duy" <pclouds@gmail.com>, git@vger.kernel.org
Subject: Re: [RFC] Speed up "git status" by caching untracked file info
Date: Tue, 22 Apr 2014 11:56:02 +0200	[thread overview]
Message-ID: <53563CB2.5030603@gmail.com> (raw)
In-Reply-To: <1397713918-22829-1-git-send-email-pclouds@gmail.com>

Am 17.04.2014 07:51, schrieb Nguyễn Thái Ngọc Duy:
> This patch serves as a heads up about a feature I'm working on. I hope
> that by posting it early, people could double check if I have made
> some fundamental mistakes that completely ruin the idea. It's about
> speeding up "git status" by caching untracked file info in the index
> _if_ your file system supports it (more below).
> 
> The whole WIP series is at
> 
> https://github.com/pclouds/git/commits/untracked-cache
> 
> I only post the real meat here. I'm aware of a few incomplete details
> in this patch, but nothing fundamentally wrong. So far the numbers are
> promising.  ls-files is updated to run fill_directory() twice in a
> row and "ls-files -o --directory --no-empty-directory --exclude-standard"
> (with gcc -O0) gives me:
> 
>            first run  second (cached) run
> gentoo-x86    500 ms             71.6  ms
> wine          140 ms              9.72 ms
> webkit        125 ms              6.88 ms

IIRC name_hash.c::lazy_init_name_hash took ~100ms on my system, so hopefully you did a dummy 'cache_name_exists("anything")' before starting the measurement of the first run?

> linux-2.6     106 ms             16.2  ms
> 
> Basically untracked time is cut to one tenth in the best case
> scenario. The final numbers would be a bit higher because I haven't
> stored or read the cache from index yet. Real commit message follows..
> 
> 
> read_directory() plays a bit part in the slowness of "git status"
> because it has to read every directory and check for excluded entries,
> which is really expensive. This patch adds an option to cache the
> results so that after the first slow read_directory(), the following
> calls should be cheap and fast.
> 
> The following inputs are sufficient to determine what files in a
> directory are excluded:
> 
>  - The list of files and directories of the direction in question
>  - The $GIT_DIR/index
>  - The content of $GIT_DIR/info/exclude
>  - The content of core.excludesfile
>  - The content (or the lack) of .gitignore of all parent directories
>    from $GIT_WORK_TREE
> 

The dir_struct.flags also play a big role in evaluation of read_directory.

E.g. it seems untracked files are not properly recorded if the cache is filled with '--ignored' option:

> @@ -1360,15 +1603,18 @@ static enum path_treatment read_directory_recursive(struct dir_struct *dir,
>  			break;
>  
>  		case path_untracked:
> -			if (!(dir->flags & DIR_SHOW_IGNORED))
> -				dir_add_name(dir, path.buf, path.len);
> +			if (dir->flags & DIR_SHOW_IGNORED)
> +				break;
> +			dir_add_name(dir, path.buf, path.len);
> +			if (cdir.fdir)
> +				add_untracked(untracked, path.buf + baselen);
>  			break;

Similarly, the '--directory' option controls early returns from the directory scan (via read_directory_recursive's check_only argument), so you won't be able to get a full untracked files listing if the cache was recorded with '--directory'. Additionally, '--directory' aggregates the state at the topmost untracked directory, so that directory's cached state depends on all sub-directories as well...

I wonder if it makes sense to separate cache recording logic from read_directory_recursive and friends, which are mainly concerned with flags processing.

> If we can cheaply validate all those inputs for a certain directory,
> we are sure that the current code will always produce the same
> results, so we can cache and reuse those results.
> 
> This is not a silver bullet approach. When you compile a C file, for
> example, the old .o file is removed and a new one with the same name
> created, effectively invalidating the containing directory's
> cache. But at least with a large enough work tree, there could be many
> directories you never touch. The cache could help there.
> 
> The first input can be checked using directory mtime. In many
> filesystems, directory mtime is updated when direct files/dirs are
> added or removed (*). If you do not use such a file system, this
> feature is not for you.
> 
> The second one can be hooked from read-cache.c. Whenever a file (or a
> submodule) is added or removed from a directory, we invalidate that
> directory. This will be done in a later patch.
> 
> The remaining inputs are easy, their SHA-1 could be used to verify
> their contents. We do need to read .gitignore files and digest
> them. But they are usually few and small, so the overhead should not
> be much.
> 
> At the implementation level, the whole directory structure is saved,
> each directory corresponds to one struct untracked_dir.

With the usual options (e.g. standard 'git status'), untracked directories are mostly skipped, so the cache would mostly store tracked directories. Naming it 'struct untracked_dir' is a bit confusing, IMO.

> Each directory
> holds SHA-1 of the .gitignore underneath (or null if it does not
> exist) and the list of untracked "files" and subdirs that need to
> recurse into if all is well. Untracked subdirectories are saved in the
> file queue and are the reason of quoting "files" in the previous
> sentence.
> 
> On the first run, no untracked_dir is valid, the default code path is
> run. prep_exclude() is updated to record SHA-1 of .gitignore along the
> way. read_directory_recursive() is updated to record untracked files.
> 
> On subsequent runs, read_directory_recursive() reads stat info of the
> directory in question and verifies if files/dirs have been added or
> removed. With the help of prep_exclude() to verify .gitignore chain,
> it may decide "all is well" and enable the fast path in
> treat_path(). read_directory_recursive() is still called for
> subdirectories even in fast path, because a directory mtime does not
> cover all subdirs recursively.
> 
> So if all is really well, read_directory() becomes a series of
> open(".gitignore"), read(".gitignore"), close(), hash_sha1_file() and
> stat(<dir>) _without_ heavyweight exclude filtering. There should be
> no overhead if this feature is disabled.
> 

Wouldn't mtime of .gitignore files suffice here (so you don't need to open and parse them every time)?

next prev parent reply	other threads:[~2014-04-22  9:56 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-04-17  5:51 [RFC] Speed up "git status" by caching untracked file info Nguyễn Thái Ngọc Duy
2014-04-17 19:40 ` Junio C Hamano
2014-04-17 23:27   ` Duy Nguyen
2014-04-22  9:56 ` Karsten Blees [this message]
2014-04-22 10:13   ` Duy Nguyen
2014-04-22 10:35     ` Duy Nguyen
2014-04-22 18:56       ` Karsten Blees
2014-04-23  0:52         ` Duy Nguyen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=53563CB2.5030603@gmail.com \
    --to=karsten.blees@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=pclouds@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.