All of lore.kernel.org
 help / color / mirror / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: "Nguyễn Thái Ngọc Duy" <pclouds@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: [RFC] Speed up "git status" by caching untracked file info
Date: Thu, 17 Apr 2014 12:40:11 -0700	[thread overview]
Message-ID: <xmqqy4z3d9t0.fsf@gitster.dls.corp.google.com> (raw)
In-Reply-To: <1397713918-22829-1-git-send-email-pclouds@gmail.com> ("Nguyễn	Thái Ngọc Duy"'s message of "Thu, 17 Apr 2014 12:51:58 +0700")

Nguyễn Thái Ngọc Duy  <pclouds@gmail.com> writes:

>            first run  second (cached) run
> gentoo-x86    500 ms             71.6  ms
> wine          140 ms              9.72 ms
> webkit        125 ms              6.88 ms
> linux-2.6     106 ms             16.2  ms
>
> Basically untracked time is cut to one tenth in the best case
> scenario. The final numbers would be a bit higher because I haven't
> stored or read the cache from index yet. Real commit message follows..

As you allude to later with "if you recompile a single file, the
whole hierarchy in that directory is lost", two back-to-back runs of
"git status" is not very interesting.

>  - The list of files and directories of the direction in question
>  - The $GIT_DIR/index
>  - The content of $GIT_DIR/info/exclude
>  - The content of core.excludesfile
>  - The content (or the lack) of .gitignore of all parent directories
>    from $GIT_WORK_TREE
>
> If we can cheaply validate all those inputs for a certain directory,
> we are sure that the current code will always produce the same
> results, so we can cache and reuse those results.
>
> This is not a silver bullet approach. When you compile a C file, for
> example, the old .o file is removed and a new one with the same name
> created, effectively invalidating the containing directory's
> cache. But at least with a large enough work tree, there could be many
> directories you never touch. The cache could help there.
>
> The first input can be checked using directory mtime. In many
> filesystems, directory mtime is updated when direct files/dirs are
> added or removed (*).

An important thing is that creation of new cruft or deletion of
existing cruft can be detected without any false negative with the
mechanism, and mtime on directory would be a good way to check it.

> The second one can be hooked from read-cache.c. Whenever a file (or a
> submodule) is added or removed from a directory, we invalidate that
> directory. This will be done in a later patch.

I would imagine that it would be done at the same places as we
invalidate cache-trees, with the same "invalidation percolates up"
logic.

> On subsequent runs, read_directory_recursive() reads stat info of the
> directory in question and verifies if files/dirs have been added or
> removed.

Hmph.  If you have a two-level hierarchy D1/D2 and you change the
list of crufts in D2 but not in D1, the mtime of D1/D2 changes but
not the mtime of D1, as you observed below.

> With the help of prep_exclude() to verify .gitignore chain,
> it may decide "all is well" and enable the fast path in
> treat_path(). read_directory_recursive() is still called for
> subdirectories even in fast path, because a directory mtime does not
> cover all subdirs recursively.

I wonder if you can avoid recursing into D1 when no cached mtime
(and .gitignore) information has changed in any subdirectory of it
(e.g. both D1 and D1/D2 match the cache).

  reply	other threads:[~2014-04-17 19:40 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-04-17  5:51 [RFC] Speed up "git status" by caching untracked file info Nguyễn Thái Ngọc Duy
2014-04-17 19:40 ` Junio C Hamano [this message]
2014-04-17 23:27   ` Duy Nguyen
2014-04-22  9:56 ` Karsten Blees
2014-04-22 10:13   ` Duy Nguyen
2014-04-22 10:35     ` Duy Nguyen
2014-04-22 18:56       ` Karsten Blees
2014-04-23  0:52         ` Duy Nguyen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqqy4z3d9t0.fsf@gitster.dls.corp.google.com \
    --to=gitster@pobox.com \
    --cc=git@vger.kernel.org \
    --cc=pclouds@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.