All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: Andreas Krey <a.krey@gmx.de>
Cc: Michael Haggerty <mhagger@alum.mit.edu>,
	Junio C Hamano <gitster@pobox.com>,
	Git Mailing List <git@vger.kernel.org>
Subject: Re: [PATCH] refs.c: get_ref_cache: use a bucket hash
Date: Mon, 16 Mar 2015 22:40:05 -0400	[thread overview]
Message-ID: <20150317024005.GA26313@peff.net> (raw)
In-Reply-To: <20150316184040.GA8902@inner.h.apk.li>

[+cc Michael for get_ref_cache wisdom]

On Mon, Mar 16, 2015 at 07:40:40PM +0100, Andreas Krey wrote:

> >I am guessing that the repository has tons
> > of submodules?
> 
> Not a single one. Thats's thie interesting thing that
> makes me think I'm not actually solving the right problem.
> 
> This repo has about 100k subdirectories that are ignored
> (I don't know whether directly or within ignored dirs),
> and strace said that git looks for '.git/HEAD' and one
> other file in each of these. Apparently it trieds to
> find out if any of these dirs happen to be a git repo
> which git clean treats specially, but it seems it also
> calls get_ref_cache for each of these dires even though
> the turn out not to be a sub-repo.
> 
> In other words: I suspect that get_ref_cache shouldn't
> be called that often, or that the cache entries should
> be removed once a directory is found not to be a sub repo.
> Then the linear list wouldn't really hurt.

Yeah, I'd agree.

The get_ref_cache code was designed to scale to the actual number of
submodules. I do not mind seeing it become a hash if people really do
have a large number of submodules, but that is not what is happening
here.

Bisecting, it looks like things got slow for your case starting in
f538a91 (git-clean: Display more accurate delete messages, 2013-01-11).
I reproduced with basically:

  git init
  for i in $(seq 30000); do mkdir $i; done
  time git clean -nd >/dev/null

It jumps in that commit from ~50ms to ~3000ms.

A backtrace from get_ref_cache shows:

  #0  get_ref_cache (submodule=0xa6a4f0 "1") at refs.c:1070
  #1  0x0000000000516469 in resolve_gitlink_ref (path=0xa6a4d0 "1/", refname=0x584822 "HEAD", 
      sha1=0x7fffffffde90 "\002") at refs.c:1429
  #2  0x0000000000423584 in remove_dirs (path=0x7fffffffe2f0, prefix=0x0, force_flag=2, dry_run=1, quiet=0, 
      dir_gone=0x7fffffffe314) at builtin/clean.c:164
  #3  0x00000000004255a9 in cmd_clean (argc=0, argv=0x7fffffffe5e0, prefix=0x0) at builtin/clean.c:981
  #4  0x0000000000405554 in run_builtin (p=0x7f7b18 <commands+408>, argc=2, argv=0x7fffffffe5e0) at git.c:348
  #5  0x0000000000405761 in handle_builtin (argc=2, argv=0x7fffffffe5e0) at git.c:530
  #6  0x000000000040587d in run_argv (argcp=0x7fffffffe4cc, argv=0x7fffffffe4d8) at git.c:576
  #7  0x0000000000405a6e in main (argc=2, av=0x7fffffffe5d8) at git.c:685

So git-clean speculatively asks "what is HEAD in this maybe-submodule?". The
right solution is probably one of:

  1. In remove_dirs, find out if we have an actual submodule before calling
     resolve_gitlink_ref.

  2. Teach get_ref_cache a "read-only" mode that will not auto-vivify the cache
     if it does not already exist.

Of the two, I think (1) is probably cleaner (I think the way the ref
code is structured, we have to create the submodule ref_cache in order
to start looking things up in it).

It looks like we don't even really care about the value of HEAD. We just
want to know "is it a git directory?". I think in other places (like
"git add"), we just do an existence check for "$dir/.git". That would
not catch a bare repository, but I do not think the current check does
either (it is looking for submodules, which always have a .git).

Maybe something like (largely untested):

diff --git a/builtin/clean.c b/builtin/clean.c
index 98c103f..e2cc47b 100644
--- a/builtin/clean.c
+++ b/builtin/clean.c
@@ -148,6 +148,32 @@ static int exclude_cb(const struct option *opt, const char *arg, int unset)
 	return 0;
 }
 
+static int dir_is_repo(struct strbuf *path)
+{
+	size_t orig = path->len;
+	int ret;
+
+	strbuf_addstr(path, "/.git");
+	if (!access(path->buf, F_OK))
+		ret = 1; /* definitely */
+	else if (errno == ENOENT)
+		ret = 0; /* definitely not */
+	else {
+		/*
+		 * We couldn't tell. It would probably be safer to err
+		 * on the side of saying "yes" here, because we are
+		 * deciding what to delete, and are more likely to keep
+		 * a sub-repo. But it would probably also create annoying
+		 * false positives, where a directory we do not have
+		 * permission to read would say something misleading
+		 * like "not deleting sub-repo foo..."
+		 */
+		ret = 0;
+	}
+	strbuf_setlen(path, orig);
+	return ret;
+}
+
 static int remove_dirs(struct strbuf *path, const char *prefix, int force_flag,
 		int dry_run, int quiet, int *dir_gone)
 {
@@ -155,13 +181,11 @@ static int remove_dirs(struct strbuf *path, const char *prefix, int force_flag,
 	struct strbuf quoted = STRBUF_INIT;
 	struct dirent *e;
 	int res = 0, ret = 0, gone = 1, original_len = path->len, len;
-	unsigned char submodule_head[20];
 	struct string_list dels = STRING_LIST_INIT_DUP;
 
 	*dir_gone = 1;
 
-	if ((force_flag & REMOVE_DIR_KEEP_NESTED_GIT) &&
-			!resolve_gitlink_ref(path->buf, "HEAD", submodule_head)) {
+	if ((force_flag & REMOVE_DIR_KEEP_NESTED_GIT) && dir_is_repo(path)) {
 		if (!quiet) {
 			quote_path_relative(path->buf, prefix, &quoted);
 			printf(dry_run ?  _(msg_would_skip_git_dir) : _(msg_skip_git_dir),

  reply	other threads:[~2015-03-17  2:40 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-03-16 14:20 [PATCH] refs.c: get_ref_cache: use a bucket hash Andreas Krey
2015-03-16 17:19 ` Thomas Gummerer
2015-03-16 17:23 ` Junio C Hamano
2015-03-16 18:40   ` Andreas Krey
2015-03-17  2:40     ` Jeff King [this message]
2015-03-17  5:35       ` Junio C Hamano
2015-03-17  5:48         ` Jeff King
2015-11-13 15:29           ` Andreas Krey
2015-11-14  0:01             ` Jeff King
2015-11-14 13:22               ` Andreas Krey
2015-11-14 13:35               ` Andreas Krey
2015-11-16 16:31                 ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150317024005.GA26313@peff.net \
    --to=peff@peff.net \
    --cc=a.krey@gmx.de \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=mhagger@alum.mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.