From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeff King Subject: Re: inotify to minimize stat() calls Date: Wed, 13 Feb 2013 13:18:51 -0500 Message-ID: <20130213181851.GA5603@sigill.intra.peff.net> References: <20130210111732.GA24377@lanh> <511AAA92.4030508@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Cc: blees@dcon.de, kusmabite@gmail.com, Ramkumar Ramachandra , Robert Zeh , Junio C Hamano , Git List , finnag@pvv.org To: Duy Nguyen X-From: git-owner@vger.kernel.org Wed Feb 13 19:19:34 2013 Return-path: Envelope-to: gcvg-git-2@plane.gmane.org Received: from vger.kernel.org ([209.132.180.67]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1U5gvd-0003S5-EG for gcvg-git-2@plane.gmane.org; Wed, 13 Feb 2013 19:19:33 +0100 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934533Ab3BMSS7 (ORCPT ); Wed, 13 Feb 2013 13:18:59 -0500 Received: from 75-15-5-89.uvs.iplsin.sbcglobal.net ([75.15.5.89]:48160 "EHLO peff.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1759463Ab3BMSS4 (ORCPT ); Wed, 13 Feb 2013 13:18:56 -0500 Received: (qmail 13730 invoked by uid 107); 13 Feb 2013 18:20:22 -0000 Received: from sigill.intra.peff.net (HELO sigill.intra.peff.net) (10.0.0.7) (smtp-auth username relayok, mechanism cram-md5) by peff.net (qpsmtpd/0.84) with ESMTPA; Wed, 13 Feb 2013 13:20:22 -0500 Received: by sigill.intra.peff.net (sSMTP sendmail emulation); Wed, 13 Feb 2013 13:18:51 -0500 Content-Disposition: inline In-Reply-To: Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: On Wed, Feb 13, 2013 at 07:15:47PM +0700, Nguyen Thai Ngoc Duy wrote: > On Wed, Feb 13, 2013 at 3:48 AM, Karsten Blees wrote: > > 2.) 0.135 s is spent in name-hash.c/hash_index_entry_directories, reindexing the same directories over and over again. In the end, the hashtable contains 939k directory entries, even though the WebKit test repo only has 7k directories. Checking if a directory entry already exists could reduce that, i.e.: > > This function is only used when core.ignorecase = true. I probably > won't be able to test this, so I'll leave this to other people who > care about ignorecase. > > This function used to have lookup_hash, but it was removed by Jeff in > 2548183 (fix phantom untracked files when core.ignorecase is set - > 2011-10-06). There's a looong commit message which I'm too lazy to > read. Anybody who works on this should though. Yeah, the problem that commit tried to solve is that linking to a single cache entry through the hash is not enough, because we may remove cache items. Imagine you have "dir/one" and "dir/two", and you add them to the in-memory index in that order. The original code hashed "dir/" and inserted a link to the "dir/one" cache entry. When it came time to put in the "dir/two" entry, we noticed that there was already a "dir/" entry and did nothing. Then later, if we remove "dir/one", we do so by marking it with CE_UNHASHED. So a later query for "dir/" will see "nope, nothing here that wasn't CE_UNHASHED", which is wrong. We never recorded that "dir/two" existed under the hash for "dir/", so we can't know about it. My patch just stores the cache_entry for both under the "dir/" hash. As Karsten noticed, that can lead to a large number of hash entries, because adding "some/deep/hierarchy/with/files" will add 4 directory entries for just that single file. Moreover, looking at it again, I don't think my patch produces the right behavior: we have a single dir_next pointer, even though the same ce_entry may appear under many directory hashes. So the cache_entries that has to "dir/foo/" and those that hash to "dir/bar/" may get confused, because they will also both be found under "dir/", and both try to create a linked list from the dir_next pointer. Looking at Karsten's patch, it seems like it will not add a cache entry if there is one of the same name. But I'm not sure if that is right, as the old one might be CE_UNHASHED (or it might get removed later). You actually want to be able to find each cache_entry that has a file under the directory at the hash of that directory, so you can make sure it is still valid. And of course that still leaves the existing correctness problem I mentioned above. I think the best way forward is to actually create a separate hash table for the directory lookups. I note that we only care about these entries in directory_exists_in_index_icase, which is really about whether something is there, versus what exactly is there. So could we maybe get by with a separate hash table that stores a count of entries at each directory, and increment/decrement the count when we add/remove entries? The biggest problem I see with that is that we do indeed care a little bit what is at the directory: we check the mode to see if it is a gitdir or not. But I think we can maybe sneak around that: gitdirs have actual entries in the index, whereas the directories do not. So we would find them via index_name_exists; anything that is not there, but _is_ in the special directory hash would therefore be a directory. I realize it got pretty esoteric there in the middle. I'll see if I can work up a patch that expresses what I'm thinking. -Peff