From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeff King Subject: Re: Commit cache to speed up rev-list and merge Date: Thu, 27 Sep 2012 14:32:44 -0400 Message-ID: <20120927183244.GB2519@sigill.intra.peff.net> References: <20120927173932.GE1547@sigill.intra.peff.net> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Cc: Nguyen Thai Ngoc Duy , git , Colby Ranger To: Shawn Pearce X-From: git-owner@vger.kernel.org Thu Sep 27 20:32:59 2012 Return-path: Envelope-to: gcvg-git-2@plane.gmane.org Received: from vger.kernel.org ([209.132.180.67]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1THItM-00073k-GC for gcvg-git-2@plane.gmane.org; Thu, 27 Sep 2012 20:32:56 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754124Ab2I0Scr (ORCPT ); Thu, 27 Sep 2012 14:32:47 -0400 Received: from 75-15-5-89.uvs.iplsin.sbcglobal.net ([75.15.5.89]:32985 "EHLO peff.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753011Ab2I0Scq (ORCPT ); Thu, 27 Sep 2012 14:32:46 -0400 Received: (qmail 4445 invoked by uid 107); 27 Sep 2012 18:33:15 -0000 Received: from sigill.intra.peff.net (HELO sigill.intra.peff.net) (10.0.0.7) (smtp-auth username relayok, mechanism cram-md5) by peff.net (qpsmtpd/0.84) with ESMTPA; Thu, 27 Sep 2012 14:33:15 -0400 Received: by sigill.intra.peff.net (sSMTP sendmail emulation); Thu, 27 Sep 2012 14:32:44 -0400 Content-Disposition: inline In-Reply-To: Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: On Thu, Sep 27, 2012 at 10:45:32AM -0700, Shawn O. Pearce wrote: > On 2012-08-12 Nguyen Thai Ngoc Duy wrote: > > Long term we might gain slight lookup speedup if we know object type > > as search region is made smaller. But for that to happen, we need to > > propagate object type hint down to find_pack_entry_one() and friends. > > Possible thing to do, I think. > > I'm not sure reclustering the index by object type is going to make a > worthwhile difference. Of 2.2m objects in the Linux tree, 320k are > commits. The difference between doing the binary search through all > objects vs. just commits is only 2 iterations more of binary search if > we assume the per-type ranges have their own fan-out tables. To me the big win would be implicit indexing for items that are present for every instance of a particular object type. So if we wanted to keep the timestamp for every commit, you could have a "pack-*.timestamps" that is literally just a packed list of uint32's, one per commit, where the position of a commit's timestamp in the list is the same as its position in the index of sha1s in the pack index. That's simple to do if your index is just commits. But if it includes all objects, then your list is sparse. So either you waste space by making an empty slot for the non-commit objects, or you have an extra level of indirection mapping the commit into the packed list, which is going to double the storage in this case (though you could reuse that extra mapping for the parent, generation number, etc, so it at least gets amortized as you store more data). Or is there some clever solution I'm missing? For your extension, I don't think it matters. You're sparse even in the commit-object space, so you have to store the mapping anyway. And your data is big enough that the overhead isn't too painful. -Peff