From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-3.4 required=3.0 tests=AWL,BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,RCVD_IN_DNSWL_HI,T_RP_MATCHES_RCVD shortcircuit=no autolearn=ham autolearn_force=no version=3.4.0 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by dcvr.yhbt.net (Postfix) with ESMTP id AF0AC1F404 for ; Tue, 9 Jan 2018 07:12:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754698AbeAIHML (ORCPT ); Tue, 9 Jan 2018 02:12:11 -0500 Received: from cloud.peff.net ([104.130.231.41]:38244 "HELO cloud.peff.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with SMTP id S1750818AbeAIHMK (ORCPT ); Tue, 9 Jan 2018 02:12:10 -0500 Received: (qmail 4091 invoked by uid 109); 9 Jan 2018 07:12:10 -0000 Received: from Unknown (HELO peff.net) (10.0.1.2) by cloud.peff.net (qpsmtpd/0.94) with SMTP; Tue, 09 Jan 2018 07:12:10 +0000 Authentication-Results: cloud.peff.net; auth=none Received: (qmail 2339 invoked by uid 111); 9 Jan 2018 07:12:43 -0000 Received: from sigill.intra.peff.net (HELO sigill.intra.peff.net) (10.0.0.7) by peff.net (qpsmtpd/0.94) with (ECDHE-RSA-AES256-GCM-SHA384 encrypted) SMTP; Tue, 09 Jan 2018 02:12:43 -0500 Authentication-Results: peff.net; auth=none Received: by sigill.intra.peff.net (sSMTP sendmail emulation); Tue, 09 Jan 2018 02:12:08 -0500 Date: Tue, 9 Jan 2018 02:12:08 -0500 From: Jeff King To: Derrick Stolee Cc: =?utf-8?B?w4Z2YXIgQXJuZmrDtnLDsA==?= Bjarmason , git@vger.kernel.org, dstolee@microsoft.com, git@jeffhostetler.com, gitster@pobox.com, johannes.schindelin@gmx.de, jrnieder@gmail.com Subject: Re: [RFC PATCH 00/18] Multi-pack index (MIDX) Message-ID: <20180109071208.GB32257@sigill.intra.peff.net> References: <20180107181459.222909-1-dstolee@microsoft.com> <87k1wtb8a4.fsf@evledraar.gmail.com> <20180108102029.GA21232@sigill.intra.peff.net> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org On Mon, Jan 08, 2018 at 08:43:44AM -0500, Derrick Stolee wrote: > > Just to make sure I'm parsing this correctly: normal lookups do get faster > > when you have a single index, given the right setup? > > > > I'm curious what that setup looked like. Is it just tons and tons of > > packs? Is it ones where the packs do not follow the mru patterns very > > well? > > The way I repacked the Linux repo creates an artificially good set of packs > for the MRU cache. When the packfiles are partitioned instead by the time > the objects were pushed to a remote, the MRU cache performs poorly. > Improving these object lookups are a primary reason for the MIDX feature, > and almost all commands improve because of it. 'git log' is just the > simplest to use for demonstration. Interesting. The idea of the pack mru (and the "last pack looked in" before it) is that there would be good locality for time-segmented packs (like those generated by pushes), since most operations also tend to visit history in chronological-ish order (e.g., log). Tree-wide operations would be an exception there, though, since files would have many ages across the tree (though in practice, one would hope that pretty-ancient history would eventually be replaced in a modern tree). I've often wondered, though, if the mru (and "last pack") work mainly because the "normal" distribution for a repository is to have one big pack with most of history, and then a couple of smaller ones (what hasn't been packed yet). Even something as naive as "look in the last pack" works there, because it turns into "look in the biggest pack". If it has most of the objects, it's the most likely place for the previous and the next objects to be. But from my understanding of how you handle the Windows repository, you have tons of equal-sized packs that are never coalesced. Which is quite a different pattern. I would expect your "--max-pack-size" thing to end up with something roughly segmented like pushes, though, just because we do order the write phase in reverse chronological order. Well, mostly. We do each object type in its own chunks, and there's some reordering based on deltas. So given the sizes, it's likely that most of your trees all end up in a few packs. > > There may be other reasons to want MIDX or something like it, but I just > > wonder if we can do this much simpler thing to cover the abbreviation > > case. I guess the question is whether somebody is going to be annoyed in > > the off chance that they hit a collision. > > No only are users going to be annoyed when they hit collisions after > copy-pasting an abbreviated hash, there are also a large number of tools > that people build that use abbreviated hashes (either for presenting to > users or because they didn't turn off abbreviations). > > Abbreviations cause performance issues in other commands, too (like > 'fetch'!), so whatever short-circuit you put in, it would need to be global. > A flag on one builtin would not suffice. Yeah, I agree it's potentially problematic for a lot of reasons. I was just hoping we could bound the probabilities in a way that is acceptable. Maybe it's a fool's errand. -Peff