From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-5.8 required=3.0 tests=AWL,BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,RCVD_IN_DNSWL_HI,RP_MATCHES_RCVD shortcircuit=no autolearn=ham autolearn_force=no version=3.4.0 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by dcvr.yhbt.net (Postfix) with ESMTP id E757120756 for ; Mon, 9 Jan 2017 07:01:27 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1761428AbdAIHBZ (ORCPT ); Mon, 9 Jan 2017 02:01:25 -0500 Received: from ns332406.ip-37-187-123.eu ([37.187.123.207]:56186 "EHLO glandium.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755914AbdAIHBY (ORCPT ); Mon, 9 Jan 2017 02:01:24 -0500 Received: from glandium by mitsuha.glandium.org with local (Exim 4.88) (envelope-from ) id 1cQTxH-00088B-4Q; Mon, 09 Jan 2017 16:01:19 +0900 Date: Mon, 9 Jan 2017 16:01:19 +0900 From: Mike Hommey To: Jeff King Cc: git@vger.kernel.org Subject: Re: Preserve/Prune Old Pack Files Message-ID: <20170109070119.lite2o7k3t2wuvtt@glandium.org> References: <24abd0ed58c25ce832014f9bd5bb2090@codeaurora.org> <5172470.bsscxDU4yv@mfick1-lnx> <20170109062137.zghmurndlbts5x44@sigill.intra.peff.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170109062137.zghmurndlbts5x44@sigill.intra.peff.net> X-GPG-Fingerprint: 182E 161D 1130 B9FC CD7D B167 E42A A04F A6AA 8C72 User-Agent: NeoMutt/20161126 (1.7.1) Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org On Mon, Jan 09, 2017 at 01:21:37AM -0500, Jeff King wrote: > On Wed, Jan 04, 2017 at 09:11:55AM -0700, Martin Fick wrote: > > > I am replying to this email across lists because I wanted to > > highlight to the git community this jgit change to repacking > > that we have up for review > > > > https://git.eclipse.org/r/#/c/87969/ > > > > This change introduces a new convention for how to preserve > > old pack files in a staging area > > (.git/objects/packs/preserved) before deleting them. I > > wanted to ensure that the new proposed convention would be > > done in a way that would be satisfactory to the git > > community as a whole so that it would be more easy to > > provide the same behavior in git eventually. The preserved > > pack files (and accompanying index and bitmap files), are not > > only moved, but they are also renamed so that they no longer > > will match recursive finds looking for pack files. > > It looks like objects/pack/pack-123.pack becomes > objects/pack/preserved/pack-123.old-pack, and so forth. > Which seems reasonable, and I'm happy that: > > find objects/pack -name '*.pack' > > would not find it. :) > > I suspect the name-change will break a few tools that you might want to > use to look at a preserved pack (like verify-pack). I know that's not > your primary use case, but it seems plausible that somebody may one day > want to use a preserved pack to try to recover from corruption. I think > "git index-pack --stdin could always be a last-resort for re-admitting the objects to the > repository. > > I notice this doesn't do anything for loose objects. I think they > technically suffer the same issue, though the race window is much > shorter (we mmap them and zlib inflate immediately, whereas packfiles > may stay mapped across many object requests). > > I have one other thought that's tangentially related. > > I've wondered if we could make object pruning more atomic by > speculatively moving items to be deleted into some kind of "outgoing" > object area. Right now you can have a case like: > > 0. We have a pack that has commit X, which is reachable, and commit Y, > which is not. > > 1. Process A is repacking. It walks the object graph and finds that X > is reachable. It begins creating a new pack with X and its > dependent objects. > > 2. Meanwhile, process B pushes up a merge of X and Y, and updates a > ref to point to it. > > 3. Process A finishes writing the new pack, and deletes the old one, > removing Y. The repository is now corrupt. > > I don't have a solution here. I don't think we want to solve it by > locking the repository for updates during a repack. I have a vague sense > that a solution could be crafted around moving the old pack into a > holding area instead of deleting (during which time nobody else would > see the objects, and thus not reference them), while the repacking > process checks to see if the actual deletion would break any references > (and rolls back the deletion if it would). > > That's _way_ more complicated than your problem, and as I said, I do not > have a finished solution. But it seems like they touch on a similar > concept (a post-delete holding area for objects). So I thought I'd > mention it in case if spurs any brilliance. Something that is kind-of in the same family of problems is the "loosening" or objects on repacks, before they can be pruned. When you have a large repository and do large rewrite operations (extreme case, a filter-branch on a multi-hundred-thousands commits), and you gc for the first time, git will possibly create a *lot* of loose objects, each of which will consume an inode and a file system block. In the extreme case, you can end up with git gc filling up multiple extra gigabytes on your disk. Mike