From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Dana How" Subject: Re: [PATCH] Prevent megablobs from gunking up git packs Date: Wed, 23 May 2007 16:55:27 -0700 Message-ID: <56b7f5510705231655o589de801w88adc1aa6c18162b@mail.gmail.com> References: <46528A48.9050903@gmail.com> <7v7iqz19d2.fsf@assigned-by-dhcp.cox.net> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: "Git Mailing List" , danahow@gmail.com To: "Junio C Hamano" X-From: git-owner@vger.kernel.org Thu May 24 01:55:37 2007 Return-path: Envelope-to: gcvg-git@gmane.org Received: from vger.kernel.org ([209.132.176.167]) by lo.gmane.org with esmtp (Exim 4.50) id 1Hr0g3-0004xt-6h for gcvg-git@gmane.org; Thu, 24 May 2007 01:55:35 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756892AbXEWXzb (ORCPT ); Wed, 23 May 2007 19:55:31 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1756722AbXEWXza (ORCPT ); Wed, 23 May 2007 19:55:30 -0400 Received: from ug-out-1314.google.com ([66.249.92.170]:32522 "EHLO ug-out-1314.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756892AbXEWXz3 (ORCPT ); Wed, 23 May 2007 19:55:29 -0400 Received: by ug-out-1314.google.com with SMTP id j3so725487ugf for ; Wed, 23 May 2007 16:55:28 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=f/Lvhh/zgzkjwYS++jRFB4D2z5djw2wOcoDU4bMpw6ySclgNKwnemcHNz4eSr1tPz0I1cZR2sRpn013Sv4qB0tj63McXeLQudxgLtBTWYrOrLdrrR2DQYDRv1KeiylRTHZyayIoXVkfLSnxrNcg61Z2g6qWf22j9VXdar+ZPA/8= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=fgMbpBogJjeU9oXn/BeUl1UAYjhtJhrDa/3vDkaVbPi+FL09afkGzfSLmWfB6m8XVui43XBQ9DNVEMX5SI9zDQaTXyrS047URB8ZhM3SJwLEsCtIgCcwPL7q38OU2iMu2X8aj4E8s8MUmvfvtlGiftHYpVZBUgyN8dlUL4ysBR4= Received: by 10.78.81.20 with SMTP id e20mr328431hub.1179964527574; Wed, 23 May 2007 16:55:27 -0700 (PDT) Received: by 10.78.129.3 with HTTP; Wed, 23 May 2007 16:55:27 -0700 (PDT) In-Reply-To: <7v7iqz19d2.fsf@assigned-by-dhcp.cox.net> Content-Disposition: inline Sender: git-owner@vger.kernel.org Precedence: bulk X-Mailing-List: git@vger.kernel.org Archived-At: On 5/23/07, Junio C Hamano wrote: > Dana How writes: > > This patch implements the following: > > 1. git pack-objects takes a new --max-blob-size=N flag, > > with the effect that only blobs less than N KB are written > > to the packfiles(s). If a blob was in a pack but violates > > this limit (perhaps the packs were created by fast-import > > or max-blob-size was reduced), then a new loose object > > is written out if needed so the data is not lost. > > Why? > > I really do not like that "write a new loose object" part > without proper justification. From your description, I thought > the most natural way to do this is to pretend you did not hear > about large objects at all, by rejecting them early, perhaps > inside add_object_entry() or inside get_object_details() -- > either case you would do sha1_object_info() early instead of > doing it in check_object(). > > By the way, is there fundamental reason that this needs to be > "blob size" limit? Wouldn't "max-object-size" be more clean in > theory, and work the same way in practice? I agree with your sentiments. Some nasty details pushed me to implement it this way. Let me explain them and perhaps you can come up with a different combination of solutions. Each object to be packed can be in one of 4 states: {loose or packed} X {small or too big}. Regardless of the {loose or packed} selection, a "small" object can always be placed in the pack. A loose object which is too big does not require any special action either -- if we drop it from the pack, we don't lose anything since it already exists in objects/xx . The packed X too big combination is the problem. As the commit message says, this could happen if the packs came from fast-import, or if the max-blob-size were reduced. We have three options in this case: (1) Drop the object (do not put it in the new pack(s)). (2) Pass the object into the new pack(s). (3) Write out the object as a new loose object. Option (1) is unacceptable. When you call git-repack -a, it blindly deletes all the non-kept packs at the end. So the megablobs would be lost. Let's suppose we always used Option (2), and Option (3) were never available. Then once a large object got into a pack, the only way to get it out would be to explode the pack using git-unpack-objects -- but hey, that doesn't work because they all already exist in this repository packed. So we make a temporary repository to explode the pack, then copy over the loose objects, then delete the pack, then repack with the correct/updated max-blob-size specification. I would claim this is even more ugly, and more importantly, more error-prone than supporting Option (3) in at least some cases. The way large objects get into a pack is that you realize after the fact that you need a max-blob-size, or that you need to decrease it since your pack files have become unwieldy. I think I've shown we need Option (3) in at least some cases. To better address _which_ cases, let's move on to your second point:: why did I implement --max-blob-size instead of --max-object-size? I take this to mean that I should use the blob size if undeltified, and the delta size if previously deltified? That's actually what I do internally, but the total effect over many repackings is what one would expect from --max-blob-size. In normal use on a personal repository starting from scratch, all blobs are first packed from loose objects. When I am deciding if I want to include the loose object or not, I want to use the object's direct size, not its deltified size. If I use the latter, then I have to deltify all the megablobs to see if their deltas are small enough. This takes a long time and is one of the things this patch aims to eliminate. So I take the list of loose objects, throw out the megablobs, try to deltify the rest, and write out the pack. When I repack, any deltas are smaller than the original objects, so in any sequence of packings from loose objects and repackings of packs, in which all packs were created with a max-blob-size limit which stayed the same or increased over time, there is no difference between max-blob-size or max-object-size: checking loose object size or in-pack size gives the same pack contents. Now let's say you decrease max-blob-size and repack, or repack from some packs that had no max-blob-size (e.g. from fast-import). Now there is some difficulty in a clean definition when you follow Option (3). You might have a base object B in an old pack, and a deltified object D based on it, where B is now larger than the newly-decreased max-blob-size while D's deltified size is still smaller. So you write out B, but D can no longer be stored deltified and may become bigger as well. But in this case you should do the following: Since you are going to decrease max-blob-size when you decide your repository packs have become "gunked up" with megablobs, I would propose git-repack -a -f (git-pack-objects --no-object-reuse) be used. This means all deltas are recomputed from the new, smaller universe of blobs, which makes sense, and since no deltas are reused, all blob-size limits can only be based on the actual blob size by the arguments 3 paragraphs back. (Note: I made sure the current patch loses no data if you decrease max-blob-size and forget to specify -f -- you just get a few large packed objects resulting from "lost" deltifications.) In actuality, the object_entry.size field is checked, so objects _are_ being filtered by delta size if deltified. But I went through the (suggested) usage scenarios above to show that the *result* is that you get a pack limited in raw object size. --max-object-size is what I implemented because it's easier, but in use it acts like --max-blob-size so I named it that. Now, concerning Option (2) vs Option (3): I need to write out loose objects in *some* cases. Right now I always write out a large object as a loose one. It would be reasonable to change this so that it only happens with --no-object-reuse (i.e. git-repack -f). This is what you should specify when you're decreasing max-blob-size. Shall I change it so that large objects are passed to the result pack without --no-object-reuse, and written out as loose objects with --no-object-reuse? I did not do this because it's a strange interaction -- just look how long it took me to build up to explaining it! > > 2. git repack inspects repack.maxblobsize . If set, its > > value is passed to git pack-objects on the command line. > > The user should change repack.maxblobsize , NOT specify > > --max-blob-size=N . > Why not? Well, indeed. That's why [PATCH *v2*] lets you specify --max-blob-size to git-repack, and gives an example of why you might want to do that [I don't]. > > This patch is on top of the earlier max-pack-size patch, > > because I thought I needed some behavior it supplied, > > but could be rebased on master if desired. > > Your earlier "split according to max-pack-size" will hopefully be > on master shortly. Thanks! -- Dana L. How danahow@gmail.com +1 650 804 5991 cell