From mboxrd@z Thu Jan 1 00:00:00 1970 From: Avery Pennarun Subject: Re: Multiblobs Date: Wed, 28 Apr 2010 17:27:32 -0400 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: git@vger.kernel.org To: Sergio Callegari X-From: git-owner@vger.kernel.org Wed Apr 28 23:28:02 2010 connect(): No such file or directory Return-path: Envelope-to: gcvg-git-2@lo.gmane.org Received: from vger.kernel.org ([209.132.180.67]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1O7EnZ-0004ei-IS for gcvg-git-2@lo.gmane.org; Wed, 28 Apr 2010 23:28:02 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932250Ab0D1V1y convert rfc822-to-quoted-printable (ORCPT ); Wed, 28 Apr 2010 17:27:54 -0400 Received: from mail-gw0-f46.google.com ([74.125.83.46]:62181 "EHLO mail-gw0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932188Ab0D1V1x convert rfc822-to-8bit (ORCPT ); Wed, 28 Apr 2010 17:27:53 -0400 Received: by gwj19 with SMTP id 19so5104323gwj.19 for ; Wed, 28 Apr 2010 14:27:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:mime-version:received:in-reply-to :references:from:date:message-id:subject:to:cc:content-type :content-transfer-encoding; bh=UzWn2l6nR4ULJ9azZ5YWzieKbKFIPKq95lkKvbfZ0ys=; b=pbBVcIFvML27MCLXg4cKEFMqfhvyiEOJsvyU8YPGLK2J7G2kPLziwwH1RTV9h/b8rN LDQNm1Z8AKWiv9SENryoBGMi4IzRVkKrr8awyTW/P6ZmF5v87SQyNbkbhoavKLG5IeuO 9OdNSaAeQn/+r/bAa8JenwApL2YIZn20vBAX4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type:content-transfer-encoding; b=sOtGk6BK40AsCuX1HnUQpQpgRN/cJW9LkH+B4I9Itiel+UyGQkUgR69+geL3bvRlPK GAMibDrpJEL403ih4E//o7aTL+IBBIU7l1vxwC4yfDaDqbp+7pSjBfu1xBiE0EGwFQuf tUE2ogEgRuLBRE3TdQ0Ich1nvvdwX6ADC69s0= Received: by 10.151.92.9 with SMTP id u9mr370452ybl.336.1272490072181; Wed, 28 Apr 2010 14:27:52 -0700 (PDT) Received: by 10.151.109.5 with HTTP; Wed, 28 Apr 2010 14:27:32 -0700 (PDT) In-Reply-To: Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: On Wed, Apr 28, 2010 at 3:13 PM, Sergio Callegari wrote: > Avery Pennarun gmail.com> writes: >> I'm not sure it would help very much for these sorts of files. =A0Th= e >> problem is that compressed files tend to change a lot even if only a >> few bytes of the original data have changed. > > Probably I have not provided enough elements... My idea is the follow= ing: > > If you store a structured file as a multiblob, you can use a blob for= each > uncompressed element of content. =A0For instance, when storing an ope= ndocument > file you could use a blob for manifest.xml, one for content.xml, etc.= =2E. (try > unzip -l on an odt or odp file to get an idea). When you edit your fi= le only a > few of these change. For instance, if we talk about a presentation, e= ach slide > has its own content.xml, so changing one slide only that changes. But why not use a .gitattributes filter to recompress the zip/odp file with no compression, as I suggested? Then you can just dump the whole thing into git directly. When you change the file, only the changes need to be stored thanks to delta compression. Unless your presentation is hundreds of megs in size, git should be able to handle that just fine already. > The same for PDF files, if you split them using a blob for each uncom= pressed > stream, little variations of the pdf file will touch only a blob. But then you're digging around inside the pdf file by hand, which is a lot of pdf-specific work that probably doesn't belong inside git. Worse, because compression programs don't always produce the same output, this operation would most likely actually *change* the hash of your pdf file as you do it. (That's also true for openoffice files, but at least those are just plain zip files, and zip files are somewhat less of a special case.) >> For things like opendocument, or uncompressed tars, you'd be better >> off to decompress them (or recompress with zip -0) using >> .gitattributes. =A0Generally these files aren't *so* large that they >> really need to be chunked; what you want to do is improve the deltas= , >> which decompressing will do. > > This is what I currently do. =A0But using multiblobs would be a defin= ite > improvement over this. In what way? I doubt you'd get more efficient storage, at least. Git's deltas are awfully hard to beat. >> That sounds complicated and error prone, and is suspiciously like >> Apple's "resource forks," which even Apple has mostly realized were = a >> bad idea. > > I did not mean the Apple way... Suppose that you need to store images= with exif > tags. =A0In order to diff them you would tipically set a textconv att= ribute, to > see only the tags. =A0However, this kind of filter needs to read the = whole file > (expensive). BTW this is why a caching mechanism involving notes has = recently > been proposed. Now suppose that you can set up a rule so that image f= iles with > tags are stored as a multiblob. You can use 3 blobs... 1 as a header,= one for > the raw image data and one for the tags. =A0Now your textconv filter = only needs to > look at the content of the tags blob. A resource fork by any other name is still a resource fork, and it's still ugly. If you really need something like this, just cache the attributes in a file alongside the big file, and store both files in the git repo. > Similar... Right now to do package management with git, you need to u= se pristine > tar. This is because when you check in the upstream tar you only chec= k in its > elements, not the whole tar.gz. =A0So you need pristine tar to recrea= te the > upstream tar.gz whenever needed. But with multiblob you could store b= oth the > content /and/ the upstream tar and there would be minimal overlap sin= ce the > blobs would be the same. I guess. For something like that, though, Debian's pristine-tarball tool seems to already solve the problem and works with any VCS, not just git. >> Sharing the blobs of a tarball with a checked-out tree would require= a >> tar-specific chunking algorithm. =A0Not impossible, but a pain, and = you >> might have a hard time getting it accepted into git since it's >> obviously not something you really need for a normal "source code" >> tracking system. > > I agree... but there could be just a mere couple of gitattributes mul= tiblobsplit > and multiblobcompose, so that one could provide his own splitting and= composing > methods for the types of files he is interested in (and maybe contrib= ute them to > the community). I guess this would be mostly harmless; the implementation could mirror the filter stuff. > I am not really thinking that much about large binary files (that wou= ld anyway > come as a bonus - an many people often talk about them on the list), = but of > structured files that currently do not pack well. =A0My personal issu= e is with > opendocument files, since I need to check in lots of documentation an= d > presentation material. In that case, I'd like to see some comparisons of real numbers (memory, disk usage, CPU usage) when storing your openoffice documents (using the .gitattributes filter, of course). I can't really imagine how splitting the files into more pieces would really improve disk space usage, at least. Having done some tests while writing bup, my experience has been that chunking-without-deltas is great for these situations: 1) you have the same data shared across *multiple* files (eg. the same images in lots of openoffice documents with different filenames); 2) you have the same data *repeated* in the same file at large distances (so that gzip compression doesn't catch it; eg. VMware images) 3) your file is too big to work with the delta compressor (eg. VMware i= mages). However, in my experience #1 is pretty rare and #2 and #3 aren't in your use case. And deltas-between-chunks is not very easy to do, since it's hard to guess which chunks might be "similar" to which other chunks. Personally, I think it would be great if git could natively handle large numbers of large binary files efficiently, because there are a few use cases I would have for it. But whenever I start investigating my use cases, it always turns out that just "supporting large files" is just the tip of the iceberg, and there's a huge submerged mass of iceberg that becomes obvious as soon as you start crashing into it. The bup use case (write-once, read-almost-never, incremental backups) is a rare exception in which fixing *only* the file size problem has produced useful results. Have fun, Avery