From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michael J Gruber Subject: Re: Tracking OpenOffice files/other compressed files with Git Date: Tue, 09 Sep 2008 12:28:05 +0200 Message-ID: <48C64FB5.6080007@fastmail.fm> References: <48C61F94.3060400@viscovery.net> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: git@vger.kernel.org To: Sergio X-From: git-owner@vger.kernel.org Tue Sep 09 12:29:20 2008 Return-path: Envelope-to: gcvg-git-2@gmane.org Received: from vger.kernel.org ([209.132.176.167]) by lo.gmane.org with esmtp (Exim 4.50) id 1Kd0TG-0002lm-0x for gcvg-git-2@gmane.org; Tue, 09 Sep 2008 12:29:18 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752096AbYIIK2L (ORCPT ); Tue, 9 Sep 2008 06:28:11 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751493AbYIIK2K (ORCPT ); Tue, 9 Sep 2008 06:28:10 -0400 Received: from out3.smtp.messagingengine.com ([66.111.4.27]:53689 "EHLO out3.smtp.messagingengine.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751824AbYIIK2J (ORCPT ); Tue, 9 Sep 2008 06:28:09 -0400 Received: from compute2.internal (compute2.internal [10.202.2.42]) by out1.messagingengine.com (Postfix) with ESMTP id 191F515E593; Tue, 9 Sep 2008 06:28:08 -0400 (EDT) Received: from heartbeat1.messagingengine.com ([10.202.2.160]) by compute2.internal (MEProxy); Tue, 09 Sep 2008 06:28:08 -0400 X-Sasl-enc: WW7Ax1/jbCglwKdOA0A6jxJHVbD+LgvxFOV9aeVVkbBP 1220956086 Received: from [139.174.44.12] (whitehead.math.tu-clausthal.de [139.174.44.12]) by mail.messagingengine.com (Postfix) with ESMTPSA id A9098D083; Tue, 9 Sep 2008 06:28:06 -0400 (EDT) User-Agent: Thunderbird 2.0.0.16 (X11/20080707) In-Reply-To: Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: Sergio venit, vidit, dixit 09.09.2008 11:02: > Johannes Sixt viscovery.net> writes: > >> Peter Krefting schrieb: >>> Since OpenOffice doucuments are just zipped xml files, I wondered how >>> difficult it would be to create some hooks/hack git to track the files >>> inside the archives instead? >> You could write a "clean" filter that "recompresses" the archive with >> level 0 upon git-add. >> > > > A couple of notes: > > 1) For Openoffice documents whose size is dominated by embed images and other > large objects, the git delta mechanism already performs reasonably well, since > OO files are Zip archives where each file is compressed separately. If you do > not change an image, then that image remains stored in the same way and the > delta can be done. > > 2) For OO documents whose size is dominated by plain content, the git delta > mechanism cannot work, since the zip compression introduces "mixing" and a small > change in the document is converted into a very large change in the zip file. > > It could be possible to write a clean filter to uncompress before commit. > However there is a trick with the complementary smudge filter to be used at > checkout. If you do not smudge properly, git always shows the file as changed > wrt the index. Smudging correctly would mean using the very same compression > ratio and compress method that OO uses, which can be a little tricky. I have > tried using the zip binary both in the clean and the smudge phases and it does > not work nicely. The smudged file is always different from the original one. One > should probably work at a lower level to have a finer control on what is > happening (libzip) and prepend to the uncompressed file the compression > parameters to be restored on smudging. > > The bigger issue is however that the clean/smudge thing can be really slow when > dealing with large OO files. I made similar observations when I experimented with tracking pdf and sqlite (FF profile) files. Problems occurred so far: PDF: on compressing/uncompressing with pdftk there seems to be a random order of objects. We need something bijective. sqlite files for FF profiles: uncompressing (i.e. dumping) and recompressing gives something different than what FF writes. FF seems to write out "holes" in the db to be filled out later. I know, you and I will be told that git is not meant to track OO, PDF, sql. Anyways, I think it's all up to finding a strictly bijective and reasonably efficient compress/uncompress pair. It turns out that when I have a choice between tracking larger or smaller formats, such as ps/dvi vs pdf, it's often better to track the larger one if it's mostly clear text. On a side note, gc'ing helps a lot with binary files. Michael