From: Michael J Gruber <michaeljgruber+gmane@fastmail.fm>
To: Sergio <sergio.callegari@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: Tracking OpenOffice files/other compressed files with Git
Date: Tue, 09 Sep 2008 12:28:05 +0200 [thread overview]
Message-ID: <48C64FB5.6080007@fastmail.fm> (raw)
In-Reply-To: <loom.20080909T085002-376@post.gmane.org>
Sergio venit, vidit, dixit 09.09.2008 11:02:
> Johannes Sixt <j.sixt <at> viscovery.net> writes:
>
>> Peter Krefting schrieb:
>>> Since OpenOffice doucuments are just zipped xml files, I wondered how
>>> difficult it would be to create some hooks/hack git to track the files
>>> inside the archives instead?
>> You could write a "clean" filter that "recompresses" the archive with
>> level 0 upon git-add.
>>
>
>
> A couple of notes:
>
> 1) For Openoffice documents whose size is dominated by embed images and other
> large objects, the git delta mechanism already performs reasonably well, since
> OO files are Zip archives where each file is compressed separately. If you do
> not change an image, then that image remains stored in the same way and the
> delta can be done.
>
> 2) For OO documents whose size is dominated by plain content, the git delta
> mechanism cannot work, since the zip compression introduces "mixing" and a small
> change in the document is converted into a very large change in the zip file.
>
> It could be possible to write a clean filter to uncompress before commit.
> However there is a trick with the complementary smudge filter to be used at
> checkout. If you do not smudge properly, git always shows the file as changed
> wrt the index. Smudging correctly would mean using the very same compression
> ratio and compress method that OO uses, which can be a little tricky. I have
> tried using the zip binary both in the clean and the smudge phases and it does
> not work nicely. The smudged file is always different from the original one. One
> should probably work at a lower level to have a finer control on what is
> happening (libzip) and prepend to the uncompressed file the compression
> parameters to be restored on smudging.
>
> The bigger issue is however that the clean/smudge thing can be really slow when
> dealing with large OO files.
I made similar observations when I experimented with tracking pdf and
sqlite (FF profile) files. Problems occurred so far:
PDF: on compressing/uncompressing with pdftk there seems to be a random
order of objects. We need something bijective.
sqlite files for FF profiles: uncompressing (i.e. dumping) and
recompressing gives something different than what FF writes. FF seems to
write out "holes" in the db to be filled out later.
I know, you and I will be told that git is not meant to track OO, PDF,
sql. Anyways, I think it's all up to finding a strictly bijective and
reasonably efficient compress/uncompress pair.
It turns out that when I have a choice between tracking larger or
smaller formats, such as ps/dvi vs pdf, it's often better to track the
larger one if it's mostly clear text.
On a side note, gc'ing helps a lot with binary files.
Michael
next prev parent reply other threads:[~2008-09-09 10:29 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-09-09 6:19 Tracking OpenOffice files/other compressed files with Git Peter Krefting
2008-09-09 7:02 ` Johannes Sixt
2008-09-09 9:02 ` Sergio
2008-09-09 10:28 ` Michael J Gruber [this message]
2008-09-09 10:57 ` Johannes Sixt
2008-09-09 11:07 ` Sergio Callegari
2008-09-09 11:22 ` Johannes Sixt
2008-09-09 8:18 ` Mike Hommey
2008-09-09 8:34 ` Matthieu Moy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=48C64FB5.6080007@fastmail.fm \
--to=michaeljgruber+gmane@fastmail.fm \
--cc=git@vger.kernel.org \
--cc=sergio.callegari@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).