From: Michael J Gruber <michaeljgruber+gmane@fastmail.fm>
To: Sergio <sergio.callegari@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: Tracking OpenOffice files/other compressed files with Git
Date: Tue, 09 Sep 2008 12:28:05 +0200 [thread overview]
Message-ID: <48C64FB5.6080007@fastmail.fm> (raw)
In-Reply-To: <loom.20080909T085002-376@post.gmane.org>
Sergio venit, vidit, dixit 09.09.2008 11:02:
> Johannes Sixt <j.sixt <at> viscovery.net> writes:
>
>> Peter Krefting schrieb:
>>> Since OpenOffice doucuments are just zipped xml files, I wondered how
>>> difficult it would be to create some hooks/hack git to track the files
>>> inside the archives instead?
>> You could write a "clean" filter that "recompresses" the archive with
>> level 0 upon git-add.
>>
>
>
> A couple of notes:
>
> 1) For Openoffice documents whose size is dominated by embed images and other
> large objects, the git delta mechanism already performs reasonably well, since
> OO files are Zip archives where each file is compressed separately. If you do
> not change an image, then that image remains stored in the same way and the
> delta can be done.
>
> 2) For OO documents whose size is dominated by plain content, the git delta
> mechanism cannot work, since the zip compression introduces "mixing" and a small
> change in the document is converted into a very large change in the zip file.
>
> It could be possible to write a clean filter to uncompress before commit.
> However there is a trick with the complementary smudge filter to be used at
> checkout. If you do not smudge properly, git always shows the file as changed
> wrt the index. Smudging correctly would mean using the very same compression
> ratio and compress method that OO uses, which can be a little tricky. I have
> tried using the zip binary both in the clean and the smudge phases and it does
> not work nicely. The smudged file is always different from the original one. One
> should probably work at a lower level to have a finer control on what is
> happening (libzip) and prepend to the uncompressed file the compression
> parameters to be restored on smudging.
>
> The bigger issue is however that the clean/smudge thing can be really slow when
> dealing with large OO files.
I made similar observations when I experimented with tracking pdf and
sqlite (FF profile) files. Problems occurred so far:
PDF: on compressing/uncompressing with pdftk there seems to be a random
order of objects. We need something bijective.
sqlite files for FF profiles: uncompressing (i.e. dumping) and
recompressing gives something different than what FF writes. FF seems to
write out "holes" in the db to be filled out later.
I know, you and I will be told that git is not meant to track OO, PDF,
sql. Anyways, I think it's all up to finding a strictly bijective and
reasonably efficient compress/uncompress pair.
It turns out that when I have a choice between tracking larger or
smaller formats, such as ps/dvi vs pdf, it's often better to track the
larger one if it's mostly clear text.
On a side note, gc'ing helps a lot with binary files.
Michael
next prev parent reply other threads:[~2008-09-09 10:29 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-09-09 6:19 Tracking OpenOffice files/other compressed files with Git Peter Krefting
2008-09-09 7:02 ` Johannes Sixt
2008-09-09 9:02 ` Sergio
2008-09-09 10:28 ` Michael J Gruber [this message]
2008-09-09 10:57 ` Johannes Sixt
2008-09-09 11:07 ` Sergio Callegari
2008-09-09 11:22 ` Johannes Sixt
2008-09-09 8:18 ` Mike Hommey
2008-09-09 8:34 ` Matthieu Moy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=48C64FB5.6080007@fastmail.fm \
--to=michaeljgruber+gmane@fastmail.fm \
--cc=git@vger.kernel.org \
--cc=sergio.callegari@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.