git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Tracking OpenOffice files/other compressed files with Git
@ 2008-09-09  6:19 Peter Krefting
  2008-09-09  7:02 ` Johannes Sixt
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Peter Krefting @ 2008-09-09  6:19 UTC (permalink / raw)
  To: Git Mailing List

Hi!

I find myself tracking OpenOffice files every now and then. Mostly to
synchronise to be able to edit documents in multiple locations, less
for the actual history.

I notice, however, that the Git history tend to grow quite a bit,
especially for larger documents (I have a 175 kilobyte spredsheet that
has a git database of about 8 megabytes).

Since OpenOffice doucuments are just zipped xml files, I wondered how
difficult it would be to create some hooks/hack git to track the files
inside the archives instead?


Alternatively, does anyone know if it is possible to set OpenOffice not
to use compression for a saved document? If it was uncompressed text, I
would think Git's delta compression would fare better.

-- 
\\// Peter - http://www.softwolves.pp.se/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Tracking OpenOffice files/other compressed files with Git
  2008-09-09  6:19 Tracking OpenOffice files/other compressed files with Git Peter Krefting
@ 2008-09-09  7:02 ` Johannes Sixt
  2008-09-09  9:02   ` Sergio
  2008-09-09  8:18 ` Mike Hommey
  2008-09-09  8:34 ` Matthieu Moy
  2 siblings, 1 reply; 9+ messages in thread
From: Johannes Sixt @ 2008-09-09  7:02 UTC (permalink / raw)
  To: Peter Krefting; +Cc: Git Mailing List

Peter Krefting schrieb:
> Since OpenOffice doucuments are just zipped xml files, I wondered how
> difficult it would be to create some hooks/hack git to track the files
> inside the archives instead?

You could write a "clean" filter that "recompresses" the archive with
level 0 upon git-add.

-- Hannes

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Tracking OpenOffice files/other compressed files with Git
  2008-09-09  6:19 Tracking OpenOffice files/other compressed files with Git Peter Krefting
  2008-09-09  7:02 ` Johannes Sixt
@ 2008-09-09  8:18 ` Mike Hommey
  2008-09-09  8:34 ` Matthieu Moy
  2 siblings, 0 replies; 9+ messages in thread
From: Mike Hommey @ 2008-09-09  8:18 UTC (permalink / raw)
  To: Peter Krefting; +Cc: Git Mailing List

On Tue, Sep 09, 2008 at 07:19:55AM +0100, Peter Krefting <peter@softwolves.pp.se> wrote:
> Hi!
> 
> I find myself tracking OpenOffice files every now and then. Mostly to
> synchronise to be able to edit documents in multiple locations, less
> for the actual history.
> 
> I notice, however, that the Git history tend to grow quite a bit,
> especially for larger documents (I have a 175 kilobyte spredsheet that
> has a git database of about 8 megabytes).
> 
> Since OpenOffice doucuments are just zipped xml files, I wondered how
> difficult it would be to create some hooks/hack git to track the files
> inside the archives instead?

It could be worth having a generic tool that would do similar things
to what pristine-tar[1] does.

Mike

1. http://joey.kitenet.net/code/pristine-tar/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Tracking OpenOffice files/other compressed files with Git
  2008-09-09  6:19 Tracking OpenOffice files/other compressed files with Git Peter Krefting
  2008-09-09  7:02 ` Johannes Sixt
  2008-09-09  8:18 ` Mike Hommey
@ 2008-09-09  8:34 ` Matthieu Moy
  2 siblings, 0 replies; 9+ messages in thread
From: Matthieu Moy @ 2008-09-09  8:34 UTC (permalink / raw)
  To: Peter Krefting; +Cc: Git Mailing List

Peter Krefting <peter@softwolves.pp.se> writes:

> I find myself tracking OpenOffice files every now and then.

[...]

> I notice, however, that the Git history tend to grow quite a bit,

<shameless ad>
Slightly off-topic answer, since it doesn't solve the disk space
problem, but you can have a look at

  http://www-verimag.imag.fr/~moy/opendocument/

(which lets you use "git diff" on OpenOffice files)
</shameless ad>

-- 
Matthieu

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Tracking OpenOffice files/other compressed files with Git
  2008-09-09  7:02 ` Johannes Sixt
@ 2008-09-09  9:02   ` Sergio
  2008-09-09 10:28     ` Michael J Gruber
  2008-09-09 10:57     ` Johannes Sixt
  0 siblings, 2 replies; 9+ messages in thread
From: Sergio @ 2008-09-09  9:02 UTC (permalink / raw)
  To: git

Johannes Sixt <j.sixt <at> viscovery.net> writes:

> 
> Peter Krefting schrieb:
> > Since OpenOffice doucuments are just zipped xml files, I wondered how
> > difficult it would be to create some hooks/hack git to track the files
> > inside the archives instead?
> 
> You could write a "clean" filter that "recompresses" the archive with
> level 0 upon git-add.
> 


A couple of notes:

1) For Openoffice documents whose size is dominated by embed images and other
large objects, the git delta mechanism already performs reasonably well, since
OO files are Zip archives where each file is compressed separately.  If you do
not change an image, then that image remains stored in the same way and the
delta can be done.

2) For OO documents whose size is dominated by plain content, the git delta
mechanism cannot work, since the zip compression introduces "mixing" and a small
change in the document is converted into a very large change in the zip file.

It could be possible to write a clean filter to uncompress before commit.
However there is a trick with the complementary smudge filter to be used at
checkout. If you do not smudge properly, git always shows the file as changed
wrt the index.  Smudging correctly would mean using the very same compression
ratio and compress method that OO uses, which can be a little tricky. I have
tried using the zip binary both in the clean and the smudge phases and it does
not work nicely. The smudged file is always different from the original one. One
should probably work at a lower level to have a finer control on what is
happening (libzip) and prepend to the uncompressed file the compression
parameters to be restored on smudging.

The bigger issue is however that the clean/smudge thing can be really slow when
dealing with large OO files.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Tracking OpenOffice files/other compressed files with Git
  2008-09-09  9:02   ` Sergio
@ 2008-09-09 10:28     ` Michael J Gruber
  2008-09-09 10:57     ` Johannes Sixt
  1 sibling, 0 replies; 9+ messages in thread
From: Michael J Gruber @ 2008-09-09 10:28 UTC (permalink / raw)
  To: Sergio; +Cc: git

Sergio venit, vidit, dixit 09.09.2008 11:02:
> Johannes Sixt <j.sixt <at> viscovery.net> writes:
> 
>> Peter Krefting schrieb:
>>> Since OpenOffice doucuments are just zipped xml files, I wondered how
>>> difficult it would be to create some hooks/hack git to track the files
>>> inside the archives instead?
>> You could write a "clean" filter that "recompresses" the archive with
>> level 0 upon git-add.
>>
> 
> 
> A couple of notes:
> 
> 1) For Openoffice documents whose size is dominated by embed images and other
> large objects, the git delta mechanism already performs reasonably well, since
> OO files are Zip archives where each file is compressed separately.  If you do
> not change an image, then that image remains stored in the same way and the
> delta can be done.
> 
> 2) For OO documents whose size is dominated by plain content, the git delta
> mechanism cannot work, since the zip compression introduces "mixing" and a small
> change in the document is converted into a very large change in the zip file.
> 
> It could be possible to write a clean filter to uncompress before commit.
> However there is a trick with the complementary smudge filter to be used at
> checkout. If you do not smudge properly, git always shows the file as changed
> wrt the index.  Smudging correctly would mean using the very same compression
> ratio and compress method that OO uses, which can be a little tricky. I have
> tried using the zip binary both in the clean and the smudge phases and it does
> not work nicely. The smudged file is always different from the original one. One
> should probably work at a lower level to have a finer control on what is
> happening (libzip) and prepend to the uncompressed file the compression
> parameters to be restored on smudging.
> 
> The bigger issue is however that the clean/smudge thing can be really slow when
> dealing with large OO files.

I made similar observations when I experimented with tracking pdf and
sqlite (FF profile) files. Problems occurred so far:

PDF: on compressing/uncompressing with pdftk there seems to be a random
order of objects. We need something bijective.

sqlite files for FF profiles: uncompressing (i.e. dumping) and
recompressing gives something different than what FF writes. FF seems to
write out "holes" in the db to be filled out later.

I know, you and I will be told that git is not meant to track OO, PDF,
sql. Anyways, I think it's all up to finding a strictly bijective and
reasonably efficient compress/uncompress pair.

It turns out that when I have a choice between tracking larger or
smaller formats, such as ps/dvi vs pdf, it's often better to track the
larger one if it's mostly clear text.

On a side note, gc'ing helps a lot with binary files.

Michael

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Tracking OpenOffice files/other compressed files with Git
  2008-09-09  9:02   ` Sergio
  2008-09-09 10:28     ` Michael J Gruber
@ 2008-09-09 10:57     ` Johannes Sixt
  2008-09-09 11:07       ` Sergio Callegari
  1 sibling, 1 reply; 9+ messages in thread
From: Johannes Sixt @ 2008-09-09 10:57 UTC (permalink / raw)
  To: Sergio; +Cc: git

[please reply to all and keep the Cc list.]

Sergio schrieb:
> Johannes Sixt <j.sixt <at> viscovery.net> writes:
> 
>> Peter Krefting schrieb:
>>> Since OpenOffice doucuments are just zipped xml files, I wondered how
>>> difficult it would be to create some hooks/hack git to track the files
>>> inside the archives instead?
>> You could write a "clean" filter that "recompresses" the archive with
>> level 0 upon git-add.
...
> It could be possible to write a clean filter to uncompress before commit.
> However there is a trick with the complementary smudge filter to be used at
> checkout. If you do not smudge properly, git always shows the file as changed
> wrt the index.  Smudging correctly would mean using the very same compression
> ratio and compress method that OO uses, which can be a little tricky. I have
> tried using the zip binary both in the clean and the smudge phases and it does
> not work nicely. The smudged file is always different from the original one. One
> should probably work at a lower level to have a finer control on what is
> happening (libzip) and prepend to the uncompressed file the compression
> parameters to be restored on smudging.

You don't need to smudge the OOo file on checkout iff OOo can read a file
that is "compressed" at level 0.

A file that you have just 'git add'ed must not show up as dirty even if it
was processed by a "clean" filter. If it does, then this indicates a bug
in git, and not that a corresponding "smudge" filter is missing or
misbehaves. Yes, I have observed this with my own "clean" filter some time
ago, but I have not yet tried hard enough to find a reproducible test case.

> The bigger issue is however that the clean/smudge thing can be really slow when
> dealing with large OO files.

True indeed.

-- Hannes

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Tracking OpenOffice files/other compressed files with Git
  2008-09-09 10:57     ` Johannes Sixt
@ 2008-09-09 11:07       ` Sergio Callegari
  2008-09-09 11:22         ` Johannes Sixt
  0 siblings, 1 reply; 9+ messages in thread
From: Sergio Callegari @ 2008-09-09 11:07 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: git


> You don't need to smudge the OOo file on checkout iff OOo can read a file
> that is "compressed" at level 0.
>
>   
I will check again then, since my first attempt at this was not very 
successful and I think I was getting dirty files even cleaning only.  
But it was a long time ago right after clean was introduced.

But in any case it would be preferable to smudge on checkout since 
uncompressed OO files can be quite huge.
Also to have uncompressed OO files in the worktree means that if you 
ever need to send one as an attachment to somebody you need to reopen 
and resave it before making the attachment, which is a bit uncomfortable!
> A file that you have just 'git add'ed must not show up as dirty even if it
> was processed by a "clean" filter. If it does, then this indicates a bug
> in git, and not that a corresponding "smudge" filter is missing or
> misbehaves. Yes, I have observed this with my own "clean" filter some time
> ago, but I have not yet tried hard enough to find a reproducible test case.
>
>   
But am I correct in saying that it will show dirty if you clean and then 
smudge in a non symmetric way?

Sergio

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Tracking OpenOffice files/other compressed files with Git
  2008-09-09 11:07       ` Sergio Callegari
@ 2008-09-09 11:22         ` Johannes Sixt
  0 siblings, 0 replies; 9+ messages in thread
From: Johannes Sixt @ 2008-09-09 11:22 UTC (permalink / raw)
  To: Sergio Callegari; +Cc: git

Sergio Callegari schrieb:
> But in any case it would be preferable to smudge on checkout since
> uncompressed OO files can be quite huge.
> Also to have uncompressed OO files in the worktree means that if you
> ever need to send one as an attachment to somebody you need to reopen
> and resave it before making the attachment, which is a bit uncomfortable!

True. Choose your poison.

>> A file that you have just 'git add'ed must not show up as dirty even
>> if it
>> was processed by a "clean" filter. If it does, then this indicates a bug
>> in git, and not that a corresponding "smudge" filter is missing or
>> misbehaves. Yes, I have observed this with my own "clean" filter some
>> time
>> ago, but I have not yet tried hard enough to find a reproducible test
>> case.
>>
>>   
> But am I correct in saying that it will show dirty if you clean and then
> smudge in a non symmetric way?

No.

The "smudge" filter kicks in only if the file in the worktree must be
replaced, for example, due to 'git checkout'. After the filter has
completed, the stat information of the smudged version is stored in the
index, and so the file does not appear as dirty. (Again, if you observe
something else, then git must be fixed, IMO.)

-- Hannes

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2008-09-09 11:23 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-09-09  6:19 Tracking OpenOffice files/other compressed files with Git Peter Krefting
2008-09-09  7:02 ` Johannes Sixt
2008-09-09  9:02   ` Sergio
2008-09-09 10:28     ` Michael J Gruber
2008-09-09 10:57     ` Johannes Sixt
2008-09-09 11:07       ` Sergio Callegari
2008-09-09 11:22         ` Johannes Sixt
2008-09-09  8:18 ` Mike Hommey
2008-09-09  8:34 ` Matthieu Moy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).