git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Michael J Gruber <git@drmicha.warpmail.net>
To: SLONIK.AZ@gmail.com
Cc: Pierre Habouzit <madcoder@debian.org>, git@vger.kernel.org
Subject: Re: clean/smudge filters for pdf files
Date: Fri, 24 Oct 2008 10:10:13 +0200	[thread overview]
Message-ID: <490182E5.4030804@drmicha.warpmail.net> (raw)
In-Reply-To: <ee2a733e0810231840u1aed8455w7e4c461e2565ad08@mail.gmail.com>

Leo Razoumov venit, vidit, dixit 24.10.2008 03:40:
> On 10/23/08, Pierre Habouzit <madcoder@debian.org> wrote:
>> On Thu, Oct 23, 2008 at 07:44:39PM +0000, Leo Razoumov wrote:
>>  > I am trying to improve storage efficiency for PDF files in a git repo.
>>  > Following earlier discussions in this list I am trying to set up
>>  > proper clean/smudge filters. What follows is my current setup
>>  >
>>  > # in ~/.gitconfig
>>  > [filter "pdf"]
>>  >       clean  = "pdftk - output - uncompress"
>>  >       smudge = "pdftk - output - compress"
>>  >
>>  > # in .gitattributes
>>  > *.pdf filter=pdf
>>  >
>>  > Unfortunately, it seems as though that pdftk uncompress followed by
>>  > pdftk compress do not leave the file invariant. I tried several
>>  > uncompress+compress iterations and the file still keep changing (the
>>  > size though stays the same).
>>  > Is there any other alternative way to store PDF files in git repo more
>>  > efficiently?
>>  > Any alternative to pdftk on Linux?
>>
>>
>> actually it uses some kind of zlib algorithm so that's pretty normal you
>>  don't have the same result with a packer. Maybe one could write a tool
>>  like pristine-tar for that purpose.
>>
> 
> With zlib you get the same deterministic result as long as you use the
> same zlib packer and unpacker. With pdftk compress/uncompress seem not
> to form a bijection pair. This issue was briefly discussed on this
> list back in April 2008 but no resolution emerged.

For a different file format I use the pair "gzip -c, gunzip -c" without
any problems, so zlib is not a problem. I do see the effect that
checkouts on different machines may have different compressed files
(same gzip version), but this is a non-issue.

Your experience with pdftk confirms mine. It shuffles things around
becauses it parses the files into objects and then writes them out again
in possibly different order. This is no problem for pdf because it uses
"pointers" (it's a bijection up to reordering), but it's a weird design,
and complicates things for us.

I'm still looking for something viable, I'll let list know when I've
found something...

Michael

  reply	other threads:[~2008-10-24  8:11 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-10-23 19:44 clean/smudge filters for pdf files Leo Razoumov
2008-10-23 21:32 ` Pierre Habouzit
2008-10-24  1:40   ` Leo Razoumov
2008-10-24  8:10     ` Michael J Gruber [this message]
2008-10-24  8:44     ` Michael J Gruber

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=490182E5.4030804@drmicha.warpmail.net \
    --to=git@drmicha.warpmail.net \
    --cc=SLONIK.AZ@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=madcoder@debian.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).