git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* make git ignore the timestamp embedded in PDFs
@ 2013-05-14 13:17 Andreas Leha
  2013-05-14 19:26 ` Johannes Sixt
  0 siblings, 1 reply; 5+ messages in thread
From: Andreas Leha @ 2013-05-14 13:17 UTC (permalink / raw)
  To: git

Hi all,

how can I make git ignore the time stamp(s) in a PDF.  Two PDFs that
differ only in these time stamps should be considered identical.

Here is an example:
,----
| > pdfinfo some.pdf
| Title:          R Graphics Output
| Creator:        R
| Producer:       R 2.15.1
| CreationDate:   Thu Jan 24 13:43:31 2013 <==  these entries
| ModDate:        Thu Jan 24 13:43:31 2013 <==  should be ignored
| Tagged:         no
| Pages:          1
| Encrypted:      no
| Page size:      504 x 504 pts
| File size:      54138 bytes
| Optimized:      no
| PDF version:    1.4
`----


What I tried is a filter:
,----[ ~/.gitconfig ]
| [filter "pdfresetdate"]
|         clean = pdfresetdate
`----

With this filter script:
,----[ pdfresetdate ]
| #!/bin/bash
|
| FILEASARG=true
| if [ "$#" == 0 ]; then
|     FILEASARG=false
| fi
|
| if $FILEASARG ; then
|     FILENAME="$1"
| else
|     FILENAME=`mktemp`
|     cat /dev/stdin > "${FILENAME}"
| fi
|
| TMPFILE=`mktemp`
| TMPFILE2=`mktemp`
|
| ## dump the pdf metadata to a file and replace the dates
| pdftk "$FILENAME" dump_data | sed -e '{N;s/Date\nInfoValue: D:.*/Date\nInfoValue: D:19790101072619/}' > "$TMPFILE"
|
| ## update the pdf metadata
| pdftk "$FILENAME" update_info "$TMPFILE" output "$TMPFILE2"
|
| ## overwrite the original pdf
| mv -f "$TMPFILE2" "$FILENAME"
|
| ## clean up
| rm -f "$TMPFILE"
| rm -f "$TMPFILE2"
| if [ -n $FILEASARG ] ; then
|     cat "$FILENAME"
| fi
`----


This 'works' as far as the committed pdf indeed has the date reset to my
default value.

However, when I re-checkout the files, they are marked modified by git.

So, my question is:  How can I make git *completely* ignore the embedded
date in the PDF?

Many thanks in advance for any help!

Regards,
Andreas


PS:
I had posted this question (without much success) here:
http://stackoverflow.com/questions/16058187/make-git-ignore-the-date-in-pdf-files
and with no answer on the git-users mailing list:
https://groups.google.com/forum/#!topic/git-users/KqtecNa3cOc

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: make git ignore the timestamp embedded in PDFs
  2013-05-14 13:17 make git ignore the timestamp embedded in PDFs Andreas Leha
@ 2013-05-14 19:26 ` Johannes Sixt
  2013-05-18  7:42   ` Andreas Leha
  0 siblings, 1 reply; 5+ messages in thread
From: Johannes Sixt @ 2013-05-14 19:26 UTC (permalink / raw)
  To: Andreas Leha; +Cc: git

Am 14.05.2013 15:17, schrieb Andreas Leha:
> Hi all,
> 
> how can I make git ignore the time stamp(s) in a PDF.  Two PDFs that
> differ only in these time stamps should be considered identical.
> ...
> What I tried is a filter:
> ,----[ ~/.gitconfig ]
> | [filter "pdfresetdate"]
> |         clean = pdfresetdate
> `----
> 
> This 'works' as far as the committed pdf indeed has the date reset to my
> default value.
> 
> However, when I re-checkout the files, they are marked modified by git.

I'm using cleaned files every now and then, but not on Linux. I have
never observed this behavior recently.

If you 'git add' the file, does it keep its modified state? Does 'git
diff' tell a difference?

-- Hannes

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: make git ignore the timestamp embedded in PDFs
  2013-05-14 19:26 ` Johannes Sixt
@ 2013-05-18  7:42   ` Andreas Leha
  2013-05-18 16:32     ` Johannes Sixt
  0 siblings, 1 reply; 5+ messages in thread
From: Andreas Leha @ 2013-05-18  7:42 UTC (permalink / raw)
  To: git

Hi Hannes,

thanks for taking this up and sorry for the long delay in my answer.

Johannes Sixt <j6t@kdbg.org> writes:

> Am 14.05.2013 15:17, schrieb Andreas Leha:
>> Hi all,
>> 
>> how can I make git ignore the time stamp(s) in a PDF.  Two PDFs that
>> differ only in these time stamps should be considered identical.
>> ...
>> What I tried is a filter:
>> ,----[ ~/.gitconfig ]
>> | [filter "pdfresetdate"]
>> |         clean = pdfresetdate
>> `----
>> 
>> This 'works' as far as the committed pdf indeed has the date reset to my
>> default value.
>> 
>> However, when I re-checkout the files, they are marked modified by git.
>
> I'm using cleaned files every now and then, but not on Linux. I have
> never observed this behavior recently.
>
> If you 'git add' the file, does it keep its modified state? Does 'git

yes.

> diff' tell a difference?

no.

Here is a complete 'session':
,----
| > mkdir test
| > cd test
| > git init
| > echo '*.pdf filter=pdfresetdate' > .gitattributes
| > cp ~/PDF/score_table.pdf .
| > pdfinfo score_table.pdf
| Title:          (score_table)
| Author:         (andreas)
| Creator:        GPL Ghostscript 905 (ps2write)
| Producer:       GPL Ghostscript 9.05
| CreationDate:   Fri Feb  8 15:44:47 2013
| ModDate:        Fri Feb  8 15:44:47 2013
| Tagged:         no
| Pages:          1
| Encrypted:      no
| Page size:      595 x 842 pts (A4)
| File size:      36989 bytes
| Optimized:      no
| PDF version:    1.4
| > git add score_table.pdf
| > pdfinfo score_table.pdf
| Title:          (score_table)
| Author:         (andreas)
| Creator:        GPL Ghostscript 905 (ps2write)
| Producer:       GPL Ghostscript 9.05
| CreationDate:   Fri Feb  8 15:44:47 2013
| ModDate:        Fri Feb  8 15:44:47 2013
| Tagged:         no
| Pages:          1
| Encrypted:      no
| Page size:      595 x 842 pts (A4)
| File size:      36989 bytes
| Optimized:      no
| PDF version:    1.4
| > git commit -m "test"
| > pdfinfo score_table.pdf
| Title:          (score_table)
| Author:         (andreas)
| Creator:        GPL Ghostscript 905 (ps2write)
| Producer:       GPL Ghostscript 9.05
| CreationDate:   Fri Feb  8 15:44:47 2013
| ModDate:        Fri Feb  8 15:44:47 2013
| Tagged:         no
| Pages:          1
| Encrypted:      no
| Page size:      595 x 842 pts (A4)
| File size:      36989 bytes
| Optimized:      no
| PDF version:    1.4
| > rm score_table.pdf
| > git checkout  score_table.pdf  
| > git status
| # On branch master
| # Changes not staged for commit:
| #   (use "git add <file>..." to update what will be committed)
| #   (use "git checkout -- <file>..." to discard changes in working directory)
| #
| #       modified:   score_table.pdf
| #
| # Untracked files:
| #   (use "git add <file>..." to include in what will be committed)
| #
| #       .gitattributes
| no changes added to commit (use "git add" and/or "git commit -a")
| > pdfinfo score_table.pdf 
| Title:          (score_table)
| Author:         (andreas)
| Creator:        GPL Ghostscript 905 (ps2write)
| Producer:       GPL Ghostscript 9.05
| CreationDate:   Mon Jan  1 07:26:19 1979
| ModDate:        Mon Jan  1 07:26:19 1979
| Tagged:         no
| Pages:          1
| Encrypted:      no
| Page size:      595 x 842 pts (A4)
| File size:      37126 bytes
| Optimized:      no
| PDF version:    1.4
| > git add score_table.pdf
| > git status
| # On branch master
| # Changes to be committed:
| #   (use "git reset HEAD <file>..." to unstage)
| #
| #       modified:   score_table.pdf
| #
| # Untracked files:
| #   (use "git add <file>..." to include in what will be committed)
| #
| #       .gitattributes
| > git diff score_table.pdf
| > 
`----

Regards,
Andreas

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: make git ignore the timestamp embedded in PDFs
  2013-05-18  7:42   ` Andreas Leha
@ 2013-05-18 16:32     ` Johannes Sixt
  2013-05-18 18:09       ` Andreas Leha
  0 siblings, 1 reply; 5+ messages in thread
From: Johannes Sixt @ 2013-05-18 16:32 UTC (permalink / raw)
  To: Andreas Leha; +Cc: git

Am 18.05.2013 09:42, schrieb Andreas Leha:
>> Am 14.05.2013 15:17, schrieb Andreas Leha:
>>> Hi all,
>>>
>>> how can I make git ignore the time stamp(s) in a PDF.  Two PDFs that
>>> differ only in these time stamps should be considered identical.
>>> ...
>>> What I tried is a filter:
>>> ,----[ ~/.gitconfig ]
>>> | [filter "pdfresetdate"]
>>> |         clean = pdfresetdate
>>> `----
>>>
>>> This 'works' as far as the committed pdf indeed has the date reset to my
>>> default value.
>>>
>>> However, when I re-checkout the files, they are marked modified by git.
>>
>> I'm using cleaned files every now and then, but not on Linux. I have
>> never observed this behavior recently.
>>
>> If you 'git add' the file, does it keep its modified state? Does 'git
> 
> yes.
> 
>> diff' tell a difference?
> 
> no.

I do not believe you. I'm sure that "Binary files differ" was reported.
The reason is that your pdfresetdate script is not idempotent. Look:

$ pdfresetdate < x.pdf > y.pdf
$ pdfresetdate < y.pdf > z.pdf
$ md5sum x.pdf y.pdf z.pdf
c46a7097574a035e89d1a46d93c83528  x.pdf
8e6d942b4cc7d8a4dfe6898867573617  y.pdf
e6333bc0f8ab9781d3e1d811a392d516  z.pdf

A file that was already cleaned by the clean filter must not be
modified, i.e., the y.pdf and z.pdf should be identical. But they are not.

Fix your clean filter.

-- Hannes

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: make git ignore the timestamp embedded in PDFs
  2013-05-18 16:32     ` Johannes Sixt
@ 2013-05-18 18:09       ` Andreas Leha
  0 siblings, 0 replies; 5+ messages in thread
From: Andreas Leha @ 2013-05-18 18:09 UTC (permalink / raw)
  To: git

Johannes Sixt <j6t@kdbg.org> writes:

> Am 18.05.2013 09:42, schrieb Andreas Leha:
>>> Am 14.05.2013 15:17, schrieb Andreas Leha:
>>>> Hi all,
>>>>
>>>> how can I make git ignore the time stamp(s) in a PDF.  Two PDFs that
>>>> differ only in these time stamps should be considered identical.
>>>> ...
>>>> What I tried is a filter:
>>>> ,----[ ~/.gitconfig ]
>>>> | [filter "pdfresetdate"]
>>>> |         clean = pdfresetdate
>>>> `----
>>>>
>>>> This 'works' as far as the committed pdf indeed has the date reset to my
>>>> default value.
>>>>
>>>> However, when I re-checkout the files, they are marked modified by git.
>>>
>>> I'm using cleaned files every now and then, but not on Linux. I have
>>> never observed this behavior recently.
>>>
>>> If you 'git add' the file, does it keep its modified state? Does 'git
>> 
>> yes.
>> 
>>> diff' tell a difference?
>> 
>> no.
>
> I do not believe you. I'm sure that "Binary files differ" was
> reported.

You are correct, of course.  I had forgotten that I also had enabled a
special diff for pdf files, that reports the difference in the pdfinfo
output.

> The reason is that your pdfresetdate script is not idempotent. Look:
>
> $ pdfresetdate < x.pdf > y.pdf
> $ pdfresetdate < y.pdf > z.pdf
> $ md5sum x.pdf y.pdf z.pdf
> c46a7097574a035e89d1a46d93c83528  x.pdf
> 8e6d942b4cc7d8a4dfe6898867573617  y.pdf
> e6333bc0f8ab9781d3e1d811a392d516  z.pdf
>

Thanks for that.  I had not noticed due to the non-binary diff I had
enabled.

> A file that was already cleaned by the clean filter must not be
> modified, i.e., the y.pdf and z.pdf should be identical. But they are not.
>
> Fix your clean filter.

I will (try to) do.  Anyway, git seems unresponsible for my issue.


Thanks for that clear analysis!

Regards,
Andreas

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2013-05-18 18:10 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-05-14 13:17 make git ignore the timestamp embedded in PDFs Andreas Leha
2013-05-14 19:26 ` Johannes Sixt
2013-05-18  7:42   ` Andreas Leha
2013-05-18 16:32     ` Johannes Sixt
2013-05-18 18:09       ` Andreas Leha

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).