* make git ignore the timestamp embedded in PDFs
@ 2013-05-14 13:17 Andreas Leha
2013-05-14 19:26 ` Johannes Sixt
0 siblings, 1 reply; 5+ messages in thread
From: Andreas Leha @ 2013-05-14 13:17 UTC (permalink / raw)
To: git
Hi all,
how can I make git ignore the time stamp(s) in a PDF. Two PDFs that
differ only in these time stamps should be considered identical.
Here is an example:
,----
| > pdfinfo some.pdf
| Title: R Graphics Output
| Creator: R
| Producer: R 2.15.1
| CreationDate: Thu Jan 24 13:43:31 2013 <== these entries
| ModDate: Thu Jan 24 13:43:31 2013 <== should be ignored
| Tagged: no
| Pages: 1
| Encrypted: no
| Page size: 504 x 504 pts
| File size: 54138 bytes
| Optimized: no
| PDF version: 1.4
`----
What I tried is a filter:
,----[ ~/.gitconfig ]
| [filter "pdfresetdate"]
| clean = pdfresetdate
`----
With this filter script:
,----[ pdfresetdate ]
| #!/bin/bash
|
| FILEASARG=true
| if [ "$#" == 0 ]; then
| FILEASARG=false
| fi
|
| if $FILEASARG ; then
| FILENAME="$1"
| else
| FILENAME=`mktemp`
| cat /dev/stdin > "${FILENAME}"
| fi
|
| TMPFILE=`mktemp`
| TMPFILE2=`mktemp`
|
| ## dump the pdf metadata to a file and replace the dates
| pdftk "$FILENAME" dump_data | sed -e '{N;s/Date\nInfoValue: D:.*/Date\nInfoValue: D:19790101072619/}' > "$TMPFILE"
|
| ## update the pdf metadata
| pdftk "$FILENAME" update_info "$TMPFILE" output "$TMPFILE2"
|
| ## overwrite the original pdf
| mv -f "$TMPFILE2" "$FILENAME"
|
| ## clean up
| rm -f "$TMPFILE"
| rm -f "$TMPFILE2"
| if [ -n $FILEASARG ] ; then
| cat "$FILENAME"
| fi
`----
This 'works' as far as the committed pdf indeed has the date reset to my
default value.
However, when I re-checkout the files, they are marked modified by git.
So, my question is: How can I make git *completely* ignore the embedded
date in the PDF?
Many thanks in advance for any help!
Regards,
Andreas
PS:
I had posted this question (without much success) here:
http://stackoverflow.com/questions/16058187/make-git-ignore-the-date-in-pdf-files
and with no answer on the git-users mailing list:
https://groups.google.com/forum/#!topic/git-users/KqtecNa3cOc
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: make git ignore the timestamp embedded in PDFs
2013-05-14 13:17 make git ignore the timestamp embedded in PDFs Andreas Leha
@ 2013-05-14 19:26 ` Johannes Sixt
2013-05-18 7:42 ` Andreas Leha
0 siblings, 1 reply; 5+ messages in thread
From: Johannes Sixt @ 2013-05-14 19:26 UTC (permalink / raw)
To: Andreas Leha; +Cc: git
Am 14.05.2013 15:17, schrieb Andreas Leha:
> Hi all,
>
> how can I make git ignore the time stamp(s) in a PDF. Two PDFs that
> differ only in these time stamps should be considered identical.
> ...
> What I tried is a filter:
> ,----[ ~/.gitconfig ]
> | [filter "pdfresetdate"]
> | clean = pdfresetdate
> `----
>
> This 'works' as far as the committed pdf indeed has the date reset to my
> default value.
>
> However, when I re-checkout the files, they are marked modified by git.
I'm using cleaned files every now and then, but not on Linux. I have
never observed this behavior recently.
If you 'git add' the file, does it keep its modified state? Does 'git
diff' tell a difference?
-- Hannes
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: make git ignore the timestamp embedded in PDFs
2013-05-14 19:26 ` Johannes Sixt
@ 2013-05-18 7:42 ` Andreas Leha
2013-05-18 16:32 ` Johannes Sixt
0 siblings, 1 reply; 5+ messages in thread
From: Andreas Leha @ 2013-05-18 7:42 UTC (permalink / raw)
To: git
Hi Hannes,
thanks for taking this up and sorry for the long delay in my answer.
Johannes Sixt <j6t@kdbg.org> writes:
> Am 14.05.2013 15:17, schrieb Andreas Leha:
>> Hi all,
>>
>> how can I make git ignore the time stamp(s) in a PDF. Two PDFs that
>> differ only in these time stamps should be considered identical.
>> ...
>> What I tried is a filter:
>> ,----[ ~/.gitconfig ]
>> | [filter "pdfresetdate"]
>> | clean = pdfresetdate
>> `----
>>
>> This 'works' as far as the committed pdf indeed has the date reset to my
>> default value.
>>
>> However, when I re-checkout the files, they are marked modified by git.
>
> I'm using cleaned files every now and then, but not on Linux. I have
> never observed this behavior recently.
>
> If you 'git add' the file, does it keep its modified state? Does 'git
yes.
> diff' tell a difference?
no.
Here is a complete 'session':
,----
| > mkdir test
| > cd test
| > git init
| > echo '*.pdf filter=pdfresetdate' > .gitattributes
| > cp ~/PDF/score_table.pdf .
| > pdfinfo score_table.pdf
| Title: (score_table)
| Author: (andreas)
| Creator: GPL Ghostscript 905 (ps2write)
| Producer: GPL Ghostscript 9.05
| CreationDate: Fri Feb 8 15:44:47 2013
| ModDate: Fri Feb 8 15:44:47 2013
| Tagged: no
| Pages: 1
| Encrypted: no
| Page size: 595 x 842 pts (A4)
| File size: 36989 bytes
| Optimized: no
| PDF version: 1.4
| > git add score_table.pdf
| > pdfinfo score_table.pdf
| Title: (score_table)
| Author: (andreas)
| Creator: GPL Ghostscript 905 (ps2write)
| Producer: GPL Ghostscript 9.05
| CreationDate: Fri Feb 8 15:44:47 2013
| ModDate: Fri Feb 8 15:44:47 2013
| Tagged: no
| Pages: 1
| Encrypted: no
| Page size: 595 x 842 pts (A4)
| File size: 36989 bytes
| Optimized: no
| PDF version: 1.4
| > git commit -m "test"
| > pdfinfo score_table.pdf
| Title: (score_table)
| Author: (andreas)
| Creator: GPL Ghostscript 905 (ps2write)
| Producer: GPL Ghostscript 9.05
| CreationDate: Fri Feb 8 15:44:47 2013
| ModDate: Fri Feb 8 15:44:47 2013
| Tagged: no
| Pages: 1
| Encrypted: no
| Page size: 595 x 842 pts (A4)
| File size: 36989 bytes
| Optimized: no
| PDF version: 1.4
| > rm score_table.pdf
| > git checkout score_table.pdf
| > git status
| # On branch master
| # Changes not staged for commit:
| # (use "git add <file>..." to update what will be committed)
| # (use "git checkout -- <file>..." to discard changes in working directory)
| #
| # modified: score_table.pdf
| #
| # Untracked files:
| # (use "git add <file>..." to include in what will be committed)
| #
| # .gitattributes
| no changes added to commit (use "git add" and/or "git commit -a")
| > pdfinfo score_table.pdf
| Title: (score_table)
| Author: (andreas)
| Creator: GPL Ghostscript 905 (ps2write)
| Producer: GPL Ghostscript 9.05
| CreationDate: Mon Jan 1 07:26:19 1979
| ModDate: Mon Jan 1 07:26:19 1979
| Tagged: no
| Pages: 1
| Encrypted: no
| Page size: 595 x 842 pts (A4)
| File size: 37126 bytes
| Optimized: no
| PDF version: 1.4
| > git add score_table.pdf
| > git status
| # On branch master
| # Changes to be committed:
| # (use "git reset HEAD <file>..." to unstage)
| #
| # modified: score_table.pdf
| #
| # Untracked files:
| # (use "git add <file>..." to include in what will be committed)
| #
| # .gitattributes
| > git diff score_table.pdf
| >
`----
Regards,
Andreas
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: make git ignore the timestamp embedded in PDFs
2013-05-18 7:42 ` Andreas Leha
@ 2013-05-18 16:32 ` Johannes Sixt
2013-05-18 18:09 ` Andreas Leha
0 siblings, 1 reply; 5+ messages in thread
From: Johannes Sixt @ 2013-05-18 16:32 UTC (permalink / raw)
To: Andreas Leha; +Cc: git
Am 18.05.2013 09:42, schrieb Andreas Leha:
>> Am 14.05.2013 15:17, schrieb Andreas Leha:
>>> Hi all,
>>>
>>> how can I make git ignore the time stamp(s) in a PDF. Two PDFs that
>>> differ only in these time stamps should be considered identical.
>>> ...
>>> What I tried is a filter:
>>> ,----[ ~/.gitconfig ]
>>> | [filter "pdfresetdate"]
>>> | clean = pdfresetdate
>>> `----
>>>
>>> This 'works' as far as the committed pdf indeed has the date reset to my
>>> default value.
>>>
>>> However, when I re-checkout the files, they are marked modified by git.
>>
>> I'm using cleaned files every now and then, but not on Linux. I have
>> never observed this behavior recently.
>>
>> If you 'git add' the file, does it keep its modified state? Does 'git
>
> yes.
>
>> diff' tell a difference?
>
> no.
I do not believe you. I'm sure that "Binary files differ" was reported.
The reason is that your pdfresetdate script is not idempotent. Look:
$ pdfresetdate < x.pdf > y.pdf
$ pdfresetdate < y.pdf > z.pdf
$ md5sum x.pdf y.pdf z.pdf
c46a7097574a035e89d1a46d93c83528 x.pdf
8e6d942b4cc7d8a4dfe6898867573617 y.pdf
e6333bc0f8ab9781d3e1d811a392d516 z.pdf
A file that was already cleaned by the clean filter must not be
modified, i.e., the y.pdf and z.pdf should be identical. But they are not.
Fix your clean filter.
-- Hannes
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: make git ignore the timestamp embedded in PDFs
2013-05-18 16:32 ` Johannes Sixt
@ 2013-05-18 18:09 ` Andreas Leha
0 siblings, 0 replies; 5+ messages in thread
From: Andreas Leha @ 2013-05-18 18:09 UTC (permalink / raw)
To: git
Johannes Sixt <j6t@kdbg.org> writes:
> Am 18.05.2013 09:42, schrieb Andreas Leha:
>>> Am 14.05.2013 15:17, schrieb Andreas Leha:
>>>> Hi all,
>>>>
>>>> how can I make git ignore the time stamp(s) in a PDF. Two PDFs that
>>>> differ only in these time stamps should be considered identical.
>>>> ...
>>>> What I tried is a filter:
>>>> ,----[ ~/.gitconfig ]
>>>> | [filter "pdfresetdate"]
>>>> | clean = pdfresetdate
>>>> `----
>>>>
>>>> This 'works' as far as the committed pdf indeed has the date reset to my
>>>> default value.
>>>>
>>>> However, when I re-checkout the files, they are marked modified by git.
>>>
>>> I'm using cleaned files every now and then, but not on Linux. I have
>>> never observed this behavior recently.
>>>
>>> If you 'git add' the file, does it keep its modified state? Does 'git
>>
>> yes.
>>
>>> diff' tell a difference?
>>
>> no.
>
> I do not believe you. I'm sure that "Binary files differ" was
> reported.
You are correct, of course. I had forgotten that I also had enabled a
special diff for pdf files, that reports the difference in the pdfinfo
output.
> The reason is that your pdfresetdate script is not idempotent. Look:
>
> $ pdfresetdate < x.pdf > y.pdf
> $ pdfresetdate < y.pdf > z.pdf
> $ md5sum x.pdf y.pdf z.pdf
> c46a7097574a035e89d1a46d93c83528 x.pdf
> 8e6d942b4cc7d8a4dfe6898867573617 y.pdf
> e6333bc0f8ab9781d3e1d811a392d516 z.pdf
>
Thanks for that. I had not noticed due to the non-binary diff I had
enabled.
> A file that was already cleaned by the clean filter must not be
> modified, i.e., the y.pdf and z.pdf should be identical. But they are not.
>
> Fix your clean filter.
I will (try to) do. Anyway, git seems unresponsible for my issue.
Thanks for that clear analysis!
Regards,
Andreas
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2013-05-18 18:10 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-05-14 13:17 make git ignore the timestamp embedded in PDFs Andreas Leha
2013-05-14 19:26 ` Johannes Sixt
2013-05-18 7:42 ` Andreas Leha
2013-05-18 16:32 ` Johannes Sixt
2013-05-18 18:09 ` Andreas Leha
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).