* Question about --tree-filter
@ 2009-02-04 16:08 Sergio Callegari
2009-02-04 16:37 ` Johannes Sixt
0 siblings, 1 reply; 6+ messages in thread
From: Sergio Callegari @ 2009-02-04 16:08 UTC (permalink / raw)
To: git
Hi,
in working with the "rezip" filter for the efficient git management of
openoffice, zip and docx files, I am encountering the following problem.
Suppose that you have an existing repository and that you want to convert it
into a repository using the rezip filters: git filter-branch should be the tool
to do the conversion.
Initially I believed that once set up the appropriate .git/config filter entries
and a .git/info/attributes file tying the filter to the appropriate file types,
it would have been enough to
git filter-branch --tree-filter true tag-name-filter cat
to do the conversion.
This is also what I suggested in my original post about the rezip script.
Unfortunately, this does not seem to work as expected. Not all files get
rewritten as filtered blobs. The only way to do the right job seems to use a
tree-filter that touches every single file in the project.
Any idea why it is so?
Also this is not very nice, because it makes the filter-branch result in a huge
amount of work. In other terms, the rezip blob rewriting gets called many many
times more than needed with this technique.
Does anybody have some suggestion of a tree filter that would be both "safe" and
"efficient" ?
Thanks
Sergio
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Question about --tree-filter
2009-02-04 16:08 Question about --tree-filter Sergio Callegari
@ 2009-02-04 16:37 ` Johannes Sixt
2009-02-04 20:42 ` Sergio Callegari
0 siblings, 1 reply; 6+ messages in thread
From: Johannes Sixt @ 2009-02-04 16:37 UTC (permalink / raw)
To: Sergio Callegari; +Cc: git
Sergio Callegari schrieb:
> in working with the "rezip" filter for the efficient git management of
> openoffice, zip and docx files, I am encountering the following problem.
>
> Suppose that you have an existing repository and that you want to convert it
> into a repository using the rezip filters: git filter-branch should be the tool
> to do the conversion.
>
> Initially I believed that once set up the appropriate .git/config filter entries
> and a .git/info/attributes file tying the filter to the appropriate file types,
> it would have been enough to
>
> git filter-branch --tree-filter true tag-name-filter cat
>
> to do the conversion.
> This is also what I suggested in my original post about the rezip script.
>
> Unfortunately, this does not seem to work as expected. Not all files get
> rewritten as filtered blobs.
Before the tree-filter runs, the files are checked out (and smudged by
rezip). But they are marked as unchanged (because they were checked out
moments ago). Since your tree-filter doesn't do anything, no new blobs are
added to the index, and none of your files are cleaned by rezip.
I think your brute-force tree-filter should be
rm -f "$GIT_INDEX_FILE"
assuming that a .gitattributes file is already in all revisions.
-- Hannes
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Question about --tree-filter
2009-02-04 16:37 ` Johannes Sixt
@ 2009-02-04 20:42 ` Sergio Callegari
2009-02-05 8:32 ` Johannes Sixt
0 siblings, 1 reply; 6+ messages in thread
From: Sergio Callegari @ 2009-02-04 20:42 UTC (permalink / raw)
To: git
Johannes Sixt wrote:
> Sergio Callegari schrieb:
>
>> in working with the "rezip" filter for the efficient git management of
>> openoffice, zip and docx files, I am encountering the following problem.
>>
>> Suppose that you have an existing repository and that you want to convert it
>> into a repository using the rezip filters: git filter-branch should be the tool
>> to do the conversion.
>>
>> Initially I believed that once set up the appropriate .git/config filter entries
>> and a .git/info/attributes file tying the filter to the appropriate file types,
>> it would have been enough to
>>
>> git filter-branch --tree-filter true tag-name-filter cat
>>
>> to do the conversion.
>> This is also what I suggested in my original post about the rezip script.
>>
>> Unfortunately, this does not seem to work as expected. Not all files get
>> rewritten as filtered blobs.
>>
>
> Before the tree-filter runs, the files are checked out (and smudged by
> rezip). But they are marked as unchanged (because they were checked out
> moments ago). Since your tree-filter doesn't do anything, no new blobs are
> added to the index, and none of your files are cleaned by rezip.
>
> I think your brute-force tree-filter should be
>
> rm -f "$GIT_INDEX_FILE"
>
> assuming that a .gitattributes file is already in all revisions.
>
> -- Hannes
>
Sorry, it still is not completely clear to me... I would be very glad if
you could detail better what happens when I tree-filter. From what you
say I get the impression that no file should get a new blob. As a
matter of fact, most do (and that is why at the very beginning I thought
that --tree-filter true would have been sufficient)... only a few do not
get the new blob.
And if I experiment filter-branch again, with exactly the same
parameters, apparently some of the files that did not get the new blob
in the beginning do... which looks completely weird.
The attributes are in the info subdir of .git, so the brute force
approach should be fine. I guess that it does not make any difference
wrt to a
find ./ -type f -exec touch \{\} \;
apart from looking slightly more aggressive to the index (and faster) or
does it?
Sergio
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Question about --tree-filter
2009-02-04 20:42 ` Sergio Callegari
@ 2009-02-05 8:32 ` Johannes Sixt
2009-02-05 13:13 ` Sergio Callegari
0 siblings, 1 reply; 6+ messages in thread
From: Johannes Sixt @ 2009-02-05 8:32 UTC (permalink / raw)
To: Sergio Callegari; +Cc: git
[ Don't cull Cc list, please! ]
Sergio Callegari schrieb:
> And if I experiment filter-branch again, with exactly the same
> parameters, apparently some of the files that did not get the new blob
> in the beginning do... which looks completely weird.
I think I know what's going on. filter-branch has this code where the
tree-filter is applied:
git checkout-index -f -u -a ||
die "Could not checkout the index"
This command may take a while to complete, and at the end it writes the
index file. At this point:
(=) Some files may have the same timestamp as the index file.
(<) Others have an earlier timestamp.
Later we have this code:
(
git diff-index -r --name-only $commit
git ls-files --others
) |
git update-index --add --replace --remove --stdin
The files (=) are racily-clean, and are added to the database; they pass
through the clean filter (rezip). The files (<) are regarded as unchanged,
and are not added again, and are not rezipped.
> find ./ -type f -exec touch \{\} \;
This could help, too, because now all files are regarded as either
racily-clean (same timestamp as index file) or as changed (newer timestamp).
-- Hannes
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Question about --tree-filter
2009-02-05 8:32 ` Johannes Sixt
@ 2009-02-05 13:13 ` Sergio Callegari
2009-02-05 13:57 ` Johannes Sixt
0 siblings, 1 reply; 6+ messages in thread
From: Sergio Callegari @ 2009-02-05 13:13 UTC (permalink / raw)
To: git
Johannes Sixt <j.sixt <at> viscovery.net> writes:
>
> [ Don't cull Cc list, please! ]
Sorry... first made a reply ignoring the list and then tried to
fix!
> I think I know what's going on. filter-branch has this code where the
> tree-filter is applied:
>
> git checkout-index -f -u -a ||
> die "Could not checkout the index"
>
> This command may take a while to complete, and at the end it writes the
> index file. At this point:
>
> (=) Some files may have the same timestamp as the index file.
>
> (<) Others have an earlier timestamp.
>
> Later we have this code:
>
> (
> git diff-index -r --name-only $commit
> git ls-files --others
> ) |
> git update-index --add --replace --remove --stdin
>
> The files (=) are racily-clean, and are added to the database; they pass
> through the clean filter (rezip). The files (<) are regarded as unchanged,
> and are not added again, and are not rezipped.
Ok it is because of a race... now I start understanding the non-consistent
behaviour between different runs! Thanks a lot for the explanation.
When you say "at the end it updates the index file" do you mean the effect of
the -u switch?
And when you say "Some files have the same timestamp as the index file" do you
mean that diff-index uses the stat info inside the index only if a file is "<"
than the index otherwise it is directly assumed that the file is changed wrt the
index content? If so, would it make sense to re-touch the index after the
checkout -u so that after the checkout the index is always > than every file it
contains and one always starts at a non-racy situation? With this, one could
only explicitly touch those files that need to get (re)filtered and gain in
efficiency... or am I still missing something?
Sergio
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Question about --tree-filter
2009-02-05 13:13 ` Sergio Callegari
@ 2009-02-05 13:57 ` Johannes Sixt
0 siblings, 0 replies; 6+ messages in thread
From: Johannes Sixt @ 2009-02-05 13:57 UTC (permalink / raw)
To: Sergio Callegari; +Cc: git
Sergio Callegari schrieb:
> Johannes Sixt <j.sixt <at> viscovery.net> writes:
>> [ Don't cull Cc list, please! ]
By this I actually meant that you should "Reply to All", not just the
mailing list.
> When you say "at the end it updates the index file" do you mean the effect of
> the -u switch?
I think so.
> And when you say "Some files have the same timestamp as the index file" do you
> mean that diff-index uses the stat info inside the index only if a file is "<"
> than the index otherwise it is directly assumed that the file is changed wrt the
> index content? If so, would it make sense to re-touch the index after the
> checkout -u so that after the checkout the index is always > than every file it
> contains and one always starts at a non-racy situation?
No. The "racy" situation is not something that is bad. It's merely a
situation where git cannot decide from stat information alone whether a
file was changed or not. So it plays safe, and looks also at the content.
But if you lie about the index's timestamp, then git will think that all
files are up-to-date.
> With this, one could
> only explicitly touch those files that need to get (re)filtered and gain in
> efficiency... or am I still missing something?
No, you cannot if you are on a "fast" machine, where the touch happens in
the same second that the index file was written. But you can wait one
second before you touch the files. Depending on the volume of your total
data, this might actually be faster as long as you touch only selected files.
-- Hannes
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2009-02-05 13:58 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-02-04 16:08 Question about --tree-filter Sergio Callegari
2009-02-04 16:37 ` Johannes Sixt
2009-02-04 20:42 ` Sergio Callegari
2009-02-05 8:32 ` Johannes Sixt
2009-02-05 13:13 ` Sergio Callegari
2009-02-05 13:57 ` Johannes Sixt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).