git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: Anatoly Borodin <anatoly.borodin@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: A different bug in git-filter-branch (v2.7.0)
Date: Mon, 22 Feb 2016 16:13:28 -0500	[thread overview]
Message-ID: <20160222211328.GA16512@sigill.intra.peff.net> (raw)
In-Reply-To: <n9b9tp$gbr$1@ger.gmane.org>

On Mon, Feb 08, 2016 at 11:55:37PM +0000, Anatoly Borodin wrote:

> unfortunately, `--tree-filter true` doesn't solve the problem with the
> repo at my work: not all old blobs are replaced with the new ones. I've
> made a test repository to demonstrate it; it's a huge one (115M), but I
> couldn't make it much smaller, because the bug fails to reproduce if the
> repo is not big enough:
> 
> https://github.com/anatolyborodin/git-filter-branch-bug
> 
> There are some description and instructions in `README.md`. I hope you
> will be able to reproduce it on your machine, if not - just add more
> files :)
> 
> I've debugged the test case and found the place where `git diff-index`
> behaves differently regarding `aa/bb.dat`:
> 
> read-cache.c +351	ie_match_stat():
> ...
> 	if (!changed && is_racy_timestamp(istate, ce)) {
> 		if (assume_racy_is_modified)
> 			changed |= DATA_CHANGED;
> 		else
> 			changed |= ce_modified_check_fs(ce, st);
> 	}
> ...
> 
> After `git-checkout-index` the blob hash for `aa/bb.dat` in the index is
> 88a0f09b9b2e4ccf2faec89ab37d416fba4ee79d (the huge binary), but the file
> on disk is a text file "This file was to big, and it has been removed."
> with the hash 16e0939430610600301680d5bf8a24a22ff8b6c4.

Right, that makes sense. The index doesn't know anything about the
replace mechanism. So it assumes that what it wrote matches what is in
the stat cache (i.e., some sha1 and the matching stat info). Later, when
git wants to know "should I bother reading this file back in and
computing its sha1", the stat cache says no.

And then as you noticed, it sometimes "works" because if the file and
index timestamps are the same, we err on the side of re-reading. So more
files means more likelihood of crossing the one-second boundary.

> I don't know if it should be considered to be a bug in in the logic of
> `git checkout-index`, or `git diff-index` / `git update-index`.

I'd say if any, it is the fault of checkout-index for writing out
content that does not match the sha1 in the index, but writing out the
stat information as if it is clean. For your case, you'd obviously
prefer that the file be marked dirty and re-read later.

But I'm not sure whether other users of replace refs would want the same
behavior. They may want to silently pretend as if the data is
unmodified. I'm not sure if anyone is doing that in practice, though; as
you noted, the results are not deterministic, so it probably ends up
just being a huge pain. So perhaps it would make sense to make the
checkout operation aware of replaced blobs.

As a workaround for your filter-branch, I suspect you could do
`--tree-filter 'git ls-files | xargs touch'` or something, but that is
going to be rather inefficient.

By now you've probably realized that replaced blobs are not widely used,
and there are a lot of corner cases around checking them out. So let's
take a step back for a moment. What are you trying to accomplish with
your filter-branch? If you just want to replace blob A with blob B, I
think it might be easier to skip refs/replace entirely and just do so
explicitly in an index-filter, like:

  --index-filter '
    git ls-files -s |
    sed "s/$old_sha1/$new_sha1/" |
    git update-index --index-info
  '

-Peff

      reply	other threads:[~2016-02-22 21:13 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-01-28 14:46 Bugs in git filter-branch (git replace related) Anatoly Borodin
2016-01-29  6:18 ` Jeff King
2016-01-29 18:24   ` Anatoly Borodin
2016-01-29 23:11     ` Jeff King
2016-02-08 23:55       ` A different bug in git-filter-branch (v2.7.0) Anatoly Borodin
2016-02-22 21:13         ` Jeff King [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160222211328.GA16512@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=anatoly.borodin@gmail.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).