git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: "Kache Hit" <kache.hit@gmail.com>
Cc: "Chris Torek" <chris.torek@gmail.com>,
	 "Johannes Sixt" <j6t@kdbg.org>, <git@vger.kernel.org>
Subject: Re: Filter smudge for secret restoration: no disk access?
Date: Mon, 24 Nov 2025 11:35:08 -0800	[thread overview]
Message-ID: <xmqqms4bw7f7.fsf@gitster.g> (raw)
In-Reply-To: <DEH58DEF5MGO.2CFIKCM2CAQY2@gmail.com> (Kache Hit's message of "Mon, 24 Nov 2025 10:40:49 -0800")

"Kache Hit" <kache.hit@gmail.com> writes:

> On Mon Nov 24, 2025 at 1:01 AM PST, Johannes Sixt wrote:
>> A smudge filter must read its stdin and write the result to stdout. The
>> presence of %f in the configuration does not change this.
>>
>> The filter can inspect the file name it receives via the %f token (note:
>> the *name* of the file, not the file itself) to draw additional hints
>> how to process the data, but it still has to read stdin and write to stdout.
>
> Yes, I underststand. I'm asking why it's necessary that smudge not read
> from disk, even as it properly satisfies that stdin/stdout operation, as
> in my Python implementation of `smudge()`

I do not think it is a total dogmatic prohibition, but is a
practical piece of advice to be prepared in a situation where the
file %f does not exist on the disk in the working tree.  Also even
when the file %f does exist, its contents would not match (because
it was smudged when it was checked out, and the user may have
further modified it) what in the tree of the commit you are
switching out of.

Suppose you added a path F and G with a SAME smudge/clean filter
pair to the history at commit X.  You check out a commit before that
happened:

	$ git checkout -b practice X~1

and then try to come back to commit after X:

	$ git checkout X

Git would read the cleaned contents of blobs X:F and X:G, invokes
your smudge filter once for each of these blobs, and feeds the blob
contents to it.  Your smudge filter learns in its one of the two
invocations that it is being handed the clean contents and it is
expected to smudge it for path F via %f, and then the other
invocation of the same smudge filter is told that it is now being
asked to smudge for path G.

If F or G exists on the disk, surely, the smudge filter can read it,
but in this situation, because you are coming from X~1 before F and
G appeared in the history, these files are not on disk in your
working tree.

The smudge filter needs to be careful about a similar situation
where commit Y that is a descendant of X modifies F and/or G.  When
Y is checked out and you want to switch to X, working tree may have
smudged versions of F and G from Y when your smudge filter is
called.  Or it may happen during a checkout of F or G, and one of
the things the checkout needs to do may be to remove the existing
file from the working tree, and then create a file anew (probably in
a temporary file) and move it to the final place, in which case,
your smudge filter may be called during "create a file anew" phase,
where the old file F or G may be missing from the working tree.
Even if F and G are there, it may be from commit Y and their
contents may have nothing to do with the version of the files your
smudge filter is trying to turn the clean blob data taken from
commit X.

The note from the "git help attributes" you cited summarizes the
advice concisely.

    Note that "%f" is the name of the path that is being worked on. Depending
    on the version that is being filtered, the corresponding file on disk may
    not exist, or may have different contents. So, smudge and clean commands
    should not try to access the file on disk, but only act as filters on the
    content provided to them on standard input.

The smudge filter needs to be prepared to work in such scenarios.

Perhaps "Depending on ..." talks too much without giving readers
enough benefit.  A shorter description like this one ...

    Note that the purpose of %f is to tell the filter for what output
    path it is asked to smudge the clean blob data, and should not be
    used for anything else.

... may be less confusing, perhaps?


  reply	other threads:[~2025-11-24 19:35 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-24  7:39 Filter smudge for secret restoration: no disk access? Kache Hit
2025-11-24  9:01 ` Johannes Sixt
2025-11-24  9:49   ` Chris Torek
2025-11-24 18:40     ` Kache Hit
2025-11-24 19:35       ` Junio C Hamano [this message]
2025-11-25  7:28         ` Kache Hit
2025-11-25  8:55       ` Chris Torek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqqms4bw7f7.fsf@gitster.g \
    --to=gitster@pobox.com \
    --cc=chris.torek@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=j6t@kdbg.org \
    --cc=kache.hit@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).