git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Filter smudge for secret restoration: no disk access?
@ 2025-11-24  7:39 Kache Hit
  2025-11-24  9:01 ` Johannes Sixt
  0 siblings, 1 reply; 7+ messages in thread
From: Kache Hit @ 2025-11-24  7:39 UTC (permalink / raw)
  To: git

I was working on a git redaction script that restores working copy
secrets when applied via `.gitattributes` clean/smudge filters, but
encountered `smudge` not having access to the "working file" on disk.

I see it's documented as intended in
https://git-scm.com/docs/gitattributes:

> Note that "%f" is the name of the path that is being worked on.
> Depending on the version that is being filtered, the corresponding
> file on disk may not exist, or may have different contents. So, smudge
> and clean commands should not try to access the file on disk, but only
> act as filters on the content provided to them on standard input.

Any chance there's a way around this or some alternative? Python
implementation below for reference.

And also for my understanding, why _shouldn't_ smudge access disk?

```py
#!/usr/bin/env python3
"""
Git clean/smudge filter for redactions that retains working secrets

If the following is in the repo as `bar/foo_secrets.yml`:
```
    foo_token: ##REDACTED##
    other: "not secret"
```

The local token won't be overwritten on checkout/restore:
```
    foo_token: secret_value
    other: "not secret"
```

Setup & example usage:

Save this file in repo root as `git_redact_filter.py`

`.gitattributes`:
```
bar/foo_secrets.yml filter=foo_token
```

`.gitconfig`:
```
[filter "foo_token"]
  clean = ./git_redact_filter.py --prefix foo_token:
  smudge = ./git_redact_filter.py --prefix foo_token: --smudge %f
```
"""
import inspect
import re
import sys
from argparse import ArgumentParser
from pathlib import Path
from typing import TextIO

REDACTED = '##REDACTED##'


def clean(workfile: TextIO, prefixes: list[str], out=None):
    pat = prefix_secret_rgx(prefixes)

    for line in workfile.readlines():
        if match := pat.match(line):
            print(match['prefix'] + REDACTED, file=out)
        else:
            print(line, end='', file=out)


def smudge(repofile: TextIO, prefixes: list[str], path: Path, out=None):
    pat = prefix_secret_rgx(prefixes)

    with path.open() as workfile:  # fails: FileNotFoundError
        secrets = {
            str(match['prefix']): match
            for match in map(pat.match, workfile.readlines())
            if match
        }

    for line in repofile.readlines():
        match = pat.match(line)
        secret = match and secrets.get(match['prefix'])

        if match and secret and match['secret'] == REDACTED:
            print(match['prefix'] + secret['secret'], file=out)
        else:
            print(line, end='', file=out)


def prefix_secret_rgx(prefixes_unsafe: list[str]):
    keys = '|'.join(map(re.escape, prefixes_unsafe))
    pat = rf"(?P<prefix>\s*({keys})\s*)(?P<secret>.*)"
    return re.compile(pat if keys else r'$^')


def heredoc(s: str):
    return inspect.cleandoc(s) + '\n'


def main():
    desc = "Git clean/smudge filter for redactions"
    list_arg = {'action': 'append', 'default': []}
    parser = ArgumentParser(description=desc)
    parser.add_argument('-p', '--prefix', **list_arg, metavar='PREFIX')
    parser.add_argument('--smudge', type=Path, metavar='PATH')
    args = parser.parse_args()

    if args.smudge:
        return smudge(sys.stdin, args.prefix, args.smudge)
    else:
        return clean(sys.stdin, args.prefix)


if __name__ == '__main__':
    sys.exit(main())


import io
from unittest.mock import Mock

import pytest
from pytest import CaptureFixture


work_file = io.StringIO(heredoc("""
    foo_token: secret_value
    other: "not secret"
"""))
clean_file = io.StringIO(heredoc("""
    foo_token: ##REDACTED##
    other: "not secret"
"""))

empty_file = io.StringIO()
work_file_secret_removed = io.StringIO(heredoc("""
    other: "not secret"
"""))
work_file_lines_added = io.StringIO(heredoc("""
    new_other: 123
    foo_token: secret_value
    other: "not secret"
"""))


def test_clean(capsys: CaptureFixture):
    clean(work_file, ['foo_token:'])
    captured = capsys.readouterr()
    assert captured.out == clean_file.getvalue(), "should be redacted"


def test_clean_idempotent():
    out, out2 = io.StringIO(), io.StringIO()
    clean(work_file, ['foo_token:'], out)
    clean(io.StringIO(out.getvalue()), ['foo_token:'], out2)
    assert out2.getvalue() == clean_file.getvalue()


@pytest.mark.parametrize(['workfile', 'expected', 'msg'], [
    (work_file,                work_file,  "secrets should be kept"),
    (work_file_lines_added,    work_file,  "should retain secret"),
    (work_file_secret_removed, clean_file, "should restore redacted"),
])
def test_smudge_goal(capsys: CaptureFixture, workfile, expected, msg):
    path = Mock()
    path.open.side_effect = lambda: io.StringIO(workfile.getvalue())

    smudge(clean_file, ['foo_token:'], path)
    captured = capsys.readouterr()
    assert captured.out == expected.getvalue(), msg


def test_smudge_idempotent():
    path = Mock()
    path.open.side_effect = lambda: io.StringIO(work_file.getvalue())
    cleaned, cleaned2 = io.StringIO(), io.StringIO()

    smudge(clean_file, ['foo_token:'], path, cleaned)
    cleaned.seek(0)
    smudge(cleaned, ['foo_token:'], path, cleaned2)
    assert cleaned.getvalue() == cleaned2.getvalue()


git_doc_url = "https://git-scm.com/docs/gitattributes"
@pytest.mark.xfail(reason=f"should access file on disk: {git_doc_url}")
@pytest.mark.parametrize(['workfile', 'expected', 'msg'], [
    (work_file,                work_file,  "secrets should be kept"),
    (work_file_lines_added,    work_file,  "should retain secret"),
    (work_file_secret_removed, clean_file, "should restore redacted"),
])
def test_smudge_actual(capsys: CaptureFixture, workfile, expected, msg):
    msg = "[Errno 2] No such file or directory: 'bar/foo_secrets.yml'"
    err = FileNotFoundError(msg)
    mock_workfile_path = Mock()
    mock_workfile_path.open.side_effect = err

    smudge(clean_file, ['foo_token:'], mock_workfile_path)
    captured = capsys.readouterr()
    assert captured.out == expected.getvalue(), msg


@pytest.fixture(autouse=True)
def reset_files():
    for file in [work_file, clean_file]:
        file.seek(0)
```

Thanks,

Kache

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Filter smudge for secret restoration: no disk access?
  2025-11-24  7:39 Filter smudge for secret restoration: no disk access? Kache Hit
@ 2025-11-24  9:01 ` Johannes Sixt
  2025-11-24  9:49   ` Chris Torek
  0 siblings, 1 reply; 7+ messages in thread
From: Johannes Sixt @ 2025-11-24  9:01 UTC (permalink / raw)
  To: Kache Hit; +Cc: git

Am 24.11.25 um 08:39 schrieb Kache Hit:
> I was working on a git redaction script that restores working copy
> secrets when applied via `.gitattributes` clean/smudge filters, but
> encountered `smudge` not having access to the "working file" on disk.
> 
> I see it's documented as intended in
> https://git-scm.com/docs/gitattributes:
> 
>> Note that "%f" is the name of the path that is being worked on.
>> Depending on the version that is being filtered, the corresponding
>> file on disk may not exist, or may have different contents. So, smudge
>> and clean commands should not try to access the file on disk, but only
>> act as filters on the content provided to them on standard input.
> 
> Any chance there's a way around this or some alternative? Python
> implementation below for reference.
> 
> And also for my understanding, why _shouldn't_ smudge access disk?

A smudge filter must read its stdin and write the result to stdout. The
presence of %f in the configuration does not change this.

The filter can inspect the file name it receives via the %f token (note:
the *name* of the file, not the file itself) to draw additional hints
how to process the data, but it still has to read stdin and write to stdout.

-- Hannes


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Filter smudge for secret restoration: no disk access?
  2025-11-24  9:01 ` Johannes Sixt
@ 2025-11-24  9:49   ` Chris Torek
  2025-11-24 18:40     ` Kache Hit
  0 siblings, 1 reply; 7+ messages in thread
From: Chris Torek @ 2025-11-24  9:49 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: Kache Hit, git

On Mon, Nov 24, 2025 at 1:01 AM Johannes Sixt <j6t@kdbg.org> wrote:
> The filter can inspect the file name it receives via the %f token (note:
> the *name* of the file, not the file itself) to draw additional hints
> how to process the data, but it still has to read stdin and write to stdout.

It can, of course, also read and/or write anything else on disk.

When and how this is actually useful is another matter entirely.

For sanity purposes, if no other reasons, it might be wise to store a
"file with secrets" under a file with a name such that it is **never**
controlled by Git (i.e., always listed in a .gitignore or equivalent,
or outside the working tree entirely), and to store instead, in Git, a
"template file with secrets that are replaced". That way, the secrets
either exist on disk (and are secret because Git is blind to them), or
do not exist at all (and are therefore secret to Git). The template
file controls the template and nothing else; the secret-data file has
both secrets and, perhaps, data that are extracted from the
Git-controlled file as well.

In this manner, a "to-be-smudged" file named foo.template might
control some external-to-Git manipulation of an invisible-to-Gt file
named foo.secret, and no clean filter would be required at all, though
one could inspect and strip secrets accidentally copied into a
foo.template.

Chris

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Filter smudge for secret restoration: no disk access?
  2025-11-24  9:49   ` Chris Torek
@ 2025-11-24 18:40     ` Kache Hit
  2025-11-24 19:35       ` Junio C Hamano
  2025-11-25  8:55       ` Chris Torek
  0 siblings, 2 replies; 7+ messages in thread
From: Kache Hit @ 2025-11-24 18:40 UTC (permalink / raw)
  To: Chris Torek, Johannes Sixt; +Cc: Kache Hit, git

On Mon Nov 24, 2025 at 1:01 AM PST, Johannes Sixt wrote:
> A smudge filter must read its stdin and write the result to stdout. The
> presence of %f in the configuration does not change this.
>
> The filter can inspect the file name it receives via the %f token (note:
> the *name* of the file, not the file itself) to draw additional hints
> how to process the data, but it still has to read stdin and write to stdout.

Yes, I underststand. I'm asking why it's necessary that smudge not read
from disk, even as it properly satisfies that stdin/stdout operation, as
in my Python implementation of `smudge()`

On Mon Nov 24, 2025 at 1:49 AM PST, Chris Torek wrote:
> For sanity purposes, if no other reasons, it might be wise to store a
> "file with secrets" under a file with a name such that it is **never**
> controlled by Git (i.e., always listed in a .gitignore or equivalent,
> or outside the working tree entirely), and to store instead, in Git, a
> "template file with secrets that are replaced". That way, the secrets
> either exist on disk (and are secret because Git is blind to them), or
> do not exist at all (and are therefore secret to Git). The template
> file controls the template and nothing else; the secret-data file has
> both secrets and, perhaps, data that are extracted from the
> Git-controlled file as well.
>
> In this manner, a "to-be-smudged" file named foo.template might
> control some external-to-Git manipulation of an invisible-to-Gt file
> named foo.secret, and no clean filter would be required at all, though
> one could inspect and strip secrets accidentally copied into a
> foo.template.

I'm familiar with this practice, e.g. committing an `.env.template`
which is used to create an `.env` file with secrets within.

However, this is my dotfiles repo that includes `~/.config`. There are
config files that store credentials right next to configuration, managed
by software that I don't control.

Although I could still apply that pattern by ignoring `foo.yml` and
committing a redacted `foo.template.yml`, I'd have to manually upstream
changes back to the template as the config file changes.

Another use case is to ignore changes to a specific line without losing
the working copy. Some software saves a volatile "last_updated_at" or
"last_opened" field into config that doesn't need to be committed. This
could also be useful for https://stackoverflow.com/questions/16244969
and https://stackoverflow.com/questions/61091219

- Kache

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Filter smudge for secret restoration: no disk access?
  2025-11-24 18:40     ` Kache Hit
@ 2025-11-24 19:35       ` Junio C Hamano
  2025-11-25  7:28         ` Kache Hit
  2025-11-25  8:55       ` Chris Torek
  1 sibling, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2025-11-24 19:35 UTC (permalink / raw)
  To: Kache Hit; +Cc: Chris Torek, Johannes Sixt, git

"Kache Hit" <kache.hit@gmail.com> writes:

> On Mon Nov 24, 2025 at 1:01 AM PST, Johannes Sixt wrote:
>> A smudge filter must read its stdin and write the result to stdout. The
>> presence of %f in the configuration does not change this.
>>
>> The filter can inspect the file name it receives via the %f token (note:
>> the *name* of the file, not the file itself) to draw additional hints
>> how to process the data, but it still has to read stdin and write to stdout.
>
> Yes, I underststand. I'm asking why it's necessary that smudge not read
> from disk, even as it properly satisfies that stdin/stdout operation, as
> in my Python implementation of `smudge()`

I do not think it is a total dogmatic prohibition, but is a
practical piece of advice to be prepared in a situation where the
file %f does not exist on the disk in the working tree.  Also even
when the file %f does exist, its contents would not match (because
it was smudged when it was checked out, and the user may have
further modified it) what in the tree of the commit you are
switching out of.

Suppose you added a path F and G with a SAME smudge/clean filter
pair to the history at commit X.  You check out a commit before that
happened:

	$ git checkout -b practice X~1

and then try to come back to commit after X:

	$ git checkout X

Git would read the cleaned contents of blobs X:F and X:G, invokes
your smudge filter once for each of these blobs, and feeds the blob
contents to it.  Your smudge filter learns in its one of the two
invocations that it is being handed the clean contents and it is
expected to smudge it for path F via %f, and then the other
invocation of the same smudge filter is told that it is now being
asked to smudge for path G.

If F or G exists on the disk, surely, the smudge filter can read it,
but in this situation, because you are coming from X~1 before F and
G appeared in the history, these files are not on disk in your
working tree.

The smudge filter needs to be careful about a similar situation
where commit Y that is a descendant of X modifies F and/or G.  When
Y is checked out and you want to switch to X, working tree may have
smudged versions of F and G from Y when your smudge filter is
called.  Or it may happen during a checkout of F or G, and one of
the things the checkout needs to do may be to remove the existing
file from the working tree, and then create a file anew (probably in
a temporary file) and move it to the final place, in which case,
your smudge filter may be called during "create a file anew" phase,
where the old file F or G may be missing from the working tree.
Even if F and G are there, it may be from commit Y and their
contents may have nothing to do with the version of the files your
smudge filter is trying to turn the clean blob data taken from
commit X.

The note from the "git help attributes" you cited summarizes the
advice concisely.

    Note that "%f" is the name of the path that is being worked on. Depending
    on the version that is being filtered, the corresponding file on disk may
    not exist, or may have different contents. So, smudge and clean commands
    should not try to access the file on disk, but only act as filters on the
    content provided to them on standard input.

The smudge filter needs to be prepared to work in such scenarios.

Perhaps "Depending on ..." talks too much without giving readers
enough benefit.  A shorter description like this one ...

    Note that the purpose of %f is to tell the filter for what output
    path it is asked to smudge the clean blob data, and should not be
    used for anything else.

... may be less confusing, perhaps?


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Filter smudge for secret restoration: no disk access?
  2025-11-24 19:35       ` Junio C Hamano
@ 2025-11-25  7:28         ` Kache Hit
  0 siblings, 0 replies; 7+ messages in thread
From: Kache Hit @ 2025-11-25  7:28 UTC (permalink / raw)
  To: Junio C Hamano, Kache Hit; +Cc: Chris Torek, Johannes Sixt, git

On Mon Nov 24, 2025 at 11:35 AM PST, Junio C Hamano wrote:
> I do not think it is a total dogmatic prohibition, but is a
> practical piece of advice to be prepared in a situation where the
> file %f does not exist on the disk in the working tree.  Also even
> when the file %f does exist, its contents would not match (because
> it was smudged when it was checked out, and the user may have
> further modified it) what in the tree of the commit you are
> switching out of.

You're right, it can be tricky as there are several cases to handle. I
try covering this and other cases in the script's tests.

However, isn't properly handling different scenarios a separate issue?
Simplying my concept to "ignoring" instead of "redacting":

 * Clean: ignore certain lines, preventing them from being committed
 * Smudge: don't overwrite working copy of ignored lines on checkout

Then the functionality becomes line-wise analogous to gitignore working
on whole files. My local copy of gitignored `.env` isn't overwritten
when I checkout. I'm looking for the same, just line-wise.

On Mon Nov 24, 2025 at 11:35 AM PST, Junio C Hamano wrote:
> ... one of the things the checkout needs to do may be to remove the
> existing file from the working tree, and then create a file anew
> (probably in a temporary file) and move it to the final place, in
> which case, your smudge filter may be called during "create a file
> anew" phase, where the old file F or G may be missing from the working
> tree.

The old file being missing, being wholly removed right away, is exactly
what I'm running into. If the working copy was kept around for `smudge`,
I could achive a basic implementation of line-wise ignore/redact.

As-is, git's clean -> smudge filters can:
 * idempotent op -> no-op, e.g. identing or formatting
 * perfect mapping -> map back, e.g. git-lfs
 * add info -> remove info, e.g. expand RCS keyword -> unexpand

But not:
 * remove info -> restore info, e.g. ignoring lines, redacting


- Kache

PS

I've just found a case I'm not yet handling: at the end of `smudge()`,
any unused secrets from the "previous working copy" that haven't been
restored into the template would be lost. It is analogous to having
local changes to a file at commit `X` and checking out `Y` where that
file has been deleted. Git avoids overwriting local changes by aborting
the checkout.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Filter smudge for secret restoration: no disk access?
  2025-11-24 18:40     ` Kache Hit
  2025-11-24 19:35       ` Junio C Hamano
@ 2025-11-25  8:55       ` Chris Torek
  1 sibling, 0 replies; 7+ messages in thread
From: Chris Torek @ 2025-11-25  8:55 UTC (permalink / raw)
  To: Kache Hit; +Cc: Johannes Sixt, git

On Mon, Nov 24, 2025 at 10:40 AM Kache Hit <kache.hit@gmail.com> wrote:
> I'm familiar with this practice, e.g. committing an `.env.template`
> which is used to create an `.env` file with secrets within.
>
> However, this is my dotfiles repo that includes `~/.config`. There are
> config files that store credentials right next to configuration, managed
> by software that I don't control.

My technique for this is that my dotfiles are in a repository where
they are named "profile", "bashrc", "gitconfig", and so on. These
get installed by my dotfiles-installer as $HOME/.profile, etc. The
installer (my own creation, tuned to my personal needs and not really
suitable for anyone else) builds the target files as needed.

(The thing probably needs a redesign and rewrite since newer
software messes with these files more dynamically at this point,
but I have not had to do that yet. So far I haven't needed to
do the "update repository from active files" part, which would
be harder.)

The reason for naming them without the leading dot is to
make it abundantly obvious during editing whether I'm on the
template or the actual config file.

As you've seen, there are more issues with going back in
history (to points where various files didn't exist yet). This
sidesteps most of these.

Chris

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-11-25  8:55 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-24  7:39 Filter smudge for secret restoration: no disk access? Kache Hit
2025-11-24  9:01 ` Johannes Sixt
2025-11-24  9:49   ` Chris Torek
2025-11-24 18:40     ` Kache Hit
2025-11-24 19:35       ` Junio C Hamano
2025-11-25  7:28         ` Kache Hit
2025-11-25  8:55       ` Chris Torek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).