[PATCH] git-disambiguate: disambiguate shorthand git ids

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] git-disambiguate: disambiguate shorthand git ids
@ 2024-12-26 22:05 Sasha Levin
  2024-12-26 22:32 ` Sasha Levin
  0 siblings, 1 reply; 8+ messages in thread
From: Sasha Levin @ 2024-12-26 22:05 UTC (permalink / raw)
  To: torvalds
  Cc: apw, conor, corbet, dwaipayanray1, geert+renesas, gitster, horms,
	joe, linux-doc, linux-kernel, linux, lukas.bulwahn,
	miguel.ojeda.sandonis, niklas.soderlund, willy, workflows, kees,
	Sasha Levin

Sometimes long commit hashes can be ambiguous even when providing
several digits from its abbreviation.

Add a script that resolves such ambiguity by also considering the
commit subject line.

This also allows users to use shorter commit ID prefixes than normally
required, since we can correctly identify the intended commit using the
subject line as additional context.

In force mode (--force), you can even omit a valid commit ID prefix
entirely - the script will try to find a commit matching just the
subject line.

Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 scripts/git-disambiguate.sh | 199 ++++++++++++++++++++++++++++++++++++
 1 file changed, 199 insertions(+)
 create mode 100755 scripts/git-disambiguate.sh

diff --git a/scripts/git-disambiguate.sh b/scripts/git-disambiguate.sh
new file mode 100755
index 000000000000..db73671d2665
--- /dev/null
+++ b/scripts/git-disambiguate.sh
@@ -0,0 +1,199 @@
+#!/bin/bash
+
+usage() {
+	echo "Usage: $(basename "$0") [--selftest] [--force] <commit-id> [commit-subject]"
+	echo "Disambiguates a short git commit ID to its full SHA-1 hash."
+	echo ""
+	echo "Arguments:"
+	echo "  --selftest	  Run self-tests"
+	echo "  --force		 Try to find commit by subject if ID lookup fails"
+	echo "  commit-id	   Short git commit ID to disambiguate"
+	echo "  commit-subject  Optional commit subject to help disambiguate between multiple matches"
+	exit 1
+}
+
+# Convert subject with ellipsis to grep pattern
+convert_to_grep_pattern() {
+	local subject="$1"
+	# First escape ALL regex special characters
+	local escaped_subject
+	escaped_subject=$(printf '%s\n' "$subject" | sed 's/[[\.*^$()+?{}|]/\\&/g')
+	# Also escape colons, parentheses, and hyphens as they are special in our context
+	escaped_subject=$(echo "$escaped_subject" | sed 's/[:-]/\\&/g')
+	# Then convert escaped ... sequence to .*?
+	escaped_subject=$(echo "$escaped_subject" | sed 's/\\\.\\\.\\\./.*?/g')
+	echo "^${escaped_subject}$"
+}
+
+git_full_id() {
+	local force=0
+	if [ "$1" = "--force" ]; then
+		force=1
+		shift
+	fi
+
+	# Split input into commit ID and subject
+	local input="$*"
+	local commit_id="${input%% *}"
+	local subject=""
+
+	# Extract subject if present (everything after the first space)
+	if [[ "$input" == *" "* ]]; then
+		subject="${input#* }"
+		# Strip the ("...") quotes if present
+		subject="${subject#*(\"}"
+		subject="${subject%\")*}"
+	fi
+
+	# Get all possible matching commit IDs
+	local matches
+	readarray -t matches < <(git rev-parse --disambiguate="$commit_id" 2>/dev/null)
+
+	# Return immediately if we have exactly one match
+	if [ ${#matches[@]} -eq 1 ]; then
+		echo "${matches[0]}"
+		return 0
+	fi
+
+	# If no matches and not in force mode, return failure
+	if [ ${#matches[@]} -eq 0 ] && [ $force -eq 0 ]; then
+		return 1
+	fi
+
+	# If we have a subject, try to find a match with that subject
+	if [ -n "$subject" ]; then
+		# Convert subject with possible ellipsis to grep pattern
+		local grep_pattern
+		grep_pattern=$(convert_to_grep_pattern "$subject")
+
+		# In force mode with no ID matches, use git log --grep directly
+		if [ ${#matches[@]} -eq 0 ] && [ $force -eq 1 ]; then
+			# Use git log to search, but filter to ensure subject matches exactly
+			local match
+			match=$(git log --format="%H %s" --grep="$grep_pattern" --perl-regexp -10 | \
+					while read -r hash subject; do
+						if echo "$subject" | grep -qP "$grep_pattern"; then
+							echo "$hash"
+							break
+						fi
+					done)
+			if [ -n "$match" ]; then
+				echo "$match"
+				return 0
+			fi
+		else
+			# Normal subject matching for existing matches
+			for match in "${matches[@]}"; do
+				if git log -1 --format="%s" "$match" | grep -qP "$grep_pattern"; then
+					echo "$match"
+					return 0
+				fi
+			done
+		fi
+	fi
+
+	# No match found
+	return 1
+}
+
+run_selftest() {
+	local test_cases=(
+		'00250b5 ("MAINTAINERS: add new Rockchip SoC list")'
+		'0037727 ("KVM: selftests: Convert xen_shinfo_test away from VCPU_ID")'
+		'ffef737 ("net/tls: Fix skb memory leak when running kTLS traffic")'
+		'd3d7 ("cifs: Improve guard for excluding $LXDEV xattr")'
+		'dbef ("Rename .data.once to .data..once to fix resetting WARN*_ONCE")'
+		'12345678'  # Non-existent commit
+		'12345 ("I'\''m a dummy commit")'  # Valid prefix but wrong subject
+		'--force 99999999 ("net/tls: Fix skb memory leak when running kTLS traffic")'  # Force mode with non-existent ID but valid subject
+		'83be ("firmware: ... auto-update: fix poll_complete() ... errors")'  # Wildcard test
+		'--force 999999999999 ("firmware: ... auto-update: fix poll_complete() ... errors")'  # Force mode wildcard test
+	)
+
+	local expected=(
+		"00250b529313d6262bb0ebbd6bdf0a88c809f6f0"
+		"0037727b3989c3fe1929c89a9a1dfe289ad86f58"
+		"ffef737fd0372ca462b5be3e7a592a8929a82752"
+		"d3d797e326533794c3f707ce1761da7a8895458c"
+		"dbefa1f31a91670c9e7dac9b559625336206466f"
+		""  # Expect empty output for non-existent commit
+		""  # Expect empty output for wrong subject
+		"ffef737fd0372ca462b5be3e7a592a8929a82752"  # Should find commit by subject in force mode
+		"83beece5aff75879bdfc6df8ba84ea88fd93050e"  # Wildcard test
+		"83beece5aff75879bdfc6df8ba84ea88fd93050e"  # Force mode wildcard test
+	)
+
+	local expected_exit_codes=(
+		0
+		0
+		0
+		0
+		0
+		1  # Expect failure for non-existent commit
+		1  # Expect failure for wrong subject
+		0  # Should succeed in force mode
+		0  # Should succeed with wildcard
+		0  # Should succeed with force mode and wildcard
+	)
+
+	local failed=0
+
+	echo "Running self-tests..."
+	for i in "${!test_cases[@]}"; do
+		# Capture both output and exit code
+		local result
+		result=$(git_full_id ${test_cases[$i]})  # Removed quotes to allow --force to be parsed
+		local exit_code=$?
+
+		# Check both output and exit code
+		if [ "$result" != "${expected[$i]}" ] || [ $exit_code != ${expected_exit_codes[$i]} ]; then
+			echo "Test case $((i+1)) FAILED"
+			echo "Input: ${test_cases[$i]}"
+			echo "Expected output: '${expected[$i]}'"
+			echo "Got output: '$result'"
+			echo "Expected exit code: ${expected_exit_codes[$i]}"
+			echo "Got exit code: $exit_code"
+			failed=1
+		else
+			echo "Test case $((i+1)) PASSED"
+		fi
+	done
+
+	if [ $failed -eq 0 ]; then
+		echo "All tests passed!"
+		exit 0
+	else
+		echo "Some tests failed!"
+		exit 1
+	fi
+}
+
+# Check for selftest
+if [ "$1" = "--selftest" ]; then
+	run_selftest
+	exit $?
+fi
+
+# Handle --force flag
+force=""
+if [ "$1" = "--force" ]; then
+	force="--force"
+	shift
+fi
+
+# Verify arguments
+if [ $# -eq 0 ]; then
+	usage
+fi
+
+# Skip validation in force mode
+if [ -z "$force" ]; then
+	# Validate that the first argument matches at least one git commit
+	if [ $(git rev-parse --disambiguate="$1" 2>/dev/null | wc -l) -eq 0 ]; then
+		echo "Error: '$1' does not match any git commit"
+		exit 1
+	fi
+fi
+
+git_full_id $force "$@"
+exit $?
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] git-disambiguate: disambiguate shorthand git ids
  2024-12-26 22:05 [PATCH] git-disambiguate: disambiguate shorthand git ids Sasha Levin
@ 2024-12-26 22:32 ` Sasha Levin
  2024-12-26 23:33   ` Linus Torvalds
  0 siblings, 1 reply; 8+ messages in thread
From: Sasha Levin @ 2024-12-26 22:32 UTC (permalink / raw)
  To: torvalds
  Cc: apw, conor, corbet, dwaipayanray1, geert+renesas, gitster, horms,
	joe, linux-doc, linux-kernel, linux, lukas.bulwahn,
	miguel.ojeda.sandonis, niklas.soderlund, willy, workflows, kees

On Thu, Dec 26, 2024 at 05:05:55PM -0500, Sasha Levin wrote:
>This also allows users to use shorter commit ID prefixes than normally
>required, since we can correctly identify the intended commit using the
>subject line as additional context.

With this script, and a git alias:

	git config --global alias.klog '!sh -c '"'"'last=""; next_to_last=""; args=""; for arg in "$@"; do [ -n "$next_to_last" ] && args="$args $next_to_last"; next_to_last="$last"; last="$arg"; done; hash=$(git-disambiguate.sh "$next_to_last" "$last") && git log $args "$hash"'"'"' -'

We can have (git-log compatible) git-klog find the right commit for us:

	$ git klog --oneline -1 cff8 '("net: phy: avoid ... *_led_polarity_set()")'
	cff865c70071 net: phy: avoid undefined behavior in *_led_polarity_set()

Which means that folks should be able to use a fairly short abbreviated
commit IDs in messages, specially for commits with a long subject line.

-- 
Thanks,
Sasha

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] git-disambiguate: disambiguate shorthand git ids
  2024-12-26 22:32 ` Sasha Levin
@ 2024-12-26 23:33   ` Linus Torvalds
  2024-12-27  0:55     ` Sasha Levin
  0 siblings, 1 reply; 8+ messages in thread
From: Linus Torvalds @ 2024-12-26 23:33 UTC (permalink / raw)
  To: Sasha Levin
  Cc: apw, conor, corbet, dwaipayanray1, geert+renesas, gitster, horms,
	joe, linux-doc, linux-kernel, linux, lukas.bulwahn,
	miguel.ojeda.sandonis, niklas.soderlund, willy, workflows, kees

On Thu, 26 Dec 2024 at 14:33, Sasha Levin <sashal@kernel.org> wrote:
>
> Which means that folks should be able to use a fairly short abbreviated
> commit IDs in messages, specially for commits with a long subject line.

So I don't think we should take this as a way to make *shorter* IDs,
just as a way to accept historical short IDs.

Also, I absolutely detest how you made this be all about "short IDs".

As mentioned in my very original email on this matter, the actual REAL
AND PRESENT issue isn't ambiguous IDs. We don't really have them.

What we *do* have is "wrong IDs". We have a ton of those.

Look here, you can get a list of suspiciously wrong SHA1s with
something like this:

    git log |
        egrep '[0-9a-f]{9,40} \(".*"\)' |
        sed 's/.*[^0-9a-f]\([0-9a-f]\{9,40\}\)[^0-9a-f].*/\1/' |
        sort -u > hexes

which generates a list of things that look like commit IDs (ok,
there's probably smarter ways) in our git logs.

Now, *some* of those won't actually be commit IDs at all, they'll just
be random hex numbers the above finds.

But most of them will indeed be references to other commits.

Then you try to find the bogus ones by doing something like

    cat hexes |
        while read i; do git show $i >& /dev/null || echo "No $i SHA"; done

and you will get a lot ot hits.

A *LOT*.

Look, I didn't check very many of them. Mainly because it gets *so*
many hits, and I get bored very easily.

But I did check a handful, just to get a feel for things.

And yes, some of them were random hex numbers unrelated to actual git
IDs, but most were really supposed to be git IDs. Except they weren't
- or at least not from the mainline tree.

For example, look at commit daa07cbc9ae3 ("KVM: x86: fix L1TF's MMIO
GFN calculation") which references one of those nonexistent commit
IDs:

    Fixes: d9b47449c1a1 ("kvm: x86: Set highest physical address bits
in non-present/reserved SPTEs")

and I have no idea where that bogus commit ID comes from. Maybe it's a
rebase. Maybe it's from a stable tree. But it sure doesn't exist in
mainline.

What *does* exist is commit 28a1f3ac1d0c ("kvm: x86: Set highest
physical address bits in non-present/reserved SPTEs"), which I found
by just doing that

    git log --grep='kvm: x86: Set highest physical address bits in
non-present/reserved SPTEs'

and my point is that this is really not about "disambiguating short
SHA1 IDs". Because those "ambiguous" SHA1's to a very close
approximation simply DO NOT EXIST.

But the completely wrong ones? They are plentiful.

For a completely different example, we have

    ec680c1990e7 ("ide: remove BUG_ON(in_interrupt() ||
irqs_disabled()) from ide_unregister()")

which has this in the log message:

    Both BUG_ON()s in ide-probe.c were introduced in commit
       4015c949fb465 ("[PATCH] update ide core")

and it turns out that that commit ID doesn't exist in the kernel tree.

It is actually a real commit ID, and it *does* actually exist in the
historical BK import by Thomas:

     https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=4015c949fb465

so that's an example of a cross-tree ID, and yeah, it might be really
cool to "disambiguate" *those* too.

But again, the problem wasn't actually that the SHA1 was _short_.

            Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] git-disambiguate: disambiguate shorthand git ids
  2024-12-26 23:33   ` Linus Torvalds
@ 2024-12-27  0:55     ` Sasha Levin
  2024-12-27  1:50       ` Linus Torvalds
  0 siblings, 1 reply; 8+ messages in thread
From: Sasha Levin @ 2024-12-27  0:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: apw, conor, corbet, dwaipayanray1, geert+renesas, gitster, horms,
	joe, linux-doc, linux-kernel, linux, lukas.bulwahn,
	miguel.ojeda.sandonis, niklas.soderlund, willy, workflows, kees

On Thu, Dec 26, 2024 at 03:33:36PM -0800, Linus Torvalds wrote:
>On Thu, 26 Dec 2024 at 14:33, Sasha Levin <sashal@kernel.org> wrote:
>>
>> Which means that folks should be able to use a fairly short abbreviated
>> commit IDs in messages, specially for commits with a long subject line.
>
>So I don't think we should take this as a way to make *shorter* IDs,
>just as a way to accept historical short IDs.
>
>Also, I absolutely detest how you made this be all about "short IDs".
>
>As mentioned in my very original email on this matter, the actual REAL
>AND PRESENT issue isn't ambiguous IDs. We don't really have them.

What got me worried originally is the example Kees provided which
creates a collision of a 12-character abbreviated commit ID:

$ git log 1da177e4c3f4
error: short object ID 1da177e4c3 is ambiguous
hint: The candidates are:
hint:   1da177e4c3f41 commit 2005-04-16 - Linux-2.6.12-rc2
hint:   1da177e4c3f47 commit 2024-12-14 - docs: git SHA prefixes are for humans

When I tested it locally, my scripts were broken, our CVE scripts were
broken, and it is quite the PITA to address after the fact (think of all
the "Fixes: 1da177e4c3f4 [...]" lines we have in the log).

So sure, we don't have a collision right now, but going from 0 to 1 is
going to be painful.

Are we going to be actively watching for someone trying to sneak in a
commit like that, or should we just handle that case properly?

>What we *do* have is "wrong IDs". We have a ton of those.
>
>Look here, you can get a list of suspiciously wrong SHA1s with
>something like this:
>
>    git log |
>        egrep '[0-9a-f]{9,40} \(".*"\)' |
>        sed 's/.*[^0-9a-f]\([0-9a-f]\{9,40\}\)[^0-9a-f].*/\1/' |
>        sort -u > hexes
>
>which generates a list of things that look like commit IDs (ok,
>there's probably smarter ways) in our git logs.
>
>Now, *some* of those won't actually be commit IDs at all, they'll just
>be random hex numbers the above finds.
>
>But most of them will indeed be references to other commits.
>
>Then you try to find the bogus ones by doing something like
>
>    cat hexes |
>        while read i; do git show $i >& /dev/null || echo "No $i SHA"; done
>
>and you will get a lot ot hits.
>
>A *LOT*.

I ended up with this fun thing:

git log --pretty=format:'%B' | grep -E '^Fixes:[[:space:]][0-9a-fA-F]{8,40}' | while read -r line; do
     [[ $line =~ ^Fixes:[[:space:]]([0-9a-fA-F]{8,40})(.*) ]] && \
     ! git rev-parse --quiet --verify "${BASH_REMATCH[1]}^{commit}" >/dev/null 2>&1 && \
     { r=$(./scripts/git-disambiguate.sh --force "${BASH_REMATCH[1]}" "${BASH_REMATCH[2]}"); \
     [[ -n $r ]] && echo "Bad: $line" && echo "Good: Fixes: $(git ol "$r")"; }

Which looks for these wrong IDs in "Fixes:" tags and tries to run them
through git-disambiguate.sh:

Bad: Fixes: d71e629bed5b ("ARC: build: disallow invalid PAE40 + 4K page config")
Good: Fixes: 8871331b1769 ("ARC: build: disallow invalid PAE40 + 4K page config")
Bad: Fixes: 7e654ab7da03 ("cifs: during remount, make sure passwords are in sync")
Good: Fixes: 0f0e35790295 ("cifs: during remount, make sure passwords are in sync")
Bad: Fixes: ee650b3820f3 ("apparmor: properly handle cx/px lookup failure for complain")
Good: Fixes: db93ca15e5ae ("apparmor: properly handle cx/px lookup failure for complain")
[...]

>Look, I didn't check very many of them. Mainly because it gets *so*
>many hits, and I get bored very easily.
>
>But I did check a handful, just to get a feel for things.
>
>And yes, some of them were random hex numbers unrelated to actual git
>IDs, but most were really supposed to be git IDs. Except they weren't
>- or at least not from the mainline tree.
>
>For example, look at commit daa07cbc9ae3 ("KVM: x86: fix L1TF's MMIO
>GFN calculation") which references one of those nonexistent commit
>IDs:
>
>    Fixes: d9b47449c1a1 ("kvm: x86: Set highest physical address bits
>in non-present/reserved SPTEs")
>
>and I have no idea where that bogus commit ID comes from. Maybe it's a
>rebase. Maybe it's from a stable tree. But it sure doesn't exist in
>mainline.

This one is indeed in the 4.18 stable tree.

>What *does* exist is commit 28a1f3ac1d0c ("kvm: x86: Set highest
>physical address bits in non-present/reserved SPTEs"), which I found
>by just doing that
>
>    git log --grep='kvm: x86: Set highest physical address bits in
>non-present/reserved SPTEs'
>
>and my point is that this is really not about "disambiguating short
>SHA1 IDs". Because those "ambiguous" SHA1's to a very close
>approximation simply DO NOT EXIST.
>
>But the completely wrong ones? They are plentiful.

Is the concern mostly around the term "disambiguate"? Would
git-sanitize.sh work better?

-- 
Thanks,
Sasha

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] git-disambiguate: disambiguate shorthand git ids
  2024-12-27  0:55     ` Sasha Levin
@ 2024-12-27  1:50       ` Linus Torvalds
  2024-12-27 16:19         ` Sasha Levin
  0 siblings, 1 reply; 8+ messages in thread
From: Linus Torvalds @ 2024-12-27  1:50 UTC (permalink / raw)
  To: Sasha Levin
  Cc: apw, conor, corbet, dwaipayanray1, geert+renesas, gitster, horms,
	joe, linux-doc, linux-kernel, linux, lukas.bulwahn,
	miguel.ojeda.sandonis, niklas.soderlund, willy, workflows, kees

On Thu, 26 Dec 2024 at 16:55, Sasha Levin <sashal@kernel.org> wrote:
>
> What got me worried originally is the example Kees provided which
> creates a collision of a 12-character abbreviated commit ID:

Note that you can always create short ambiguous cases.

And in fact, since the way you create them is to just put noise in the
right place and brute-force things, you can also always make sure that
the one-liner short commit name will match the target commit too.

In other words: just accept the fact that a short hash is simply not a
secure hash. That's very very fundamental. It's just inherent in the
whole "we have shortened things to be legible".

And once you just accept that it's fundamental to any short hash, you
can relax and just live with it as just a fact of life.

> So sure, we don't have a collision right now, but going from 0 to 1 is
> going to be painful.

I disagree. You need to just own up to how it is, and more importantly
that you WILL NEVER EVER BE ABLE TO FIX IT.

There is no way to disambiguate the active malicious case, because the
active malicious case will just be able to handle any disambiguation
you throw at it too.

So the fix is to not *rely* on disambiguation, and to just accept it.
Don't fight reality. You'll just lose.

> Are we going to be actively watching for someone trying to sneak in a
> commit like that, or should we just handle that case properly?

See above. There is no "properly". There is only reality.

Git internally uses the full hashes. And any reasonably usefully
shortened hashes *cannot* be disambiguated in the face of an active
malicious person.

If you have some tool that you rely on absolutely to give you the "One
Correct Answer", then you are already broken. If thats' the goal, then
all you can do is give up and pray to some random $deity.

Or, as my argument is, DON'T DO THAT THEN.

Don't rely on some absolute thing. Accept the fact that a short hash
will always possibly refer to multiple things, and that you will
*always* need to have a human that actually *THINKS* about it, not
mindless automation.

The best the tools can do is say "there are multiple options here"
(or, more commonly, "I found no options at all, but maybe it meant
XYZ").

Literally.

To summarize: If you are aiming for anything else, you are bound to be
disappointed.

So aim for that simple "let me know about multiple choices", with the
knowledge that they are going to be EXCEEDINGLY rare.

And then if we have somebody who actively acts badly and works to make
that "it's going to be EXCEEDINGLY rare" be much more common than it
is supposed to be, we deal with that *person* by just not working with
them any more.

See? The solution is not some kind of "this cannot happen". The
solution is "stop accepting shit from evil people".

Which, btw, is not a new thing. This is how open source works. It's
actually way more work to create collisions in short hashes, when you
can much more easily just send a bad patch that wastes peoples time
without any hash collision at all - just by virtue of writing bad
code.

So thinking that the short hash is some kind of special problem is wrong.

It's in fact a particularly *easy* problem to deal with, because at
least the whole "somebody did something malicious" is something git
will balk at, simply because the basic git tools will refuse to touch
the ambiguous hash.

So *your* tooling should concentrate on one thing, and one thing only:
making it easier for a *THINKING*HUMAN* to decide what a bad hash
means.

But you need to do it with the up-front understanding that it requires
that thinking human, and is not mindless automation.

And as mentioned, the most common case of bad hashes BY FAR is the
"oh, that doesn't match anything I know about at all", as opposed to
"oh, that matches two or more different commits".

> >and my point is that this is really not about "disambiguating short
> >SHA1 IDs". Because those "ambiguous" SHA1's to a very close
> >approximation simply DO NOT EXIST.
> >
> >But the completely wrong ones? They are plentiful.
>
> Is the concern mostly around the term "disambiguate"? Would
> git-sanitize.sh work better?

No, my main worry is that you talk about short hashes (and talked
about shortening our existing ones even more).

When I really think that the size isn't the main problem at all.

"disambiguate" is a fine word, but only if you use it in the context
of "I have a bad hash". Not just a short one. Because the hash being
too short is simply not the main case worth worrying about.

And hey, just to really hit this same argument home: I can absolutely
guarantee that you have a much more insidious problem of "wrong hash",
namely the "oh, it's actually a valid commit hash, but the Fixes: line
simply got the wrong commit:".

Because that's a mistake I see regularly: somebody simply blames the
wrong commit. I've most definitely done that very thing myself.
Sometimes it's people reading the history wrong, and not realizing
that the bug actually happened before the blamed change too. Or the
bug ends up being a combination of factors, and people didn't realize
that the commit they blamed was just one part of it.

So my complaint really ends up being that anybody who trusts those
hashes mindlessly is going to have mistakes, and the actual hash
collision case MUST NOT be the primary worry.

Because as long as that hash size is all you care about, you're
barking up the wrong tree.

               Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] git-disambiguate: disambiguate shorthand git ids
  2024-12-27  1:50       ` Linus Torvalds
@ 2024-12-27 16:19         ` Sasha Levin
  2024-12-27 18:33           ` Geert Uytterhoeven
  0 siblings, 1 reply; 8+ messages in thread
From: Sasha Levin @ 2024-12-27 16:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: apw, conor, corbet, dwaipayanray1, geert+renesas, gitster, horms,
	joe, linux-doc, linux-kernel, linux, lukas.bulwahn,
	miguel.ojeda.sandonis, niklas.soderlund, willy, workflows, kees

On Thu, Dec 26, 2024 at 05:50:22PM -0800, Linus Torvalds wrote:
>On Thu, 26 Dec 2024 at 16:55, Sasha Levin <sashal@kernel.org> wrote:
>>
>> What got me worried originally is the example Kees provided which
>> creates a collision of a 12-character abbreviated commit ID:
>
>Note that you can always create short ambiguous cases.
>
>And in fact, since the way you create them is to just put noise in the
>right place and brute-force things, you can also always make sure that
>the one-liner short commit name will match the target commit too.
>
>In other words: just accept the fact that a short hash is simply not a
>secure hash. That's very very fundamental. It's just inherent in the
>whole "we have shortened things to be legible".
>
>And once you just accept that it's fundamental to any short hash, you
>can relax and just live with it as just a fact of life.
>
>> So sure, we don't have a collision right now, but going from 0 to 1 is
>> going to be painful.
>
>I disagree. You need to just own up to how it is, and more importantly
>that you WILL NEVER EVER BE ABLE TO FIX IT.
>
>There is no way to disambiguate the active malicious case, because the
>active malicious case will just be able to handle any disambiguation
>you throw at it too.
>
>So the fix is to not *rely* on disambiguation, and to just accept it.
>Don't fight reality. You'll just lose.
>
>> Are we going to be actively watching for someone trying to sneak in a
>> commit like that, or should we just handle that case properly?
>
>See above. There is no "properly". There is only reality.
>
>Git internally uses the full hashes. And any reasonably usefully
>shortened hashes *cannot* be disambiguated in the face of an active
>malicious person.
>
>If you have some tool that you rely on absolutely to give you the "One
>Correct Answer", then you are already broken. If thats' the goal, then
>all you can do is give up and pray to some random $deity.
>
>Or, as my argument is, DON'T DO THAT THEN.
>
>Don't rely on some absolute thing. Accept the fact that a short hash
>will always possibly refer to multiple things, and that you will
>*always* need to have a human that actually *THINKS* about it, not
>mindless automation.
>
>The best the tools can do is say "there are multiple options here"
>(or, more commonly, "I found no options at all, but maybe it meant
>XYZ").
>
>Literally.
>
>To summarize: If you are aiming for anything else, you are bound to be
>disappointed.

The script in the original mail will handle two important cases for me:

1. The "wrong ID" case, where the the provided commit ID doesn't exist
and we need to look at the subject line to figure out what the actual ID
is.

2. The case where we either get a too-short commit ID that has a
collision, or we just got unlucky and finally ended up with a 12-char
collision, but the subject line is enough to automatically resolve to
the correct ID.

We're disagreeing over the remaining <0.0001% of cases.

>So aim for that simple "let me know about multiple choices", with the
>knowledge that they are going to be EXCEEDINGLY rare.
>
>And then if we have somebody who actively acts badly and works to make
>that "it's going to be EXCEEDINGLY rare" be much more common than it
>is supposed to be, we deal with that *person* by just not working with
>them any more.
>
>See? The solution is not some kind of "this cannot happen". The
>solution is "stop accepting shit from evil people".

Not necessarily "evil", but consider something similar to the UMN saga
where we get "researchers" trying to sneak in commits that create a
collision. Once these make it in, it will be difficult walking them back
out.

>Which, btw, is not a new thing. This is how open source works. It's
>actually way more work to create collisions in short hashes, when you
>can much more easily just send a bad patch that wastes peoples time
>without any hash collision at all - just by virtue of writing bad
>code.

No, but it makes for a "good" undergrad research paper...

>So thinking that the short hash is some kind of special problem is wrong.
>
>It's in fact a particularly *easy* problem to deal with, because at
>least the whole "somebody did something malicious" is something git
>will balk at, simply because the basic git tools will refuse to touch
>the ambiguous hash.
>
>So *your* tooling should concentrate on one thing, and one thing only:
>making it easier for a *THINKING*HUMAN* to decide what a bad hash
>means.
>
>But you need to do it with the up-front understanding that it requires
>that thinking human, and is not mindless automation.
>
>And as mentioned, the most common case of bad hashes BY FAR is the
>"oh, that doesn't match anything I know about at all", as opposed to
>"oh, that matches two or more different commits".
>
>> >and my point is that this is really not about "disambiguating short
>> >SHA1 IDs". Because those "ambiguous" SHA1's to a very close
>> >approximation simply DO NOT EXIST.
>> >
>> >But the completely wrong ones? They are plentiful.
>>
>> Is the concern mostly around the term "disambiguate"? Would
>> git-sanitize.sh work better?
>
>No, my main worry is that you talk about short hashes (and talked
>about shortening our existing ones even more).
>
>When I really think that the size isn't the main problem at all.

No, it's not a problem. In my mind, I figured we could use shorter
hashes in mail message to make it easier to communicate.

It doesn't have to be formal, but for example if we exchange mails about
an issue, and I end up referring to '1d1a ("arm64/sme: ...")' it both
makes the mail more readable, but still someone who doesn't have context
can still easily get to the commit I was referring to.

To me, it's a nice (minor) improvement.

>"disambiguate" is a fine word, but only if you use it in the context
>of "I have a bad hash". Not just a short one. Because the hash being
>too short is simply not the main case worth worrying about.
>
>And hey, just to really hit this same argument home: I can absolutely
>guarantee that you have a much more insidious problem of "wrong hash",
>namely the "oh, it's actually a valid commit hash, but the Fixes: line
>simply got the wrong commit:".
>
>Because that's a mistake I see regularly: somebody simply blames the
>wrong commit. I've most definitely done that very thing myself.
>Sometimes it's people reading the history wrong, and not realizing
>that the bug actually happened before the blamed change too. Or the
>bug ends up being a combination of factors, and people didn't realize
>that the commit they blamed was just one part of it.
>
>So my complaint really ends up being that anybody who trusts those
>hashes mindlessly is going to have mistakes, and the actual hash
>collision case MUST NOT be the primary worry.
>
>Because as long as that hash size is all you care about, you're
>barking up the wrong tree.
>
>               Linus

-- 
Thanks,
Sasha

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] git-disambiguate: disambiguate shorthand git ids
  2024-12-27 16:19         ` Sasha Levin
@ 2024-12-27 18:33           ` Geert Uytterhoeven
  2024-12-27 19:07             ` Sasha Levin
  0 siblings, 1 reply; 8+ messages in thread
From: Geert Uytterhoeven @ 2024-12-27 18:33 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Linus Torvalds, apw, conor, corbet, dwaipayanray1, gitster, horms,
	joe, linux-doc, linux-kernel, linux, lukas.bulwahn,
	miguel.ojeda.sandonis, niklas.soderlund, willy, workflows, kees

Hi Sasha,

On Fri, Dec 27, 2024 at 5:19 PM Sasha Levin <sashal@kernel.org> wrote:
> No, it's not a problem. In my mind, I figured we could use shorter
> hashes in mail message to make it easier to communicate.
>
> It doesn't have to be formal, but for example if we exchange mails about
> an issue, and I end up referring to '1d1a ("arm64/sme: ...")' it both
> makes the mail more readable, but still someone who doesn't have context
> can still easily get to the commit I was referring to.

Is that 1d1a commit something in your local tree? I don't seem to have it.

A few other comments:
  1. Please add support for --help. It took me a while to find out I
     need to call the script without parameters to get the help.
  2. The passed commit-subject needs to be the full commit subject.
     It would be nice to support abbreviations, too.

BTW, I am a heavy user of looking up commits (recent and old ;-)
My .vimrc has:

    noremap ;gs "zyiw:exe "new \| setlocal buftype=nofile
bufhidden=hide noswapfile syntax=git \| 0r ! git show ".@z."" \|
:0<CR><CR>

So I can move the cursor to a git commit ID, and type ";gs" to open
the commit in a throw-away split window.
Adding full support for commit-subjects may be challenging, especially
if they are split across multiple lines (i.e. not Fixes: tags, but
mentioned in the description).

Thanks!

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] git-disambiguate: disambiguate shorthand git ids
  2024-12-27 18:33           ` Geert Uytterhoeven
@ 2024-12-27 19:07             ` Sasha Levin
  0 siblings, 0 replies; 8+ messages in thread
From: Sasha Levin @ 2024-12-27 19:07 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Linus Torvalds, apw, conor, corbet, dwaipayanray1, gitster, horms,
	joe, linux-doc, linux-kernel, linux, lukas.bulwahn,
	miguel.ojeda.sandonis, niklas.soderlund, willy, workflows, kees

On Fri, Dec 27, 2024 at 07:33:30PM +0100, Geert Uytterhoeven wrote:
>Hi Sasha,
>
>On Fri, Dec 27, 2024 at 5:19 PM Sasha Levin <sashal@kernel.org> wrote:
>> No, it's not a problem. In my mind, I figured we could use shorter
>> hashes in mail message to make it easier to communicate.
>>
>> It doesn't have to be formal, but for example if we exchange mails about
>> an issue, and I end up referring to '1d1a ("arm64/sme: ...")' it both
>> makes the mail more readable, but still someone who doesn't have context
>> can still easily get to the commit I was referring to.
>
>Is that 1d1a commit something in your local tree? I don't seem to have it.

Heh, yeah, I wanted to pick a random commit with a longer subject line
and ended up landing on something local...

1d1a ("Merge tag 'mlx5-updates-2018-02-23' ...") works just as well :)

>A few other comments:
>  1. Please add support for --help. It took me a while to find out I
>     need to call the script without parameters to get the help.

Ack

>  2. The passed commit-subject needs to be the full commit subject.
>     It would be nice to support abbreviations, too.

It does it with ellipsis (I picked it up from sfr).

$ ./git-disambiguate.sh 1d1ab1ae6970 '("Merge tag 'mlx5-updates-2018-02-23' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux into k.o/wip/dl-for-next")'
1d1ab1ae69701d813993af40cf3ee39781ec4d6f

$ ./git-disambiguate.sh 1d1a '("Merge tag...")'
1d1ab1ae69701d813993af40cf3ee39781ec4d6f

Which was what I was trying to show off in the example above :)

>BTW, I am a heavy user of looking up commits (recent and old ;-)
>My .vimrc has:
>
>    noremap ;gs "zyiw:exe "new \| setlocal buftype=nofile
>bufhidden=hide noswapfile syntax=git \| 0r ! git show ".@z."" \|
>:0<CR><CR>
>
>So I can move the cursor to a git commit ID, and type ";gs" to open
>the commit in a throw-away split window.
>Adding full support for commit-subjects may be challenging, especially
>if they are split across multiple lines (i.e. not Fixes: tags, but
>mentioned in the description).

Hyperlink git commits, eh? :)

-- 
Thanks,
Sasha

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-12-27 19:07 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-26 22:05 [PATCH] git-disambiguate: disambiguate shorthand git ids Sasha Levin
2024-12-26 22:32 ` Sasha Levin
2024-12-26 23:33   ` Linus Torvalds
2024-12-27  0:55     ` Sasha Levin
2024-12-27  1:50       ` Linus Torvalds
2024-12-27 16:19         ` Sasha Levin
2024-12-27 18:33           ` Geert Uytterhoeven
2024-12-27 19:07             ` Sasha Levin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).