Re: [PATCH] git-disambiguate: disambiguate shorthand git ids

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Sasha Levin <sashal@kernel.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: apw@canonical.com, conor@kernel.org, corbet@lwn.net,
	dwaipayanray1@gmail.com, geert+renesas@glider.be,
	gitster@pobox.com, horms@kernel.org, joe@perches.com,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux@leemhuis.info, lukas.bulwahn@gmail.com,
	miguel.ojeda.sandonis@gmail.com, niklas.soderlund@corigine.com,
	willy@infradead.org, workflows@vger.kernel.org, kees@kernel.org
Subject: Re: [PATCH] git-disambiguate: disambiguate shorthand git ids
Date: Thu, 26 Dec 2024 19:55:39 -0500	[thread overview]
Message-ID: <Z237CwC_YKhoZubs@lappy> (raw)
In-Reply-To: <CAHk-=wiPNpQ=zmGvPoZRMFZ7a7mm2pzSoOsuQPdDRaSF7gbw9w@mail.gmail.com>

On Thu, Dec 26, 2024 at 03:33:36PM -0800, Linus Torvalds wrote:
>On Thu, 26 Dec 2024 at 14:33, Sasha Levin <sashal@kernel.org> wrote:
>>
>> Which means that folks should be able to use a fairly short abbreviated
>> commit IDs in messages, specially for commits with a long subject line.
>
>So I don't think we should take this as a way to make *shorter* IDs,
>just as a way to accept historical short IDs.
>
>Also, I absolutely detest how you made this be all about "short IDs".
>
>As mentioned in my very original email on this matter, the actual REAL
>AND PRESENT issue isn't ambiguous IDs. We don't really have them.

What got me worried originally is the example Kees provided which
creates a collision of a 12-character abbreviated commit ID:

$ git log 1da177e4c3f4
error: short object ID 1da177e4c3 is ambiguous
hint: The candidates are:
hint:   1da177e4c3f41 commit 2005-04-16 - Linux-2.6.12-rc2
hint:   1da177e4c3f47 commit 2024-12-14 - docs: git SHA prefixes are for humans

When I tested it locally, my scripts were broken, our CVE scripts were
broken, and it is quite the PITA to address after the fact (think of all
the "Fixes: 1da177e4c3f4 [...]" lines we have in the log).

So sure, we don't have a collision right now, but going from 0 to 1 is
going to be painful.

Are we going to be actively watching for someone trying to sneak in a
commit like that, or should we just handle that case properly?

>What we *do* have is "wrong IDs". We have a ton of those.
>
>Look here, you can get a list of suspiciously wrong SHA1s with
>something like this:
>
>    git log |
>        egrep '[0-9a-f]{9,40} \(".*"\)' |
>        sed 's/.*[^0-9a-f]\([0-9a-f]\{9,40\}\)[^0-9a-f].*/\1/' |
>        sort -u > hexes
>
>which generates a list of things that look like commit IDs (ok,
>there's probably smarter ways) in our git logs.
>
>Now, *some* of those won't actually be commit IDs at all, they'll just
>be random hex numbers the above finds.
>
>But most of them will indeed be references to other commits.
>
>Then you try to find the bogus ones by doing something like
>
>    cat hexes |
>        while read i; do git show $i >& /dev/null || echo "No $i SHA"; done
>
>and you will get a lot ot hits.
>
>A *LOT*.

I ended up with this fun thing:

git log --pretty=format:'%B' | grep -E '^Fixes:[[:space:]][0-9a-fA-F]{8,40}' | while read -r line; do
     [[ $line =~ ^Fixes:[[:space:]]([0-9a-fA-F]{8,40})(.*) ]] && \
     ! git rev-parse --quiet --verify "${BASH_REMATCH[1]}^{commit}" >/dev/null 2>&1 && \
     { r=$(./scripts/git-disambiguate.sh --force "${BASH_REMATCH[1]}" "${BASH_REMATCH[2]}"); \
     [[ -n $r ]] && echo "Bad: $line" && echo "Good: Fixes: $(git ol "$r")"; }

Which looks for these wrong IDs in "Fixes:" tags and tries to run them
through git-disambiguate.sh:

Bad: Fixes: d71e629bed5b ("ARC: build: disallow invalid PAE40 + 4K page config")
Good: Fixes: 8871331b1769 ("ARC: build: disallow invalid PAE40 + 4K page config")
Bad: Fixes: 7e654ab7da03 ("cifs: during remount, make sure passwords are in sync")
Good: Fixes: 0f0e35790295 ("cifs: during remount, make sure passwords are in sync")
Bad: Fixes: ee650b3820f3 ("apparmor: properly handle cx/px lookup failure for complain")
Good: Fixes: db93ca15e5ae ("apparmor: properly handle cx/px lookup failure for complain")
[...]

>Look, I didn't check very many of them. Mainly because it gets *so*
>many hits, and I get bored very easily.
>
>But I did check a handful, just to get a feel for things.
>
>And yes, some of them were random hex numbers unrelated to actual git
>IDs, but most were really supposed to be git IDs. Except they weren't
>- or at least not from the mainline tree.
>
>For example, look at commit daa07cbc9ae3 ("KVM: x86: fix L1TF's MMIO
>GFN calculation") which references one of those nonexistent commit
>IDs:
>
>    Fixes: d9b47449c1a1 ("kvm: x86: Set highest physical address bits
>in non-present/reserved SPTEs")
>
>and I have no idea where that bogus commit ID comes from. Maybe it's a
>rebase. Maybe it's from a stable tree. But it sure doesn't exist in
>mainline.

This one is indeed in the 4.18 stable tree.

>What *does* exist is commit 28a1f3ac1d0c ("kvm: x86: Set highest
>physical address bits in non-present/reserved SPTEs"), which I found
>by just doing that
>
>    git log --grep='kvm: x86: Set highest physical address bits in
>non-present/reserved SPTEs'
>
>and my point is that this is really not about "disambiguating short
>SHA1 IDs". Because those "ambiguous" SHA1's to a very close
>approximation simply DO NOT EXIST.
>
>But the completely wrong ones? They are plentiful.

Is the concern mostly around the term "disambiguate"? Would
git-sanitize.sh work better?

-- 
Thanks,
Sasha

next prev parent reply	other threads:[~2024-12-27  0:55 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-26 22:05 [PATCH] git-disambiguate: disambiguate shorthand git ids Sasha Levin
2024-12-26 22:32 ` Sasha Levin
2024-12-26 23:33   ` Linus Torvalds
2024-12-27  0:55     ` Sasha Levin [this message]
2024-12-27  1:50       ` Linus Torvalds
2024-12-27 16:19         ` Sasha Levin
2024-12-27 18:33           ` Geert Uytterhoeven
2024-12-27 19:07             ` Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z237CwC_YKhoZubs@lappy \
    --to=sashal@kernel.org \
    --cc=apw@canonical.com \
    --cc=conor@kernel.org \
    --cc=corbet@lwn.net \
    --cc=dwaipayanray1@gmail.com \
    --cc=geert+renesas@glider.be \
    --cc=gitster@pobox.com \
    --cc=horms@kernel.org \
    --cc=joe@perches.com \
    --cc=kees@kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@leemhuis.info \
    --cc=lukas.bulwahn@gmail.com \
    --cc=miguel.ojeda.sandonis@gmail.com \
    --cc=niklas.soderlund@corigine.com \
    --cc=torvalds@linux-foundation.org \
    --cc=willy@infradead.org \
    --cc=workflows@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).