All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: apw@canonical.com, conor@kernel.org, corbet@lwn.net,
	dwaipayanray1@gmail.com, geert+renesas@glider.be,
	gitster@pobox.com, horms@kernel.org, joe@perches.com,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux@leemhuis.info, lukas.bulwahn@gmail.com,
	miguel.ojeda.sandonis@gmail.com, niklas.soderlund@corigine.com,
	willy@infradead.org, workflows@vger.kernel.org, kees@kernel.org
Subject: Re: [PATCH] git-disambiguate: disambiguate shorthand git ids
Date: Thu, 26 Dec 2024 19:55:39 -0500	[thread overview]
Message-ID: <Z237CwC_YKhoZubs@lappy> (raw)
In-Reply-To: <CAHk-=wiPNpQ=zmGvPoZRMFZ7a7mm2pzSoOsuQPdDRaSF7gbw9w@mail.gmail.com>

On Thu, Dec 26, 2024 at 03:33:36PM -0800, Linus Torvalds wrote:
>On Thu, 26 Dec 2024 at 14:33, Sasha Levin <sashal@kernel.org> wrote:
>>
>> Which means that folks should be able to use a fairly short abbreviated
>> commit IDs in messages, specially for commits with a long subject line.
>
>So I don't think we should take this as a way to make *shorter* IDs,
>just as a way to accept historical short IDs.
>
>Also, I absolutely detest how you made this be all about "short IDs".
>
>As mentioned in my very original email on this matter, the actual REAL
>AND PRESENT issue isn't ambiguous IDs. We don't really have them.

What got me worried originally is the example Kees provided which
creates a collision of a 12-character abbreviated commit ID:

$ git log 1da177e4c3f4
error: short object ID 1da177e4c3 is ambiguous
hint: The candidates are:
hint:   1da177e4c3f41 commit 2005-04-16 - Linux-2.6.12-rc2
hint:   1da177e4c3f47 commit 2024-12-14 - docs: git SHA prefixes are for humans

When I tested it locally, my scripts were broken, our CVE scripts were
broken, and it is quite the PITA to address after the fact (think of all
the "Fixes: 1da177e4c3f4 [...]" lines we have in the log).

So sure, we don't have a collision right now, but going from 0 to 1 is
going to be painful.

Are we going to be actively watching for someone trying to sneak in a
commit like that, or should we just handle that case properly?

>What we *do* have is "wrong IDs". We have a ton of those.
>
>Look here, you can get a list of suspiciously wrong SHA1s with
>something like this:
>
>    git log |
>        egrep '[0-9a-f]{9,40} \(".*"\)' |
>        sed 's/.*[^0-9a-f]\([0-9a-f]\{9,40\}\)[^0-9a-f].*/\1/' |
>        sort -u > hexes
>
>which generates a list of things that look like commit IDs (ok,
>there's probably smarter ways) in our git logs.
>
>Now, *some* of those won't actually be commit IDs at all, they'll just
>be random hex numbers the above finds.
>
>But most of them will indeed be references to other commits.
>
>Then you try to find the bogus ones by doing something like
>
>    cat hexes |
>        while read i; do git show $i >& /dev/null || echo "No $i SHA"; done
>
>and you will get a lot ot hits.
>
>A *LOT*.

I ended up with this fun thing:

git log --pretty=format:'%B' | grep -E '^Fixes:[[:space:]][0-9a-fA-F]{8,40}' | while read -r line; do
     [[ $line =~ ^Fixes:[[:space:]]([0-9a-fA-F]{8,40})(.*) ]] && \
     ! git rev-parse --quiet --verify "${BASH_REMATCH[1]}^{commit}" >/dev/null 2>&1 && \
     { r=$(./scripts/git-disambiguate.sh --force "${BASH_REMATCH[1]}" "${BASH_REMATCH[2]}"); \
     [[ -n $r ]] && echo "Bad: $line" && echo "Good: Fixes: $(git ol "$r")"; }

Which looks for these wrong IDs in "Fixes:" tags and tries to run them
through git-disambiguate.sh:

Bad: Fixes: d71e629bed5b ("ARC: build: disallow invalid PAE40 + 4K page config")
Good: Fixes: 8871331b1769 ("ARC: build: disallow invalid PAE40 + 4K page config")
Bad: Fixes: 7e654ab7da03 ("cifs: during remount, make sure passwords are in sync")
Good: Fixes: 0f0e35790295 ("cifs: during remount, make sure passwords are in sync")
Bad: Fixes: ee650b3820f3 ("apparmor: properly handle cx/px lookup failure for complain")
Good: Fixes: db93ca15e5ae ("apparmor: properly handle cx/px lookup failure for complain")
[...]

>Look, I didn't check very many of them. Mainly because it gets *so*
>many hits, and I get bored very easily.
>
>But I did check a handful, just to get a feel for things.
>
>And yes, some of them were random hex numbers unrelated to actual git
>IDs, but most were really supposed to be git IDs. Except they weren't
>- or at least not from the mainline tree.
>
>For example, look at commit daa07cbc9ae3 ("KVM: x86: fix L1TF's MMIO
>GFN calculation") which references one of those nonexistent commit
>IDs:
>
>    Fixes: d9b47449c1a1 ("kvm: x86: Set highest physical address bits
>in non-present/reserved SPTEs")
>
>and I have no idea where that bogus commit ID comes from. Maybe it's a
>rebase. Maybe it's from a stable tree. But it sure doesn't exist in
>mainline.

This one is indeed in the 4.18 stable tree.

>What *does* exist is commit 28a1f3ac1d0c ("kvm: x86: Set highest
>physical address bits in non-present/reserved SPTEs"), which I found
>by just doing that
>
>    git log --grep='kvm: x86: Set highest physical address bits in
>non-present/reserved SPTEs'
>
>and my point is that this is really not about "disambiguating short
>SHA1 IDs". Because those "ambiguous" SHA1's to a very close
>approximation simply DO NOT EXIST.
>
>But the completely wrong ones? They are plentiful.

Is the concern mostly around the term "disambiguate"? Would
git-sanitize.sh work better?

-- 
Thanks,
Sasha

  reply	other threads:[~2024-12-27  0:55 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-26 22:05 [PATCH] git-disambiguate: disambiguate shorthand git ids Sasha Levin
2024-12-26 22:32 ` Sasha Levin
2024-12-26 23:33   ` Linus Torvalds
2024-12-27  0:55     ` Sasha Levin [this message]
2024-12-27  1:50       ` Linus Torvalds
2024-12-27 16:19         ` Sasha Levin
2024-12-27 18:33           ` Geert Uytterhoeven
2024-12-27 19:07             ` Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z237CwC_YKhoZubs@lappy \
    --to=sashal@kernel.org \
    --cc=apw@canonical.com \
    --cc=conor@kernel.org \
    --cc=corbet@lwn.net \
    --cc=dwaipayanray1@gmail.com \
    --cc=geert+renesas@glider.be \
    --cc=gitster@pobox.com \
    --cc=horms@kernel.org \
    --cc=joe@perches.com \
    --cc=kees@kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@leemhuis.info \
    --cc=lukas.bulwahn@gmail.com \
    --cc=miguel.ojeda.sandonis@gmail.com \
    --cc=niklas.soderlund@corigine.com \
    --cc=torvalds@linux-foundation.org \
    --cc=willy@infradead.org \
    --cc=workflows@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.