From: "Michal Suchánek" <msuchanek@suse.de>
To: Jonathan McDowell <noodles@earth.li>
Cc: linux-integrity@vger.kernel.org,
Jarkko Sakkinen <jarkko@kernel.org>,
Lino Sanfilippo <l.sanfilippo@kunbus.com>
Subject: Re: TPM operation times out (very rarely)
Date: Mon, 24 Feb 2025 13:21:48 +0100 [thread overview]
Message-ID: <Z7xkXAMy4vrUrQce@kitsune.suse.cz> (raw)
In-Reply-To: <Z7h1PYOcqK2lHvLq@earth.li>
On Fri, Feb 21, 2025 at 12:44:45PM +0000, Jonathan McDowell wrote:
> On Thu, Feb 20, 2025 at 09:42:28AM +0100, Michal Suchánek wrote:
> > On Wed, Feb 19, 2025 at 10:29:45PM +0000, Jonathan McDowell wrote:
> > > On Wed, Jan 29, 2025 at 04:27:15PM +0100, Michal Suchánek wrote:
> > > > Hello,
> > > >
> > > > there is a problem report that booting a specific type of system about
> > > > 0.1% of the time encrypted volume (using a PCR to release the key) fails
> > > > to unlock because of TPM operation timeout.
> > > >
> > > > Minimizing the test case failed so far.
> > > >
> > > > For example, booting into text mode as opposed to graphical desktop
> > > > makes the problem unreproducible.
> > > >
> > > > The test is done with a frankenkernel that has TPM drivers about on par
> > > > with Linux 6.4 but using actual Linux 6.4 the problem is not
> > > > reproducible, either.
> > > >
> > > > However, given the problem takes up to a day to reproduce I do not have
> > > > much confidence in the negative results.
> > >
> > > Michal, can you possibly try the below and see if it helps out? There
> > > seems to be a timing bug introduced in 6.4+ that I think might be
> > > related, and matches up with some of our internal metrics that showed an
> > > increase in timeouts in 6.4 onwards.
> >
> > Thanks for looking into this
>
> No problem. It's something we've seen in our fleet and I've been trying
> to get to the bottom of, so having some additional data from someone
> else is really helpful.
>
> > > commit 79041fba797d0fe907e227012767f56dd93fac32
> > > Author: Jonathan McDowell <noodles@meta.com>
> > > Date: Wed Feb 19 16:20:44 2025 -0600
> > >
> > > tpm, tpm_tis: Fix timeout handling when waiting for TPM status
> > >
> > > The change to only use interrupts to handle supported status changes,
> > > then switch to polling for the rest, inverted the status test and sleep
> > > such that we can end up sleeping beyond our timeout and not actually
> > > checking the status. This can result in spurious TPM timeouts,
> > > especially on a more loaded system. Fix by switching the order back so
> > > we sleep *then* check. We've done a up front check when we enter the
> > > function so this won't cause an additional delay when the status is
> > > already what we're looking for.
> > >
> > > Cc: stable@vger.kernel.org # v6.4+
> > > Fixes: e87fcf0dc2b4 ("tpm, tpm_tis: Only handle supported interrupts")
> > > Signed-off-by: Jonathan McDowell <noodles@meta.com>
> > >
> > > diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c
> > > index fdef214b9f6b..167d71747666 100644
> > > --- a/drivers/char/tpm/tpm_tis_core.c
> > > +++ b/drivers/char/tpm/tpm_tis_core.c
> > > @@ -114,11 +114,11 @@ static int wait_for_tpm_stat(struct tpm_chip *chip, u8 mask,
> > > return 0;
> > > /* process status changes without irq support */
> > > do {
> > > + usleep_range(priv->timeout_min,
> > > + priv->timeout_max);
> >
> > What would be the priv->timeout_min and priv->timeout_max here?
> >
> > Note that there are timeouts that are 200ms, and are overblown by 2s.
> >
> > If the 200ms timeout relies on the sleep during the wait for the timeout
> > being much longer than the timeout itself then the timeout is arguably
> > bogus regardless of this change helping.
>
> Ah, I thought your major issue was the 2s timeout that was only slightly
> exceeded.
>
> However in my initial tracing I've seen wait_for_tpm_stat take much
> longer than the timeout that's passed in, which is what caused me to go
> and investigate this code path and note it had been changed in 6.4. It
> seems like a bug either way, but I've been at the TCG meeting this week
> and not had time to do further instrumentation and confirmation. Given
> you seem to have a more reliable reproducer I thought it might be easy
> enough for you to see if it made any difference.
The problem is no longer reproducible, probably due to some other hcange
in the test environment. So much for reliable reproducer.
Yes, I think this is a bug either way and should be addressed although
the effect on this problem is minor at best.
Thanks
Michal
next prev parent reply other threads:[~2025-02-24 12:21 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-29 15:27 TPM operation times out (very rarely) Michal Suchánek
2025-01-29 16:02 ` Jonathan McDowell
2025-01-29 16:20 ` Michal Suchánek
2025-01-29 17:14 ` Jonathan McDowell
2025-01-29 17:25 ` Michal Suchánek
2025-01-30 23:31 ` Jarkko Sakkinen
2025-01-31 8:35 ` Michal Suchánek
2025-01-31 10:25 ` Jarkko Sakkinen
2025-01-31 13:02 ` Michal Suchánek
2025-01-31 17:12 ` Jarkko Sakkinen
2025-01-31 17:28 ` Michal Suchánek
2025-01-31 19:31 ` Jarkko Sakkinen
2025-02-05 13:26 ` Michal Suchánek
2025-02-05 13:45 ` Michal Suchánek
2025-02-05 14:29 ` Jonathan McDowell
2025-02-05 15:29 ` Michal Suchánek
2025-02-06 20:35 ` Jarkko Sakkinen
2025-02-07 9:26 ` Jonathan McDowell
2025-02-07 9:40 ` Michal Suchánek
2025-02-07 9:47 ` Jonathan McDowell
2025-02-07 9:58 ` Michal Suchánek
2025-02-10 16:13 ` Jonathan McDowell
2025-02-10 17:30 ` Jarkko Sakkinen
2025-02-08 20:29 ` Jarkko Sakkinen
2025-02-10 16:18 ` Jonathan McDowell
2025-02-10 17:32 ` Jarkko Sakkinen
2025-02-24 13:04 ` Michal Suchánek
2025-03-01 2:13 ` Jarkko Sakkinen
2025-03-05 12:20 ` Michal Suchánek
2025-03-06 22:29 ` Jarkko Sakkinen
2025-03-27 12:57 ` Michal Suchánek
2025-03-27 13:15 ` Jarkko Sakkinen
2025-02-19 22:29 ` Jonathan McDowell
2025-02-20 8:42 ` Michal Suchánek
2025-02-21 12:44 ` Jonathan McDowell
2025-02-24 12:21 ` Michal Suchánek [this message]
2025-02-24 12:56 ` Michal Suchánek
2025-03-01 2:03 ` Jarkko Sakkinen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z7xkXAMy4vrUrQce@kitsune.suse.cz \
--to=msuchanek@suse.de \
--cc=jarkko@kernel.org \
--cc=l.sanfilippo@kunbus.com \
--cc=linux-integrity@vger.kernel.org \
--cc=noodles@earth.li \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).