TPM operation times out (very rarely)

linux-integrity.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* TPM operation times out (very rarely)
@ 2025-01-29 15:27 Michal Suchánek
  2025-01-29 16:02 ` Jonathan McDowell
  2025-02-19 22:29 ` Jonathan McDowell
  0 siblings, 2 replies; 38+ messages in thread
From: Michal Suchánek @ 2025-01-29 15:27 UTC (permalink / raw)
  To: linux-integrity

Hello,

there is a problem report that booting a specific type of system about
0.1% of the time encrypted volume (using a PCR to release the key) fails
to unlock because of TPM operation timeout.

Minimizing the test case failed so far.

For example, booting into text mode as opposed to graphical desktop
makes the problem unreproducible.

The test is done with a frankenkernel that has TPM drivers about on par
with Linux 6.4 but using actual Linux 6.4 the problem is not
reproducible, either.

However, given the problem takes up to a day to reproduce I do not have
much confidence in the negative results.

With some instrumentation it was determined that the problem happens
here:

/*
 * If interrupts are used (signaled by an irq set in the vendor structure)
 * tpm.c can skip polling for the data to be available as the interrupt is
 * waited for here
 */
static int tpm_tis_send_data(struct tpm_chip *chip, const u8 *buf, size_t len)
{
	struct tpm_tis_data *priv = dev_get_drvdata(&chip->dev);
	int rc, status, burstcnt;
	size_t count = 0;
	bool itpm = test_bit(TPM_TIS_ITPM_WORKAROUND, &priv->flags);

	status = tpm_tis_status(chip);
	if ((status & TPM_STS_COMMAND_READY) == 0) {
		tpm_tis_ready(chip);
		if (wait_for_tpm_stat
		    (chip, TPM_STS_COMMAND_READY, chip->timeout_b,
		     &priv->int_queue, false) < 0) {
>>>			rc = -ETIME;
			goto out_err;
		}
	}

	while (count < len - 1) {
		burstcnt = get_burstcount(chip);
		if (burstcnt < 0) {
			dev_err(&chip->dev, "Unable to read burstcount\n");
			rc = burstcnt;
			goto out_err;
		}
		burstcnt = min_t(int, burstcnt, len - count - 1);
		rc = tpm_tis_write_bytes(priv, TPM_DATA_FIFO(priv->locality),
					 burstcnt, buf + count);
		if (rc < 0)
			goto out_err;

		count += burstcnt;

		if (wait_for_tpm_stat(chip, TPM_STS_VALID, chip->timeout_c,
					&priv->int_queue, false) < 0) {
			rc = -ETIME;
			goto out_err;
		}
		status = tpm_tis_status(chip);
		if (!itpm && (status & TPM_STS_DATA_EXPECT) == 0) {
			rc = -EIO;
			goto out_err;
		}
	}

	/* write last byte */
	rc = tpm_tis_write8(priv, TPM_DATA_FIFO(priv->locality), buf[count]);
	if (rc < 0)
		goto out_err;

	if (wait_for_tpm_stat(chip, TPM_STS_VALID, chip->timeout_c,
				&priv->int_queue, false) < 0) {
		rc = -ETIME;
		goto out_err;
	}
	status = tpm_tis_status(chip);
	if (!itpm && (status & TPM_STS_DATA_EXPECT) != 0) {
		rc = -EIO;
		goto out_err;
	}

	rc = tpm_tis_verify_crc(priv, len, buf);
	if (rc < 0) {
		dev_err(&chip->dev, "CRC mismatch for command.\n");
		goto out_err;
	}

	return 0;

out_err:
	tpm_tis_ready(chip);
	return rc;
}

localhost kernel: tpm tpm0: tpm_tis_send_data: 353: ready: Timed out (2236 of 2000 ms)
localhost kernel: tpm tpm0: tpm_tis_send_data: 353: ready: Took (2236 of 2000 ms)

200ms is a very long time for kernel space yet minor changes to the test
setup avoid the problem.

I am wondering if this timeout is insufficient if eg. the previous
operation can take up to TPM_LONG_LONG or perhaps if some of that time
should have been consumed by the previous operation but because of
insufficient locking somewhere the kernel got to waiting on the chip to
become ready too early.

Note that these two volume unlocks are done by systemd services, and
happen in parallel as much as possible.

The chip type:

tpm_tis IFX1522:00: 2.0 TPM (device-id 0x1D, rev-id 54)

Handle 0x0001, DMI type 43, 31 bytes
TPM Device
        Vendor ID: IFX
        Specification Version: 2.0
        Firmware Revision: 15.22
        Description: SLB9672
        Characteristics:
                Family configurable via firmware update
                Family configurable via platform software support
        OEM-specific Information: 0x00000000

Any idea what could be the cause of this problem?

Thanks

Michal

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: TPM operation times out (very rarely)
  2025-01-29 15:27 TPM operation times out (very rarely) Michal Suchánek
@ 2025-01-29 16:02 ` Jonathan McDowell
  2025-01-29 16:20   ` Michal Suchánek
  2025-01-30 23:31   ` Jarkko Sakkinen
  2025-02-19 22:29 ` Jonathan McDowell
  1 sibling, 2 replies; 38+ messages in thread
From: Jonathan McDowell @ 2025-01-29 16:02 UTC (permalink / raw)
  To: Michal Suchánek; +Cc: linux-integrity

On Wed, Jan 29, 2025 at 04:27:15PM +0100, Michal Suchánek wrote:
> there is a problem report that booting a specific type of system about
> 0.1% of the time encrypted volume (using a PCR to release the key) fails
> to unlock because of TPM operation timeout.
> 
> Minimizing the test case failed so far.
> 
> For example, booting into text mode as opposed to graphical desktop
> makes the problem unreproducible.
> 
> The test is done with a frankenkernel that has TPM drivers about on par
> with Linux 6.4 but using actual Linux 6.4 the problem is not
> reproducible, either.
> 
> However, given the problem takes up to a day to reproduce I do not have
> much confidence in the negative results.

So. We see what look like similar timeouts in our fleet, but I haven't
managed to produce a reliable test case that gives me any confidence
about what the cause is.

https://lore.kernel.org/linux-integrity/Zv1810ZfEBEhybmg@earth.li/

for my previous post about this.

> With some instrumentation it was determined that the problem happens
> here:

> static int tpm_tis_send_data(struct tpm_chip *chip, const u8 *buf, size_t len)
> {
> 	struct tpm_tis_data *priv = dev_get_drvdata(&chip->dev);
> 	int rc, status, burstcnt;
> 	size_t count = 0;
> 	bool itpm = test_bit(TPM_TIS_ITPM_WORKAROUND, &priv->flags);
> 
> 	status = tpm_tis_status(chip);
> 	if ((status & TPM_STS_COMMAND_READY) == 0) {
> 		tpm_tis_ready(chip);
> 		if (wait_for_tpm_stat
> 		    (chip, TPM_STS_COMMAND_READY, chip->timeout_b,
> 		     &priv->int_queue, false) < 0) {
> >>>			rc = -ETIME;
> 			goto out_err;
> 		}
> 	}

> localhost kernel: tpm tpm0: tpm_tis_send_data: 353: ready: Timed out (2236 of 2000 ms)
> localhost kernel: tpm tpm0: tpm_tis_send_data: 353: ready: Took (2236 of 2000 ms)

Can you track down the actual command that's taking the time? Though I
guess that's the previous command rather than the one that hits the
timeout.

> 200ms is a very long time for kernel space yet minor changes to the test
> setup avoid the problem.
> 
> I am wondering if this timeout is insufficient if eg. the previous
> operation can take up to TPM_LONG_LONG or perhaps if some of that time
> should have been consumed by the previous operation but because of
> insufficient locking somewhere the kernel got to waiting on the chip to
> become ready too early.
> 
> Note that these two volume unlocks are done by systemd services, and
> happen in parallel as much as possible.
> 
> The chip type:
> 
> tpm_tis IFX1522:00: 2.0 TPM (device-id 0x1D, rev-id 54)
> 
> Handle 0x0001, DMI type 43, 31 bytes
> TPM Device
>         Vendor ID: IFX
>         Specification Version: 2.0
>         Firmware Revision: 15.22
>         Description: SLB9672
>         Characteristics:
>                 Family configurable via firmware update
>                 Family configurable via platform software support
>         OEM-specific Information: 0x00000000
> 
> Any idea what could be the cause of this problem?

We're using SLB9670 devices, so older than yours, but at least the same
family.

J.

-- 
        The end is nearer.         |  .''`.  Debian GNU/Linux Developer
                                   | : :' :  Happy to accept PGP signed
                                   | `. `'   or encrypted mail - RSA
                                   |   `-    key on the keyservers.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: TPM operation times out (very rarely)
  2025-01-29 16:02 ` Jonathan McDowell
@ 2025-01-29 16:20   ` Michal Suchánek
  2025-01-29 17:14     ` Jonathan McDowell
  2025-01-30 23:31   ` Jarkko Sakkinen
  1 sibling, 1 reply; 38+ messages in thread
From: Michal Suchánek @ 2025-01-29 16:20 UTC (permalink / raw)
  To: Jonathan McDowell; +Cc: linux-integrity

On Wed, Jan 29, 2025 at 04:02:49PM +0000, Jonathan McDowell wrote:
> On Wed, Jan 29, 2025 at 04:27:15PM +0100, Michal Suchánek wrote:
> > there is a problem report that booting a specific type of system about
> > 0.1% of the time encrypted volume (using a PCR to release the key) fails
> > to unlock because of TPM operation timeout.
> > 
> > Minimizing the test case failed so far.
> > 
> > For example, booting into text mode as opposed to graphical desktop
> > makes the problem unreproducible.
> > 
> > The test is done with a frankenkernel that has TPM drivers about on par
> > with Linux 6.4 but using actual Linux 6.4 the problem is not
> > reproducible, either.
> > 
> > However, given the problem takes up to a day to reproduce I do not have
> > much confidence in the negative results.
> 
> So. We see what look like similar timeouts in our fleet, but I haven't
> managed to produce a reliable test case that gives me any confidence
> about what the cause is.
> 
> https://lore.kernel.org/linux-integrity/Zv1810ZfEBEhybmg@earth.li/
> 
> for my previous post about this.

I see that's basically the same as the test patch I used:


The time it takes for the TPM to become ready can exceed timeout_b

Jan 28 07:09:21 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: ready: Timed out (2236 of 2000 ms)
Jan 28 07:09:21 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: ready: Took (2236 of 2000 ms)
---
 drivers/char/tpm/tpm_tis_core.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c
index fdef214b9f6b..c7a794a448af 100644
--- a/drivers/char/tpm/tpm_tis_core.c
+++ b/drivers/char/tpm/tpm_tis_core.c
@@ -432,19 +432,29 @@ static int tpm_tis_recv(struct tpm_chip *chip, u8 *buf, size_t count)
 static int tpm_tis_send_data(struct tpm_chip *chip, const u8 *buf, size_t len)
 {
 	struct tpm_tis_data *priv = dev_get_drvdata(&chip->dev);
+	u32 ordinal = be32_to_cpu(*((__be32 *) (buf + 6)));
 	int rc, status, burstcnt;
 	size_t count = 0;
 	bool itpm = test_bit(TPM_TIS_ITPM_WORKAROUND, &priv->flags);
+	unsigned long start, timed_out;
 
 	status = tpm_tis_status(chip);
 	if ((status & TPM_STS_COMMAND_READY) == 0) {
 		tpm_tis_ready(chip);
+		timed_out = 0; start = jiffies;
+retry_ready:
 		if (wait_for_tpm_stat
 		    (chip, TPM_STS_COMMAND_READY, chip->timeout_b,
 		     &priv->int_queue, false) < 0) {
+			if (timed_out++ < 5) {
+				dev_err(&chip->dev, "%s: %u: ready: Timed out (%u of %u ms)\n", __func__, ordinal, jiffies_to_msecs(jiffies - start), jiffies_to_msecs(chip->timeout_b));
+				goto retry_ready;
+			}
 			rc = -ETIME;
 			goto out_err;
 		}
+		if (timed_out)
+			dev_err(&chip->dev, "%s: %u: ready: Took (%u of %u ms)\n", __func__, ordinal, jiffies_to_msecs(jiffies - start), jiffies_to_msecs(chip->timeout_b));
 	}
 
 	while (count < len - 1) {


> 
> > With some instrumentation it was determined that the problem happens
> > here:
> 
> > static int tpm_tis_send_data(struct tpm_chip *chip, const u8 *buf, size_t len)
> > {
> > 	struct tpm_tis_data *priv = dev_get_drvdata(&chip->dev);
> > 	int rc, status, burstcnt;
> > 	size_t count = 0;
> > 	bool itpm = test_bit(TPM_TIS_ITPM_WORKAROUND, &priv->flags);
> > 
> > 	status = tpm_tis_status(chip);
> > 	if ((status & TPM_STS_COMMAND_READY) == 0) {
> > 		tpm_tis_ready(chip);
> > 		if (wait_for_tpm_stat
> > 		    (chip, TPM_STS_COMMAND_READY, chip->timeout_b,
> > 		     &priv->int_queue, false) < 0) {
> > >>>			rc = -ETIME;
> > 			goto out_err;
> > 		}
> > 	}
> 
> > localhost kernel: tpm tpm0: tpm_tis_send_data: 353: ready: Timed out (2236 of 2000 ms)
> > localhost kernel: tpm tpm0: tpm_tis_send_data: 353: ready: Took (2236 of 2000 ms)
> 
> Can you track down the actual command that's taking the time? Though I
> guess that's the previous command rather than the one that hits the
> timeout.

Yes, 353 is supposed to be the command but it's likely the previous one
that is causing the problem.

I suppose this could be expanded to use a static variable to also save
the last command.

Thanks

Michal

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: TPM operation times out (very rarely)
  2025-01-29 16:20   ` Michal Suchánek
@ 2025-01-29 17:14     ` Jonathan McDowell
  2025-01-29 17:25       ` Michal Suchánek
  0 siblings, 1 reply; 38+ messages in thread
From: Jonathan McDowell @ 2025-01-29 17:14 UTC (permalink / raw)
  To: Michal Suchánek; +Cc: linux-integrity

On Wed, Jan 29, 2025 at 05:20:33PM +0100, Michal Suchánek wrote:
> On Wed, Jan 29, 2025 at 04:02:49PM +0000, Jonathan McDowell wrote:
> > On Wed, Jan 29, 2025 at 04:27:15PM +0100, Michal Suchánek wrote:
> > > there is a problem report that booting a specific type of system about
> > > 0.1% of the time encrypted volume (using a PCR to release the key) fails
> > > to unlock because of TPM operation timeout.
> > > 
> > > Minimizing the test case failed so far.
> > > 
> > > For example, booting into text mode as opposed to graphical desktop
> > > makes the problem unreproducible.
> > > 
> > > The test is done with a frankenkernel that has TPM drivers about on par
> > > with Linux 6.4 but using actual Linux 6.4 the problem is not
> > > reproducible, either.
> > > 
> > > However, given the problem takes up to a day to reproduce I do not have
> > > much confidence in the negative results.
> > 
> > So. We see what look like similar timeouts in our fleet, but I haven't
> > managed to produce a reliable test case that gives me any confidence
> > about what the cause is.
> > 
> > https://lore.kernel.org/linux-integrity/Zv1810ZfEBEhybmg@earth.li/
> > 
> > for my previous post about this.
> 
> I see that's basically the same as the test patch I used:
> 
> The time it takes for the TPM to become ready can exceed timeout_b
> 
> Jan 28 07:09:21 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: ready: Timed out (2236 of 2000 ms)
> Jan 28 07:09:21 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: ready: Took (2236 of 2000 ms)

Does it always complete immediately after the retry? Have you tried a
longer value (eg timeout_b * 2) and confirmed that makes the issue go
away?

> ---
>  drivers/char/tpm/tpm_tis_core.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c
> index fdef214b9f6b..c7a794a448af 100644
> --- a/drivers/char/tpm/tpm_tis_core.c
> +++ b/drivers/char/tpm/tpm_tis_core.c
> @@ -432,19 +432,29 @@ static int tpm_tis_recv(struct tpm_chip *chip, u8 *buf, size_t count)
>  static int tpm_tis_send_data(struct tpm_chip *chip, const u8 *buf, size_t len)
>  {
>  	struct tpm_tis_data *priv = dev_get_drvdata(&chip->dev);
> +	u32 ordinal = be32_to_cpu(*((__be32 *) (buf + 6)));
>  	int rc, status, burstcnt;
>  	size_t count = 0;
>  	bool itpm = test_bit(TPM_TIS_ITPM_WORKAROUND, &priv->flags);
> +	unsigned long start, timed_out;
>  
>  	status = tpm_tis_status(chip);
>  	if ((status & TPM_STS_COMMAND_READY) == 0) {
>  		tpm_tis_ready(chip);
> +		timed_out = 0; start = jiffies;
> +retry_ready:
>  		if (wait_for_tpm_stat
>  		    (chip, TPM_STS_COMMAND_READY, chip->timeout_b,
>  		     &priv->int_queue, false) < 0) {
> +			if (timed_out++ < 5) {
> +				dev_err(&chip->dev, "%s: %u: ready: Timed out (%u of %u ms)\n", __func__, ordinal, jiffies_to_msecs(jiffies - start), jiffies_to_msecs(chip->timeout_b));
> +				goto retry_ready;
> +			}
>  			rc = -ETIME;
>  			goto out_err;
>  		}
> +		if (timed_out)
> +			dev_err(&chip->dev, "%s: %u: ready: Took (%u of %u ms)\n", __func__, ordinal, jiffies_to_msecs(jiffies - start), jiffies_to_msecs(chip->timeout_b));
>  	}
>  
>  	while (count < len - 1) {
> 
> 
> > 
> > > With some instrumentation it was determined that the problem happens
> > > here:
> > 
> > > static int tpm_tis_send_data(struct tpm_chip *chip, const u8 *buf, size_t len)
> > > {
> > > 	struct tpm_tis_data *priv = dev_get_drvdata(&chip->dev);
> > > 	int rc, status, burstcnt;
> > > 	size_t count = 0;
> > > 	bool itpm = test_bit(TPM_TIS_ITPM_WORKAROUND, &priv->flags);
> > > 
> > > 	status = tpm_tis_status(chip);
> > > 	if ((status & TPM_STS_COMMAND_READY) == 0) {
> > > 		tpm_tis_ready(chip);
> > > 		if (wait_for_tpm_stat
> > > 		    (chip, TPM_STS_COMMAND_READY, chip->timeout_b,
> > > 		     &priv->int_queue, false) < 0) {
> > > >>>			rc = -ETIME;
> > > 			goto out_err;
> > > 		}
> > > 	}
> > 
> > > localhost kernel: tpm tpm0: tpm_tis_send_data: 353: ready: Timed out (2236 of 2000 ms)
> > > localhost kernel: tpm tpm0: tpm_tis_send_data: 353: ready: Took (2236 of 2000 ms)
> > 
> > Can you track down the actual command that's taking the time? Though I
> > guess that's the previous command rather than the one that hits the
> > timeout.
> 
> Yes, 353 is supposed to be the command but it's likely the previous one
> that is causing the problem.

So TPM2_CC_CONTEXT_LOAD. I'm assuming you're using /dev/tpmrm0, so
that's the start of a fresh "load context, execute command, save
context" cycle. I'd expect the previous command to be the
TPM2_CC_CONTEXT_SAVE from the previous cycle.

> I suppose this could be expanded to use a static variable to also save
> the last command.

J.

-- 
] https://www.earth.li/~noodles/ [] Design a system any fool can use,  [
]  PGP/GPG Key @ the.earth.li    []  and only a fool will want to use  [
] via keyserver, web or email.   []                it.                 [
] RSA: 4096/0x94FA372B2DA8B985   []                                    [

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: TPM operation times out (very rarely)
  2025-01-29 17:14     ` Jonathan McDowell
@ 2025-01-29 17:25       ` Michal Suchánek
  0 siblings, 0 replies; 38+ messages in thread
From: Michal Suchánek @ 2025-01-29 17:25 UTC (permalink / raw)
  To: Jonathan McDowell; +Cc: linux-integrity

On Wed, Jan 29, 2025 at 05:14:34PM +0000, Jonathan McDowell wrote:
> On Wed, Jan 29, 2025 at 05:20:33PM +0100, Michal Suchánek wrote:
> > On Wed, Jan 29, 2025 at 04:02:49PM +0000, Jonathan McDowell wrote:
> > > On Wed, Jan 29, 2025 at 04:27:15PM +0100, Michal Suchánek wrote:
> > > > there is a problem report that booting a specific type of system about
> > > > 0.1% of the time encrypted volume (using a PCR to release the key) fails
> > > > to unlock because of TPM operation timeout.
> > > > 
> > > > Minimizing the test case failed so far.
> > > > 
> > > > For example, booting into text mode as opposed to graphical desktop
> > > > makes the problem unreproducible.
> > > > 
> > > > The test is done with a frankenkernel that has TPM drivers about on par
> > > > with Linux 6.4 but using actual Linux 6.4 the problem is not
> > > > reproducible, either.
> > > > 
> > > > However, given the problem takes up to a day to reproduce I do not have
> > > > much confidence in the negative results.
> > > 
> > > So. We see what look like similar timeouts in our fleet, but I haven't
> > > managed to produce a reliable test case that gives me any confidence
> > > about what the cause is.
> > > 
> > > https://lore.kernel.org/linux-integrity/Zv1810ZfEBEhybmg@earth.li/
> > > 
> > > for my previous post about this.
> > 
> > I see that's basically the same as the test patch I used:
> > 
> > The time it takes for the TPM to become ready can exceed timeout_b
> > 
> > Jan 28 07:09:21 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: ready: Timed out (2236 of 2000 ms)
> > Jan 28 07:09:21 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: ready: Took (2236 of 2000 ms)
> 
> Does it always complete immediately after the retry? Have you tried a
> longer value (eg timeout_b * 2) and confirmed that makes the issue go
> away?

No idea. I Initially put 10000 retries there and given it succeeded in
120% of the original timeout I cut the retries to 5 and promoted it to
'production ready' workaround.

I have only this one result with the patch applied available so far.

> 
> > ---
> >  drivers/char/tpm/tpm_tis_core.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> > 
> > diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c
> > index fdef214b9f6b..c7a794a448af 100644
> > --- a/drivers/char/tpm/tpm_tis_core.c
> > +++ b/drivers/char/tpm/tpm_tis_core.c
> > @@ -432,19 +432,29 @@ static int tpm_tis_recv(struct tpm_chip *chip, u8 *buf, size_t count)
> >  static int tpm_tis_send_data(struct tpm_chip *chip, const u8 *buf, size_t len)
> >  {
> >  	struct tpm_tis_data *priv = dev_get_drvdata(&chip->dev);
> > +	u32 ordinal = be32_to_cpu(*((__be32 *) (buf + 6)));
> >  	int rc, status, burstcnt;
> >  	size_t count = 0;
> >  	bool itpm = test_bit(TPM_TIS_ITPM_WORKAROUND, &priv->flags);
> > +	unsigned long start, timed_out;
> >  
> >  	status = tpm_tis_status(chip);
> >  	if ((status & TPM_STS_COMMAND_READY) == 0) {
> >  		tpm_tis_ready(chip);
> > +		timed_out = 0; start = jiffies;
> > +retry_ready:
> >  		if (wait_for_tpm_stat
> >  		    (chip, TPM_STS_COMMAND_READY, chip->timeout_b,
> >  		     &priv->int_queue, false) < 0) {
> > +			if (timed_out++ < 5) {
> > +				dev_err(&chip->dev, "%s: %u: ready: Timed out (%u of %u ms)\n", __func__, ordinal, jiffies_to_msecs(jiffies - start), jiffies_to_msecs(chip->timeout_b));
> > +				goto retry_ready;
> > +			}
> >  			rc = -ETIME;
> >  			goto out_err;
> >  		}
> > +		if (timed_out)
> > +			dev_err(&chip->dev, "%s: %u: ready: Took (%u of %u ms)\n", __func__, ordinal, jiffies_to_msecs(jiffies - start), jiffies_to_msecs(chip->timeout_b));
> >  	}
> >  
> >  	while (count < len - 1) {
> > 
> > 
> > > 
> > > > With some instrumentation it was determined that the problem happens
> > > > here:
> > > 
> > > > static int tpm_tis_send_data(struct tpm_chip *chip, const u8 *buf, size_t len)
> > > > {
> > > > 	struct tpm_tis_data *priv = dev_get_drvdata(&chip->dev);
> > > > 	int rc, status, burstcnt;
> > > > 	size_t count = 0;
> > > > 	bool itpm = test_bit(TPM_TIS_ITPM_WORKAROUND, &priv->flags);
> > > > 
> > > > 	status = tpm_tis_status(chip);
> > > > 	if ((status & TPM_STS_COMMAND_READY) == 0) {
> > > > 		tpm_tis_ready(chip);
> > > > 		if (wait_for_tpm_stat
> > > > 		    (chip, TPM_STS_COMMAND_READY, chip->timeout_b,
> > > > 		     &priv->int_queue, false) < 0) {
> > > > >>>			rc = -ETIME;
> > > > 			goto out_err;
> > > > 		}
> > > > 	}
> > > 
> > > > localhost kernel: tpm tpm0: tpm_tis_send_data: 353: ready: Timed out (2236 of 2000 ms)
> > > > localhost kernel: tpm tpm0: tpm_tis_send_data: 353: ready: Took (2236 of 2000 ms)
> > > 
> > > Can you track down the actual command that's taking the time? Though I
> > > guess that's the previous command rather than the one that hits the
> > > timeout.
> > 
> > Yes, 353 is supposed to be the command but it's likely the previous one
> > that is causing the problem.
> 
> So TPM2_CC_CONTEXT_LOAD. I'm assuming you're using /dev/tpmrm0, so
> that's the start of a fresh "load context, execute command, save
> context" cycle. I'd expect the previous command to be the
> TPM2_CC_CONTEXT_SAVE from the previous cycle.

That would agree with your previous result of not getting the problem
when using /dev/tpm0 instead.

Thanks

Michal

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: TPM operation times out (very rarely)
  2025-01-29 16:02 ` Jonathan McDowell
  2025-01-29 16:20   ` Michal Suchánek
@ 2025-01-30 23:31   ` Jarkko Sakkinen
  2025-01-31  8:35     ` Michal Suchánek
  1 sibling, 1 reply; 38+ messages in thread
From: Jarkko Sakkinen @ 2025-01-30 23:31 UTC (permalink / raw)
  To: Jonathan McDowell, Michal Suchánek; +Cc: linux-integrity

On Wed Jan 29, 2025 at 6:02 PM EET, Jonathan McDowell wrote:
> On Wed, Jan 29, 2025 at 04:27:15PM +0100, Michal Suchánek wrote:
> > there is a problem report that booting a specific type of system about
> > 0.1% of the time encrypted volume (using a PCR to release the key) fails
> > to unlock because of TPM operation timeout.
> > 
> > Minimizing the test case failed so far.
> > 
> > For example, booting into text mode as opposed to graphical desktop
> > makes the problem unreproducible.
> > 
> > The test is done with a frankenkernel that has TPM drivers about on par
> > with Linux 6.4 but using actual Linux 6.4 the problem is not
> > reproducible, either.
> > 
> > However, given the problem takes up to a day to reproduce I do not have
> > much confidence in the negative results.
>
> So. We see what look like similar timeouts in our fleet, but I haven't
> managed to produce a reliable test case that gives me any confidence
> about what the cause is.
>
> https://lore.kernel.org/linux-integrity/Zv1810ZfEBEhybmg@earth.li/
>
> for my previous post about this.

Ugh, this was my first week at new job, sorry.

2000 ms is like a spec value, which can be a bad idea. Please look at
Table 18.

My guess is that GUI makes more stuff happening in the system, which
could make latencies more shaky.

The most trivial candidate would be:

	status = tpm_tis_status(chip);
	if ((status & TPM_STS_COMMAND_READY) == 0) {
		tpm_tis_ready(chip);
		if (wait_for_tpm_stat
		    (chip, TPM_STS_COMMAND_READY, TPM_TIS_TIMEOUT_MAX /* e.g. 2250 ms */,
		     &priv->int_queue, false) < 0) {
		     	rc = -ETIME;
			goto out_err;
		}
	}

On the other hand, for me tpm_tis_send_main() looked initially weird:

	for (try = 0; try < TPM_RETRY; try++) {
		rc = tpm_tis_send_data(chip, buf, len);
		if (rc >= 0)
			/* Data transfer done successfully */
			break;
		else if (rc != -EIO)
			/* Data transfer failed, not recoverable */
			return rc;
	}

I.e. no retry on -ETIME.

But I'd fixup instead tpm_common_write():

out:
	mutex_unlock(&priv->buffer_mutex);

	if (ret == -ETIME)
		return -ERESTARTSYS;

	return ret;
}

It still can be interrupted by a signal this way. Retry loop would
block too much.

Not sure if only the increase in timeout value would be enough or
should the both sites be fixed up.

[1] https://trustedcomputinggroup.org/wp-content/uploads/PC-Client-Specific-Platform-TPM-Profile-for-TPM-2p0-v1p05p_r14_pub.pdf

BR, Jarkko

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: TPM operation times out (very rarely)
  2025-01-30 23:31   ` Jarkko Sakkinen
@ 2025-01-31  8:35     ` Michal Suchánek
  2025-01-31 10:25       ` Jarkko Sakkinen
  0 siblings, 1 reply; 38+ messages in thread
From: Michal Suchánek @ 2025-01-31  8:35 UTC (permalink / raw)
  To: Jarkko Sakkinen; +Cc: Jonathan McDowell, linux-integrity

Hello,

On Fri, Jan 31, 2025 at 01:31:01AM +0200, Jarkko Sakkinen wrote:
> On Wed Jan 29, 2025 at 6:02 PM EET, Jonathan McDowell wrote:
> > On Wed, Jan 29, 2025 at 04:27:15PM +0100, Michal Suchánek wrote:
> > > there is a problem report that booting a specific type of system about
> > > 0.1% of the time encrypted volume (using a PCR to release the key) fails
> > > to unlock because of TPM operation timeout.
> > > 
> > > Minimizing the test case failed so far.
> > > 
> > > For example, booting into text mode as opposed to graphical desktop
> > > makes the problem unreproducible.
> > > 
> > > The test is done with a frankenkernel that has TPM drivers about on par
> > > with Linux 6.4 but using actual Linux 6.4 the problem is not
> > > reproducible, either.
> > > 
> > > However, given the problem takes up to a day to reproduce I do not have
> > > much confidence in the negative results.
> >
> > So. We see what look like similar timeouts in our fleet, but I haven't
> > managed to produce a reliable test case that gives me any confidence
> > about what the cause is.
> >
> > https://lore.kernel.org/linux-integrity/Zv1810ZfEBEhybmg@earth.li/
> >
> > for my previous post about this.
> 
> Ugh, this was my first week at new job, sorry.
> 
> 2000 ms is like a spec value, which can be a bad idea. Please look at
> Table 18.
> 
> My guess is that GUI makes more stuff happening in the system, which
> could make latencies more shaky.
> 
> The most trivial candidate would be:
> 
> 	status = tpm_tis_status(chip);
> 	if ((status & TPM_STS_COMMAND_READY) == 0) {
> 		tpm_tis_ready(chip);
> 		if (wait_for_tpm_stat
> 		    (chip, TPM_STS_COMMAND_READY, TPM_TIS_TIMEOUT_MAX /* e.g. 2250 ms */,

2250 is more than the measured 2226 but I have no idea if that's random
or in some way deterministic.

> 		     &priv->int_queue, false) < 0) {
> 		     	rc = -ETIME;
> 			goto out_err;
> 		}
> 	}
> 
> On the other hand, for me tpm_tis_send_main() looked initially weird:
> 
> 	for (try = 0; try < TPM_RETRY; try++) {
> 		rc = tpm_tis_send_data(chip, buf, len);
> 		if (rc >= 0)
> 			/* Data transfer done successfully */
> 			break;
> 		else if (rc != -EIO)
> 			/* Data transfer failed, not recoverable */
> 			return rc;
> 	}
> 
> I.e. no retry on -ETIME.
> 
> But I'd fixup instead tpm_common_write():
> 
> out:
> 	mutex_unlock(&priv->buffer_mutex);
> 
> 	if (ret == -ETIME)
> 		return -ERESTARTSYS;
> 
> 	return ret;
> }
> 
> It still can be interrupted by a signal this way. Retry loop would
> block too much.

Not sure if this would help. As was noted in the discussion so far if
the value is consumed by the kernel it will likely not retry in the
upper layer code.

Also restarting the userspace cryptsetup service reportedly did not help
addressing the problem which suggests that the consumer is indeed the
kernel, and it marked something as defunct and gave up on getting the
key from the TPM entirely.

Also the loop can already block for up to 2s. If blocking in the loop is
a problem then it should be addressed in that loop.

Thanks

Michal

> 
> Not sure if only the increase in timeout value would be enough or
> should the both sites be fixed up.
> 
> [1] https://trustedcomputinggroup.org/wp-content/uploads/PC-Client-Specific-Platform-TPM-Profile-for-TPM-2p0-v1p05p_r14_pub.pdf
> 
> BR, Jarkko

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: TPM operation times out (very rarely)
  2025-01-31  8:35     ` Michal Suchánek
@ 2025-01-31 10:25       ` Jarkko Sakkinen
  2025-01-31 13:02         ` Michal Suchánek
  0 siblings, 1 reply; 38+ messages in thread
From: Jarkko Sakkinen @ 2025-01-31 10:25 UTC (permalink / raw)
  To: Michal Suchánek; +Cc: Jonathan McDowell, linux-integrity

On Fri Jan 31, 2025 at 10:35 AM EET, Michal Suchánek wrote:
> Hello,
>
> On Fri, Jan 31, 2025 at 01:31:01AM +0200, Jarkko Sakkinen wrote:
> > On Wed Jan 29, 2025 at 6:02 PM EET, Jonathan McDowell wrote:
> > > On Wed, Jan 29, 2025 at 04:27:15PM +0100, Michal Suchánek wrote:
> > > > there is a problem report that booting a specific type of system about
> > > > 0.1% of the time encrypted volume (using a PCR to release the key) fails
> > > > to unlock because of TPM operation timeout.
> > > > 
> > > > Minimizing the test case failed so far.
> > > > 
> > > > For example, booting into text mode as opposed to graphical desktop
> > > > makes the problem unreproducible.
> > > > 
> > > > The test is done with a frankenkernel that has TPM drivers about on par
> > > > with Linux 6.4 but using actual Linux 6.4 the problem is not
> > > > reproducible, either.
> > > > 
> > > > However, given the problem takes up to a day to reproduce I do not have
> > > > much confidence in the negative results.
> > >
> > > So. We see what look like similar timeouts in our fleet, but I haven't
> > > managed to produce a reliable test case that gives me any confidence
> > > about what the cause is.
> > >
> > > https://lore.kernel.org/linux-integrity/Zv1810ZfEBEhybmg@earth.li/
> > >
> > > for my previous post about this.
> > 
> > Ugh, this was my first week at new job, sorry.
> > 
> > 2000 ms is like a spec value, which can be a bad idea. Please look at
> > Table 18.
> > 
> > My guess is that GUI makes more stuff happening in the system, which
> > could make latencies more shaky.
> > 
> > The most trivial candidate would be:
> > 
> > 	status = tpm_tis_status(chip);
> > 	if ((status & TPM_STS_COMMAND_READY) == 0) {
> > 		tpm_tis_ready(chip);
> > 		if (wait_for_tpm_stat
> > 		    (chip, TPM_STS_COMMAND_READY, TPM_TIS_TIMEOUT_MAX /* e.g. 2250 ms */,
>
> 2250 is more than the measured 2226 but I have no idea if that's random
> or in some way deterministic.

Your text vs GUI at least gives evidence of stochasticity while not a
full-fledged proof. You can expect e.g. more IRQs happening when you
run a GUI. I did not engineer that number. You could e.g. double the
original number. The whole framework for timeout_b is ridiculous (if
it is because of me it does not change that fact).

Or perhaps we could consider even  wait_event_interruptible() inside
wait_for_tpm_stat(), since it is interruptible.

BR, Jarkko

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: TPM operation times out (very rarely)
  2025-01-31 10:25       ` Jarkko Sakkinen
@ 2025-01-31 13:02         ` Michal Suchánek
  2025-01-31 17:12           ` Jarkko Sakkinen
  0 siblings, 1 reply; 38+ messages in thread
From: Michal Suchánek @ 2025-01-31 13:02 UTC (permalink / raw)
  To: Jarkko Sakkinen; +Cc: Jonathan McDowell, linux-integrity

On Fri, Jan 31, 2025 at 12:25:21PM +0200, Jarkko Sakkinen wrote:
> On Fri Jan 31, 2025 at 10:35 AM EET, Michal Suchánek wrote:
> > Hello,
> >
> > On Fri, Jan 31, 2025 at 01:31:01AM +0200, Jarkko Sakkinen wrote:
> > > On Wed Jan 29, 2025 at 6:02 PM EET, Jonathan McDowell wrote:
> > > > On Wed, Jan 29, 2025 at 04:27:15PM +0100, Michal Suchánek wrote:
> > > > > there is a problem report that booting a specific type of system about
> > > > > 0.1% of the time encrypted volume (using a PCR to release the key) fails
> > > > > to unlock because of TPM operation timeout.
> > > > > 
> > > > > Minimizing the test case failed so far.
> > > > > 
> > > > > For example, booting into text mode as opposed to graphical desktop
> > > > > makes the problem unreproducible.
> > > > > 
> > > > > The test is done with a frankenkernel that has TPM drivers about on par
> > > > > with Linux 6.4 but using actual Linux 6.4 the problem is not
> > > > > reproducible, either.
> > > > > 
> > > > > However, given the problem takes up to a day to reproduce I do not have
> > > > > much confidence in the negative results.
> > > >
> > > > So. We see what look like similar timeouts in our fleet, but I haven't
> > > > managed to produce a reliable test case that gives me any confidence
> > > > about what the cause is.
> > > >
> > > > https://lore.kernel.org/linux-integrity/Zv1810ZfEBEhybmg@earth.li/
> > > >
> > > > for my previous post about this.
> > > 
> > > Ugh, this was my first week at new job, sorry.
> > > 
> > > 2000 ms is like a spec value, which can be a bad idea. Please look at
> > > Table 18.
> > > 
> > > My guess is that GUI makes more stuff happening in the system, which
> > > could make latencies more shaky.
> > > 
> > > The most trivial candidate would be:
> > > 
> > > 	status = tpm_tis_status(chip);
> > > 	if ((status & TPM_STS_COMMAND_READY) == 0) {
> > > 		tpm_tis_ready(chip);
> > > 		if (wait_for_tpm_stat
> > > 		    (chip, TPM_STS_COMMAND_READY, TPM_TIS_TIMEOUT_MAX /* e.g. 2250 ms */,
> >
> > 2250 is more than the measured 2226 but I have no idea if that's random
> > or in some way deterministic.
> 
> Your text vs GUI at least gives evidence of stochasticity while not a
> full-fledged proof. You can expect e.g. more IRQs happening when you
> run a GUI. I did not engineer that number. You could e.g. double the
> original number. The whole framework for timeout_b is ridiculous (if
> it is because of me it does not change that fact).

It looks like the timeout_b is used exclusively as the ready timeout *),
with various sources of the value depending on chip type.

Then increasing it should not cause any problem other than the kernel
waiting longer when the TPM chip is really stuck.

* There is one instance of use of timeout_b for TPM_STS_VALID in
st33zp24_pm_resume.

> Or perhaps we could consider even  wait_event_interruptible() inside
> wait_for_tpm_stat(), since it is interruptible.

It seems to be already interruptible, at least the implementation in
tpm_tis_core. There is another one in xenfront, and a few more
wait_for_stat() without _tpm_ in the middle.

Thanks

Michal

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: TPM operation times out (very rarely)
  2025-01-31 13:02         ` Michal Suchánek
@ 2025-01-31 17:12           ` Jarkko Sakkinen
  2025-01-31 17:28             ` Michal Suchánek
  0 siblings, 1 reply; 38+ messages in thread
From: Jarkko Sakkinen @ 2025-01-31 17:12 UTC (permalink / raw)
  To: Michal Suchánek; +Cc: Jonathan McDowell, linux-integrity

On Fri Jan 31, 2025 at 3:02 PM EET, Michal Suchánek wrote:
> It looks like the timeout_b is used exclusively as the ready timeout *),
> with various sources of the value depending on chip type.
>
> Then increasing it should not cause any problem other than the kernel
> waiting longer when the TPM chip is really stuck.
>
> * There is one instance of use of timeout_b for TPM_STS_VALID in
> st33zp24_pm_resume.

Possible for you to give a shot for patch and try it out for a while?
I'm fine with 2x, or even 4x in this case.

BR, Jarkko

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: TPM operation times out (very rarely)
  2025-01-31 17:12           ` Jarkko Sakkinen
@ 2025-01-31 17:28             ` Michal Suchánek
  2025-01-31 19:31               ` Jarkko Sakkinen
  0 siblings, 1 reply; 38+ messages in thread
From: Michal Suchánek @ 2025-01-31 17:28 UTC (permalink / raw)
  To: Jarkko Sakkinen; +Cc: Jonathan McDowell, linux-integrity

On Fri, Jan 31, 2025 at 07:12:06PM +0200, Jarkko Sakkinen wrote:
> On Fri Jan 31, 2025 at 3:02 PM EET, Michal Suchánek wrote:
> > It looks like the timeout_b is used exclusively as the ready timeout *),
> > with various sources of the value depending on chip type.
> >
> > Then increasing it should not cause any problem other than the kernel
> > waiting longer when the TPM chip is really stuck.
> >
> > * There is one instance of use of timeout_b for TPM_STS_VALID in
> > st33zp24_pm_resume.
> 
> Possible for you to give a shot for patch and try it out for a while?
> I'm fine with 2x, or even 4x in this case.

I will see what I can do. It will definitely take a while.

How would you like to multiply it?

At the sime the timeout_b is assigned, or at the time it's used?

Any specific patch that you have in mind?

Thanks

Michal

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: TPM operation times out (very rarely)
  2025-01-31 17:28             ` Michal Suchánek
@ 2025-01-31 19:31               ` Jarkko Sakkinen
  2025-02-05 13:26                 ` Michal Suchánek
  0 siblings, 1 reply; 38+ messages in thread
From: Jarkko Sakkinen @ 2025-01-31 19:31 UTC (permalink / raw)
  To: Michal Suchánek; +Cc: Jonathan McDowell, linux-integrity

On Fri Jan 31, 2025 at 7:28 PM EET, Michal Suchánek wrote:
> On Fri, Jan 31, 2025 at 07:12:06PM +0200, Jarkko Sakkinen wrote:
> > On Fri Jan 31, 2025 at 3:02 PM EET, Michal Suchánek wrote:
> > > It looks like the timeout_b is used exclusively as the ready timeout *),
> > > with various sources of the value depending on chip type.
> > >
> > > Then increasing it should not cause any problem other than the kernel
> > > waiting longer when the TPM chip is really stuck.
> > >
> > > * There is one instance of use of timeout_b for TPM_STS_VALID in
> > > st33zp24_pm_resume.
> > 
> > Possible for you to give a shot for patch and try it out for a while?
> > I'm fine with 2x, or even 4x in this case.
>
> I will see what I can do. It will definitely take a while.
>
> How would you like to multiply it?
>
> At the sime the timeout_b is assigned, or at the time it's used?
>
> Any specific patch that you have in mind?

I'll think about this a bit and send a patch with RFC tag. Might take
to late next week.

BR, Jarkko

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: TPM operation times out (very rarely)
  2025-01-31 19:31               ` Jarkko Sakkinen
@ 2025-02-05 13:26                 ` Michal Suchánek
  2025-02-05 13:45                   ` Michal Suchánek
  2025-02-05 14:29                   ` Jonathan McDowell
  0 siblings, 2 replies; 38+ messages in thread
From: Michal Suchánek @ 2025-02-05 13:26 UTC (permalink / raw)
  To: Jarkko Sakkinen; +Cc: Jonathan McDowell, linux-integrity

On Fri, Jan 31, 2025 at 09:31:30PM +0200, Jarkko Sakkinen wrote:
> On Fri Jan 31, 2025 at 7:28 PM EET, Michal Suchánek wrote:
> > On Fri, Jan 31, 2025 at 07:12:06PM +0200, Jarkko Sakkinen wrote:
> > > On Fri Jan 31, 2025 at 3:02 PM EET, Michal Suchánek wrote:
> > > > It looks like the timeout_b is used exclusively as the ready timeout *),
> > > > with various sources of the value depending on chip type.
> > > >
> > > > Then increasing it should not cause any problem other than the kernel
> > > > waiting longer when the TPM chip is really stuck.
> > > >
> > > > * There is one instance of use of timeout_b for TPM_STS_VALID in
> > > > st33zp24_pm_resume.
> > > 
> > > Possible for you to give a shot for patch and try it out for a while?
> > > I'm fine with 2x, or even 4x in this case.
> >
> > I will see what I can do. It will definitely take a while.
> >
> > How would you like to multiply it?
> >
> > At the sime the timeout_b is assigned, or at the time it's used?
> >
> > Any specific patch that you have in mind?
> 
> I'll think about this a bit and send a patch with RFC tag. Might take
> to late next week.

The ready timeout is not the only one exceeded:

> Jan 29 19:01:55 localhost kernel: tpm tpm0: tpm_tis_send_data: 357: ready: Timed out (2232 of 2000 ms)
> Jan 29 19:01:55 localhost kernel: tpm tpm0: tpm_tis_send_data: 357: ready: Took (2232 of 2000 ms)
> Jan 30 09:08:20 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: ready: Timed out (2228 of 2000 ms)
> Jan 30 09:08:20 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: ready: Took (2228 of 2000 ms)
> Jan 30 14:26:16 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Timed out (540 of 200 ms)
> Jan 30 14:26:16 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Took (540 of 200 ms)
> Jan 30 23:25:13 localhost kernel: tpm tpm0: tpm_tis_send_data: 354: stat: Timed out (2224 of 200 ms)
> Jan 30 23:25:13 localhost kernel: tpm tpm0: tpm_tis_send_data: 354: stat: Took (2224 of 200 ms)
> Feb 01 05:25:33 localhost kernel: tpm tpm0: tpm_tis_send_data: 357: ready: Timed out (2228 of 2000 ms)
> Feb 01 05:25:33 localhost kernel: tpm tpm0: tpm_tis_send_data: 357: ready: Took (2228 of 2000 ms)
> Feb 01 07:02:53 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Timed out (556 of 200 ms)
> Feb 01 07:02:53 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Took (556 of 200 ms)
> Feb 01 09:26:22 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Timed out (540 of 200 ms)
> Feb 01 09:26:22 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Took (540 of 200 ms)
> Feb 02 02:45:35 localhost kernel: tpm tpm0: tpm_tis_send_data: 379: stat: Timed out (272 of 200 ms)
> Feb 02 02:45:35 localhost kernel: tpm tpm0: tpm_tis_send_data: 379: stat: Took (272 of 200 ms)
> Feb 02 03:40:04 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Timed out (536 of 200 ms)
> Feb 02 03:40:04 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Took (536 of 200 ms)
> Feb 02 04:09:50 localhost kernel: tpm tpm0: tpm_tis_send_data: 357: ready: Timed out (2236 of 2000 ms)
> Feb 02 04:09:50 localhost kernel: tpm tpm0: tpm_tis_send_data: 357: ready: Took (2236 of 2000 ms)
> Feb 02 09:57:41 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Timed out (540 of 200 ms)
> Feb 02 09:57:41 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Took (540 of 200 ms)
> Feb 02 10:59:00 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Timed out (536 of 200 ms)
> Feb 02 10:59:00 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Took (536 of 200 ms)
> Feb 03 03:58:09 localhost kernel: tpm tpm0: tpm_tis_send_data: 354: stat: Timed out (540 of 200 ms)
> Feb 03 03:58:09 localhost kernel: tpm tpm0: tpm_tis_send_data: 354: stat: Took (540 of 200 ms)

While the ready timeout is quite consistently exceeded by around 230ms
so far the stat timeout a few lines lower is less consistent.

Failure is observed with another chip type as well:

localhost kernel: tpm_tis MSFT0101:00: 2.0 TPM (device-id 0x1B, rev-id
22)

TPM Device
        Vendor ID: IFX
        Specification Version: 2.0
        Firmware Revision: 7.83
        Description: TPM 2.0, ManufacturerID: IFX , Firmware Version: 7.83.3358.0
        Characteristics:
                Family configurable via firmware update
                Family configurable via OEM proprietary mechanism
        OEM-specific Information: 0x00000000

Thanks

Michal

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: TPM operation times out (very rarely)
  2025-02-05 13:26                 ` Michal Suchánek
@ 2025-02-05 13:45                   ` Michal Suchánek
  2025-02-05 14:29                   ` Jonathan McDowell
  1 sibling, 0 replies; 38+ messages in thread
From: Michal Suchánek @ 2025-02-05 13:45 UTC (permalink / raw)
  To: Jarkko Sakkinen; +Cc: Jonathan McDowell, linux-integrity

On Wed, Feb 05, 2025 at 02:26:07PM +0100, Michal Suchánek wrote:
> On Fri, Jan 31, 2025 at 09:31:30PM +0200, Jarkko Sakkinen wrote:
> > On Fri Jan 31, 2025 at 7:28 PM EET, Michal Suchánek wrote:
> > > On Fri, Jan 31, 2025 at 07:12:06PM +0200, Jarkko Sakkinen wrote:
> > > > On Fri Jan 31, 2025 at 3:02 PM EET, Michal Suchánek wrote:
> > > > > It looks like the timeout_b is used exclusively as the ready timeout *),
> > > > > with various sources of the value depending on chip type.
> > > > >
> > > > > Then increasing it should not cause any problem other than the kernel
> > > > > waiting longer when the TPM chip is really stuck.
> > > > >
> > > > > * There is one instance of use of timeout_b for TPM_STS_VALID in
> > > > > st33zp24_pm_resume.
> > > > 
> > > > Possible for you to give a shot for patch and try it out for a while?
> > > > I'm fine with 2x, or even 4x in this case.
> > >
> > > I will see what I can do. It will definitely take a while.
> > >
> > > How would you like to multiply it?
> > >
> > > At the sime the timeout_b is assigned, or at the time it's used?
> > >
> > > Any specific patch that you have in mind?
> > 
> > I'll think about this a bit and send a patch with RFC tag. Might take
> > to late next week.
> 
> The ready timeout is not the only one exceeded:
> 
> > Jan 29 19:01:55 localhost kernel: tpm tpm0: tpm_tis_send_data: 357: ready: Timed out (2232 of 2000 ms)
> > Jan 29 19:01:55 localhost kernel: tpm tpm0: tpm_tis_send_data: 357: ready: Took (2232 of 2000 ms)
> > Jan 30 09:08:20 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: ready: Timed out (2228 of 2000 ms)
> > Jan 30 09:08:20 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: ready: Took (2228 of 2000 ms)
> > Jan 30 14:26:16 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Timed out (540 of 200 ms)
> > Jan 30 14:26:16 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Took (540 of 200 ms)
> > Jan 30 23:25:13 localhost kernel: tpm tpm0: tpm_tis_send_data: 354: stat: Timed out (2224 of 200 ms)
> > Jan 30 23:25:13 localhost kernel: tpm tpm0: tpm_tis_send_data: 354: stat: Took (2224 of 200 ms)
> > Feb 01 05:25:33 localhost kernel: tpm tpm0: tpm_tis_send_data: 357: ready: Timed out (2228 of 2000 ms)
> > Feb 01 05:25:33 localhost kernel: tpm tpm0: tpm_tis_send_data: 357: ready: Took (2228 of 2000 ms)
> > Feb 01 07:02:53 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Timed out (556 of 200 ms)
> > Feb 01 07:02:53 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Took (556 of 200 ms)
> > Feb 01 09:26:22 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Timed out (540 of 200 ms)
> > Feb 01 09:26:22 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Took (540 of 200 ms)
> > Feb 02 02:45:35 localhost kernel: tpm tpm0: tpm_tis_send_data: 379: stat: Timed out (272 of 200 ms)
> > Feb 02 02:45:35 localhost kernel: tpm tpm0: tpm_tis_send_data: 379: stat: Took (272 of 200 ms)
> > Feb 02 03:40:04 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Timed out (536 of 200 ms)
> > Feb 02 03:40:04 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Took (536 of 200 ms)
> > Feb 02 04:09:50 localhost kernel: tpm tpm0: tpm_tis_send_data: 357: ready: Timed out (2236 of 2000 ms)
> > Feb 02 04:09:50 localhost kernel: tpm tpm0: tpm_tis_send_data: 357: ready: Took (2236 of 2000 ms)
> > Feb 02 09:57:41 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Timed out (540 of 200 ms)
> > Feb 02 09:57:41 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Took (540 of 200 ms)
> > Feb 02 10:59:00 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Timed out (536 of 200 ms)
> > Feb 02 10:59:00 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Took (536 of 200 ms)
> > Feb 03 03:58:09 localhost kernel: tpm tpm0: tpm_tis_send_data: 354: stat: Timed out (540 of 200 ms)
> > Feb 03 03:58:09 localhost kernel: tpm tpm0: tpm_tis_send_data: 354: stat: Took (540 of 200 ms)
> 
> While the ready timeout is quite consistently exceeded by around 230ms
> so far the stat timeout a few lines lower is less consistent.
> 
> Failure is observed with another chip type as well:
> 
> localhost kernel: tpm_tis MSFT0101:00: 2.0 TPM (device-id 0x1B, rev-id
> 22)
> 
> TPM Device
>         Vendor ID: IFX
>         Specification Version: 2.0
>         Firmware Revision: 7.83
>         Description: TPM 2.0, ManufacturerID: IFX , Firmware Version: 7.83.3358.0
>         Characteristics:
>                 Family configurable via firmware update
>                 Family configurable via OEM proprietary mechanism
>         OEM-specific Information: 0x00000000

Also adding the patch for reference.

Thanks

Michal

diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c
index fdef214b9f6b..05ae815dd132 100644
--- a/drivers/char/tpm/tpm_tis_core.c
+++ b/drivers/char/tpm/tpm_tis_core.c
@@ -432,19 +432,29 @@ static int tpm_tis_recv(struct tpm_chip *chip, u8 *buf, size_t count)
 static int tpm_tis_send_data(struct tpm_chip *chip, const u8 *buf, size_t len)
 {
 	struct tpm_tis_data *priv = dev_get_drvdata(&chip->dev);
+	u32 ordinal = be32_to_cpu(*((__be32 *) (buf + 6)));
 	int rc, status, burstcnt;
 	size_t count = 0;
 	bool itpm = test_bit(TPM_TIS_ITPM_WORKAROUND, &priv->flags);
+	unsigned long start, timed_out;
 
 	status = tpm_tis_status(chip);
 	if ((status & TPM_STS_COMMAND_READY) == 0) {
 		tpm_tis_ready(chip);
+		timed_out = 0; start = jiffies;
+retry_ready:
 		if (wait_for_tpm_stat
 		    (chip, TPM_STS_COMMAND_READY, chip->timeout_b,
 		     &priv->int_queue, false) < 0) {
+			if (timed_out++ < 2) {
+				dev_err(&chip->dev, "%s: %u: ready: Timed out (%u of %u ms)\n", __func__, ordinal, jiffies_to_msecs(jiffies - start), jiffies_to_msecs(chip->timeout_b));
+				goto retry_ready;
+			}
 			rc = -ETIME;
 			goto out_err;
 		}
+		if (timed_out)
+			dev_err(&chip->dev, "%s: %u: ready: Took (%u of %u ms)\n", __func__, ordinal, jiffies_to_msecs(jiffies - start), jiffies_to_msecs(chip->timeout_b));
 	}
 
 	while (count < len - 1) {
@@ -479,11 +489,19 @@ static int tpm_tis_send_data(struct tpm_chip *chip, const u8 *buf, size_t len)
 	if (rc < 0)
 		goto out_err;
 
+	timed_out = 0; start = jiffies;
+retry_stat:
 	if (wait_for_tpm_stat(chip, TPM_STS_VALID, chip->timeout_c,
 				&priv->int_queue, false) < 0) {
+		if (timed_out++ < 5) {
+			dev_err(&chip->dev, "%s: %u: stat: Timed out (%u of %u ms)\n", __func__, ordinal, jiffies_to_msecs(jiffies - start), jiffies_to_msecs(chip->timeout_c));
+			goto retry_stat;
+		}
 		rc = -ETIME;
 		goto out_err;
 	}
+	if (timed_out)
+		dev_err(&chip->dev, "%s: %u: stat: Took (%u of %u ms)\n", __func__, ordinal, jiffies_to_msecs(jiffies - start), jiffies_to_msecs(chip->timeout_c));
 	status = tpm_tis_status(chip);
 	if (!itpm && (status & TPM_STS_DATA_EXPECT) != 0) {
 		rc = -EIO;
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: TPM operation times out (very rarely)
  2025-02-05 13:26                 ` Michal Suchánek
  2025-02-05 13:45                   ` Michal Suchánek
@ 2025-02-05 14:29                   ` Jonathan McDowell
  2025-02-05 15:29                     ` Michal Suchánek
                                       ` (2 more replies)
  1 sibling, 3 replies; 38+ messages in thread
From: Jonathan McDowell @ 2025-02-05 14:29 UTC (permalink / raw)
  To: Michal Suchánek; +Cc: Jarkko Sakkinen, linux-integrity

On Wed, Feb 05, 2025 at 02:26:05PM +0100, Michal Suchánek wrote:
> On Fri, Jan 31, 2025 at 09:31:30PM +0200, Jarkko Sakkinen wrote:
> > On Fri Jan 31, 2025 at 7:28 PM EET, Michal Suchánek wrote:
> > > On Fri, Jan 31, 2025 at 07:12:06PM +0200, Jarkko Sakkinen wrote:
> > > > On Fri Jan 31, 2025 at 3:02 PM EET, Michal Suchánek wrote:
> > > > > It looks like the timeout_b is used exclusively as the ready timeout *),
> > > > > with various sources of the value depending on chip type.
> > > > >
> > > > > Then increasing it should not cause any problem other than the kernel
> > > > > waiting longer when the TPM chip is really stuck.
> > > > >
> > > > > * There is one instance of use of timeout_b for TPM_STS_VALID in
> > > > > st33zp24_pm_resume.
> > > > 
> > > > Possible for you to give a shot for patch and try it out for a while?
> > > > I'm fine with 2x, or even 4x in this case.
> > >
> > > I will see what I can do. It will definitely take a while.
> > >
> > > How would you like to multiply it?
> > >
> > > At the sime the timeout_b is assigned, or at the time it's used?
> > >
> > > Any specific patch that you have in mind?
> > 
> > I'll think about this a bit and send a patch with RFC tag. Might take
> > to late next week.
> 
> The ready timeout is not the only one exceeded:
> 
> > Jan 29 19:01:55 localhost kernel: tpm tpm0: tpm_tis_send_data: 357: ready: Timed out (2232 of 2000 ms)
> > Jan 29 19:01:55 localhost kernel: tpm tpm0: tpm_tis_send_data: 357: ready: Took (2232 of 2000 ms)
> > Jan 30 09:08:20 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: ready: Timed out (2228 of 2000 ms)
> > Jan 30 09:08:20 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: ready: Took (2228 of 2000 ms)
> > Jan 30 14:26:16 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Timed out (540 of 200 ms)
> > Jan 30 14:26:16 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Took (540 of 200 ms)
> > Jan 30 23:25:13 localhost kernel: tpm tpm0: tpm_tis_send_data: 354: stat: Timed out (2224 of 200 ms)
> > Jan 30 23:25:13 localhost kernel: tpm tpm0: tpm_tis_send_data: 354: stat: Took (2224 of 200 ms)
> > Feb 01 05:25:33 localhost kernel: tpm tpm0: tpm_tis_send_data: 357: ready: Timed out (2228 of 2000 ms)
> > Feb 01 05:25:33 localhost kernel: tpm tpm0: tpm_tis_send_data: 357: ready: Took (2228 of 2000 ms)
> > Feb 01 07:02:53 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Timed out (556 of 200 ms)
> > Feb 01 07:02:53 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Took (556 of 200 ms)
> > Feb 01 09:26:22 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Timed out (540 of 200 ms)
> > Feb 01 09:26:22 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Took (540 of 200 ms)
> > Feb 02 02:45:35 localhost kernel: tpm tpm0: tpm_tis_send_data: 379: stat: Timed out (272 of 200 ms)
> > Feb 02 02:45:35 localhost kernel: tpm tpm0: tpm_tis_send_data: 379: stat: Took (272 of 200 ms)
> > Feb 02 03:40:04 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Timed out (536 of 200 ms)
> > Feb 02 03:40:04 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Took (536 of 200 ms)
> > Feb 02 04:09:50 localhost kernel: tpm tpm0: tpm_tis_send_data: 357: ready: Timed out (2236 of 2000 ms)
> > Feb 02 04:09:50 localhost kernel: tpm tpm0: tpm_tis_send_data: 357: ready: Took (2236 of 2000 ms)
> > Feb 02 09:57:41 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Timed out (540 of 200 ms)
> > Feb 02 09:57:41 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Took (540 of 200 ms)
> > Feb 02 10:59:00 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Timed out (536 of 200 ms)
> > Feb 02 10:59:00 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Took (536 of 200 ms)
> > Feb 03 03:58:09 localhost kernel: tpm tpm0: tpm_tis_send_data: 354: stat: Timed out (540 of 200 ms)
> > Feb 03 03:58:09 localhost kernel: tpm tpm0: tpm_tis_send_data: 354: stat: Took (540 of 200 ms)
> 
> While the ready timeout is quite consistently exceeded by around 230ms
> so far the stat timeout a few lines lower is less consistent.

Interesting. TPM2_CC_CONTEXT_LOAD (353) / TPM2_CC_FLUSH_CONTEXT (357) /
TPM2_CC_CONTEXT_SAVE (354) I kinda expect to maybe take a bit longer,
but TPM2_CC_GET_RANDOM (379) is a little surprising.

> Failure is observed with another chip type as well:
> 
> localhost kernel: tpm_tis MSFT0101:00: 2.0 TPM (device-id 0x1B, rev-id
> 22)
> 
> TPM Device
>         Vendor ID: IFX
>         Specification Version: 2.0
>         Firmware Revision: 7.83
>         Description: TPM 2.0, ManufacturerID: IFX , Firmware Version: 7.83.3358.0
>         Characteristics:
>                 Family configurable via firmware update
>                 Family configurable via OEM proprietary mechanism
>         OEM-specific Information: 0x00000000

That looks like an SLB9670, not running the latest firmware (7.85). I
think that might have the errata I've been trying to work around; my
current patch in testing (with logging to see how effective it is):

commit d8c680ec34e7f42f731e7d64605a670fb7b3b4d1
Author: Jonathan McDowell <noodles@meta.com>
Date:   Mon Aug 19 09:22:46 2024 -0700

    tpm: Workaround failed command reception on Infineon devices
    
    Some Infineon devices have a issue where the status register will get
    stuck with a quick REQUEST_USE / COMMAND_READY sequence. The work around
    is to retry the command submission. Add appropriate logic to do this in
    the send path.
    
diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c
index fdef214b9f6b..561d6801e299 100644
--- a/drivers/char/tpm/tpm_tis_core.c
+++ b/drivers/char/tpm/tpm_tis_core.c
@@ -464,7 +464,12 @@ static int tpm_tis_send_data(struct tpm_chip *chip, const u8 *buf, size_t len)
 
 		if (wait_for_tpm_stat(chip, TPM_STS_VALID, chip->timeout_c,
 					&priv->int_queue, false) < 0) {
-			rc = -ETIME;
+			if (test_bit(TPM_TIS_STATUS_WORKAROUND, &priv->flags)) {
+				dev_err(&chip->dev, "Timed out waiting for status valid in send, retrying\n");
+				rc = -EAGAIN;
+			} else {
+				rc = -ETIME;
+			}
 			goto out_err;
 		}
 		status = tpm_tis_status(chip);
@@ -481,7 +486,13 @@ static int tpm_tis_send_data(struct tpm_chip *chip, const u8 *buf, size_t len)
 
 	if (wait_for_tpm_stat(chip, TPM_STS_VALID, chip->timeout_c,
 				&priv->int_queue, false) < 0) {
-		rc = -ETIME;
+
+		if (test_bit(TPM_TIS_STATUS_WORKAROUND, &priv->flags)) {
+			dev_err(&chip->dev, "Timed out waiting for status after send, retrying\n");
+			rc = -EAGAIN;
+		} else {
+			rc = -ETIME;
+		}
 		goto out_err;
 	}
 	status = tpm_tis_status(chip);
@@ -546,10 +557,13 @@ static int tpm_tis_send_main(struct tpm_chip *chip, const u8 *buf, size_t len)
 		if (rc >= 0)
 			/* Data transfer done successfully */
 			break;
-		else if (rc != -EIO)
+		else if (rc != -EAGAIN && rc != -EIO)
 			/* Data transfer failed, not recoverable */
 			return rc;
 	}
+	if (try)
+		dev_info(&chip->dev,
+			"Took %d attempts to send command\n", try + 1);
 
 	/* go and do it */
 	rc = tpm_tis_write8(priv, TPM_STS(priv->locality), TPM_STS_GO);
@@ -563,6 +577,8 @@ static int tpm_tis_send_main(struct tpm_chip *chip, const u8 *buf, size_t len)
 		if (wait_for_tpm_stat
 		    (chip, TPM_STS_DATA_AVAIL | TPM_STS_VALID, dur,
 		     &priv->read_queue, false) < 0) {
+			dev_err(&chip->dev,
+				"Timed out waiting for command to complete after send\n");
 			rc = -ETIME;
 			goto out_err;
 		}
@@ -1144,6 +1160,9 @@ int tpm_tis_core_init(struct device *dev, struct tpm_tis_data *priv, int irq,
 		priv->timeout_max = TIS_TIMEOUT_MAX_ATML;
 	}
 
+	if (priv->manufacturer_id == TPM_VID_IFX)
+		set_bit(TPM_TIS_STATUS_WORKAROUND, &priv->flags);
+
 	if (is_bsw()) {
 		priv->ilb_base_addr = ioremap(INTEL_LEGACY_BLK_BASE_ADDR,
 					ILB_REMAP_SIZE);
diff --git a/drivers/char/tpm/tpm_tis_core.h b/drivers/char/tpm/tpm_tis_core.h
index 690ad8e9b731..ce97b58dc005 100644
--- a/drivers/char/tpm/tpm_tis_core.h
+++ b/drivers/char/tpm/tpm_tis_core.h
@@ -89,6 +89,7 @@ enum tpm_tis_flags {
 	TPM_TIS_INVALID_STATUS		= 1,
 	TPM_TIS_DEFAULT_CANCELLATION	= 2,
 	TPM_TIS_IRQ_TESTED		= 3,
+	TPM_TIS_STATUS_WORKAROUND	= 4,
 };
 
 struct tpm_tis_data {
diff --git a/include/linux/tpm.h b/include/linux/tpm.h
index 20a40ade8030..6c3125300c00 100644
--- a/include/linux/tpm.h
+++ b/include/linux/tpm.h
@@ -335,6 +335,7 @@ enum tpm2_cc_attrs {
 #define TPM_VID_WINBOND  0x1050
 #define TPM_VID_STM      0x104A
 #define TPM_VID_ATML     0x1114
+#define TPM_VID_IFX      0x15D1
 
 enum tpm_chip_flags {
 	TPM_CHIP_FLAG_BOOTSTRAPPED		= BIT(0),


-- 
Web [  "A true friend knows who you are...but likes you anyway."   ]
site: https:// [                                          ]      Made by
www.earth.li/~noodles/  [                      ]         HuggieTag 0.0.24

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: TPM operation times out (very rarely)
  2025-02-05 14:29                   ` Jonathan McDowell
@ 2025-02-05 15:29                     ` Michal Suchánek
  2025-02-06 20:35                     ` Jarkko Sakkinen
  2025-03-27 12:57                     ` Michal Suchánek
  2 siblings, 0 replies; 38+ messages in thread
From: Michal Suchánek @ 2025-02-05 15:29 UTC (permalink / raw)
  To: Jonathan McDowell; +Cc: Jarkko Sakkinen, linux-integrity

On Wed, Feb 05, 2025 at 02:29:36PM +0000, Jonathan McDowell wrote:
> On Wed, Feb 05, 2025 at 02:26:05PM +0100, Michal Suchánek wrote:
> > On Fri, Jan 31, 2025 at 09:31:30PM +0200, Jarkko Sakkinen wrote:
> > > On Fri Jan 31, 2025 at 7:28 PM EET, Michal Suchánek wrote:
> > > > On Fri, Jan 31, 2025 at 07:12:06PM +0200, Jarkko Sakkinen wrote:
> > > > > On Fri Jan 31, 2025 at 3:02 PM EET, Michal Suchánek wrote:
> > > > > > It looks like the timeout_b is used exclusively as the ready timeout *),
> > > > > > with various sources of the value depending on chip type.
> > > > > >
> > > > > > Then increasing it should not cause any problem other than the kernel
> > > > > > waiting longer when the TPM chip is really stuck.
> > > > > >
> > > > > > * There is one instance of use of timeout_b for TPM_STS_VALID in
> > > > > > st33zp24_pm_resume.
> > > > > 
> > > > > Possible for you to give a shot for patch and try it out for a while?
> > > > > I'm fine with 2x, or even 4x in this case.
> > > >
> > > > I will see what I can do. It will definitely take a while.
> > > >
> > > > How would you like to multiply it?
> > > >
> > > > At the sime the timeout_b is assigned, or at the time it's used?
> > > >
> > > > Any specific patch that you have in mind?
> > > 
> > > I'll think about this a bit and send a patch with RFC tag. Might take
> > > to late next week.
> > 
> > The ready timeout is not the only one exceeded:
> > 
> > > Jan 29 19:01:55 localhost kernel: tpm tpm0: tpm_tis_send_data: 357: ready: Timed out (2232 of 2000 ms)
> > > Jan 29 19:01:55 localhost kernel: tpm tpm0: tpm_tis_send_data: 357: ready: Took (2232 of 2000 ms)
> > > Jan 30 09:08:20 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: ready: Timed out (2228 of 2000 ms)
> > > Jan 30 09:08:20 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: ready: Took (2228 of 2000 ms)
> > > Jan 30 14:26:16 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Timed out (540 of 200 ms)
> > > Jan 30 14:26:16 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Took (540 of 200 ms)
> > > Jan 30 23:25:13 localhost kernel: tpm tpm0: tpm_tis_send_data: 354: stat: Timed out (2224 of 200 ms)
> > > Jan 30 23:25:13 localhost kernel: tpm tpm0: tpm_tis_send_data: 354: stat: Took (2224 of 200 ms)
> > > Feb 01 05:25:33 localhost kernel: tpm tpm0: tpm_tis_send_data: 357: ready: Timed out (2228 of 2000 ms)
> > > Feb 01 05:25:33 localhost kernel: tpm tpm0: tpm_tis_send_data: 357: ready: Took (2228 of 2000 ms)
> > > Feb 01 07:02:53 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Timed out (556 of 200 ms)
> > > Feb 01 07:02:53 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Took (556 of 200 ms)
> > > Feb 01 09:26:22 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Timed out (540 of 200 ms)
> > > Feb 01 09:26:22 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Took (540 of 200 ms)
> > > Feb 02 02:45:35 localhost kernel: tpm tpm0: tpm_tis_send_data: 379: stat: Timed out (272 of 200 ms)
> > > Feb 02 02:45:35 localhost kernel: tpm tpm0: tpm_tis_send_data: 379: stat: Took (272 of 200 ms)
> > > Feb 02 03:40:04 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Timed out (536 of 200 ms)
> > > Feb 02 03:40:04 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Took (536 of 200 ms)
> > > Feb 02 04:09:50 localhost kernel: tpm tpm0: tpm_tis_send_data: 357: ready: Timed out (2236 of 2000 ms)
> > > Feb 02 04:09:50 localhost kernel: tpm tpm0: tpm_tis_send_data: 357: ready: Took (2236 of 2000 ms)
> > > Feb 02 09:57:41 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Timed out (540 of 200 ms)
> > > Feb 02 09:57:41 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Took (540 of 200 ms)
> > > Feb 02 10:59:00 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Timed out (536 of 200 ms)
> > > Feb 02 10:59:00 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Took (536 of 200 ms)
> > > Feb 03 03:58:09 localhost kernel: tpm tpm0: tpm_tis_send_data: 354: stat: Timed out (540 of 200 ms)
> > > Feb 03 03:58:09 localhost kernel: tpm tpm0: tpm_tis_send_data: 354: stat: Took (540 of 200 ms)
> > 
> > While the ready timeout is quite consistently exceeded by around 230ms
> > so far the stat timeout a few lines lower is less consistent.
> 
> Interesting. TPM2_CC_CONTEXT_LOAD (353) / TPM2_CC_FLUSH_CONTEXT (357) /
> TPM2_CC_CONTEXT_SAVE (354) I kinda expect to maybe take a bit longer,
> but TPM2_CC_GET_RANDOM (379) is a little surprising.

But it only takes extra 72ms, the context functions take about extra
350ms.

> 
> > Failure is observed with another chip type as well:
> > 
> > localhost kernel: tpm_tis MSFT0101:00: 2.0 TPM (device-id 0x1B, rev-id
> > 22)
> > 
> > TPM Device
> >         Vendor ID: IFX
> >         Specification Version: 2.0
> >         Firmware Revision: 7.83
> >         Description: TPM 2.0, ManufacturerID: IFX , Firmware Version: 7.83.3358.0
> >         Characteristics:
> >                 Family configurable via firmware update
> >                 Family configurable via OEM proprietary mechanism
> >         OEM-specific Information: 0x00000000
> 
> That looks like an SLB9670, not running the latest firmware (7.85). I
> think that might have the errata I've been trying to work around; my
> current patch in testing (with logging to see how effective it is):

I don't have any result with timeout waiting for the valid state but all
the results with the debug prints are from the newer chip revision.

Thanks

Michal

> 
> commit d8c680ec34e7f42f731e7d64605a670fb7b3b4d1
> Author: Jonathan McDowell <noodles@meta.com>
> Date:   Mon Aug 19 09:22:46 2024 -0700
> 
>     tpm: Workaround failed command reception on Infineon devices
>     
>     Some Infineon devices have a issue where the status register will get
>     stuck with a quick REQUEST_USE / COMMAND_READY sequence. The work around
>     is to retry the command submission. Add appropriate logic to do this in
>     the send path.
>     
> diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c
> index fdef214b9f6b..561d6801e299 100644
> --- a/drivers/char/tpm/tpm_tis_core.c
> +++ b/drivers/char/tpm/tpm_tis_core.c
> @@ -464,7 +464,12 @@ static int tpm_tis_send_data(struct tpm_chip *chip, const u8 *buf, size_t len)
>  
>  		if (wait_for_tpm_stat(chip, TPM_STS_VALID, chip->timeout_c,
>  					&priv->int_queue, false) < 0) {
> -			rc = -ETIME;
> +			if (test_bit(TPM_TIS_STATUS_WORKAROUND, &priv->flags)) {
> +				dev_err(&chip->dev, "Timed out waiting for status valid in send, retrying\n");
> +				rc = -EAGAIN;
> +			} else {
> +				rc = -ETIME;
> +			}
>  			goto out_err;
>  		}
>  		status = tpm_tis_status(chip);
> @@ -481,7 +486,13 @@ static int tpm_tis_send_data(struct tpm_chip *chip, const u8 *buf, size_t len)
>  
>  	if (wait_for_tpm_stat(chip, TPM_STS_VALID, chip->timeout_c,
>  				&priv->int_queue, false) < 0) {
> -		rc = -ETIME;
> +
> +		if (test_bit(TPM_TIS_STATUS_WORKAROUND, &priv->flags)) {
> +			dev_err(&chip->dev, "Timed out waiting for status after send, retrying\n");
> +			rc = -EAGAIN;
> +		} else {
> +			rc = -ETIME;
> +		}
>  		goto out_err;
>  	}
>  	status = tpm_tis_status(chip);
> @@ -546,10 +557,13 @@ static int tpm_tis_send_main(struct tpm_chip *chip, const u8 *buf, size_t len)
>  		if (rc >= 0)
>  			/* Data transfer done successfully */
>  			break;
> -		else if (rc != -EIO)
> +		else if (rc != -EAGAIN && rc != -EIO)
>  			/* Data transfer failed, not recoverable */
>  			return rc;
>  	}
> +	if (try)
> +		dev_info(&chip->dev,
> +			"Took %d attempts to send command\n", try + 1);
>  
>  	/* go and do it */
>  	rc = tpm_tis_write8(priv, TPM_STS(priv->locality), TPM_STS_GO);
> @@ -563,6 +577,8 @@ static int tpm_tis_send_main(struct tpm_chip *chip, const u8 *buf, size_t len)
>  		if (wait_for_tpm_stat
>  		    (chip, TPM_STS_DATA_AVAIL | TPM_STS_VALID, dur,
>  		     &priv->read_queue, false) < 0) {
> +			dev_err(&chip->dev,
> +				"Timed out waiting for command to complete after send\n");
>  			rc = -ETIME;
>  			goto out_err;
>  		}
> @@ -1144,6 +1160,9 @@ int tpm_tis_core_init(struct device *dev, struct tpm_tis_data *priv, int irq,
>  		priv->timeout_max = TIS_TIMEOUT_MAX_ATML;
>  	}
>  
> +	if (priv->manufacturer_id == TPM_VID_IFX)
> +		set_bit(TPM_TIS_STATUS_WORKAROUND, &priv->flags);
> +
>  	if (is_bsw()) {
>  		priv->ilb_base_addr = ioremap(INTEL_LEGACY_BLK_BASE_ADDR,
>  					ILB_REMAP_SIZE);
> diff --git a/drivers/char/tpm/tpm_tis_core.h b/drivers/char/tpm/tpm_tis_core.h
> index 690ad8e9b731..ce97b58dc005 100644
> --- a/drivers/char/tpm/tpm_tis_core.h
> +++ b/drivers/char/tpm/tpm_tis_core.h
> @@ -89,6 +89,7 @@ enum tpm_tis_flags {
>  	TPM_TIS_INVALID_STATUS		= 1,
>  	TPM_TIS_DEFAULT_CANCELLATION	= 2,
>  	TPM_TIS_IRQ_TESTED		= 3,
> +	TPM_TIS_STATUS_WORKAROUND	= 4,
>  };
>  
>  struct tpm_tis_data {
> diff --git a/include/linux/tpm.h b/include/linux/tpm.h
> index 20a40ade8030..6c3125300c00 100644
> --- a/include/linux/tpm.h
> +++ b/include/linux/tpm.h
> @@ -335,6 +335,7 @@ enum tpm2_cc_attrs {
>  #define TPM_VID_WINBOND  0x1050
>  #define TPM_VID_STM      0x104A
>  #define TPM_VID_ATML     0x1114
> +#define TPM_VID_IFX      0x15D1
>  
>  enum tpm_chip_flags {
>  	TPM_CHIP_FLAG_BOOTSTRAPPED		= BIT(0),
> 
> 
> -- 
> Web [  "A true friend knows who you are...but likes you anyway."   ]
> site: https:// [                                          ]      Made by
> www.earth.li/~noodles/  [                      ]         HuggieTag 0.0.24

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: TPM operation times out (very rarely)
  2025-02-05 14:29                   ` Jonathan McDowell
  2025-02-05 15:29                     ` Michal Suchánek
@ 2025-02-06 20:35                     ` Jarkko Sakkinen
  2025-02-07  9:26                       ` Jonathan McDowell
  2025-03-27 12:57                     ` Michal Suchánek
  2 siblings, 1 reply; 38+ messages in thread
From: Jarkko Sakkinen @ 2025-02-06 20:35 UTC (permalink / raw)
  To: Jonathan McDowell; +Cc: Michal Suchánek, linux-integrity

On Wed, Feb 05, 2025 at 02:29:36PM +0000, Jonathan McDowell wrote:
> Interesting. TPM2_CC_CONTEXT_LOAD (353) / TPM2_CC_FLUSH_CONTEXT (357) /
> TPM2_CC_CONTEXT_SAVE (354) I kinda expect to maybe take a bit longer,
> but TPM2_CC_GET_RANDOM (379) is a little surprising.

The whole arithmetic with timeout_a/b/c is mostly gibberish and could
be replaced with a single "max" constant without issues (just set it
large enough).

They could be all be replaced with let's say 3s timeout in a constant.

> > Failure is observed with another chip type as well:
> > 
> > localhost kernel: tpm_tis MSFT0101:00: 2.0 TPM (device-id 0x1B, rev-id
> > 22)
> > 
> > TPM Device
> >         Vendor ID: IFX
> >         Specification Version: 2.0
> >         Firmware Revision: 7.83
> >         Description: TPM 2.0, ManufacturerID: IFX , Firmware Version: 7.83.3358.0
> >         Characteristics:
> >                 Family configurable via firmware update
> >                 Family configurable via OEM proprietary mechanism
> >         OEM-specific Information: 0x00000000
> 
> That looks like an SLB9670, not running the latest firmware (7.85). I
> think that might have the errata I've been trying to work around; my
> current patch in testing (with logging to see how effective it is):
> 
> commit d8c680ec34e7f42f731e7d64605a670fb7b3b4d1
> Author: Jonathan McDowell <noodles@meta.com>
> Date:   Mon Aug 19 09:22:46 2024 -0700
> 
>     tpm: Workaround failed command reception on Infineon devices
>     
>     Some Infineon devices have a issue where the status register will get
>     stuck with a quick REQUEST_USE / COMMAND_READY sequence. The work around
>     is to retry the command submission. Add appropriate logic to do this in
>     the send path.
>     
> diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c
> index fdef214b9f6b..561d6801e299 100644
> --- a/drivers/char/tpm/tpm_tis_core.c
> +++ b/drivers/char/tpm/tpm_tis_core.c
> @@ -464,7 +464,12 @@ static int tpm_tis_send_data(struct tpm_chip *chip, const u8 *buf, size_t len)
>  
>  		if (wait_for_tpm_stat(chip, TPM_STS_VALID, chip->timeout_c,
>  					&priv->int_queue, false) < 0) {
> -			rc = -ETIME;
> +			if (test_bit(TPM_TIS_STATUS_WORKAROUND, &priv->flags)) {
> +				dev_err(&chip->dev, "Timed out waiting for status valid in send, retrying\n");
> +				rc = -EAGAIN;


I'm not sure why wait_for_tpm_stat() return value is ignored but it
should not be like that. E.g. it can return -ERESTARTSYS. Probably
would be better to check all the call sites for it that they do
same thing.

I.e. rc = wait_for_tpm_stat(...);

/* ... */

BR, Jarkko

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: TPM operation times out (very rarely)
  2025-02-06 20:35                     ` Jarkko Sakkinen
@ 2025-02-07  9:26                       ` Jonathan McDowell
  2025-02-07  9:40                         ` Michal Suchánek
  2025-02-08 20:29                         ` Jarkko Sakkinen
  0 siblings, 2 replies; 38+ messages in thread
From: Jonathan McDowell @ 2025-02-07  9:26 UTC (permalink / raw)
  To: Jarkko Sakkinen; +Cc: Michal Suchánek, linux-integrity

On Thu, Feb 06, 2025 at 10:35:32PM +0200, Jarkko Sakkinen wrote:
> On Wed, Feb 05, 2025 at 02:29:36PM +0000, Jonathan McDowell wrote:
> > Interesting. TPM2_CC_CONTEXT_LOAD (353) / TPM2_CC_FLUSH_CONTEXT (357) /
> > TPM2_CC_CONTEXT_SAVE (354) I kinda expect to maybe take a bit longer,
> > but TPM2_CC_GET_RANDOM (379) is a little surprising.
> 
> The whole arithmetic with timeout_a/b/c is mostly gibberish and could
> be replaced with a single "max" constant without issues (just set it
> large enough).
> 
> They could be all be replaced with let's say 3s timeout in a constant.

This appears to have come up before:

https://lore.kernel.org/linux-integrity/358e89ed2b766d51b5f57abf31ab7a925ac63379.1552348123.git.calvinowens@fb.com/

That patch was deemed overly complex and it was suggested to split it
up; I can't find any indication that was ever done which I guess is why
the discussion died off.

> > > Failure is observed with another chip type as well:
> > > 
> > > localhost kernel: tpm_tis MSFT0101:00: 2.0 TPM (device-id 0x1B, rev-id
> > > 22)
> > > 
> > > TPM Device
> > >         Vendor ID: IFX
> > >         Specification Version: 2.0
> > >         Firmware Revision: 7.83
> > >         Description: TPM 2.0, ManufacturerID: IFX , Firmware Version: 7.83.3358.0
> > >         Characteristics:
> > >                 Family configurable via firmware update
> > >                 Family configurable via OEM proprietary mechanism
> > >         OEM-specific Information: 0x00000000
> > 
> > That looks like an SLB9670, not running the latest firmware (7.85). I
> > think that might have the errata I've been trying to work around; my
> > current patch in testing (with logging to see how effective it is):
> > 
> > commit d8c680ec34e7f42f731e7d64605a670fb7b3b4d1
> > Author: Jonathan McDowell <noodles@meta.com>
> > Date:   Mon Aug 19 09:22:46 2024 -0700
> > 
> >     tpm: Workaround failed command reception on Infineon devices
> >     
> >     Some Infineon devices have a issue where the status register will get
> >     stuck with a quick REQUEST_USE / COMMAND_READY sequence. The work around
> >     is to retry the command submission. Add appropriate logic to do this in
> >     the send path.
> >     
> > diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c
> > index fdef214b9f6b..561d6801e299 100644
> > --- a/drivers/char/tpm/tpm_tis_core.c
> > +++ b/drivers/char/tpm/tpm_tis_core.c
> > @@ -464,7 +464,12 @@ static int tpm_tis_send_data(struct tpm_chip *chip, const u8 *buf, size_t len)
> >  
> >  		if (wait_for_tpm_stat(chip, TPM_STS_VALID, chip->timeout_c,
> >  					&priv->int_queue, false) < 0) {
> > -			rc = -ETIME;
> > +			if (test_bit(TPM_TIS_STATUS_WORKAROUND, &priv->flags)) {
> > +				dev_err(&chip->dev, "Timed out waiting for status valid in send, retrying\n");
> > +				rc = -EAGAIN;
> 
> 
> I'm not sure why wait_for_tpm_stat() return value is ignored but it
> should not be like that. E.g. it can return -ERESTARTSYS. Probably
> would be better to check all the call sites for it that they do
> same thing.
> 
> I.e. rc = wait_for_tpm_stat(...);
> 
> /* ... */

So just to clarify, this more recent patch is working around a situation
where the status register gets stuck and needs a complete retry of the
command send - it's an Infineon errata, not something that would be
fixed with a longer timeout.

We see what looks like Michal's issue with timeouts as well, I just
haven't made the step of instrumenting it all the way he has.

J.

-- 
  I have found the monster - the   |  .''`.  Debian GNU/Linux Developer
          monster is us.           | : :' :  Happy to accept PGP signed
                                   | `. `'   or encrypted mail - RSA
                                   |   `-    key on the keyservers.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: TPM operation times out (very rarely)
  2025-02-07  9:26                       ` Jonathan McDowell
@ 2025-02-07  9:40                         ` Michal Suchánek
  2025-02-07  9:47                           ` Jonathan McDowell
  2025-02-08 20:29                         ` Jarkko Sakkinen
  1 sibling, 1 reply; 38+ messages in thread
From: Michal Suchánek @ 2025-02-07  9:40 UTC (permalink / raw)
  To: Jonathan McDowell; +Cc: Jarkko Sakkinen, linux-integrity

On Fri, Feb 07, 2025 at 09:26:16AM +0000, Jonathan McDowell wrote:
> On Thu, Feb 06, 2025 at 10:35:32PM +0200, Jarkko Sakkinen wrote:
> > On Wed, Feb 05, 2025 at 02:29:36PM +0000, Jonathan McDowell wrote:
> > > Interesting. TPM2_CC_CONTEXT_LOAD (353) / TPM2_CC_FLUSH_CONTEXT (357) /
> > > TPM2_CC_CONTEXT_SAVE (354) I kinda expect to maybe take a bit longer,
> > > but TPM2_CC_GET_RANDOM (379) is a little surprising.
> > 
> > The whole arithmetic with timeout_a/b/c is mostly gibberish and could
> > be replaced with a single "max" constant without issues (just set it
> > large enough).
> > 
> > They could be all be replaced with let's say 3s timeout in a constant.
> 
> This appears to have come up before:
> 
> https://lore.kernel.org/linux-integrity/358e89ed2b766d51b5f57abf31ab7a925ac63379.1552348123.git.calvinowens@fb.com/
> 
> That patch was deemed overly complex and it was suggested to split it
> up; I can't find any indication that was ever done which I guess is why
> the discussion died off.

Also it comes before TPM_LONG_LONG, it sounds like that would still be
special.

> 
> > > > Failure is observed with another chip type as well:
> > > > 
> > > > localhost kernel: tpm_tis MSFT0101:00: 2.0 TPM (device-id 0x1B, rev-id
> > > > 22)
> > > > 
> > > > TPM Device
> > > >         Vendor ID: IFX
> > > >         Specification Version: 2.0
> > > >         Firmware Revision: 7.83
> > > >         Description: TPM 2.0, ManufacturerID: IFX , Firmware Version: 7.83.3358.0
> > > >         Characteristics:
> > > >                 Family configurable via firmware update
> > > >                 Family configurable via OEM proprietary mechanism
> > > >         OEM-specific Information: 0x00000000
> > > 
> > > That looks like an SLB9670, not running the latest firmware (7.85). I
> > > think that might have the errata I've been trying to work around; my
> > > current patch in testing (with logging to see how effective it is):
> > > 
> > > commit d8c680ec34e7f42f731e7d64605a670fb7b3b4d1
> > > Author: Jonathan McDowell <noodles@meta.com>
> > > Date:   Mon Aug 19 09:22:46 2024 -0700
> > > 
> > >     tpm: Workaround failed command reception on Infineon devices
> > >     
> > >     Some Infineon devices have a issue where the status register will get
> > >     stuck with a quick REQUEST_USE / COMMAND_READY sequence. The work around
> > >     is to retry the command submission. Add appropriate logic to do this in
> > >     the send path.
> > >     
> > > diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c
> > > index fdef214b9f6b..561d6801e299 100644
> > > --- a/drivers/char/tpm/tpm_tis_core.c
> > > +++ b/drivers/char/tpm/tpm_tis_core.c
> > > @@ -464,7 +464,12 @@ static int tpm_tis_send_data(struct tpm_chip *chip, const u8 *buf, size_t len)
> > >  
> > >  		if (wait_for_tpm_stat(chip, TPM_STS_VALID, chip->timeout_c,
> > >  					&priv->int_queue, false) < 0) {
> > > -			rc = -ETIME;
> > > +			if (test_bit(TPM_TIS_STATUS_WORKAROUND, &priv->flags)) {
> > > +				dev_err(&chip->dev, "Timed out waiting for status valid in send, retrying\n");
> > > +				rc = -EAGAIN;
> > 
> > 
> > I'm not sure why wait_for_tpm_stat() return value is ignored but it
> > should not be like that. E.g. it can return -ERESTARTSYS. Probably
> > would be better to check all the call sites for it that they do
> > same thing.
> > 
> > I.e. rc = wait_for_tpm_stat(...);
> > 
> > /* ... */
> 
> So just to clarify, this more recent patch is working around a situation
> where the status register gets stuck and needs a complete retry of the
> command send - it's an Infineon errata, not something that would be
> fixed with a longer timeout.
> 
> We see what looks like Michal's issue with timeouts as well, I just
> haven't made the step of instrumenting it all the way he has.

And I haven't seen the issue that needs re-sending the command so far.

Maybe it happens even less frequently than the excessive processing
time.

Fully restarting the syscall would fix both issues but manual restart of
the userspace task reportedly did not work so I have my doubts that
this method with returning from the syscall would be effective.

Thanks

Michal

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: TPM operation times out (very rarely)
  2025-02-07  9:40                         ` Michal Suchánek
@ 2025-02-07  9:47                           ` Jonathan McDowell
  2025-02-07  9:58                             ` Michal Suchánek
  0 siblings, 1 reply; 38+ messages in thread
From: Jonathan McDowell @ 2025-02-07  9:47 UTC (permalink / raw)
  To: Michal Suchánek; +Cc: Jarkko Sakkinen, linux-integrity

On Fri, Feb 07, 2025 at 10:40:16AM +0100, Michal Suchánek wrote:
> On Fri, Feb 07, 2025 at 09:26:16AM +0000, Jonathan McDowell wrote:
> > So just to clarify, this more recent patch is working around a situation
> > where the status register gets stuck and needs a complete retry of the
> > command send - it's an Infineon errata, not something that would be
> > fixed with a longer timeout.
> > 
> > We see what looks like Michal's issue with timeouts as well, I just
> > haven't made the step of instrumenting it all the way he has.
> 
> And I haven't seen the issue that needs re-sending the command so far.

Your SLB9672 is on the latest firmware, which I believe fixes the
problem.

> Maybe it happens even less frequently than the excessive processing
> time.
> 
> Fully restarting the syscall would fix both issues but manual restart of
> the userspace task reportedly did not work so I have my doubts that
> this method with returning from the syscall would be effective.

Hmmm. I wonder if e3aaebcbb7c6b403416f442d1de70d437ce313a7 (tpm: Clean
up TPM space after command failure) would help the userspace restart
work. We saw problems with the timeouts affecting the kernel doing the
cleanup of the transient handles, meaning we'd then run out of transient
slots in the TPM.

If you can reproduce the problem then

tpm2_getcap -T device:/dev/tpm0 handles-transient

will show if that is the case - if it doesn't output anything this isn't
what you're hitting.

J.

-- 
Inside every living person there's a dead person trying to get out.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: TPM operation times out (very rarely)
  2025-02-07  9:47                           ` Jonathan McDowell
@ 2025-02-07  9:58                             ` Michal Suchánek
  2025-02-10 16:13                               ` Jonathan McDowell
  0 siblings, 1 reply; 38+ messages in thread
From: Michal Suchánek @ 2025-02-07  9:58 UTC (permalink / raw)
  To: Jonathan McDowell; +Cc: Jarkko Sakkinen, linux-integrity

On Fri, Feb 07, 2025 at 09:47:13AM +0000, Jonathan McDowell wrote:
> On Fri, Feb 07, 2025 at 10:40:16AM +0100, Michal Suchánek wrote:
> > On Fri, Feb 07, 2025 at 09:26:16AM +0000, Jonathan McDowell wrote:
> > > So just to clarify, this more recent patch is working around a situation
> > > where the status register gets stuck and needs a complete retry of the
> > > command send - it's an Infineon errata, not something that would be
> > > fixed with a longer timeout.
> > > 
> > > We see what looks like Michal's issue with timeouts as well, I just
> > > haven't made the step of instrumenting it all the way he has.
> > 
> > And I haven't seen the issue that needs re-sending the command so far.
> 
> Your SLB9672 is on the latest firmware, which I believe fixes the
> problem.
> 
> > Maybe it happens even less frequently than the excessive processing
> > time.
> > 
> > Fully restarting the syscall would fix both issues but manual restart of
> > the userspace task reportedly did not work so I have my doubts that
> > this method with returning from the syscall would be effective.
> 
> Hmmm. I wonder if e3aaebcbb7c6b403416f442d1de70d437ce313a7 (tpm: Clean
> up TPM space after command failure) would help the userspace restart

It's backported to the frankenkernel already so it would not help this
particular case.

Unfortunately, it's not clear what the userspace task does, and why it
would not complete after the first failure.

Would need to come up with some way of tracing it.

Thanks

Michal

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: TPM operation times out (very rarely)
  2025-02-07  9:26                       ` Jonathan McDowell
  2025-02-07  9:40                         ` Michal Suchánek
@ 2025-02-08 20:29                         ` Jarkko Sakkinen
  2025-02-10 16:18                           ` Jonathan McDowell
  1 sibling, 1 reply; 38+ messages in thread
From: Jarkko Sakkinen @ 2025-02-08 20:29 UTC (permalink / raw)
  To: Jonathan McDowell; +Cc: Michal Suchánek, linux-integrity

On Fri, Feb 07, 2025 at 09:26:16AM +0000, Jonathan McDowell wrote:
> > The whole arithmetic with timeout_a/b/c is mostly gibberish and could
> > be replaced with a single "max" constant without issues (just set it
> > large enough).
> > 
> > They could be all be replaced with let's say 3s timeout in a constant.
> 
> This appears to have come up before:
> 
> https://lore.kernel.org/linux-integrity/358e89ed2b766d51b5f57abf31ab7a925ac63379.1552348123.git.calvinowens@fb.com/
> 
> That patch was deemed overly complex and it was suggested to split it
> up; I can't find any indication that was ever done which I guess is why
> the discussion died off.

Looking back I suggest splitting timeouts and durations into separate
patches:

https://lore.kernel.org/linux-integrity/20190312145553.GB6682@linux.intel.com/

> So just to clarify, this more recent patch is working around a situation
> where the status register gets stuck and needs a complete retry of the
> command send - it's an Infineon errata, not something that would be
> fixed with a longer timeout.

Hmm... please shout if I ignore something but if we could -ERESTARTSYS
semantics here that should ignite completely new transmit flow, wouldn't
it?

I'm not seeing this locally so far unfortunately.

BR, Jarkko

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: TPM operation times out (very rarely)
  2025-02-07  9:58                             ` Michal Suchánek
@ 2025-02-10 16:13                               ` Jonathan McDowell
  2025-02-10 17:30                                 ` Jarkko Sakkinen
  0 siblings, 1 reply; 38+ messages in thread
From: Jonathan McDowell @ 2025-02-10 16:13 UTC (permalink / raw)
  To: Michal Suchánek; +Cc: Jarkko Sakkinen, linux-integrity

[-- Attachment #1: Type: text/plain, Size: 2051 bytes --]

On Fri, Feb 07, 2025 at 10:58:15AM +0100, Michal Suchánek wrote:
> On Fri, Feb 07, 2025 at 09:47:13AM +0000, Jonathan McDowell wrote:
> > On Fri, Feb 07, 2025 at 10:40:16AM +0100, Michal Suchánek wrote:
> > > On Fri, Feb 07, 2025 at 09:26:16AM +0000, Jonathan McDowell wrote:
> > > > So just to clarify, this more recent patch is working around a situation
> > > > where the status register gets stuck and needs a complete retry of the
> > > > command send - it's an Infineon errata, not something that would be
> > > > fixed with a longer timeout.
> > > > 
> > > > We see what looks like Michal's issue with timeouts as well, I just
> > > > haven't made the step of instrumenting it all the way he has.
> > > 
> > > And I haven't seen the issue that needs re-sending the command so far.
> > 
> > Your SLB9672 is on the latest firmware, which I believe fixes the
> > problem.
> > 
> > > Maybe it happens even less frequently than the excessive processing
> > > time.
> > > 
> > > Fully restarting the syscall would fix both issues but manual restart of
> > > the userspace task reportedly did not work so I have my doubts that
> > > this method with returning from the syscall would be effective.
> > 
> > Hmmm. I wonder if e3aaebcbb7c6b403416f442d1de70d437ce313a7 (tpm: Clean
> > up TPM space after command failure) would help the userspace restart
> 
> It's backported to the frankenkernel already so it would not help this
> particular case.
> 
> Unfortunately, it's not clear what the userspace task does, and why it
> would not complete after the first failure.
> 
> Would need to come up with some way of tracing it.

FWIW bpftrace is great for this. I'm using the attached script as a
basis for investigations. Obviously the timings it reports are the whole
command rather than just the wait for status, but over a long period it
can then show you the latency histogram. (Run it in one terminal, do TPM
stuff elsewhere, hit Ctrl-C when you're done and it dumps the
histogram.)

J.

-- 
101 things you can't have too much of : 47 - More coffee.

[-- Attachment #2: tpm-tracing.bt --]
[-- Type: text/plain, Size: 1040 bytes --]

#!/usr/bin/bpftrace

struct tpm_header {
        uint8_t tag[2];
        uint8_t length[4];
        uint8_t ordinal[4];
}
struct tpm_buf {
        uint32_t flags;
        uint32_t length;
        struct tpm_header *hdr;
}

kprobe:tpm_transmit_cmd
{
        $buf = (struct tpm_buf *) arg1;
        printf("%s (0x%02X%02X%02X%02X) (%s)\n", comm,
                $buf->hdr->ordinal[0],
                $buf->hdr->ordinal[1],
                $buf->hdr->ordinal[2],
                $buf->hdr->ordinal[3],
                str(arg3));
}

kprobe:tpm_transmit
{
        $hdr = (struct tpm_header *) arg1;
        $cmd = $hdr->ordinal[0] << 24 | $hdr->ordinal[1] << 16 |
                $hdr->ordinal[2] << 8 | $hdr->ordinal[3];
        @start[tid] = nsecs;
        @tpmcmd[tid] = $cmd;
}

kretprobe:tpm_transmit
/@start[tid]/ {
        $dur = nsecs - @start[tid];
        @cmdhist[@tpmcmd[tid]] = hist($dur);
        printf("%s (0x%08X): %d (%d ns)\n", comm, @tpmcmd[tid], retval, $dur);
        delete(@start[tid]);
        delete(@tpmcmd[tid]);
}

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: TPM operation times out (very rarely)
  2025-02-08 20:29                         ` Jarkko Sakkinen
@ 2025-02-10 16:18                           ` Jonathan McDowell
  2025-02-10 17:32                             ` Jarkko Sakkinen
  0 siblings, 1 reply; 38+ messages in thread
From: Jonathan McDowell @ 2025-02-10 16:18 UTC (permalink / raw)
  To: Jarkko Sakkinen; +Cc: Michal Suchánek, linux-integrity

On Sat, Feb 08, 2025 at 10:29:55PM +0200, Jarkko Sakkinen wrote:
> On Fri, Feb 07, 2025 at 09:26:16AM +0000, Jonathan McDowell wrote:
> > > The whole arithmetic with timeout_a/b/c is mostly gibberish and could
> > > be replaced with a single "max" constant without issues (just set it
> > > large enough).
> > > 
> > > They could be all be replaced with let's say 3s timeout in a constant.
> > 
> > This appears to have come up before:
> > 
> > https://lore.kernel.org/linux-integrity/358e89ed2b766d51b5f57abf31ab7a925ac63379.1552348123.git.calvinowens@fb.com/
> > 
> > That patch was deemed overly complex and it was suggested to split it
> > up; I can't find any indication that was ever done which I guess is why
> > the discussion died off.
> 
> Looking back I suggest splitting timeouts and durations into separate
> patches:
> 
> https://lore.kernel.org/linux-integrity/20190312145553.GB6682@linux.intel.com/

Yup, that seems to be where it all died out from what I could find.

> > So just to clarify, this more recent patch is working around a situation
> > where the status register gets stuck and needs a complete retry of the
> > command send - it's an Infineon errata, not something that would be
> > fixed with a longer timeout.
> 
> Hmm... please shout if I ignore something but if we could -ERESTARTSYS
> semantics here that should ignite completely new transmit flow, wouldn't
> it?

Who then handles the ERESTARTSYS though? Part of the issues we've seen
is the failure happens in a context save or load, which is all within
the kernel rather than directly under the control of userspace. I'm
guessing the HMAC changes are likely to hit similar problems. I think
some level of timeout improvement in tpm_transmit is appropriate, if we
can work out what it should be.

Likewise for the Infineon errata I think an internal retry makes sense,
because it could happen for a number of internally generated kernel
reasons as well as a userspace sent command.

J.

-- 
How I wish, how I wish you were here.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: TPM operation times out (very rarely)
  2025-02-10 16:13                               ` Jonathan McDowell
@ 2025-02-10 17:30                                 ` Jarkko Sakkinen
  0 siblings, 0 replies; 38+ messages in thread
From: Jarkko Sakkinen @ 2025-02-10 17:30 UTC (permalink / raw)
  To: Jonathan McDowell, Michal Suchánek; +Cc: linux-integrity

On Mon Feb 10, 2025 at 6:13 PM EET, Jonathan McDowell wrote:
> FWIW bpftrace is great for this. I'm using the attached script as a
> basis for investigations. Obviously the timings it reports are the whole
> command rather than just the wait for status, but over a long period it
> can then show you the latency histogram. (Run it in one terminal, do TPM
> stuff elsewhere, hit Ctrl-C when you're done and it dumps the
> histogram.)
>
> J.

Yes, I used that to trace and spot issues with HMAC integrity protection
in the near past.

BR, Jarkko

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: TPM operation times out (very rarely)
  2025-02-10 16:18                           ` Jonathan McDowell
@ 2025-02-10 17:32                             ` Jarkko Sakkinen
  2025-02-24 13:04                               ` Michal Suchánek
  0 siblings, 1 reply; 38+ messages in thread
From: Jarkko Sakkinen @ 2025-02-10 17:32 UTC (permalink / raw)
  To: Jonathan McDowell; +Cc: Michal Suchánek, linux-integrity

On Mon Feb 10, 2025 at 6:18 PM EET, Jonathan McDowell wrote:
> Who then handles the ERESTARTSYS though? Part of the issues we've seen
> is the failure happens in a context save or load, which is all within
> the kernel rather than directly under the control of userspace. I'm
> guessing the HMAC changes are likely to hit similar problems. I think
> some level of timeout improvement in tpm_transmit is appropriate, if we
> can work out what it should be.

Right I get what you mean, not all transmits initiate from syscalls
And obviously this can happen without hmac too with tpmrm0.

Hmm... so I'm open for a patch that radically simplifies the state
change timeouts, i.e. sort of part of that old patch.

BR, Jarkko

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: TPM operation times out (very rarely)
  2025-01-29 15:27 TPM operation times out (very rarely) Michal Suchánek
  2025-01-29 16:02 ` Jonathan McDowell
@ 2025-02-19 22:29 ` Jonathan McDowell
  2025-02-20  8:42   ` Michal Suchánek
  2025-02-24 12:56   ` Michal Suchánek
  1 sibling, 2 replies; 38+ messages in thread
From: Jonathan McDowell @ 2025-02-19 22:29 UTC (permalink / raw)
  To: Michal Suchánek; +Cc: linux-integrity, Jarkko Sakkinen, Lino Sanfilippo

On Wed, Jan 29, 2025 at 04:27:15PM +0100, Michal Suchánek wrote:
> Hello,
> 
> there is a problem report that booting a specific type of system about
> 0.1% of the time encrypted volume (using a PCR to release the key) fails
> to unlock because of TPM operation timeout.
> 
> Minimizing the test case failed so far.
> 
> For example, booting into text mode as opposed to graphical desktop
> makes the problem unreproducible.
> 
> The test is done with a frankenkernel that has TPM drivers about on par
> with Linux 6.4 but using actual Linux 6.4 the problem is not
> reproducible, either.
> 
> However, given the problem takes up to a day to reproduce I do not have
> much confidence in the negative results.

Michal, can you possibly try the below and see if it helps out? There
seems to be a timing bug introduced in 6.4+ that I think might be
related, and matches up with some of our internal metrics that showed an
increase in timeouts in 6.4 onwards.

commit 79041fba797d0fe907e227012767f56dd93fac32
Author: Jonathan McDowell <noodles@meta.com>
Date:   Wed Feb 19 16:20:44 2025 -0600

    tpm, tpm_tis: Fix timeout handling when waiting for TPM status
    
    The change to only use interrupts to handle supported status changes,
    then switch to polling for the rest, inverted the status test and sleep
    such that we can end up sleeping beyond our timeout and not actually
    checking the status. This can result in spurious TPM timeouts,
    especially on a more loaded system. Fix by switching the order back so
    we sleep *then* check. We've done a up front check when we enter the
    function so this won't cause an additional delay when the status is
    already what we're looking for.
    
    Cc: stable@vger.kernel.org # v6.4+
    Fixes: e87fcf0dc2b4 ("tpm, tpm_tis: Only handle supported interrupts")
    Signed-off-by: Jonathan McDowell <noodles@meta.com>

diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c
index fdef214b9f6b..167d71747666 100644
--- a/drivers/char/tpm/tpm_tis_core.c
+++ b/drivers/char/tpm/tpm_tis_core.c
@@ -114,11 +114,11 @@ static int wait_for_tpm_stat(struct tpm_chip *chip, u8 mask,
 		return 0;
 	/* process status changes without irq support */
 	do {
+		usleep_range(priv->timeout_min,
+			     priv->timeout_max);
 		status = chip->ops->status(chip);
 		if ((status & mask) == mask)
 			return 0;
-		usleep_range(priv->timeout_min,
-			     priv->timeout_max);
 	} while (time_before(jiffies, stop));
 	return -ETIME;
 }

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: TPM operation times out (very rarely)
  2025-02-19 22:29 ` Jonathan McDowell
@ 2025-02-20  8:42   ` Michal Suchánek
  2025-02-21 12:44     ` Jonathan McDowell
  2025-02-24 12:56   ` Michal Suchánek
  1 sibling, 1 reply; 38+ messages in thread
From: Michal Suchánek @ 2025-02-20  8:42 UTC (permalink / raw)
  To: Jonathan McDowell; +Cc: linux-integrity, Jarkko Sakkinen, Lino Sanfilippo

Hello,

On Wed, Feb 19, 2025 at 10:29:45PM +0000, Jonathan McDowell wrote:
> On Wed, Jan 29, 2025 at 04:27:15PM +0100, Michal Suchánek wrote:
> > Hello,
> > 
> > there is a problem report that booting a specific type of system about
> > 0.1% of the time encrypted volume (using a PCR to release the key) fails
> > to unlock because of TPM operation timeout.
> > 
> > Minimizing the test case failed so far.
> > 
> > For example, booting into text mode as opposed to graphical desktop
> > makes the problem unreproducible.
> > 
> > The test is done with a frankenkernel that has TPM drivers about on par
> > with Linux 6.4 but using actual Linux 6.4 the problem is not
> > reproducible, either.
> > 
> > However, given the problem takes up to a day to reproduce I do not have
> > much confidence in the negative results.
> 
> Michal, can you possibly try the below and see if it helps out? There
> seems to be a timing bug introduced in 6.4+ that I think might be
> related, and matches up with some of our internal metrics that showed an
> increase in timeouts in 6.4 onwards.

Thanks for looking into this

> 
> commit 79041fba797d0fe907e227012767f56dd93fac32
> Author: Jonathan McDowell <noodles@meta.com>
> Date:   Wed Feb 19 16:20:44 2025 -0600
> 
>     tpm, tpm_tis: Fix timeout handling when waiting for TPM status
>     
>     The change to only use interrupts to handle supported status changes,
>     then switch to polling for the rest, inverted the status test and sleep
>     such that we can end up sleeping beyond our timeout and not actually
>     checking the status. This can result in spurious TPM timeouts,
>     especially on a more loaded system. Fix by switching the order back so
>     we sleep *then* check. We've done a up front check when we enter the
>     function so this won't cause an additional delay when the status is
>     already what we're looking for.
>     
>     Cc: stable@vger.kernel.org # v6.4+
>     Fixes: e87fcf0dc2b4 ("tpm, tpm_tis: Only handle supported interrupts")
>     Signed-off-by: Jonathan McDowell <noodles@meta.com>
> 
> diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c
> index fdef214b9f6b..167d71747666 100644
> --- a/drivers/char/tpm/tpm_tis_core.c
> +++ b/drivers/char/tpm/tpm_tis_core.c
> @@ -114,11 +114,11 @@ static int wait_for_tpm_stat(struct tpm_chip *chip, u8 mask,
>  		return 0;
>  	/* process status changes without irq support */
>  	do {
> +		usleep_range(priv->timeout_min,
> +			     priv->timeout_max);

What would be the priv->timeout_min and priv->timeout_max here?

Note that there are timeouts that are 200ms, and are overblown by 2s.

If the 200ms timeout relies on the sleep during the wait for the timeout
being much longer than the timeout itself then the timeout is arguably
bogus regardless of this change helping.

Thanks

Michal

>  		status = chip->ops->status(chip);
>  		if ((status & mask) == mask)
>  			return 0;
> -		usleep_range(priv->timeout_min,
> -			     priv->timeout_max);
>  	} while (time_before(jiffies, stop));
>  	return -ETIME;
>  }

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: TPM operation times out (very rarely)
  2025-02-20  8:42   ` Michal Suchánek
@ 2025-02-21 12:44     ` Jonathan McDowell
  2025-02-24 12:21       ` Michal Suchánek
  0 siblings, 1 reply; 38+ messages in thread
From: Jonathan McDowell @ 2025-02-21 12:44 UTC (permalink / raw)
  To: Michal Suchánek; +Cc: linux-integrity, Jarkko Sakkinen, Lino Sanfilippo

On Thu, Feb 20, 2025 at 09:42:28AM +0100, Michal Suchánek wrote:
> On Wed, Feb 19, 2025 at 10:29:45PM +0000, Jonathan McDowell wrote:
> > On Wed, Jan 29, 2025 at 04:27:15PM +0100, Michal Suchánek wrote:
> > > Hello,
> > > 
> > > there is a problem report that booting a specific type of system about
> > > 0.1% of the time encrypted volume (using a PCR to release the key) fails
> > > to unlock because of TPM operation timeout.
> > > 
> > > Minimizing the test case failed so far.
> > > 
> > > For example, booting into text mode as opposed to graphical desktop
> > > makes the problem unreproducible.
> > > 
> > > The test is done with a frankenkernel that has TPM drivers about on par
> > > with Linux 6.4 but using actual Linux 6.4 the problem is not
> > > reproducible, either.
> > > 
> > > However, given the problem takes up to a day to reproduce I do not have
> > > much confidence in the negative results.
> > 
> > Michal, can you possibly try the below and see if it helps out? There
> > seems to be a timing bug introduced in 6.4+ that I think might be
> > related, and matches up with some of our internal metrics that showed an
> > increase in timeouts in 6.4 onwards.
> 
> Thanks for looking into this

No problem. It's something we've seen in our fleet and I've been trying
to get to the bottom of, so having some additional data from someone
else is really helpful.

> > commit 79041fba797d0fe907e227012767f56dd93fac32
> > Author: Jonathan McDowell <noodles@meta.com>
> > Date:   Wed Feb 19 16:20:44 2025 -0600
> > 
> >     tpm, tpm_tis: Fix timeout handling when waiting for TPM status
> >     
> >     The change to only use interrupts to handle supported status changes,
> >     then switch to polling for the rest, inverted the status test and sleep
> >     such that we can end up sleeping beyond our timeout and not actually
> >     checking the status. This can result in spurious TPM timeouts,
> >     especially on a more loaded system. Fix by switching the order back so
> >     we sleep *then* check. We've done a up front check when we enter the
> >     function so this won't cause an additional delay when the status is
> >     already what we're looking for.
> >     
> >     Cc: stable@vger.kernel.org # v6.4+
> >     Fixes: e87fcf0dc2b4 ("tpm, tpm_tis: Only handle supported interrupts")
> >     Signed-off-by: Jonathan McDowell <noodles@meta.com>
> > 
> > diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c
> > index fdef214b9f6b..167d71747666 100644
> > --- a/drivers/char/tpm/tpm_tis_core.c
> > +++ b/drivers/char/tpm/tpm_tis_core.c
> > @@ -114,11 +114,11 @@ static int wait_for_tpm_stat(struct tpm_chip *chip, u8 mask,
> >  		return 0;
> >  	/* process status changes without irq support */
> >  	do {
> > +		usleep_range(priv->timeout_min,
> > +			     priv->timeout_max);
> 
> What would be the priv->timeout_min and priv->timeout_max here?
> 
> Note that there are timeouts that are 200ms, and are overblown by 2s.
> 
> If the 200ms timeout relies on the sleep during the wait for the timeout
> being much longer than the timeout itself then the timeout is arguably
> bogus regardless of this change helping.

Ah, I thought your major issue was the 2s timeout that was only slightly
exceeded.

However in my initial tracing I've seen wait_for_tpm_stat take much
longer than the timeout that's passed in, which is what caused me to go
and investigate this code path and note it had been changed in 6.4. It
seems like a bug either way, but I've been at the TCG meeting this week
and not had time to do further instrumentation and confirmation. Given
you seem to have a more reliable reproducer I thought it might be easy
enough for you to see if it made any difference.

J.

-- 
Why do I get the feeling I'm going to regret this?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: TPM operation times out (very rarely)
  2025-02-21 12:44     ` Jonathan McDowell
@ 2025-02-24 12:21       ` Michal Suchánek
  0 siblings, 0 replies; 38+ messages in thread
From: Michal Suchánek @ 2025-02-24 12:21 UTC (permalink / raw)
  To: Jonathan McDowell; +Cc: linux-integrity, Jarkko Sakkinen, Lino Sanfilippo

On Fri, Feb 21, 2025 at 12:44:45PM +0000, Jonathan McDowell wrote:
> On Thu, Feb 20, 2025 at 09:42:28AM +0100, Michal Suchánek wrote:
> > On Wed, Feb 19, 2025 at 10:29:45PM +0000, Jonathan McDowell wrote:
> > > On Wed, Jan 29, 2025 at 04:27:15PM +0100, Michal Suchánek wrote:
> > > > Hello,
> > > > 
> > > > there is a problem report that booting a specific type of system about
> > > > 0.1% of the time encrypted volume (using a PCR to release the key) fails
> > > > to unlock because of TPM operation timeout.
> > > > 
> > > > Minimizing the test case failed so far.
> > > > 
> > > > For example, booting into text mode as opposed to graphical desktop
> > > > makes the problem unreproducible.
> > > > 
> > > > The test is done with a frankenkernel that has TPM drivers about on par
> > > > with Linux 6.4 but using actual Linux 6.4 the problem is not
> > > > reproducible, either.
> > > > 
> > > > However, given the problem takes up to a day to reproduce I do not have
> > > > much confidence in the negative results.
> > > 
> > > Michal, can you possibly try the below and see if it helps out? There
> > > seems to be a timing bug introduced in 6.4+ that I think might be
> > > related, and matches up with some of our internal metrics that showed an
> > > increase in timeouts in 6.4 onwards.
> > 
> > Thanks for looking into this
> 
> No problem. It's something we've seen in our fleet and I've been trying
> to get to the bottom of, so having some additional data from someone
> else is really helpful.
> 
> > > commit 79041fba797d0fe907e227012767f56dd93fac32
> > > Author: Jonathan McDowell <noodles@meta.com>
> > > Date:   Wed Feb 19 16:20:44 2025 -0600
> > > 
> > >     tpm, tpm_tis: Fix timeout handling when waiting for TPM status
> > >     
> > >     The change to only use interrupts to handle supported status changes,
> > >     then switch to polling for the rest, inverted the status test and sleep
> > >     such that we can end up sleeping beyond our timeout and not actually
> > >     checking the status. This can result in spurious TPM timeouts,
> > >     especially on a more loaded system. Fix by switching the order back so
> > >     we sleep *then* check. We've done a up front check when we enter the
> > >     function so this won't cause an additional delay when the status is
> > >     already what we're looking for.
> > >     
> > >     Cc: stable@vger.kernel.org # v6.4+
> > >     Fixes: e87fcf0dc2b4 ("tpm, tpm_tis: Only handle supported interrupts")
> > >     Signed-off-by: Jonathan McDowell <noodles@meta.com>
> > > 
> > > diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c
> > > index fdef214b9f6b..167d71747666 100644
> > > --- a/drivers/char/tpm/tpm_tis_core.c
> > > +++ b/drivers/char/tpm/tpm_tis_core.c
> > > @@ -114,11 +114,11 @@ static int wait_for_tpm_stat(struct tpm_chip *chip, u8 mask,
> > >  		return 0;
> > >  	/* process status changes without irq support */
> > >  	do {
> > > +		usleep_range(priv->timeout_min,
> > > +			     priv->timeout_max);
> > 
> > What would be the priv->timeout_min and priv->timeout_max here?
> > 
> > Note that there are timeouts that are 200ms, and are overblown by 2s.
> > 
> > If the 200ms timeout relies on the sleep during the wait for the timeout
> > being much longer than the timeout itself then the timeout is arguably
> > bogus regardless of this change helping.
> 
> Ah, I thought your major issue was the 2s timeout that was only slightly
> exceeded.
> 
> However in my initial tracing I've seen wait_for_tpm_stat take much
> longer than the timeout that's passed in, which is what caused me to go
> and investigate this code path and note it had been changed in 6.4. It
> seems like a bug either way, but I've been at the TCG meeting this week
> and not had time to do further instrumentation and confirmation. Given
> you seem to have a more reliable reproducer I thought it might be easy
> enough for you to see if it made any difference.

The problem is no longer reproducible, probably due to some other hcange
in the test environment. So much for reliable reproducer.

Yes, I think this is a bug either way and should be addressed although
the effect on this problem is minor at best.

Thanks

Michal

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: TPM operation times out (very rarely)
  2025-02-19 22:29 ` Jonathan McDowell
  2025-02-20  8:42   ` Michal Suchánek
@ 2025-02-24 12:56   ` Michal Suchánek
  2025-03-01  2:03     ` Jarkko Sakkinen
  1 sibling, 1 reply; 38+ messages in thread
From: Michal Suchánek @ 2025-02-24 12:56 UTC (permalink / raw)
  To: Jonathan McDowell; +Cc: linux-integrity, Jarkko Sakkinen, Lino Sanfilippo

On Wed, Feb 19, 2025 at 10:29:45PM +0000, Jonathan McDowell wrote:
> On Wed, Jan 29, 2025 at 04:27:15PM +0100, Michal Suchánek wrote:
> > Hello,
> > 
> > there is a problem report that booting a specific type of system about
> > 0.1% of the time encrypted volume (using a PCR to release the key) fails
> > to unlock because of TPM operation timeout.
> > 
> > Minimizing the test case failed so far.
> > 
> > For example, booting into text mode as opposed to graphical desktop
> > makes the problem unreproducible.
> > 
> > The test is done with a frankenkernel that has TPM drivers about on par
> > with Linux 6.4 but using actual Linux 6.4 the problem is not
> > reproducible, either.
> > 
> > However, given the problem takes up to a day to reproduce I do not have
> > much confidence in the negative results.
> 
> Michal, can you possibly try the below and see if it helps out? There
> seems to be a timing bug introduced in 6.4+ that I think might be
> related, and matches up with some of our internal metrics that showed an
> increase in timeouts in 6.4 onwards.
> 
> commit 79041fba797d0fe907e227012767f56dd93fac32
> Author: Jonathan McDowell <noodles@meta.com>
> Date:   Wed Feb 19 16:20:44 2025 -0600
> 
>     tpm, tpm_tis: Fix timeout handling when waiting for TPM status
>     
>     The change to only use interrupts to handle supported status changes,
>     then switch to polling for the rest, inverted the status test and sleep
>     such that we can end up sleeping beyond our timeout and not actually
>     checking the status. This can result in spurious TPM timeouts,
>     especially on a more loaded system. Fix by switching the order back so
>     we sleep *then* check. We've done a up front check when we enter the
>     function so this won't cause an additional delay when the status is
>     already what we're looking for.
>     
>     Cc: stable@vger.kernel.org # v6.4+
>     Fixes: e87fcf0dc2b4 ("tpm, tpm_tis: Only handle supported interrupts")
>     Signed-off-by: Jonathan McDowell <noodles@meta.com>

Reviewed-by: Michal Suchánek <msuchanek@suse.de>

Thanks

> 
> diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c
> index fdef214b9f6b..167d71747666 100644
> --- a/drivers/char/tpm/tpm_tis_core.c
> +++ b/drivers/char/tpm/tpm_tis_core.c
> @@ -114,11 +114,11 @@ static int wait_for_tpm_stat(struct tpm_chip *chip, u8 mask,
>  		return 0;
>  	/* process status changes without irq support */
>  	do {
> +		usleep_range(priv->timeout_min,
> +			     priv->timeout_max);
>  		status = chip->ops->status(chip);
>  		if ((status & mask) == mask)
>  			return 0;
> -		usleep_range(priv->timeout_min,
> -			     priv->timeout_max);
>  	} while (time_before(jiffies, stop));
>  	return -ETIME;
>  }

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: TPM operation times out (very rarely)
  2025-02-10 17:32                             ` Jarkko Sakkinen
@ 2025-02-24 13:04                               ` Michal Suchánek
  2025-03-01  2:13                                 ` Jarkko Sakkinen
  0 siblings, 1 reply; 38+ messages in thread
From: Michal Suchánek @ 2025-02-24 13:04 UTC (permalink / raw)
  To: Jarkko Sakkinen; +Cc: Jonathan McDowell, linux-integrity

On Mon, Feb 10, 2025 at 07:32:53PM +0200, Jarkko Sakkinen wrote:
> On Mon Feb 10, 2025 at 6:18 PM EET, Jonathan McDowell wrote:
> > Who then handles the ERESTARTSYS though? Part of the issues we've seen
> > is the failure happens in a context save or load, which is all within
> > the kernel rather than directly under the control of userspace. I'm
> > guessing the HMAC changes are likely to hit similar problems. I think
> > some level of timeout improvement in tpm_transmit is appropriate, if we
> > can work out what it should be.
> 
> Right I get what you mean, not all transmits initiate from syscalls
> And obviously this can happen without hmac too with tpmrm0.
> 
> Hmm... so I'm open for a patch that radically simplifies the state
> change timeouts, i.e. sort of part of that old patch.

There is also another aspect to this:

What happens when the context save/load result is dropped on the floor?

There was other call that can take a very long time: get random number.
And while dropping a random number that took a long time ito fabricate
on the floor is wasteful it does not cause any real problem.

With the context save/load, however, the context state gets
desynchronized between TPM and kernel when the result of the call is
ignored.

If the kernel cannot correct the state by examining return value from
later calls the TPM can effectively become defunct because the kernel
will call the wrong context operation, it will return unexpected value
which the kernel interprets as failure, and no operation can be
performed.

Thanks

Michal

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: TPM operation times out (very rarely)
  2025-02-24 12:56   ` Michal Suchánek
@ 2025-03-01  2:03     ` Jarkko Sakkinen
  0 siblings, 0 replies; 38+ messages in thread
From: Jarkko Sakkinen @ 2025-03-01  2:03 UTC (permalink / raw)
  To: Michal Suchánek; +Cc: Jonathan McDowell, linux-integrity, Lino Sanfilippo

On Mon, Feb 24, 2025 at 01:56:56PM +0100, Michal Suchánek wrote:
> On Wed, Feb 19, 2025 at 10:29:45PM +0000, Jonathan McDowell wrote:
> > On Wed, Jan 29, 2025 at 04:27:15PM +0100, Michal Suchánek wrote:
> > > Hello,
> > > 
> > > there is a problem report that booting a specific type of system about
> > > 0.1% of the time encrypted volume (using a PCR to release the key) fails
> > > to unlock because of TPM operation timeout.
> > > 
> > > Minimizing the test case failed so far.
> > > 
> > > For example, booting into text mode as opposed to graphical desktop
> > > makes the problem unreproducible.
> > > 
> > > The test is done with a frankenkernel that has TPM drivers about on par
> > > with Linux 6.4 but using actual Linux 6.4 the problem is not
> > > reproducible, either.
> > > 
> > > However, given the problem takes up to a day to reproduce I do not have
> > > much confidence in the negative results.
> > 
> > Michal, can you possibly try the below and see if it helps out? There
> > seems to be a timing bug introduced in 6.4+ that I think might be
> > related, and matches up with some of our internal metrics that showed an
> > increase in timeouts in 6.4 onwards.
> > 
> > commit 79041fba797d0fe907e227012767f56dd93fac32
> > Author: Jonathan McDowell <noodles@meta.com>
> > Date:   Wed Feb 19 16:20:44 2025 -0600
> > 
> >     tpm, tpm_tis: Fix timeout handling when waiting for TPM status
> >     
> >     The change to only use interrupts to handle supported status changes,
> >     then switch to polling for the rest, inverted the status test and sleep
> >     such that we can end up sleeping beyond our timeout and not actually
> >     checking the status. This can result in spurious TPM timeouts,
> >     especially on a more loaded system. Fix by switching the order back so
> >     we sleep *then* check. We've done a up front check when we enter the
> >     function so this won't cause an additional delay when the status is
> >     already what we're looking for.
> >     
> >     Cc: stable@vger.kernel.org # v6.4+
> >     Fixes: e87fcf0dc2b4 ("tpm, tpm_tis: Only handle supported interrupts")
> >     Signed-off-by: Jonathan McDowell <noodles@meta.com>
> 
> Reviewed-by: Michal Suchánek <msuchanek@suse.de>

Send as a proper patch. I've this month mostly on holiday but will be
back at work on Monday.

BR, Jarkko

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: TPM operation times out (very rarely)
  2025-02-24 13:04                               ` Michal Suchánek
@ 2025-03-01  2:13                                 ` Jarkko Sakkinen
  2025-03-05 12:20                                   ` Michal Suchánek
  0 siblings, 1 reply; 38+ messages in thread
From: Jarkko Sakkinen @ 2025-03-01  2:13 UTC (permalink / raw)
  To: Michal Suchánek; +Cc: Jonathan McDowell, linux-integrity

On Mon, Feb 24, 2025 at 02:04:13PM +0100, Michal Suchánek wrote:
> On Mon, Feb 10, 2025 at 07:32:53PM +0200, Jarkko Sakkinen wrote:
> > On Mon Feb 10, 2025 at 6:18 PM EET, Jonathan McDowell wrote:
> > > Who then handles the ERESTARTSYS though? Part of the issues we've seen
> > > is the failure happens in a context save or load, which is all within
> > > the kernel rather than directly under the control of userspace. I'm
> > > guessing the HMAC changes are likely to hit similar problems. I think
> > > some level of timeout improvement in tpm_transmit is appropriate, if we
> > > can work out what it should be.
> > 
> > Right I get what you mean, not all transmits initiate from syscalls
> > And obviously this can happen without hmac too with tpmrm0.
> > 
> > Hmm... so I'm open for a patch that radically simplifies the state
> > change timeouts, i.e. sort of part of that old patch.
> 
> There is also another aspect to this:
> 
> What happens when the context save/load result is dropped on the floor?

Trying to understand what are you meaning by this. Are you speaking
about scenario where after the operation context operations fail
inside kernel?

> 
> There was other call that can take a very long time: get random number.
> And while dropping a random number that took a long time ito fabricate
> on the floor is wasteful it does not cause any real problem.
> 
> With the context save/load, however, the context state gets
> desynchronized between TPM and kernel when the result of the call is
> ignored.
> 
> If the kernel cannot correct the state by examining return value from
> later calls the TPM can effectively become defunct because the kernel
> will call the wrong context operation, it will return unexpected value
> which the kernel interprets as failure, and no operation can be
> performed.

I got lost in the beginning so not yet sure about this (not same
as rejecting, also I'm responding from holiday).

Not really sure if this related but I'll just this here.

We could possibly identify some expensive TPM commands that
fill these requirements:

1. Don't run in a session.
2. Don't cause handles to be created (neither sessions nor
   objects).

and executed as they were put to /dev/tpm0. At least we might
be able to do this for TPM2_GetRandom.

> 
> Thanks
> 
> Michal


BR, Jarkko

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: TPM operation times out (very rarely)
  2025-03-01  2:13                                 ` Jarkko Sakkinen
@ 2025-03-05 12:20                                   ` Michal Suchánek
  2025-03-06 22:29                                     ` Jarkko Sakkinen
  0 siblings, 1 reply; 38+ messages in thread
From: Michal Suchánek @ 2025-03-05 12:20 UTC (permalink / raw)
  To: Jarkko Sakkinen; +Cc: Jonathan McDowell, linux-integrity

On Sat, Mar 01, 2025 at 04:13:23AM +0200, Jarkko Sakkinen wrote:
> On Mon, Feb 24, 2025 at 02:04:13PM +0100, Michal Suchánek wrote:
> > On Mon, Feb 10, 2025 at 07:32:53PM +0200, Jarkko Sakkinen wrote:
> > > On Mon Feb 10, 2025 at 6:18 PM EET, Jonathan McDowell wrote:
> > > > Who then handles the ERESTARTSYS though? Part of the issues we've seen
> > > > is the failure happens in a context save or load, which is all within
> > > > the kernel rather than directly under the control of userspace. I'm
> > > > guessing the HMAC changes are likely to hit similar problems. I think
> > > > some level of timeout improvement in tpm_transmit is appropriate, if we
> > > > can work out what it should be.
> > > 
> > > Right I get what you mean, not all transmits initiate from syscalls
> > > And obviously this can happen without hmac too with tpmrm0.
> > > 
> > > Hmm... so I'm open for a patch that radically simplifies the state
> > > change timeouts, i.e. sort of part of that old patch.
> > 
> > There is also another aspect to this:
> > 
> > What happens when the context save/load result is dropped on the floor?
> 
> Trying to understand what are you meaning by this. Are you speaking
> about scenario where after the operation context operations fail
> inside kernel?

I am speaking about the scenario when the opration succeeds but the
kernel declares it a failure because it did not happen within the
timeout.

Thanks

Michal

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: TPM operation times out (very rarely)
  2025-03-05 12:20                                   ` Michal Suchánek
@ 2025-03-06 22:29                                     ` Jarkko Sakkinen
  0 siblings, 0 replies; 38+ messages in thread
From: Jarkko Sakkinen @ 2025-03-06 22:29 UTC (permalink / raw)
  To: Michal Suchánek; +Cc: Jonathan McDowell, linux-integrity

On Wed, Mar 05, 2025 at 01:20:45PM +0100, Michal Suchánek wrote:
> On Sat, Mar 01, 2025 at 04:13:23AM +0200, Jarkko Sakkinen wrote:
> > On Mon, Feb 24, 2025 at 02:04:13PM +0100, Michal Suchánek wrote:
> > > On Mon, Feb 10, 2025 at 07:32:53PM +0200, Jarkko Sakkinen wrote:
> > > > On Mon Feb 10, 2025 at 6:18 PM EET, Jonathan McDowell wrote:
> > > > > Who then handles the ERESTARTSYS though? Part of the issues we've seen
> > > > > is the failure happens in a context save or load, which is all within
> > > > > the kernel rather than directly under the control of userspace. I'm
> > > > > guessing the HMAC changes are likely to hit similar problems. I think
> > > > > some level of timeout improvement in tpm_transmit is appropriate, if we
> > > > > can work out what it should be.
> > > > 
> > > > Right I get what you mean, not all transmits initiate from syscalls
> > > > And obviously this can happen without hmac too with tpmrm0.
> > > > 
> > > > Hmm... so I'm open for a patch that radically simplifies the state
> > > > change timeouts, i.e. sort of part of that old patch.
> > > 
> > > There is also another aspect to this:
> > > 
> > > What happens when the context save/load result is dropped on the floor?
> > 
> > Trying to understand what are you meaning by this. Are you speaking
> > about scenario where after the operation context operations fail
> > inside kernel?
> 
> I am speaking about the scenario when the opration succeeds but the
> kernel declares it a failure because it did not happen within the
> timeout.

OK got it, thanks.

I guess theoretically we could even never do timeout and let the caller
SIGINT.

> 
> Thanks
> 
> Michal

BR, Jarkko

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: TPM operation times out (very rarely)
  2025-02-05 14:29                   ` Jonathan McDowell
  2025-02-05 15:29                     ` Michal Suchánek
  2025-02-06 20:35                     ` Jarkko Sakkinen
@ 2025-03-27 12:57                     ` Michal Suchánek
  2025-03-27 13:15                       ` Jarkko Sakkinen
  2 siblings, 1 reply; 38+ messages in thread
From: Michal Suchánek @ 2025-03-27 12:57 UTC (permalink / raw)
  To: Jonathan McDowell; +Cc: Jarkko Sakkinen, linux-integrity

Hello,

On Wed, Feb 05, 2025 at 02:29:36PM +0000, Jonathan McDowell wrote:
> On Wed, Feb 05, 2025 at 02:26:05PM +0100, Michal Suchánek wrote:
> > On Fri, Jan 31, 2025 at 09:31:30PM +0200, Jarkko Sakkinen wrote:
> > > On Fri Jan 31, 2025 at 7:28 PM EET, Michal Suchánek wrote:
> > > > On Fri, Jan 31, 2025 at 07:12:06PM +0200, Jarkko Sakkinen wrote:
> > > > > On Fri Jan 31, 2025 at 3:02 PM EET, Michal Suchánek wrote:
> > > > > > It looks like the timeout_b is used exclusively as the ready timeout *),
> > > > > > with various sources of the value depending on chip type.
> > > > > >
> > > > > > Then increasing it should not cause any problem other than the kernel
> > > > > > waiting longer when the TPM chip is really stuck.
> > > > > >
> > > > > > * There is one instance of use of timeout_b for TPM_STS_VALID in
> > > > > > st33zp24_pm_resume.
> > > > > 
> > > > > Possible for you to give a shot for patch and try it out for a while?
> > > > > I'm fine with 2x, or even 4x in this case.
> > > >
> > > > I will see what I can do. It will definitely take a while.
> > > >
> > > > How would you like to multiply it?
> > > >
> > > > At the sime the timeout_b is assigned, or at the time it's used?
> > > >
> > > > Any specific patch that you have in mind?
> > > 
> > > I'll think about this a bit and send a patch with RFC tag. Might take
> > > to late next week.
> > 
> > The ready timeout is not the only one exceeded:
> > 
> > > Jan 29 19:01:55 localhost kernel: tpm tpm0: tpm_tis_send_data: 357: ready: Timed out (2232 of 2000 ms)
> > > Jan 29 19:01:55 localhost kernel: tpm tpm0: tpm_tis_send_data: 357: ready: Took (2232 of 2000 ms)
> > > Jan 30 09:08:20 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: ready: Timed out (2228 of 2000 ms)
> > > Jan 30 09:08:20 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: ready: Took (2228 of 2000 ms)
> > > Jan 30 14:26:16 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Timed out (540 of 200 ms)
> > > Jan 30 14:26:16 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Took (540 of 200 ms)
> > > Jan 30 23:25:13 localhost kernel: tpm tpm0: tpm_tis_send_data: 354: stat: Timed out (2224 of 200 ms)
> > > Jan 30 23:25:13 localhost kernel: tpm tpm0: tpm_tis_send_data: 354: stat: Took (2224 of 200 ms)
> > > Feb 01 05:25:33 localhost kernel: tpm tpm0: tpm_tis_send_data: 357: ready: Timed out (2228 of 2000 ms)
> > > Feb 01 05:25:33 localhost kernel: tpm tpm0: tpm_tis_send_data: 357: ready: Took (2228 of 2000 ms)
> > > Feb 01 07:02:53 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Timed out (556 of 200 ms)
> > > Feb 01 07:02:53 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Took (556 of 200 ms)
> > > Feb 01 09:26:22 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Timed out (540 of 200 ms)
> > > Feb 01 09:26:22 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Took (540 of 200 ms)
> > > Feb 02 02:45:35 localhost kernel: tpm tpm0: tpm_tis_send_data: 379: stat: Timed out (272 of 200 ms)
> > > Feb 02 02:45:35 localhost kernel: tpm tpm0: tpm_tis_send_data: 379: stat: Took (272 of 200 ms)
> > > Feb 02 03:40:04 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Timed out (536 of 200 ms)
> > > Feb 02 03:40:04 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Took (536 of 200 ms)
> > > Feb 02 04:09:50 localhost kernel: tpm tpm0: tpm_tis_send_data: 357: ready: Timed out (2236 of 2000 ms)
> > > Feb 02 04:09:50 localhost kernel: tpm tpm0: tpm_tis_send_data: 357: ready: Took (2236 of 2000 ms)
> > > Feb 02 09:57:41 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Timed out (540 of 200 ms)
> > > Feb 02 09:57:41 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Took (540 of 200 ms)
> > > Feb 02 10:59:00 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Timed out (536 of 200 ms)
> > > Feb 02 10:59:00 localhost kernel: tpm tpm0: tpm_tis_send_data: 353: stat: Took (536 of 200 ms)
> > > Feb 03 03:58:09 localhost kernel: tpm tpm0: tpm_tis_send_data: 354: stat: Timed out (540 of 200 ms)
> > > Feb 03 03:58:09 localhost kernel: tpm tpm0: tpm_tis_send_data: 354: stat: Took (540 of 200 ms)

Another new failure:

Mar 25 23:06:20 localhost kernel: tpm tpm0: tpm_tis_send_data: 305: valid: Timed out (212 of 200 ms)
Mar 25 23:06:20 localhost kernel: tpm tpm0: tpm_tis_send_data: 305: valid: Took (212 of 200 ms)

which is CREATE PRIMARY, and happens on

 tpm_tis MSFT0101:00: 2.0 TPM (device-id 0x1A, rev-id 16)

TPM Device
        Vendor ID: 
        Specification Version: 2.0
        Firmware Revision: 5.63
        Description: INFINEON
        Characteristics:
                Family configurable via platform software support
        OEM-specific Information: 0x00000000

With this we have ready timoeout which is 2000ms but the TPM takes
up to 2236ms, stat timeout which is 200ms and the TPM takes up to
2224ms, and valid timeout which is 200 ms and the TPM takes up to 212ms.

Given that I think it makes sense to replace all these timeouts with one
timeout that is something like 4 or 5s.

This does not address the other problem that when the rimeout triggers
on something like LOAD CONTEXT then the kernel and the TPM don't agree
on what the state of the context is.

Thanks

Michal

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: TPM operation times out (very rarely)
  2025-03-27 12:57                     ` Michal Suchánek
@ 2025-03-27 13:15                       ` Jarkko Sakkinen
  0 siblings, 0 replies; 38+ messages in thread
From: Jarkko Sakkinen @ 2025-03-27 13:15 UTC (permalink / raw)
  To: Michal Suchánek; +Cc: Jonathan McDowell, linux-integrity

On Thu, Mar 27, 2025 at 01:57:48PM +0100, Michal Suchánek wrote:
> This does not address the other problem that when the rimeout triggers
> on something like LOAD CONTEXT then the kernel and the TPM don't agree
> on what the state of the context is.

In the case of tpm2-sessions, and in-kernel callers, we could try to
improve the situation in some of the edge cases.

Out-of-my sleeve ideas:

1. E.g., tpm2_load_null() could potentially have retries instead of
   single trial.
2. Perhaps timeout in tpm_tis should be adaptive? It could be by
   defaault 2s and would grow on trials.

> 
> Thanks
> 
> Michal

BR, Jarkko

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2025-03-27 13:15 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-29 15:27 TPM operation times out (very rarely) Michal Suchánek
2025-01-29 16:02 ` Jonathan McDowell
2025-01-29 16:20   ` Michal Suchánek
2025-01-29 17:14     ` Jonathan McDowell
2025-01-29 17:25       ` Michal Suchánek
2025-01-30 23:31   ` Jarkko Sakkinen
2025-01-31  8:35     ` Michal Suchánek
2025-01-31 10:25       ` Jarkko Sakkinen
2025-01-31 13:02         ` Michal Suchánek
2025-01-31 17:12           ` Jarkko Sakkinen
2025-01-31 17:28             ` Michal Suchánek
2025-01-31 19:31               ` Jarkko Sakkinen
2025-02-05 13:26                 ` Michal Suchánek
2025-02-05 13:45                   ` Michal Suchánek
2025-02-05 14:29                   ` Jonathan McDowell
2025-02-05 15:29                     ` Michal Suchánek
2025-02-06 20:35                     ` Jarkko Sakkinen
2025-02-07  9:26                       ` Jonathan McDowell
2025-02-07  9:40                         ` Michal Suchánek
2025-02-07  9:47                           ` Jonathan McDowell
2025-02-07  9:58                             ` Michal Suchánek
2025-02-10 16:13                               ` Jonathan McDowell
2025-02-10 17:30                                 ` Jarkko Sakkinen
2025-02-08 20:29                         ` Jarkko Sakkinen
2025-02-10 16:18                           ` Jonathan McDowell
2025-02-10 17:32                             ` Jarkko Sakkinen
2025-02-24 13:04                               ` Michal Suchánek
2025-03-01  2:13                                 ` Jarkko Sakkinen
2025-03-05 12:20                                   ` Michal Suchánek
2025-03-06 22:29                                     ` Jarkko Sakkinen
2025-03-27 12:57                     ` Michal Suchánek
2025-03-27 13:15                       ` Jarkko Sakkinen
2025-02-19 22:29 ` Jonathan McDowell
2025-02-20  8:42   ` Michal Suchánek
2025-02-21 12:44     ` Jonathan McDowell
2025-02-24 12:21       ` Michal Suchánek
2025-02-24 12:56   ` Michal Suchánek
2025-03-01  2:03     ` Jarkko Sakkinen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).