sdhci-omap: additional PM issue since 5.16

Linux MultiMedia Card development
 help / color / mirror / Atom feed

* sdhci-omap: additional PM issue since 5.16
@ 2025-01-23 22:09 David Owens
  2025-01-24 10:36 ` Romain Naour
  0 siblings, 1 reply; 15+ messages in thread
From: David Owens @ 2025-01-23 22:09 UTC (permalink / raw)
  To: linux-omap; +Cc: linux-mmc

Hello,

I have a AM574x system and encountered an eMMC regression when upgrading from 5.15 to 6.1.38.  The eMMC is using mmc-hs200 powered at 1.8v.  Reads from /dev/mmcblk1boot0 will return expected data except when a delay of several seconds is inserted between reads.  With a delay between reads, the read will occasionally (~50% of the time) return garbage data.  Using hexdump, I was able to determine that the "bad" data is actually coming from /dev/mmcblk1, not /dev/mmcblk1boot0.  The same thing happens when reading from /dev/mmcblk1boot1.

Much like a previous report in the linux-omap mailing list [1], I too was able to correct the regression by reverting the commit "mmc: sdhci-omap: Allow SDIO card power off and enable aggressive PM" [2].  Unlike the previous report, applying the sdhci-omap patch [3] did not resolve my issue.  Only reverting the original commit allowed for reliable reads from /dev/mmcblk1boot0.  I also don't see the same I/O errors mentioned in the previous posting.  Reads always succeed and return the correct amount of data, its just from the wrong device.

[1] https://lore.kernel.org/all/2e5f1997-564c-44e4-b357-6343e0dae7ab@smile.fr/

[2] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=3edf588e7fe00e90d1dc7fb9e599861b2c2cf442

[3] https://lore.kernel.org/linux-omap/20240315234444.816978-1-romain.naour@smile.fr/T/#u

Regards,

Dave

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: sdhci-omap: additional PM issue since 5.16
  2025-01-23 22:09 sdhci-omap: additional PM issue since 5.16 David Owens
@ 2025-01-24 10:36 ` Romain Naour
  2025-01-24 17:15   ` David Owens
  0 siblings, 1 reply; 15+ messages in thread
From: Romain Naour @ 2025-01-24 10:36 UTC (permalink / raw)
  To: David Owens, linux-omap; +Cc: linux-mmc

Hello David,

Le 23/01/2025 à 23:09, David Owens a écrit :
> Hello,
> 
> I have a AM574x system and encountered an eMMC regression when upgrading from 5.15 to 6.1.38.  The eMMC is using mmc-hs200 powered at 1.8v.  Reads from /dev/mmcblk1boot0 will return expected data except when a delay of several seconds is inserted between reads.  With a delay between reads, the read will occasionally (~50% of the time) return garbage data.  Using hexdump, I was able to determine that the "bad" data is actually coming from /dev/mmcblk1, not /dev/mmcblk1boot0.  The same thing happens when reading from /dev/mmcblk1boot1.
> 
> Much like a previous report in the linux-omap mailing list [1], I too was able to correct the regression by reverting the commit "mmc: sdhci-omap: Allow SDIO card power off and enable aggressive PM" [2].  Unlike the previous report, applying the sdhci-omap patch [3] did not resolve my issue.  Only reverting the original commit allowed for reliable reads from /dev/mmcblk1boot0.  I also don't see the same I/O errors mentioned in the previous posting.  Reads always succeed and return the correct amount of data, its just from the wrong device.

Interesting, can you share a test script to reproduce your issue?

Why 6.1.38? nowadays the 6.1.x stable is 6.1.127 already.
Can you test with the latest stable release?

I believe this issue could be reproduced on the beaglebone-ai board (I don't
have it).

[1] https://www.beagleboard.org/boards/beaglebone-ai

Best regards,
Romain


> 
> [1] https://lore.kernel.org/all/2e5f1997-564c-44e4-b357-6343e0dae7ab@smile.fr/
> 
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=3edf588e7fe00e90d1dc7fb9e599861b2c2cf442
> 
> [3] https://lore.kernel.org/linux-omap/20240315234444.816978-1-romain.naour@smile.fr/T/#u
> 
> Regards,
> 
> Dave
> 
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: sdhci-omap: additional PM issue since 5.16
  2025-01-24 10:36 ` Romain Naour
@ 2025-01-24 17:15   ` David Owens
  2025-01-24 18:49     ` David
  0 siblings, 1 reply; 15+ messages in thread
From: David Owens @ 2025-01-24 17:15 UTC (permalink / raw)
  To: Romain Naour, linux-omap; +Cc: linux-mmc

Hi Romain

On 1/24/25 04:36, Romain Naour wrote:
> Hello David,
>
> Le 23/01/2025 à 23:09, David Owens a écrit :
>> Hello,
>>
>> I have a AM574x system and encountered an eMMC regression when upgrading from 5.15 to 6.1.38.  The eMMC is using mmc-hs200 powered at 1.8v.  Reads from /dev/mmcblk1boot0 will return expected data except when a delay of several seconds is inserted between reads.  With a delay between reads, the read will occasionally (~50% of the time) return garbage data.  Using hexdump, I was able to determine that the "bad" data is actually coming from /dev/mmcblk1, not /dev/mmcblk1boot0.  The same thing happens when reading from /dev/mmcblk1boot1.
>>
>> Much like a previous report in the linux-omap mailing list [1], I too was able to correct the regression by reverting the commit "mmc: sdhci-omap: Allow SDIO card power off and enable aggressive PM" [2].  Unlike the previous report, applying the sdhci-omap patch [3] did not resolve my issue.  Only reverting the original commit allowed for reliable reads from /dev/mmcblk1boot0.  I also don't see the same I/O errors mentioned in the previous posting.  Reads always succeed and return the correct amount of data, its just from the wrong device.
> Interesting, can you share a test script to reproduce your issue?

Here is a test script I've been running on my devices.  A failure is typically
detected after a minute or two.  I include the eMMC part type in the output as
we've used a couple different parts in production, all claiming to be compatible
and I'm starting to wonder if the failure is a combination of the aggressive
PM _and_ specific emmc parts.  The offset used in hexdump was just a place in
both mmcblk1 and mmcblk1boot0 that was non-zero.  The issue happens using any
offset.

#!/bin/bash

echo "Kernel:    $(uname -r)"
echo "eMMC part: $(dmesg | grep 'mmcblk1: mmc1:0001' | awk '{print $5}')"
BLK1=$(hexdump -C /dev/mmcblk1 -s 0x3fc000 -n 10 | head -n 1)
BOOT=$(hexdump -C /dev/mmcblk1boot0 -s 0x3fc000 -n 10 | head -n 1)

echo "/dev/mmcblk1:      ${BLK1}"
echo "/dev/mmcblk1boot0: ${BOOT}"

while [[ "$BLK1" != "$BOOT" ]]; do
    sleep 20
    BOOT=$(hexdump -C /dev/mmcblk1boot0 -s 0x3fc000 -n 10 | head -n 1)
    echo "/dev/mmcblk1boot0: ${BOOT}"
done

echo "/dev/mmcblk1boot0 read failure"

>
> Why 6.1.38? nowadays the 6.1.x stable is 6.1.127 already.
> Can you test with the latest stable release?

Good question.  I can certainly update to .127 but at the time we were shipping
units we were on .38 so that's where I've been doing all my testing.  I'll let
you know how running under .127 compares.

>
> I believe this issue could be reproduced on the beaglebone-ai board (I don't
> have it).
>
> [1] https://www.beagleboard.org/boards/beaglebone-ai

Thanks for the suggestion, I'll see if I can dig one up.

>
> Best regards,
> Romain
>
>
>> [1] https://lore.kernel.org/all/2e5f1997-564c-44e4-b357-6343e0dae7ab@smile.fr/
>>
>> [2] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=3edf588e7fe00e90d1dc7fb9e599861b2c2cf442
>>
>> [3] https://lore.kernel.org/linux-omap/20240315234444.816978-1-romain.naour@smile.fr/T/#u
>>
>> Regards,
>>
>> Dave
>>
>>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: sdhci-omap: additional PM issue since 5.16
  2025-01-24 17:15   ` David Owens
@ 2025-01-24 18:49     ` David
  2025-01-27  9:06       ` Romain Naour
  0 siblings, 1 reply; 15+ messages in thread
From: David @ 2025-01-24 18:49 UTC (permalink / raw)
  To: Romain Naour, linux-omap; +Cc: linux-mmc


On 1/24/25 11:15, David Owens wrote:
> Hi Romain
>
> On 1/24/25 04:36, Romain Naour wrote:
>> Hello David,
>>
>> Le 23/01/2025 à 23:09, David Owens a écrit :
>>> Hello,
>>>
>>> I have a AM574x system and encountered an eMMC regression when upgrading from 5.15 to 6.1.38.  The eMMC is using mmc-hs200 powered at 1.8v.  Reads from /dev/mmcblk1boot0 will return expected data except when a delay of several seconds is inserted between reads.  With a delay between reads, the read will occasionally (~50% of the time) return garbage data.  Using hexdump, I was able to determine that the "bad" data is actually coming from /dev/mmcblk1, not /dev/mmcblk1boot0.  The same thing happens when reading from /dev/mmcblk1boot1.
>>>
>>> Much like a previous report in the linux-omap mailing list [1], I too was able to correct the regression by reverting the commit "mmc: sdhci-omap: Allow SDIO card power off and enable aggressive PM" [2].  Unlike the previous report, applying the sdhci-omap patch [3] did not resolve my issue.  Only reverting the original commit allowed for reliable reads from /dev/mmcblk1boot0.  I also don't see the same I/O errors mentioned in the previous posting.  Reads always succeed and return the correct amount of data, its just from the wrong device.
>> Interesting, can you share a test script to reproduce your issue?
> Here is a test script I've been running on my devices.  A failure is typically
> detected after a minute or two.  I include the eMMC part type in the output as
> we've used a couple different parts in production, all claiming to be compatible
> and I'm starting to wonder if the failure is a combination of the aggressive
> PM _and_ specific emmc parts.  The offset used in hexdump was just a place in
> both mmcblk1 and mmcblk1boot0 that was non-zero.  The issue happens using any
> offset.
>
> #!/bin/bash
>
> echo "Kernel:    $(uname -r)"
> echo "eMMC part: $(dmesg | grep 'mmcblk1: mmc1:0001' | awk '{print $5}')"
> BLK1=$(hexdump -C /dev/mmcblk1 -s 0x3fc000 -n 10 | head -n 1)
> BOOT=$(hexdump -C /dev/mmcblk1boot0 -s 0x3fc000 -n 10 | head -n 1)
>
> echo "/dev/mmcblk1:      ${BLK1}"
> echo "/dev/mmcblk1boot0: ${BOOT}"
>
> while [[ "$BLK1" != "$BOOT" ]]; do
>     sleep 20
>     BOOT=$(hexdump -C /dev/mmcblk1boot0 -s 0x3fc000 -n 10 | head -n 1)
>     echo "/dev/mmcblk1boot0: ${BOOT}"
> done
>
> echo "/dev/mmcblk1boot0 read failure"
>
>> Why 6.1.38? nowadays the 6.1.x stable is 6.1.127 already.
>> Can you test with the latest stable release?
> Good question.  I can certainly update to .127 but at the time we were shipping
> units we were on .38 so that's where I've been doing all my testing.  I'll let
> you know how running under .127 compares.

Testing with 6.1.127 shows the same behavior.

>> I believe this issue could be reproduced on the beaglebone-ai board (I don't
>> have it).
>>
>> [1] https://www.beagleboard.org/boards/beaglebone-ai
> Thanks for the suggestion, I'll see if I can dig one up.
>
>> Best regards,
>> Romain
>>
>>
>>> [1] https://lore.kernel.org/all/2e5f1997-564c-44e4-b357-6343e0dae7ab@smile.fr/
>>>
>>> [2] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=3edf588e7fe00e90d1dc7fb9e599861b2c2cf442
>>>
>>> [3] https://lore.kernel.org/linux-omap/20240315234444.816978-1-romain.naour@smile.fr/T/#u
>>>
>>> Regards,
>>>
>>> Dave
>>>
Thanks,

Dave


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: sdhci-omap: additional PM issue since 5.16
  2025-01-24 18:49     ` David
@ 2025-01-27  9:06       ` Romain Naour
  2025-01-27 21:20         ` Robert Nelson
  0 siblings, 1 reply; 15+ messages in thread
From: Romain Naour @ 2025-01-27  9:06 UTC (permalink / raw)
  To: David, linux-omap; +Cc: linux-mmc, Ulf Hansson, Tony Lindgren

Hello David, All,

Le 24/01/2025 à 19:49, David a écrit :
> 
> On 1/24/25 11:15, David Owens wrote:
>> Hi Romain
>>
>> On 1/24/25 04:36, Romain Naour wrote:
>>> Hello David,
>>>
>>> Le 23/01/2025 à 23:09, David Owens a écrit :
>>>> Hello,
>>>>
>>>> I have a AM574x system and encountered an eMMC regression when upgrading from 5.15 to 6.1.38.  The eMMC is using mmc-hs200 powered at 1.8v.  Reads from /dev/mmcblk1boot0 will return expected data except when a delay of several seconds is inserted between reads.  With a delay between reads, the read will occasionally (~50% of the time) return garbage data.  Using hexdump, I was able to determine that the "bad" data is actually coming from /dev/mmcblk1, not /dev/mmcblk1boot0.  The same thing happens when reading from /dev/mmcblk1boot1.
>>>>
>>>> Much like a previous report in the linux-omap mailing list [1], I too was able to correct the regression by reverting the commit "mmc: sdhci-omap: Allow SDIO card power off and enable aggressive PM" [2].  Unlike the previous report, applying the sdhci-omap patch [3] did not resolve my issue.  Only reverting the original commit allowed for reliable reads from /dev/mmcblk1boot0.  I also don't see the same I/O errors mentioned in the previous posting.  Reads always succeed and return the correct amount of data, its just from the wrong device.
>>> Interesting, can you share a test script to reproduce your issue?
>> Here is a test script I've been running on my devices.  A failure is typically
>> detected after a minute or two.  I include the eMMC part type in the output as
>> we've used a couple different parts in production, all claiming to be compatible
>> and I'm starting to wonder if the failure is a combination of the aggressive
>> PM _and_ specific emmc parts.  The offset used in hexdump was just a place in
>> both mmcblk1 and mmcblk1boot0 that was non-zero.  The issue happens using any
>> offset.
>>
>> #!/bin/bash
>>
>> echo "Kernel:    $(uname -r)"
>> echo "eMMC part: $(dmesg | grep 'mmcblk1: mmc1:0001' | awk '{print $5}')"
>> BLK1=$(hexdump -C /dev/mmcblk1 -s 0x3fc000 -n 10 | head -n 1)
>> BOOT=$(hexdump -C /dev/mmcblk1boot0 -s 0x3fc000 -n 10 | head -n 1)
>>
>> echo "/dev/mmcblk1:      ${BLK1}"
>> echo "/dev/mmcblk1boot0: ${BOOT}"
>>
>> while [[ "$BLK1" != "$BOOT" ]]; do
>>     sleep 20
>>     BOOT=$(hexdump -C /dev/mmcblk1boot0 -s 0x3fc000 -n 10 | head -n 1)
>>     echo "/dev/mmcblk1boot0: ${BOOT}"
>> done
>>
>> echo "/dev/mmcblk1boot0 read failure"
>>
>>> Why 6.1.38? nowadays the 6.1.x stable is 6.1.127 already.
>>> Can you test with the latest stable release?
>> Good question.  I can certainly update to .127 but at the time we were shipping
>> units we were on .38 so that's where I've been doing all my testing.  I'll let
>> you know how running under .127 compares.
> 
> Testing with 6.1.127 shows the same behavior.

Thanks for testing.

I'm able to reproduce the issue locally (using a kernel 6.1.112).
It fail after the first sleep 20...

If I remove MMC_CAP_AGGRESSIVE_PM from the sdhci-omap driver the issue is gone.

About sdhci-omap driver, It's one of the only few enabling
MMC_CAP_AGGRESSIVE_PM. I recently switched to a new project using a newer SoC
but the eMMC driver doesn't event set MMC_CAP_AGGRESSIVE_PM.

I'm wondering if MMC_CAP_AGGRESSIVE_PM is really safe (or compatible) for
HS200/HS400 eMMC speed. Indeed, MMC_CAP_AGGRESSIVE_PM has been added to
sdhci-omap driver to support SDIO WLAN device PM [1].

I've found another similar report on the Beaglebone-black (AM335x SoC) [2].

It seems the MMC_CAP_AGGRESSIVE_PM feature should only be enabled to SDIO cards.

The TRM (SoC manual) says that "Suspend-Resume Flow" is only supported for SDIO
cards:

  26.5.1.2.1.6 Suspend-Resume Flow
    The suspend-and-resume feature is supported only by SDIO cards.

Thoughts?

[1]
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=3edf588e7fe00e90d1dc7fb9e599861b2c2cf442

[2]
https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1332523/beagl-bone-black-problems-reading-from-emmc-boot-partitions-on-beaglebone-with-kernel-6-1

Best regards,
Romain


> 
>>> I believe this issue could be reproduced on the beaglebone-ai board (I don't
>>> have it).
>>>
>>> [1] https://www.beagleboard.org/boards/beaglebone-ai
>> Thanks for the suggestion, I'll see if I can dig one up.
>>
>>> Best regards,
>>> Romain
>>>
>>>
>>>> [1] https://lore.kernel.org/all/2e5f1997-564c-44e4-b357-6343e0dae7ab@smile.fr/
>>>>
>>>> [2] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=3edf588e7fe00e90d1dc7fb9e599861b2c2cf442
>>>>
>>>> [3] https://lore.kernel.org/linux-omap/20240315234444.816978-1-romain.naour@smile.fr/T/#u
>>>>
>>>> Regards,
>>>>
>>>> Dave
>>>>
> Thanks,
> 
> Dave
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: sdhci-omap: additional PM issue since 5.16
  2025-01-27  9:06       ` Romain Naour
@ 2025-01-27 21:20         ` Robert Nelson
  2025-02-26 15:36           ` Robert Nelson
  0 siblings, 1 reply; 15+ messages in thread
From: Robert Nelson @ 2025-01-27 21:20 UTC (permalink / raw)
  To: Romain Naour; +Cc: David, linux-omap, linux-mmc, Ulf Hansson, Tony Lindgren

> Thanks for testing.
>
> I'm able to reproduce the issue locally (using a kernel 6.1.112).
> It fail after the first sleep 20...
>
> If I remove MMC_CAP_AGGRESSIVE_PM from the sdhci-omap driver the issue is gone.
>
> About sdhci-omap driver, It's one of the only few enabling
> MMC_CAP_AGGRESSIVE_PM. I recently switched to a new project using a newer SoC
> but the eMMC driver doesn't event set MMC_CAP_AGGRESSIVE_PM.
>
> I'm wondering if MMC_CAP_AGGRESSIVE_PM is really safe (or compatible) for
> HS200/HS400 eMMC speed. Indeed, MMC_CAP_AGGRESSIVE_PM has been added to
> sdhci-omap driver to support SDIO WLAN device PM [1].
>
> I've found another similar report on the Beaglebone-black (AM335x SoC) [2].
>
> It seems the MMC_CAP_AGGRESSIVE_PM feature should only be enabled to SDIO cards.

We've been chasing this Bug in BeagleLand for a while. Had Kingston
run it thru their hardware debuggers.. On the BBB, once the eMMC is
suspended during idle, the proper 'wakeup' cmd is NOT sent over,
instead it forces a full reset. Eventually this kills the eMMC. Been
playing with this same revert for a day or so, with my personal setup,
it takes 3-4 Weeks (at idle every day) for it to finally die.. So i
won't be able to verify this 'really' fixes it till next month..

Regards,


--
Robert Nelson
https://rcn-ee.com/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: sdhci-omap: additional PM issue since 5.16
  2025-01-27 21:20         ` Robert Nelson
@ 2025-02-26 15:36           ` Robert Nelson
  2025-02-26 16:06             ` Andreas Kemnade
  0 siblings, 1 reply; 15+ messages in thread
From: Robert Nelson @ 2025-02-26 15:36 UTC (permalink / raw)
  To: Romain Naour, Aaro Koskinen, Andreas Kemnade, Kevin Hilman,
	Roger Quadros, Ulf Hansson, Jason Kridner, Aldea, Andrei
  Cc: David, linux-omap, linux-mmc, Tony Lindgren

On Mon, Jan 27, 2025 at 3:20 PM Robert Nelson <robertcnelson@gmail.com> wrote:
>
> > Thanks for testing.
> >
> > I'm able to reproduce the issue locally (using a kernel 6.1.112).
> > It fail after the first sleep 20...
> >
> > If I remove MMC_CAP_AGGRESSIVE_PM from the sdhci-omap driver the issue is gone.
> >
> > About sdhci-omap driver, It's one of the only few enabling
> > MMC_CAP_AGGRESSIVE_PM. I recently switched to a new project using a newer SoC
> > but the eMMC driver doesn't event set MMC_CAP_AGGRESSIVE_PM.
> >
> > I'm wondering if MMC_CAP_AGGRESSIVE_PM is really safe (or compatible) for
> > HS200/HS400 eMMC speed. Indeed, MMC_CAP_AGGRESSIVE_PM has been added to
> > sdhci-omap driver to support SDIO WLAN device PM [1].
> >
> > I've found another similar report on the Beaglebone-black (AM335x SoC) [2].
> >
> > It seems the MMC_CAP_AGGRESSIVE_PM feature should only be enabled to SDIO cards.
>
> We've been chasing this Bug in BeagleLand for a while. Had Kingston
> run it thru their hardware debuggers.. On the BBB, once the eMMC is
> suspended during idle, the proper 'wakeup' cmd is NOT sent over,
> instead it forces a full reset. Eventually this kills the eMMC. Been
> playing with this same revert for a day or so, with my personal setup,
> it takes 3-4 Weeks (at idle every day) for it to finally die.. So i
> won't be able to verify this 'really' fixes it till next month..

Okay, it survived 4 weeks.. We really need to revert:
3edf588e7fe00e90d1dc7fb9e599861b2c2cf442

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3edf588e7fe00e90d1dc7fb9e599861b2c2cf442

On every stable kernel back to v6.1.x, this commit is `killing`
Kingston eMMC's on BeagleBone Black's in under 21 days.

By reverting the commit, I finally have a board that's survived the 3
week timeline, (and a week more) with no issues.

Normally on MK2704 EXT_CSD_DEVICE_LIFE_TIME_EST_TYP_B changes to 0x02
and the eMMC never works again..

[44-am335x-bbb: 6.1.83-ti-r37 (up 4 weeks, 18 hours, 35 minutes)]

*************************************************
cat /sys/kernel/debug/mmc1/ios
clock: 52000000 Hz
vdd: 21 (3.3 ~ 3.4 V)
bus mode: 2 (push-pull)
chip select: 0 (don't care)
power mode: 2 (on)
bus width: 3 (8 bits)
timing spec: 1 (mmc high-speed)
signal voltage: 0 (3.30 V)
driver type: 0 (driver type B)
*************************************************
dmesg | grep boot0
[    5.362457] mmcblk1boot0: mmc1:0001 MK2704 2.00 MiB
*************************************************
eMMC Firmware Version:
eMMC Life Time Estimation A [EXT_CSD_DEVICE_LIFE_TIME_EST_TYP_A]: 0x01
eMMC Life Time Estimation B [EXT_CSD_DEVICE_LIFE_TIME_EST_TYP_B]: 0x01
eMMC Pre EOL information [EXT_CSD_PRE_EOL_INFO]: 0x01
*************************************************
cat /tmp/eMMC.txt
eMMC name: MK2704
eMMC date: 04/2023
eMMC rev: 0x7
eMMC hwrev: 0x0
eMMC fwrev: 0x0100000000000000
eMMC oemid: 0x0100
eMMC manfid: 0x000070
eMMC life_time: 0x01 0x01
eMMC serial: 0x5992401d
*************************************************
0x01
0x01 0x01
*************************************************
cat /boot/uEnv.txt
uname_r=6.1.83-ti-r37

I'm sure someone will argue that we should test this "PM" mode... Well
on AM335x this is broken, at $~60 a pop I'm tired of testing this
regression for the last year and half.. waiting 3 weeks for a board to
die..

This needs to be reverted!

Regards,

-- 
Robert Nelson
https://rcn-ee.com/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: sdhci-omap: additional PM issue since 5.16
  2025-02-26 15:36           ` Robert Nelson
@ 2025-02-26 16:06             ` Andreas Kemnade
  2025-03-07  4:28               ` Tony Lindgren
  0 siblings, 1 reply; 15+ messages in thread
From: Andreas Kemnade @ 2025-02-26 16:06 UTC (permalink / raw)
  To: Robert Nelson
  Cc: Romain Naour, Aaro Koskinen, Kevin Hilman, Roger Quadros,
	Ulf Hansson, Jason Kridner, Aldea, Andrei, David, linux-omap,
	linux-mmc, Tony Lindgren

Am Wed, 26 Feb 2025 09:36:40 -0600
schrieb Robert Nelson <robertcnelson@gmail.com>:

> On Mon, Jan 27, 2025 at 3:20 PM Robert Nelson <robertcnelson@gmail.com> wrote:
> >  
> > > Thanks for testing.
> > >
> > > I'm able to reproduce the issue locally (using a kernel 6.1.112).
> > > It fail after the first sleep 20...
> > >
> > > If I remove MMC_CAP_AGGRESSIVE_PM from the sdhci-omap driver the issue is gone.
> > >
> > > About sdhci-omap driver, It's one of the only few enabling
> > > MMC_CAP_AGGRESSIVE_PM. I recently switched to a new project using a newer SoC
> > > but the eMMC driver doesn't event set MMC_CAP_AGGRESSIVE_PM.
> > >
> > > I'm wondering if MMC_CAP_AGGRESSIVE_PM is really safe (or compatible) for
> > > HS200/HS400 eMMC speed. Indeed, MMC_CAP_AGGRESSIVE_PM has been added to
> > > sdhci-omap driver to support SDIO WLAN device PM [1].
> > >
> > > I've found another similar report on the Beaglebone-black (AM335x SoC) [2].
> > >
> > > It seems the MMC_CAP_AGGRESSIVE_PM feature should only be enabled to SDIO cards.  
> >
> > We've been chasing this Bug in BeagleLand for a while. Had Kingston
> > run it thru their hardware debuggers.. On the BBB, once the eMMC is
> > suspended during idle, the proper 'wakeup' cmd is NOT sent over,
> > instead it forces a full reset. Eventually this kills the eMMC. Been
> > playing with this same revert for a day or so, with my personal setup,
> > it takes 3-4 Weeks (at idle every day) for it to finally die.. So i
> > won't be able to verify this 'really' fixes it till next month..  
> 
> Okay, it survived 4 weeks.. We really need to revert:
> 3edf588e7fe00e90d1dc7fb9e599861b2c2cf442
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3edf588e7fe00e90d1dc7fb9e599861b2c2cf442
> 
> On every stable kernel back to v6.1.x, this commit is `killing`
> Kingston eMMC's on BeagleBone Black's in under 21 days.
> 
> By reverting the commit, I finally have a board that's survived the 3
> week timeline, (and a week more) with no issues.
> 
Is there any simple way to restrain it to only sdio devices to go
forward a bit?

Regards,
Andreas

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: sdhci-omap: additional PM issue since 5.16
  2025-02-26 16:06             ` Andreas Kemnade
@ 2025-03-07  4:28               ` Tony Lindgren
  2025-03-07 14:42                 ` Robert Nelson
  2025-03-12 11:55                 ` Ulf Hansson
  0 siblings, 2 replies; 15+ messages in thread
From: Tony Lindgren @ 2025-03-07  4:28 UTC (permalink / raw)
  To: Andreas Kemnade
  Cc: Robert Nelson, Romain Naour, Aaro Koskinen, Kevin Hilman,
	Roger Quadros, Ulf Hansson, Jason Kridner, Aldea, Andrei, David,
	linux-omap, linux-mmc

* Andreas Kemnade <andreas@kemnade.info> [250226 16:06]:
> Am Wed, 26 Feb 2025 09:36:40 -0600
> schrieb Robert Nelson <robertcnelson@gmail.com>:
> 
> > On Mon, Jan 27, 2025 at 3:20 PM Robert Nelson <robertcnelson@gmail.com> wrote:
> > >  
> > > > Thanks for testing.
> > > >
> > > > I'm able to reproduce the issue locally (using a kernel 6.1.112).
> > > > It fail after the first sleep 20...
> > > >
> > > > If I remove MMC_CAP_AGGRESSIVE_PM from the sdhci-omap driver the issue is gone.
> > > >
> > > > About sdhci-omap driver, It's one of the only few enabling
> > > > MMC_CAP_AGGRESSIVE_PM. I recently switched to a new project using a newer SoC
> > > > but the eMMC driver doesn't event set MMC_CAP_AGGRESSIVE_PM.
> > > >
> > > > I'm wondering if MMC_CAP_AGGRESSIVE_PM is really safe (or compatible) for
> > > > HS200/HS400 eMMC speed. Indeed, MMC_CAP_AGGRESSIVE_PM has been added to
> > > > sdhci-omap driver to support SDIO WLAN device PM [1].
> > > >
> > > > I've found another similar report on the Beaglebone-black (AM335x SoC) [2].
> > > >
> > > > It seems the MMC_CAP_AGGRESSIVE_PM feature should only be enabled to SDIO cards.  
> > >
> > > We've been chasing this Bug in BeagleLand for a while. Had Kingston
> > > run it thru their hardware debuggers.. On the BBB, once the eMMC is
> > > suspended during idle, the proper 'wakeup' cmd is NOT sent over,
> > > instead it forces a full reset. Eventually this kills the eMMC. Been
> > > playing with this same revert for a day or so, with my personal setup,
> > > it takes 3-4 Weeks (at idle every day) for it to finally die.. So i
> > > won't be able to verify this 'really' fixes it till next month..  
> > 
> > Okay, it survived 4 weeks.. We really need to revert:
> > 3edf588e7fe00e90d1dc7fb9e599861b2c2cf442
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3edf588e7fe00e90d1dc7fb9e599861b2c2cf442
> > 
> > On every stable kernel back to v6.1.x, this commit is `killing`
> > Kingston eMMC's on BeagleBone Black's in under 21 days.
> > 
> > By reverting the commit, I finally have a board that's survived the 3
> > week timeline, (and a week more) with no issues.
> > 
> Is there any simple way to restrain it to only sdio devices to go
> forward a bit?

Best to revert the patch first until the issue has been fixed.

Based on the symptoms, it sounds like there might be a missing flush of
a posted write in the PM runtime suspend/resume path. This could cause
something in the sequence happen in the wrong order for some of the
related surrounding resources like power, clocks or interrupts.

Regards,

Tony

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: sdhci-omap: additional PM issue since 5.16
  2025-03-07  4:28               ` Tony Lindgren
@ 2025-03-07 14:42                 ` Robert Nelson
  2025-03-09 17:58                   ` Andreas Kemnade
  2025-03-12 11:55                 ` Ulf Hansson
  1 sibling, 1 reply; 15+ messages in thread
From: Robert Nelson @ 2025-03-07 14:42 UTC (permalink / raw)
  To: Tony Lindgren
  Cc: Andreas Kemnade, Romain Naour, Aaro Koskinen, Kevin Hilman,
	Roger Quadros, Ulf Hansson, Jason Kridner, Aldea, Andrei, David,
	linux-omap, linux-mmc

On Thu, Mar 6, 2025 at 10:28 PM Tony Lindgren <tony@atomide.com> wrote:
>
> Best to revert the patch first until the issue has been fixed.
>
> Based on the symptoms, it sounds like there might be a missing flush of
> a posted write in the PM runtime suspend/resume path. This could cause
> something in the sequence happen in the wrong order for some of the
> related surrounding resources like power, clocks or interrupts.
>

Kington's hardware anaylizer said after CMD5/sleep in about 10us,
instead of CMD5/wkup being called, it just resets the eMMC..  So
someone deep within the sdhc/mmc layer might understand that. ;)

Regards,

-- 
Robert Nelson
https://rcn-ee.com/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: sdhci-omap: additional PM issue since 5.16
  2025-03-07 14:42                 ` Robert Nelson
@ 2025-03-09 17:58                   ` Andreas Kemnade
  0 siblings, 0 replies; 15+ messages in thread
From: Andreas Kemnade @ 2025-03-09 17:58 UTC (permalink / raw)
  To: Robert Nelson
  Cc: Tony Lindgren, Romain Naour, Aaro Koskinen, Kevin Hilman,
	Roger Quadros, Ulf Hansson, Jason Kridner, Aldea, Andrei, David,
	linux-omap, linux-mmc

Hi,

Am Fri, 7 Mar 2025 08:42:02 -0600
schrieb Robert Nelson <robertcnelson@gmail.com>:

> On Thu, Mar 6, 2025 at 10:28 PM Tony Lindgren <tony@atomide.com> wrote:
> >
> > Best to revert the patch first until the issue has been fixed.
> >
> > Based on the symptoms, it sounds like there might be a missing flush of
> > a posted write in the PM runtime suspend/resume path. This could cause
> > something in the sequence happen in the wrong order for some of the
> > related surrounding resources like power, clocks or interrupts.
> >  
> 
> Kington's hardware anaylizer said after CMD5/sleep in about 10us,
> instead of CMD5/wkup being called, it just resets the eMMC..  So
> someone deep within the sdhc/mmc layer might understand that. ;)
> 
hmm, omap3/4/5 are using omap-hsmmc and are not switched towards sdhci
yet. So I guess that might be the reason why things are not coming to
light that easy. Also I am wondering whether we can start by destroying
SD cards rather then eMMC. So maybe we can do some stress-testing to
have the cycles shorter. Do we have an overview what is really working
reliable with that driver?


Regards,
Andreas

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: sdhci-omap: additional PM issue since 5.16
  2025-03-07  4:28               ` Tony Lindgren
  2025-03-07 14:42                 ` Robert Nelson
@ 2025-03-12 11:55                 ` Ulf Hansson
  2025-03-20  4:14                   ` Tony Lindgren
  1 sibling, 1 reply; 15+ messages in thread
From: Ulf Hansson @ 2025-03-12 11:55 UTC (permalink / raw)
  To: Robert Nelson, Tony Lindgren
  Cc: Andreas Kemnade, Romain Naour, Aaro Koskinen, Kevin Hilman,
	Roger Quadros, Jason Kridner, Aldea, Andrei, David, linux-omap,
	linux-mmc

On Fri, 7 Mar 2025 at 05:28, Tony Lindgren <tony@atomide.com> wrote:
>
> * Andreas Kemnade <andreas@kemnade.info> [250226 16:06]:
> > Am Wed, 26 Feb 2025 09:36:40 -0600
> > schrieb Robert Nelson <robertcnelson@gmail.com>:
> >
> > > On Mon, Jan 27, 2025 at 3:20 PM Robert Nelson <robertcnelson@gmail.com> wrote:
> > > >
> > > > > Thanks for testing.
> > > > >
> > > > > I'm able to reproduce the issue locally (using a kernel 6.1.112).
> > > > > It fail after the first sleep 20...
> > > > >
> > > > > If I remove MMC_CAP_AGGRESSIVE_PM from the sdhci-omap driver the issue is gone.
> > > > >
> > > > > About sdhci-omap driver, It's one of the only few enabling
> > > > > MMC_CAP_AGGRESSIVE_PM. I recently switched to a new project using a newer SoC
> > > > > but the eMMC driver doesn't event set MMC_CAP_AGGRESSIVE_PM.
> > > > >
> > > > > I'm wondering if MMC_CAP_AGGRESSIVE_PM is really safe (or compatible) for
> > > > > HS200/HS400 eMMC speed. Indeed, MMC_CAP_AGGRESSIVE_PM has been added to
> > > > > sdhci-omap driver to support SDIO WLAN device PM [1].
> > > > >
> > > > > I've found another similar report on the Beaglebone-black (AM335x SoC) [2].
> > > > >
> > > > > It seems the MMC_CAP_AGGRESSIVE_PM feature should only be enabled to SDIO cards.
> > > >
> > > > We've been chasing this Bug in BeagleLand for a while. Had Kingston
> > > > run it thru their hardware debuggers.. On the BBB, once the eMMC is
> > > > suspended during idle, the proper 'wakeup' cmd is NOT sent over,
> > > > instead it forces a full reset. Eventually this kills the eMMC. Been
> > > > playing with this same revert for a day or so, with my personal setup,
> > > > it takes 3-4 Weeks (at idle every day) for it to finally die.. So i
> > > > won't be able to verify this 'really' fixes it till next month..
> > >
> > > Okay, it survived 4 weeks.. We really need to revert:
> > > 3edf588e7fe00e90d1dc7fb9e599861b2c2cf442
> > >
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3edf588e7fe00e90d1dc7fb9e599861b2c2cf442
> > >
> > > On every stable kernel back to v6.1.x, this commit is `killing`
> > > Kingston eMMC's on BeagleBone Black's in under 21 days.
> > >
> > > By reverting the commit, I finally have a board that's survived the 3
> > > week timeline, (and a week more) with no issues.
> > >
> > Is there any simple way to restrain it to only sdio devices to go
> > forward a bit?
>
> Best to revert the patch first until the issue has been fixed.
>
> Based on the symptoms, it sounds like there might be a missing flush of
> a posted write in the PM runtime suspend/resume path. This could cause
> something in the sequence happen in the wrong order for some of the
> related surrounding resources like power, clocks or interrupts.

SDIO is entirely different in this regard compared to eMMC/SD. So if
there are no reports of issues I suggest we keep the SDIO part.

Let me help out and cook a patch for this. I send it out in a few minutes.

>
> Regards,
>
> Tony

Kind regards
Uffe

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: sdhci-omap: additional PM issue since 5.16
  2025-03-12 11:55                 ` Ulf Hansson
@ 2025-03-20  4:14                   ` Tony Lindgren
  2025-03-20  8:47                     ` Ulf Hansson
  0 siblings, 1 reply; 15+ messages in thread
From: Tony Lindgren @ 2025-03-20  4:14 UTC (permalink / raw)
  To: Ulf Hansson
  Cc: Robert Nelson, Andreas Kemnade, Romain Naour, Aaro Koskinen,
	Kevin Hilman, Roger Quadros, Jason Kridner, Aldea, Andrei, David,
	linux-omap, linux-mmc

Hi,

* Ulf Hansson <ulf.hansson@linaro.org> [250312 11:56]:
> On Fri, 7 Mar 2025 at 05:28, Tony Lindgren <tony@atomide.com> wrote:
> > Based on the symptoms, it sounds like there might be a missing flush of
> > a posted write in the PM runtime suspend/resume path. This could cause
> > something in the sequence happen in the wrong order for some of the
> > related surrounding resources like power, clocks or interrupts.
> 
> SDIO is entirely different in this regard compared to eMMC/SD. So if
> there are no reports of issues I suggest we keep the SDIO part.

Hmm just wondering if you have any guesses what causes the eMMC/SD related
PM to break?

Regards,

Tony

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: sdhci-omap: additional PM issue since 5.16
  2025-03-20  4:14                   ` Tony Lindgren
@ 2025-03-20  8:47                     ` Ulf Hansson
  2025-04-01  4:00                       ` Tony Lindgren
  0 siblings, 1 reply; 15+ messages in thread
From: Ulf Hansson @ 2025-03-20  8:47 UTC (permalink / raw)
  To: Tony Lindgren
  Cc: Robert Nelson, Andreas Kemnade, Romain Naour, Aaro Koskinen,
	Kevin Hilman, Roger Quadros, Jason Kridner, Aldea, Andrei, David,
	linux-omap, linux-mmc

On Thu, 20 Mar 2025 at 05:14, Tony Lindgren <tony@atomide.com> wrote:
>
> Hi,
>
> * Ulf Hansson <ulf.hansson@linaro.org> [250312 11:56]:
> > On Fri, 7 Mar 2025 at 05:28, Tony Lindgren <tony@atomide.com> wrote:
> > > Based on the symptoms, it sounds like there might be a missing flush of
> > > a posted write in the PM runtime suspend/resume path. This could cause
> > > something in the sequence happen in the wrong order for some of the
> > > related surrounding resources like power, clocks or interrupts.
> >
> > SDIO is entirely different in this regard compared to eMMC/SD. So if
> > there are no reports of issues I suggest we keep the SDIO part.
>
> Hmm just wondering if you have any guesses what causes the eMMC/SD related
> PM to break?
>
> Regards,
>
> Tony

Well, I have recently been looking a bit closer at the runtime PM
support of the eMMC/SD card. We seem to have some kind of related
problems [1], but I am not really sure yet.

That said, I believe I may have found some *potential* issues and I am
working on a few patches for it (for the mmc core), I will keep you
and the people in $subject posted. Although, it's not quite clear to
me, why these problems have turned up at this point and not a lot
earlier. I have a feeling there is something that I am missing.

Also note that, if the problems are sdhci/sdhci-omap specific, it
becomes a bit more difficult for me to help out.

Luckily, it seems like David shared a pretty simple script with us,
which should reproduce the problem in just a few minutes. There are
also debugfs and the runtime PM-sysfs interface one can use to play
with the behaviour of MMC_CAP_AGGRESSIVE_PM.

Kind regards
Uffe

[1]
https://bugzilla.kernel.org/show_bug.cgi?id=218821
https://lore.kernel.org/all/CAPDyKFq4-fL3oHeT9phThWQJqzicKeA447WBJUbtcKPhdZ2d1A@mail.gmail.com/T/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: sdhci-omap: additional PM issue since 5.16
  2025-03-20  8:47                     ` Ulf Hansson
@ 2025-04-01  4:00                       ` Tony Lindgren
  0 siblings, 0 replies; 15+ messages in thread
From: Tony Lindgren @ 2025-04-01  4:00 UTC (permalink / raw)
  To: Ulf Hansson
  Cc: Robert Nelson, Andreas Kemnade, Romain Naour, Aaro Koskinen,
	Kevin Hilman, Roger Quadros, Jason Kridner, Aldea, Andrei, David,
	linux-omap, linux-mmc

* Ulf Hansson <ulf.hansson@linaro.org> [250320 08:48]:
> On Thu, 20 Mar 2025 at 05:14, Tony Lindgren <tony@atomide.com> wrote:
> >
> > Hi,
> >
> > * Ulf Hansson <ulf.hansson@linaro.org> [250312 11:56]:
> > > On Fri, 7 Mar 2025 at 05:28, Tony Lindgren <tony@atomide.com> wrote:
> > > > Based on the symptoms, it sounds like there might be a missing flush of
> > > > a posted write in the PM runtime suspend/resume path. This could cause
> > > > something in the sequence happen in the wrong order for some of the
> > > > related surrounding resources like power, clocks or interrupts.
> > >
> > > SDIO is entirely different in this regard compared to eMMC/SD. So if
> > > there are no reports of issues I suggest we keep the SDIO part.
> >
> > Hmm just wondering if you have any guesses what causes the eMMC/SD related
> > PM to break?
> >
> > Regards,
> >
> > Tony
> 
> Well, I have recently been looking a bit closer at the runtime PM
> support of the eMMC/SD card. We seem to have some kind of related
> problems [1], but I am not really sure yet.
> 
> That said, I believe I may have found some *potential* issues and I am
> working on a few patches for it (for the mmc core), I will keep you
> and the people in $subject posted. Although, it's not quite clear to
> me, why these problems have turned up at this point and not a lot
> earlier. I have a feeling there is something that I am missing.
> 
> Also note that, if the problems are sdhci/sdhci-omap specific, it
> becomes a bit more difficult for me to help out.
> 
> Luckily, it seems like David shared a pretty simple script with us,
> which should reproduce the problem in just a few minutes. There are
> also debugfs and the runtime PM-sysfs interface one can use to play
> with the behaviour of MMC_CAP_AGGRESSIVE_PM.

OK thanks for the description. AFAIK this is issue has not been happening
with the eMMC and the old mmc-omap-hs driver. Sounds like the issue might
be also eMMC specific.

Regards,

Tony

> [1]
> https://bugzilla.kernel.org/show_bug.cgi?id=218821
> https://lore.kernel.org/all/CAPDyKFq4-fL3oHeT9phThWQJqzicKeA447WBJUbtcKPhdZ2d1A@mail.gmail.com/T/

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-04-01  4:01 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-23 22:09 sdhci-omap: additional PM issue since 5.16 David Owens
2025-01-24 10:36 ` Romain Naour
2025-01-24 17:15   ` David Owens
2025-01-24 18:49     ` David
2025-01-27  9:06       ` Romain Naour
2025-01-27 21:20         ` Robert Nelson
2025-02-26 15:36           ` Robert Nelson
2025-02-26 16:06             ` Andreas Kemnade
2025-03-07  4:28               ` Tony Lindgren
2025-03-07 14:42                 ` Robert Nelson
2025-03-09 17:58                   ` Andreas Kemnade
2025-03-12 11:55                 ` Ulf Hansson
2025-03-20  4:14                   ` Tony Lindgren
2025-03-20  8:47                     ` Ulf Hansson
2025-04-01  4:00                       ` Tony Lindgren

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox