* sdhci-omap: additional PM issue since 5.16 @ 2025-01-23 22:09 David Owens 2025-01-24 10:36 ` Romain Naour 0 siblings, 1 reply; 15+ messages in thread From: David Owens @ 2025-01-23 22:09 UTC (permalink / raw) To: linux-omap; +Cc: linux-mmc Hello, I have a AM574x system and encountered an eMMC regression when upgrading from 5.15 to 6.1.38. The eMMC is using mmc-hs200 powered at 1.8v. Reads from /dev/mmcblk1boot0 will return expected data except when a delay of several seconds is inserted between reads. With a delay between reads, the read will occasionally (~50% of the time) return garbage data. Using hexdump, I was able to determine that the "bad" data is actually coming from /dev/mmcblk1, not /dev/mmcblk1boot0. The same thing happens when reading from /dev/mmcblk1boot1. Much like a previous report in the linux-omap mailing list [1], I too was able to correct the regression by reverting the commit "mmc: sdhci-omap: Allow SDIO card power off and enable aggressive PM" [2]. Unlike the previous report, applying the sdhci-omap patch [3] did not resolve my issue. Only reverting the original commit allowed for reliable reads from /dev/mmcblk1boot0. I also don't see the same I/O errors mentioned in the previous posting. Reads always succeed and return the correct amount of data, its just from the wrong device. [1] https://lore.kernel.org/all/2e5f1997-564c-44e4-b357-6343e0dae7ab@smile.fr/ [2] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=3edf588e7fe00e90d1dc7fb9e599861b2c2cf442 [3] https://lore.kernel.org/linux-omap/20240315234444.816978-1-romain.naour@smile.fr/T/#u Regards, Dave ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: sdhci-omap: additional PM issue since 5.16 2025-01-23 22:09 sdhci-omap: additional PM issue since 5.16 David Owens @ 2025-01-24 10:36 ` Romain Naour 2025-01-24 17:15 ` David Owens 0 siblings, 1 reply; 15+ messages in thread From: Romain Naour @ 2025-01-24 10:36 UTC (permalink / raw) To: David Owens, linux-omap; +Cc: linux-mmc Hello David, Le 23/01/2025 à 23:09, David Owens a écrit : > Hello, > > I have a AM574x system and encountered an eMMC regression when upgrading from 5.15 to 6.1.38. The eMMC is using mmc-hs200 powered at 1.8v. Reads from /dev/mmcblk1boot0 will return expected data except when a delay of several seconds is inserted between reads. With a delay between reads, the read will occasionally (~50% of the time) return garbage data. Using hexdump, I was able to determine that the "bad" data is actually coming from /dev/mmcblk1, not /dev/mmcblk1boot0. The same thing happens when reading from /dev/mmcblk1boot1. > > Much like a previous report in the linux-omap mailing list [1], I too was able to correct the regression by reverting the commit "mmc: sdhci-omap: Allow SDIO card power off and enable aggressive PM" [2]. Unlike the previous report, applying the sdhci-omap patch [3] did not resolve my issue. Only reverting the original commit allowed for reliable reads from /dev/mmcblk1boot0. I also don't see the same I/O errors mentioned in the previous posting. Reads always succeed and return the correct amount of data, its just from the wrong device. Interesting, can you share a test script to reproduce your issue? Why 6.1.38? nowadays the 6.1.x stable is 6.1.127 already. Can you test with the latest stable release? I believe this issue could be reproduced on the beaglebone-ai board (I don't have it). [1] https://www.beagleboard.org/boards/beaglebone-ai Best regards, Romain > > [1] https://lore.kernel.org/all/2e5f1997-564c-44e4-b357-6343e0dae7ab@smile.fr/ > > [2] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=3edf588e7fe00e90d1dc7fb9e599861b2c2cf442 > > [3] https://lore.kernel.org/linux-omap/20240315234444.816978-1-romain.naour@smile.fr/T/#u > > Regards, > > Dave > > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: sdhci-omap: additional PM issue since 5.16 2025-01-24 10:36 ` Romain Naour @ 2025-01-24 17:15 ` David Owens 2025-01-24 18:49 ` David 0 siblings, 1 reply; 15+ messages in thread From: David Owens @ 2025-01-24 17:15 UTC (permalink / raw) To: Romain Naour, linux-omap; +Cc: linux-mmc Hi Romain On 1/24/25 04:36, Romain Naour wrote: > Hello David, > > Le 23/01/2025 à 23:09, David Owens a écrit : >> Hello, >> >> I have a AM574x system and encountered an eMMC regression when upgrading from 5.15 to 6.1.38. The eMMC is using mmc-hs200 powered at 1.8v. Reads from /dev/mmcblk1boot0 will return expected data except when a delay of several seconds is inserted between reads. With a delay between reads, the read will occasionally (~50% of the time) return garbage data. Using hexdump, I was able to determine that the "bad" data is actually coming from /dev/mmcblk1, not /dev/mmcblk1boot0. The same thing happens when reading from /dev/mmcblk1boot1. >> >> Much like a previous report in the linux-omap mailing list [1], I too was able to correct the regression by reverting the commit "mmc: sdhci-omap: Allow SDIO card power off and enable aggressive PM" [2]. Unlike the previous report, applying the sdhci-omap patch [3] did not resolve my issue. Only reverting the original commit allowed for reliable reads from /dev/mmcblk1boot0. I also don't see the same I/O errors mentioned in the previous posting. Reads always succeed and return the correct amount of data, its just from the wrong device. > Interesting, can you share a test script to reproduce your issue? Here is a test script I've been running on my devices. A failure is typically detected after a minute or two. I include the eMMC part type in the output as we've used a couple different parts in production, all claiming to be compatible and I'm starting to wonder if the failure is a combination of the aggressive PM _and_ specific emmc parts. The offset used in hexdump was just a place in both mmcblk1 and mmcblk1boot0 that was non-zero. The issue happens using any offset. #!/bin/bash echo "Kernel: $(uname -r)" echo "eMMC part: $(dmesg | grep 'mmcblk1: mmc1:0001' | awk '{print $5}')" BLK1=$(hexdump -C /dev/mmcblk1 -s 0x3fc000 -n 10 | head -n 1) BOOT=$(hexdump -C /dev/mmcblk1boot0 -s 0x3fc000 -n 10 | head -n 1) echo "/dev/mmcblk1: ${BLK1}" echo "/dev/mmcblk1boot0: ${BOOT}" while [[ "$BLK1" != "$BOOT" ]]; do sleep 20 BOOT=$(hexdump -C /dev/mmcblk1boot0 -s 0x3fc000 -n 10 | head -n 1) echo "/dev/mmcblk1boot0: ${BOOT}" done echo "/dev/mmcblk1boot0 read failure" > > Why 6.1.38? nowadays the 6.1.x stable is 6.1.127 already. > Can you test with the latest stable release? Good question. I can certainly update to .127 but at the time we were shipping units we were on .38 so that's where I've been doing all my testing. I'll let you know how running under .127 compares. > > I believe this issue could be reproduced on the beaglebone-ai board (I don't > have it). > > [1] https://www.beagleboard.org/boards/beaglebone-ai Thanks for the suggestion, I'll see if I can dig one up. > > Best regards, > Romain > > >> [1] https://lore.kernel.org/all/2e5f1997-564c-44e4-b357-6343e0dae7ab@smile.fr/ >> >> [2] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=3edf588e7fe00e90d1dc7fb9e599861b2c2cf442 >> >> [3] https://lore.kernel.org/linux-omap/20240315234444.816978-1-romain.naour@smile.fr/T/#u >> >> Regards, >> >> Dave >> >> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: sdhci-omap: additional PM issue since 5.16 2025-01-24 17:15 ` David Owens @ 2025-01-24 18:49 ` David 2025-01-27 9:06 ` Romain Naour 0 siblings, 1 reply; 15+ messages in thread From: David @ 2025-01-24 18:49 UTC (permalink / raw) To: Romain Naour, linux-omap; +Cc: linux-mmc On 1/24/25 11:15, David Owens wrote: > Hi Romain > > On 1/24/25 04:36, Romain Naour wrote: >> Hello David, >> >> Le 23/01/2025 à 23:09, David Owens a écrit : >>> Hello, >>> >>> I have a AM574x system and encountered an eMMC regression when upgrading from 5.15 to 6.1.38. The eMMC is using mmc-hs200 powered at 1.8v. Reads from /dev/mmcblk1boot0 will return expected data except when a delay of several seconds is inserted between reads. With a delay between reads, the read will occasionally (~50% of the time) return garbage data. Using hexdump, I was able to determine that the "bad" data is actually coming from /dev/mmcblk1, not /dev/mmcblk1boot0. The same thing happens when reading from /dev/mmcblk1boot1. >>> >>> Much like a previous report in the linux-omap mailing list [1], I too was able to correct the regression by reverting the commit "mmc: sdhci-omap: Allow SDIO card power off and enable aggressive PM" [2]. Unlike the previous report, applying the sdhci-omap patch [3] did not resolve my issue. Only reverting the original commit allowed for reliable reads from /dev/mmcblk1boot0. I also don't see the same I/O errors mentioned in the previous posting. Reads always succeed and return the correct amount of data, its just from the wrong device. >> Interesting, can you share a test script to reproduce your issue? > Here is a test script I've been running on my devices. A failure is typically > detected after a minute or two. I include the eMMC part type in the output as > we've used a couple different parts in production, all claiming to be compatible > and I'm starting to wonder if the failure is a combination of the aggressive > PM _and_ specific emmc parts. The offset used in hexdump was just a place in > both mmcblk1 and mmcblk1boot0 that was non-zero. The issue happens using any > offset. > > #!/bin/bash > > echo "Kernel: $(uname -r)" > echo "eMMC part: $(dmesg | grep 'mmcblk1: mmc1:0001' | awk '{print $5}')" > BLK1=$(hexdump -C /dev/mmcblk1 -s 0x3fc000 -n 10 | head -n 1) > BOOT=$(hexdump -C /dev/mmcblk1boot0 -s 0x3fc000 -n 10 | head -n 1) > > echo "/dev/mmcblk1: ${BLK1}" > echo "/dev/mmcblk1boot0: ${BOOT}" > > while [[ "$BLK1" != "$BOOT" ]]; do > sleep 20 > BOOT=$(hexdump -C /dev/mmcblk1boot0 -s 0x3fc000 -n 10 | head -n 1) > echo "/dev/mmcblk1boot0: ${BOOT}" > done > > echo "/dev/mmcblk1boot0 read failure" > >> Why 6.1.38? nowadays the 6.1.x stable is 6.1.127 already. >> Can you test with the latest stable release? > Good question. I can certainly update to .127 but at the time we were shipping > units we were on .38 so that's where I've been doing all my testing. I'll let > you know how running under .127 compares. Testing with 6.1.127 shows the same behavior. >> I believe this issue could be reproduced on the beaglebone-ai board (I don't >> have it). >> >> [1] https://www.beagleboard.org/boards/beaglebone-ai > Thanks for the suggestion, I'll see if I can dig one up. > >> Best regards, >> Romain >> >> >>> [1] https://lore.kernel.org/all/2e5f1997-564c-44e4-b357-6343e0dae7ab@smile.fr/ >>> >>> [2] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=3edf588e7fe00e90d1dc7fb9e599861b2c2cf442 >>> >>> [3] https://lore.kernel.org/linux-omap/20240315234444.816978-1-romain.naour@smile.fr/T/#u >>> >>> Regards, >>> >>> Dave >>> Thanks, Dave ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: sdhci-omap: additional PM issue since 5.16 2025-01-24 18:49 ` David @ 2025-01-27 9:06 ` Romain Naour 2025-01-27 21:20 ` Robert Nelson 0 siblings, 1 reply; 15+ messages in thread From: Romain Naour @ 2025-01-27 9:06 UTC (permalink / raw) To: David, linux-omap; +Cc: linux-mmc, Ulf Hansson, Tony Lindgren Hello David, All, Le 24/01/2025 à 19:49, David a écrit : > > On 1/24/25 11:15, David Owens wrote: >> Hi Romain >> >> On 1/24/25 04:36, Romain Naour wrote: >>> Hello David, >>> >>> Le 23/01/2025 à 23:09, David Owens a écrit : >>>> Hello, >>>> >>>> I have a AM574x system and encountered an eMMC regression when upgrading from 5.15 to 6.1.38. The eMMC is using mmc-hs200 powered at 1.8v. Reads from /dev/mmcblk1boot0 will return expected data except when a delay of several seconds is inserted between reads. With a delay between reads, the read will occasionally (~50% of the time) return garbage data. Using hexdump, I was able to determine that the "bad" data is actually coming from /dev/mmcblk1, not /dev/mmcblk1boot0. The same thing happens when reading from /dev/mmcblk1boot1. >>>> >>>> Much like a previous report in the linux-omap mailing list [1], I too was able to correct the regression by reverting the commit "mmc: sdhci-omap: Allow SDIO card power off and enable aggressive PM" [2]. Unlike the previous report, applying the sdhci-omap patch [3] did not resolve my issue. Only reverting the original commit allowed for reliable reads from /dev/mmcblk1boot0. I also don't see the same I/O errors mentioned in the previous posting. Reads always succeed and return the correct amount of data, its just from the wrong device. >>> Interesting, can you share a test script to reproduce your issue? >> Here is a test script I've been running on my devices. A failure is typically >> detected after a minute or two. I include the eMMC part type in the output as >> we've used a couple different parts in production, all claiming to be compatible >> and I'm starting to wonder if the failure is a combination of the aggressive >> PM _and_ specific emmc parts. The offset used in hexdump was just a place in >> both mmcblk1 and mmcblk1boot0 that was non-zero. The issue happens using any >> offset. >> >> #!/bin/bash >> >> echo "Kernel: $(uname -r)" >> echo "eMMC part: $(dmesg | grep 'mmcblk1: mmc1:0001' | awk '{print $5}')" >> BLK1=$(hexdump -C /dev/mmcblk1 -s 0x3fc000 -n 10 | head -n 1) >> BOOT=$(hexdump -C /dev/mmcblk1boot0 -s 0x3fc000 -n 10 | head -n 1) >> >> echo "/dev/mmcblk1: ${BLK1}" >> echo "/dev/mmcblk1boot0: ${BOOT}" >> >> while [[ "$BLK1" != "$BOOT" ]]; do >> sleep 20 >> BOOT=$(hexdump -C /dev/mmcblk1boot0 -s 0x3fc000 -n 10 | head -n 1) >> echo "/dev/mmcblk1boot0: ${BOOT}" >> done >> >> echo "/dev/mmcblk1boot0 read failure" >> >>> Why 6.1.38? nowadays the 6.1.x stable is 6.1.127 already. >>> Can you test with the latest stable release? >> Good question. I can certainly update to .127 but at the time we were shipping >> units we were on .38 so that's where I've been doing all my testing. I'll let >> you know how running under .127 compares. > > Testing with 6.1.127 shows the same behavior. Thanks for testing. I'm able to reproduce the issue locally (using a kernel 6.1.112). It fail after the first sleep 20... If I remove MMC_CAP_AGGRESSIVE_PM from the sdhci-omap driver the issue is gone. About sdhci-omap driver, It's one of the only few enabling MMC_CAP_AGGRESSIVE_PM. I recently switched to a new project using a newer SoC but the eMMC driver doesn't event set MMC_CAP_AGGRESSIVE_PM. I'm wondering if MMC_CAP_AGGRESSIVE_PM is really safe (or compatible) for HS200/HS400 eMMC speed. Indeed, MMC_CAP_AGGRESSIVE_PM has been added to sdhci-omap driver to support SDIO WLAN device PM [1]. I've found another similar report on the Beaglebone-black (AM335x SoC) [2]. It seems the MMC_CAP_AGGRESSIVE_PM feature should only be enabled to SDIO cards. The TRM (SoC manual) says that "Suspend-Resume Flow" is only supported for SDIO cards: 26.5.1.2.1.6 Suspend-Resume Flow The suspend-and-resume feature is supported only by SDIO cards. Thoughts? [1] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=3edf588e7fe00e90d1dc7fb9e599861b2c2cf442 [2] https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1332523/beagl-bone-black-problems-reading-from-emmc-boot-partitions-on-beaglebone-with-kernel-6-1 Best regards, Romain > >>> I believe this issue could be reproduced on the beaglebone-ai board (I don't >>> have it). >>> >>> [1] https://www.beagleboard.org/boards/beaglebone-ai >> Thanks for the suggestion, I'll see if I can dig one up. >> >>> Best regards, >>> Romain >>> >>> >>>> [1] https://lore.kernel.org/all/2e5f1997-564c-44e4-b357-6343e0dae7ab@smile.fr/ >>>> >>>> [2] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=3edf588e7fe00e90d1dc7fb9e599861b2c2cf442 >>>> >>>> [3] https://lore.kernel.org/linux-omap/20240315234444.816978-1-romain.naour@smile.fr/T/#u >>>> >>>> Regards, >>>> >>>> Dave >>>> > Thanks, > > Dave > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: sdhci-omap: additional PM issue since 5.16 2025-01-27 9:06 ` Romain Naour @ 2025-01-27 21:20 ` Robert Nelson 2025-02-26 15:36 ` Robert Nelson 0 siblings, 1 reply; 15+ messages in thread From: Robert Nelson @ 2025-01-27 21:20 UTC (permalink / raw) To: Romain Naour; +Cc: David, linux-omap, linux-mmc, Ulf Hansson, Tony Lindgren > Thanks for testing. > > I'm able to reproduce the issue locally (using a kernel 6.1.112). > It fail after the first sleep 20... > > If I remove MMC_CAP_AGGRESSIVE_PM from the sdhci-omap driver the issue is gone. > > About sdhci-omap driver, It's one of the only few enabling > MMC_CAP_AGGRESSIVE_PM. I recently switched to a new project using a newer SoC > but the eMMC driver doesn't event set MMC_CAP_AGGRESSIVE_PM. > > I'm wondering if MMC_CAP_AGGRESSIVE_PM is really safe (or compatible) for > HS200/HS400 eMMC speed. Indeed, MMC_CAP_AGGRESSIVE_PM has been added to > sdhci-omap driver to support SDIO WLAN device PM [1]. > > I've found another similar report on the Beaglebone-black (AM335x SoC) [2]. > > It seems the MMC_CAP_AGGRESSIVE_PM feature should only be enabled to SDIO cards. We've been chasing this Bug in BeagleLand for a while. Had Kingston run it thru their hardware debuggers.. On the BBB, once the eMMC is suspended during idle, the proper 'wakeup' cmd is NOT sent over, instead it forces a full reset. Eventually this kills the eMMC. Been playing with this same revert for a day or so, with my personal setup, it takes 3-4 Weeks (at idle every day) for it to finally die.. So i won't be able to verify this 'really' fixes it till next month.. Regards, -- Robert Nelson https://rcn-ee.com/ ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: sdhci-omap: additional PM issue since 5.16 2025-01-27 21:20 ` Robert Nelson @ 2025-02-26 15:36 ` Robert Nelson 2025-02-26 16:06 ` Andreas Kemnade 0 siblings, 1 reply; 15+ messages in thread From: Robert Nelson @ 2025-02-26 15:36 UTC (permalink / raw) To: Romain Naour, Aaro Koskinen, Andreas Kemnade, Kevin Hilman, Roger Quadros, Ulf Hansson, Jason Kridner, Aldea, Andrei Cc: David, linux-omap, linux-mmc, Tony Lindgren On Mon, Jan 27, 2025 at 3:20 PM Robert Nelson <robertcnelson@gmail.com> wrote: > > > Thanks for testing. > > > > I'm able to reproduce the issue locally (using a kernel 6.1.112). > > It fail after the first sleep 20... > > > > If I remove MMC_CAP_AGGRESSIVE_PM from the sdhci-omap driver the issue is gone. > > > > About sdhci-omap driver, It's one of the only few enabling > > MMC_CAP_AGGRESSIVE_PM. I recently switched to a new project using a newer SoC > > but the eMMC driver doesn't event set MMC_CAP_AGGRESSIVE_PM. > > > > I'm wondering if MMC_CAP_AGGRESSIVE_PM is really safe (or compatible) for > > HS200/HS400 eMMC speed. Indeed, MMC_CAP_AGGRESSIVE_PM has been added to > > sdhci-omap driver to support SDIO WLAN device PM [1]. > > > > I've found another similar report on the Beaglebone-black (AM335x SoC) [2]. > > > > It seems the MMC_CAP_AGGRESSIVE_PM feature should only be enabled to SDIO cards. > > We've been chasing this Bug in BeagleLand for a while. Had Kingston > run it thru their hardware debuggers.. On the BBB, once the eMMC is > suspended during idle, the proper 'wakeup' cmd is NOT sent over, > instead it forces a full reset. Eventually this kills the eMMC. Been > playing with this same revert for a day or so, with my personal setup, > it takes 3-4 Weeks (at idle every day) for it to finally die.. So i > won't be able to verify this 'really' fixes it till next month.. Okay, it survived 4 weeks.. We really need to revert: 3edf588e7fe00e90d1dc7fb9e599861b2c2cf442 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3edf588e7fe00e90d1dc7fb9e599861b2c2cf442 On every stable kernel back to v6.1.x, this commit is `killing` Kingston eMMC's on BeagleBone Black's in under 21 days. By reverting the commit, I finally have a board that's survived the 3 week timeline, (and a week more) with no issues. Normally on MK2704 EXT_CSD_DEVICE_LIFE_TIME_EST_TYP_B changes to 0x02 and the eMMC never works again.. [44-am335x-bbb: 6.1.83-ti-r37 (up 4 weeks, 18 hours, 35 minutes)] ************************************************* cat /sys/kernel/debug/mmc1/ios clock: 52000000 Hz vdd: 21 (3.3 ~ 3.4 V) bus mode: 2 (push-pull) chip select: 0 (don't care) power mode: 2 (on) bus width: 3 (8 bits) timing spec: 1 (mmc high-speed) signal voltage: 0 (3.30 V) driver type: 0 (driver type B) ************************************************* dmesg | grep boot0 [ 5.362457] mmcblk1boot0: mmc1:0001 MK2704 2.00 MiB ************************************************* eMMC Firmware Version: eMMC Life Time Estimation A [EXT_CSD_DEVICE_LIFE_TIME_EST_TYP_A]: 0x01 eMMC Life Time Estimation B [EXT_CSD_DEVICE_LIFE_TIME_EST_TYP_B]: 0x01 eMMC Pre EOL information [EXT_CSD_PRE_EOL_INFO]: 0x01 ************************************************* cat /tmp/eMMC.txt eMMC name: MK2704 eMMC date: 04/2023 eMMC rev: 0x7 eMMC hwrev: 0x0 eMMC fwrev: 0x0100000000000000 eMMC oemid: 0x0100 eMMC manfid: 0x000070 eMMC life_time: 0x01 0x01 eMMC serial: 0x5992401d ************************************************* 0x01 0x01 0x01 ************************************************* cat /boot/uEnv.txt uname_r=6.1.83-ti-r37 I'm sure someone will argue that we should test this "PM" mode... Well on AM335x this is broken, at $~60 a pop I'm tired of testing this regression for the last year and half.. waiting 3 weeks for a board to die.. This needs to be reverted! Regards, -- Robert Nelson https://rcn-ee.com/ ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: sdhci-omap: additional PM issue since 5.16 2025-02-26 15:36 ` Robert Nelson @ 2025-02-26 16:06 ` Andreas Kemnade 2025-03-07 4:28 ` Tony Lindgren 0 siblings, 1 reply; 15+ messages in thread From: Andreas Kemnade @ 2025-02-26 16:06 UTC (permalink / raw) To: Robert Nelson Cc: Romain Naour, Aaro Koskinen, Kevin Hilman, Roger Quadros, Ulf Hansson, Jason Kridner, Aldea, Andrei, David, linux-omap, linux-mmc, Tony Lindgren Am Wed, 26 Feb 2025 09:36:40 -0600 schrieb Robert Nelson <robertcnelson@gmail.com>: > On Mon, Jan 27, 2025 at 3:20 PM Robert Nelson <robertcnelson@gmail.com> wrote: > > > > > Thanks for testing. > > > > > > I'm able to reproduce the issue locally (using a kernel 6.1.112). > > > It fail after the first sleep 20... > > > > > > If I remove MMC_CAP_AGGRESSIVE_PM from the sdhci-omap driver the issue is gone. > > > > > > About sdhci-omap driver, It's one of the only few enabling > > > MMC_CAP_AGGRESSIVE_PM. I recently switched to a new project using a newer SoC > > > but the eMMC driver doesn't event set MMC_CAP_AGGRESSIVE_PM. > > > > > > I'm wondering if MMC_CAP_AGGRESSIVE_PM is really safe (or compatible) for > > > HS200/HS400 eMMC speed. Indeed, MMC_CAP_AGGRESSIVE_PM has been added to > > > sdhci-omap driver to support SDIO WLAN device PM [1]. > > > > > > I've found another similar report on the Beaglebone-black (AM335x SoC) [2]. > > > > > > It seems the MMC_CAP_AGGRESSIVE_PM feature should only be enabled to SDIO cards. > > > > We've been chasing this Bug in BeagleLand for a while. Had Kingston > > run it thru their hardware debuggers.. On the BBB, once the eMMC is > > suspended during idle, the proper 'wakeup' cmd is NOT sent over, > > instead it forces a full reset. Eventually this kills the eMMC. Been > > playing with this same revert for a day or so, with my personal setup, > > it takes 3-4 Weeks (at idle every day) for it to finally die.. So i > > won't be able to verify this 'really' fixes it till next month.. > > Okay, it survived 4 weeks.. We really need to revert: > 3edf588e7fe00e90d1dc7fb9e599861b2c2cf442 > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3edf588e7fe00e90d1dc7fb9e599861b2c2cf442 > > On every stable kernel back to v6.1.x, this commit is `killing` > Kingston eMMC's on BeagleBone Black's in under 21 days. > > By reverting the commit, I finally have a board that's survived the 3 > week timeline, (and a week more) with no issues. > Is there any simple way to restrain it to only sdio devices to go forward a bit? Regards, Andreas ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: sdhci-omap: additional PM issue since 5.16 2025-02-26 16:06 ` Andreas Kemnade @ 2025-03-07 4:28 ` Tony Lindgren 2025-03-07 14:42 ` Robert Nelson 2025-03-12 11:55 ` Ulf Hansson 0 siblings, 2 replies; 15+ messages in thread From: Tony Lindgren @ 2025-03-07 4:28 UTC (permalink / raw) To: Andreas Kemnade Cc: Robert Nelson, Romain Naour, Aaro Koskinen, Kevin Hilman, Roger Quadros, Ulf Hansson, Jason Kridner, Aldea, Andrei, David, linux-omap, linux-mmc * Andreas Kemnade <andreas@kemnade.info> [250226 16:06]: > Am Wed, 26 Feb 2025 09:36:40 -0600 > schrieb Robert Nelson <robertcnelson@gmail.com>: > > > On Mon, Jan 27, 2025 at 3:20 PM Robert Nelson <robertcnelson@gmail.com> wrote: > > > > > > > Thanks for testing. > > > > > > > > I'm able to reproduce the issue locally (using a kernel 6.1.112). > > > > It fail after the first sleep 20... > > > > > > > > If I remove MMC_CAP_AGGRESSIVE_PM from the sdhci-omap driver the issue is gone. > > > > > > > > About sdhci-omap driver, It's one of the only few enabling > > > > MMC_CAP_AGGRESSIVE_PM. I recently switched to a new project using a newer SoC > > > > but the eMMC driver doesn't event set MMC_CAP_AGGRESSIVE_PM. > > > > > > > > I'm wondering if MMC_CAP_AGGRESSIVE_PM is really safe (or compatible) for > > > > HS200/HS400 eMMC speed. Indeed, MMC_CAP_AGGRESSIVE_PM has been added to > > > > sdhci-omap driver to support SDIO WLAN device PM [1]. > > > > > > > > I've found another similar report on the Beaglebone-black (AM335x SoC) [2]. > > > > > > > > It seems the MMC_CAP_AGGRESSIVE_PM feature should only be enabled to SDIO cards. > > > > > > We've been chasing this Bug in BeagleLand for a while. Had Kingston > > > run it thru their hardware debuggers.. On the BBB, once the eMMC is > > > suspended during idle, the proper 'wakeup' cmd is NOT sent over, > > > instead it forces a full reset. Eventually this kills the eMMC. Been > > > playing with this same revert for a day or so, with my personal setup, > > > it takes 3-4 Weeks (at idle every day) for it to finally die.. So i > > > won't be able to verify this 'really' fixes it till next month.. > > > > Okay, it survived 4 weeks.. We really need to revert: > > 3edf588e7fe00e90d1dc7fb9e599861b2c2cf442 > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3edf588e7fe00e90d1dc7fb9e599861b2c2cf442 > > > > On every stable kernel back to v6.1.x, this commit is `killing` > > Kingston eMMC's on BeagleBone Black's in under 21 days. > > > > By reverting the commit, I finally have a board that's survived the 3 > > week timeline, (and a week more) with no issues. > > > Is there any simple way to restrain it to only sdio devices to go > forward a bit? Best to revert the patch first until the issue has been fixed. Based on the symptoms, it sounds like there might be a missing flush of a posted write in the PM runtime suspend/resume path. This could cause something in the sequence happen in the wrong order for some of the related surrounding resources like power, clocks or interrupts. Regards, Tony ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: sdhci-omap: additional PM issue since 5.16 2025-03-07 4:28 ` Tony Lindgren @ 2025-03-07 14:42 ` Robert Nelson 2025-03-09 17:58 ` Andreas Kemnade 2025-03-12 11:55 ` Ulf Hansson 1 sibling, 1 reply; 15+ messages in thread From: Robert Nelson @ 2025-03-07 14:42 UTC (permalink / raw) To: Tony Lindgren Cc: Andreas Kemnade, Romain Naour, Aaro Koskinen, Kevin Hilman, Roger Quadros, Ulf Hansson, Jason Kridner, Aldea, Andrei, David, linux-omap, linux-mmc On Thu, Mar 6, 2025 at 10:28 PM Tony Lindgren <tony@atomide.com> wrote: > > Best to revert the patch first until the issue has been fixed. > > Based on the symptoms, it sounds like there might be a missing flush of > a posted write in the PM runtime suspend/resume path. This could cause > something in the sequence happen in the wrong order for some of the > related surrounding resources like power, clocks or interrupts. > Kington's hardware anaylizer said after CMD5/sleep in about 10us, instead of CMD5/wkup being called, it just resets the eMMC.. So someone deep within the sdhc/mmc layer might understand that. ;) Regards, -- Robert Nelson https://rcn-ee.com/ ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: sdhci-omap: additional PM issue since 5.16 2025-03-07 14:42 ` Robert Nelson @ 2025-03-09 17:58 ` Andreas Kemnade 0 siblings, 0 replies; 15+ messages in thread From: Andreas Kemnade @ 2025-03-09 17:58 UTC (permalink / raw) To: Robert Nelson Cc: Tony Lindgren, Romain Naour, Aaro Koskinen, Kevin Hilman, Roger Quadros, Ulf Hansson, Jason Kridner, Aldea, Andrei, David, linux-omap, linux-mmc Hi, Am Fri, 7 Mar 2025 08:42:02 -0600 schrieb Robert Nelson <robertcnelson@gmail.com>: > On Thu, Mar 6, 2025 at 10:28 PM Tony Lindgren <tony@atomide.com> wrote: > > > > Best to revert the patch first until the issue has been fixed. > > > > Based on the symptoms, it sounds like there might be a missing flush of > > a posted write in the PM runtime suspend/resume path. This could cause > > something in the sequence happen in the wrong order for some of the > > related surrounding resources like power, clocks or interrupts. > > > > Kington's hardware anaylizer said after CMD5/sleep in about 10us, > instead of CMD5/wkup being called, it just resets the eMMC.. So > someone deep within the sdhc/mmc layer might understand that. ;) > hmm, omap3/4/5 are using omap-hsmmc and are not switched towards sdhci yet. So I guess that might be the reason why things are not coming to light that easy. Also I am wondering whether we can start by destroying SD cards rather then eMMC. So maybe we can do some stress-testing to have the cycles shorter. Do we have an overview what is really working reliable with that driver? Regards, Andreas ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: sdhci-omap: additional PM issue since 5.16 2025-03-07 4:28 ` Tony Lindgren 2025-03-07 14:42 ` Robert Nelson @ 2025-03-12 11:55 ` Ulf Hansson 2025-03-20 4:14 ` Tony Lindgren 1 sibling, 1 reply; 15+ messages in thread From: Ulf Hansson @ 2025-03-12 11:55 UTC (permalink / raw) To: Robert Nelson, Tony Lindgren Cc: Andreas Kemnade, Romain Naour, Aaro Koskinen, Kevin Hilman, Roger Quadros, Jason Kridner, Aldea, Andrei, David, linux-omap, linux-mmc On Fri, 7 Mar 2025 at 05:28, Tony Lindgren <tony@atomide.com> wrote: > > * Andreas Kemnade <andreas@kemnade.info> [250226 16:06]: > > Am Wed, 26 Feb 2025 09:36:40 -0600 > > schrieb Robert Nelson <robertcnelson@gmail.com>: > > > > > On Mon, Jan 27, 2025 at 3:20 PM Robert Nelson <robertcnelson@gmail.com> wrote: > > > > > > > > > Thanks for testing. > > > > > > > > > > I'm able to reproduce the issue locally (using a kernel 6.1.112). > > > > > It fail after the first sleep 20... > > > > > > > > > > If I remove MMC_CAP_AGGRESSIVE_PM from the sdhci-omap driver the issue is gone. > > > > > > > > > > About sdhci-omap driver, It's one of the only few enabling > > > > > MMC_CAP_AGGRESSIVE_PM. I recently switched to a new project using a newer SoC > > > > > but the eMMC driver doesn't event set MMC_CAP_AGGRESSIVE_PM. > > > > > > > > > > I'm wondering if MMC_CAP_AGGRESSIVE_PM is really safe (or compatible) for > > > > > HS200/HS400 eMMC speed. Indeed, MMC_CAP_AGGRESSIVE_PM has been added to > > > > > sdhci-omap driver to support SDIO WLAN device PM [1]. > > > > > > > > > > I've found another similar report on the Beaglebone-black (AM335x SoC) [2]. > > > > > > > > > > It seems the MMC_CAP_AGGRESSIVE_PM feature should only be enabled to SDIO cards. > > > > > > > > We've been chasing this Bug in BeagleLand for a while. Had Kingston > > > > run it thru their hardware debuggers.. On the BBB, once the eMMC is > > > > suspended during idle, the proper 'wakeup' cmd is NOT sent over, > > > > instead it forces a full reset. Eventually this kills the eMMC. Been > > > > playing with this same revert for a day or so, with my personal setup, > > > > it takes 3-4 Weeks (at idle every day) for it to finally die.. So i > > > > won't be able to verify this 'really' fixes it till next month.. > > > > > > Okay, it survived 4 weeks.. We really need to revert: > > > 3edf588e7fe00e90d1dc7fb9e599861b2c2cf442 > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3edf588e7fe00e90d1dc7fb9e599861b2c2cf442 > > > > > > On every stable kernel back to v6.1.x, this commit is `killing` > > > Kingston eMMC's on BeagleBone Black's in under 21 days. > > > > > > By reverting the commit, I finally have a board that's survived the 3 > > > week timeline, (and a week more) with no issues. > > > > > Is there any simple way to restrain it to only sdio devices to go > > forward a bit? > > Best to revert the patch first until the issue has been fixed. > > Based on the symptoms, it sounds like there might be a missing flush of > a posted write in the PM runtime suspend/resume path. This could cause > something in the sequence happen in the wrong order for some of the > related surrounding resources like power, clocks or interrupts. SDIO is entirely different in this regard compared to eMMC/SD. So if there are no reports of issues I suggest we keep the SDIO part. Let me help out and cook a patch for this. I send it out in a few minutes. > > Regards, > > Tony Kind regards Uffe ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: sdhci-omap: additional PM issue since 5.16 2025-03-12 11:55 ` Ulf Hansson @ 2025-03-20 4:14 ` Tony Lindgren 2025-03-20 8:47 ` Ulf Hansson 0 siblings, 1 reply; 15+ messages in thread From: Tony Lindgren @ 2025-03-20 4:14 UTC (permalink / raw) To: Ulf Hansson Cc: Robert Nelson, Andreas Kemnade, Romain Naour, Aaro Koskinen, Kevin Hilman, Roger Quadros, Jason Kridner, Aldea, Andrei, David, linux-omap, linux-mmc Hi, * Ulf Hansson <ulf.hansson@linaro.org> [250312 11:56]: > On Fri, 7 Mar 2025 at 05:28, Tony Lindgren <tony@atomide.com> wrote: > > Based on the symptoms, it sounds like there might be a missing flush of > > a posted write in the PM runtime suspend/resume path. This could cause > > something in the sequence happen in the wrong order for some of the > > related surrounding resources like power, clocks or interrupts. > > SDIO is entirely different in this regard compared to eMMC/SD. So if > there are no reports of issues I suggest we keep the SDIO part. Hmm just wondering if you have any guesses what causes the eMMC/SD related PM to break? Regards, Tony ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: sdhci-omap: additional PM issue since 5.16 2025-03-20 4:14 ` Tony Lindgren @ 2025-03-20 8:47 ` Ulf Hansson 2025-04-01 4:00 ` Tony Lindgren 0 siblings, 1 reply; 15+ messages in thread From: Ulf Hansson @ 2025-03-20 8:47 UTC (permalink / raw) To: Tony Lindgren Cc: Robert Nelson, Andreas Kemnade, Romain Naour, Aaro Koskinen, Kevin Hilman, Roger Quadros, Jason Kridner, Aldea, Andrei, David, linux-omap, linux-mmc On Thu, 20 Mar 2025 at 05:14, Tony Lindgren <tony@atomide.com> wrote: > > Hi, > > * Ulf Hansson <ulf.hansson@linaro.org> [250312 11:56]: > > On Fri, 7 Mar 2025 at 05:28, Tony Lindgren <tony@atomide.com> wrote: > > > Based on the symptoms, it sounds like there might be a missing flush of > > > a posted write in the PM runtime suspend/resume path. This could cause > > > something in the sequence happen in the wrong order for some of the > > > related surrounding resources like power, clocks or interrupts. > > > > SDIO is entirely different in this regard compared to eMMC/SD. So if > > there are no reports of issues I suggest we keep the SDIO part. > > Hmm just wondering if you have any guesses what causes the eMMC/SD related > PM to break? > > Regards, > > Tony Well, I have recently been looking a bit closer at the runtime PM support of the eMMC/SD card. We seem to have some kind of related problems [1], but I am not really sure yet. That said, I believe I may have found some *potential* issues and I am working on a few patches for it (for the mmc core), I will keep you and the people in $subject posted. Although, it's not quite clear to me, why these problems have turned up at this point and not a lot earlier. I have a feeling there is something that I am missing. Also note that, if the problems are sdhci/sdhci-omap specific, it becomes a bit more difficult for me to help out. Luckily, it seems like David shared a pretty simple script with us, which should reproduce the problem in just a few minutes. There are also debugfs and the runtime PM-sysfs interface one can use to play with the behaviour of MMC_CAP_AGGRESSIVE_PM. Kind regards Uffe [1] https://bugzilla.kernel.org/show_bug.cgi?id=218821 https://lore.kernel.org/all/CAPDyKFq4-fL3oHeT9phThWQJqzicKeA447WBJUbtcKPhdZ2d1A@mail.gmail.com/T/ ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: sdhci-omap: additional PM issue since 5.16 2025-03-20 8:47 ` Ulf Hansson @ 2025-04-01 4:00 ` Tony Lindgren 0 siblings, 0 replies; 15+ messages in thread From: Tony Lindgren @ 2025-04-01 4:00 UTC (permalink / raw) To: Ulf Hansson Cc: Robert Nelson, Andreas Kemnade, Romain Naour, Aaro Koskinen, Kevin Hilman, Roger Quadros, Jason Kridner, Aldea, Andrei, David, linux-omap, linux-mmc * Ulf Hansson <ulf.hansson@linaro.org> [250320 08:48]: > On Thu, 20 Mar 2025 at 05:14, Tony Lindgren <tony@atomide.com> wrote: > > > > Hi, > > > > * Ulf Hansson <ulf.hansson@linaro.org> [250312 11:56]: > > > On Fri, 7 Mar 2025 at 05:28, Tony Lindgren <tony@atomide.com> wrote: > > > > Based on the symptoms, it sounds like there might be a missing flush of > > > > a posted write in the PM runtime suspend/resume path. This could cause > > > > something in the sequence happen in the wrong order for some of the > > > > related surrounding resources like power, clocks or interrupts. > > > > > > SDIO is entirely different in this regard compared to eMMC/SD. So if > > > there are no reports of issues I suggest we keep the SDIO part. > > > > Hmm just wondering if you have any guesses what causes the eMMC/SD related > > PM to break? > > > > Regards, > > > > Tony > > Well, I have recently been looking a bit closer at the runtime PM > support of the eMMC/SD card. We seem to have some kind of related > problems [1], but I am not really sure yet. > > That said, I believe I may have found some *potential* issues and I am > working on a few patches for it (for the mmc core), I will keep you > and the people in $subject posted. Although, it's not quite clear to > me, why these problems have turned up at this point and not a lot > earlier. I have a feeling there is something that I am missing. > > Also note that, if the problems are sdhci/sdhci-omap specific, it > becomes a bit more difficult for me to help out. > > Luckily, it seems like David shared a pretty simple script with us, > which should reproduce the problem in just a few minutes. There are > also debugfs and the runtime PM-sysfs interface one can use to play > with the behaviour of MMC_CAP_AGGRESSIVE_PM. OK thanks for the description. AFAIK this is issue has not been happening with the eMMC and the old mmc-omap-hs driver. Sounds like the issue might be also eMMC specific. Regards, Tony > [1] > https://bugzilla.kernel.org/show_bug.cgi?id=218821 > https://lore.kernel.org/all/CAPDyKFq4-fL3oHeT9phThWQJqzicKeA447WBJUbtcKPhdZ2d1A@mail.gmail.com/T/ ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2025-04-01 4:01 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-01-23 22:09 sdhci-omap: additional PM issue since 5.16 David Owens 2025-01-24 10:36 ` Romain Naour 2025-01-24 17:15 ` David Owens 2025-01-24 18:49 ` David 2025-01-27 9:06 ` Romain Naour 2025-01-27 21:20 ` Robert Nelson 2025-02-26 15:36 ` Robert Nelson 2025-02-26 16:06 ` Andreas Kemnade 2025-03-07 4:28 ` Tony Lindgren 2025-03-07 14:42 ` Robert Nelson 2025-03-09 17:58 ` Andreas Kemnade 2025-03-12 11:55 ` Ulf Hansson 2025-03-20 4:14 ` Tony Lindgren 2025-03-20 8:47 ` Ulf Hansson 2025-04-01 4:00 ` Tony Lindgren
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox