* [REGRESSION] sdhci no longer detects SD cards on LX2160A @ 2019-09-16 17:15 Russell King - ARM Linux admin 2019-09-16 22:57 ` Russell King - ARM Linux admin 2019-09-17 8:06 ` Marc Gonzalez 0 siblings, 2 replies; 31+ messages in thread From: Russell King - ARM Linux admin @ 2019-09-16 17:15 UTC (permalink / raw) To: Adrian Hunter, linux-mmc, linux-arm-kernel Hi, It seems that somewhere between v5.2 and v5.3, sdhci fails to detect SD cards on the NXP LX2160A, but continues to work with eMMC. This uses the sdhci-esdhc driver. v5.2: sdhci: Secure Digital Host Controller Interface driver sdhci: Copyright(c) Pierre Ossman sdhci-pltfm: SDHCI platform and OF driver helper mmc0: SDHCI controller on 2140000.esdhc [2140000.esdhc] using ADMA 64-bit mmc1: SDHCI controller on 2150000.esdhc [2150000.esdhc] using ADMA 64-bit mmc0: new high speed SDHC card at address aaaa mmcblk0: mmc0:aaaa SU04G 3.69 GiB mmc1: new HS400 MMC card at address 0001 mmcblk1: mmc1:0001 DF4064 58.2 GiB mmcblk1boot0: mmc1:0001 DF4064 partition 1 4.00 MiB mmcblk1boot1: mmc1:0001 DF4064 partition 2 4.00 MiB mmcblk1rpmb: mmc1:0001 DF4064 partition 3 4.00 MiB, chardev (247:0) mmcblk1: p1 v5.3: sdhci: Secure Digital Host Controller Interface driver sdhci: Copyright(c) Pierre Ossman sdhci-pltfm: SDHCI platform and OF driver helper mmc0: SDHCI controller on 2140000.esdhc [2140000.esdhc] using ADMA 64-bit mmc1: SDHCI controller on 2150000.esdhc [2150000.esdhc] using ADMA 64-bit mmc0: ADMA error mmc0: sdhci: ============ SDHCI REGISTER DUMP =========== mmc0: sdhci: Sys addr: 0x00000000 | Version: 0x00002202 mmc0: sdhci: Blk size: 0x00000008 | Blk cnt: 0x00000001 mmc0: sdhci: Argument: 0x00000000 | Trn mode: 0x00000013 mmc0: sdhci: Present: 0x01f50008 | Host ctl: 0x00000038 mmc0: sdhci: Power: 0x00000003 | Blk gap: 0x00000000 mmc0: sdhci: Wake-up: 0x00000000 | Clock: 0x000040d8 mmc0: sdhci: Timeout: 0x00000003 | Int stat: 0x00000001 mmc0: sdhci: Int enab: 0x037f108f | Sig enab: 0x037f108b mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00002202 mmc0: sdhci: Caps: 0x35fa0000 | Caps_1: 0x0000af00 mmc0: sdhci: Cmd: 0x0000333a | Max curr: 0x00000000 mmc0: sdhci: Resp[0]: 0x00000920 | Resp[1]: 0x001d8a33 mmc0: sdhci: Resp[2]: 0x325b5900 | Resp[3]: 0x3f400e00 mmc0: sdhci: Host ctl2: 0x00000000 mmc0: sdhci: ADMA Err: 0x00000009 | ADMA Ptr: 0x000000236d43820c mmc0: sdhci: ============================================ mmc0: error -5 whilst initialising SD card mmc0: ADMA error mmc0: sdhci: ============ SDHCI REGISTER DUMP =========== mmc0: sdhci: Sys addr: 0x00000000 | Version: 0x00002202 mmc0: sdhci: Blk size: 0x00000008 | Blk cnt: 0x00000001 mmc0: sdhci: Argument: 0x00000000 | Trn mode: 0x00000013 mmc0: sdhci: Present: 0x01f50008 | Host ctl: 0x00000038 mmc0: sdhci: Power: 0x00000003 | Blk gap: 0x00000000 mmc0: sdhci: Wake-up: 0x00000000 | Clock: 0x00008098 mmc0: sdhci: Timeout: 0x00000002 | Int stat: 0x00000001 mmc0: sdhci: Int enab: 0x037f108f | Sig enab: 0x037f108b mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00002202 mmc0: sdhci: Caps: 0x35fa0000 | Caps_1: 0x0000af00 mmc0: sdhci: Cmd: 0x0000333a | Max curr: 0x00000000 mmc0: sdhci: Resp[0]: 0x00000920 | Resp[1]: 0x001d8a33 mmc0: sdhci: Resp[2]: 0x325b5900 | Resp[3]: 0x3f400e00 mmc0: sdhci: Host ctl2: 0x00000000 mmc0: sdhci: ADMA Err: 0x00000009 | ADMA Ptr: 0x000000236d43820c mmc0: sdhci: ============================================ mmc0: error -5 whilst initialising SD card mmc1: new HS400 MMC card at address 0001 mmcblk1: mmc1:0001 DF4064 58.2 GiB mmcblk1boot0: mmc1:0001 DF4064 partition 1 4.00 MiB mmcblk1boot1: mmc1:0001 DF4064 partition 2 4.00 MiB mmcblk1rpmb: mmc1:0001 DF4064 partition 3 4.00 MiB, chardev (247:0) mmcblk1: p1 mmc0: ADMA error mmc0: sdhci: ============ SDHCI REGISTER DUMP =========== mmc0: sdhci: Sys addr: 0x00000000 | Version: 0x00002202 mmc0: sdhci: Blk size: 0x00000008 | Blk cnt: 0x00000001 mmc0: sdhci: Argument: 0x00000000 | Trn mode: 0x00000013 mmc0: sdhci: Present: 0x01f50008 | Host ctl: 0x00000038 mmc0: sdhci: Power: 0x00000003 | Blk gap: 0x00000000 mmc0: sdhci: Wake-up: 0x00000000 | Clock: 0x000080d8 mmc0: sdhci: Timeout: 0x00000002 | Int stat: 0x00000001 mmc0: sdhci: Int enab: 0x037f108f | Sig enab: 0x037f108b mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00002202 mmc0: sdhci: Caps: 0x35fa0000 | Caps_1: 0x0000af00 mmc0: sdhci: Cmd: 0x0000333a | Max curr: 0x00000000 mmc0: sdhci: Resp[0]: 0x00000920 | Resp[1]: 0x001d8a33 mmc0: sdhci: Resp[2]: 0x325b5900 | Resp[3]: 0x3f400e00 mmc0: sdhci: Host ctl2: 0x00000000 mmc0: sdhci: ADMA Err: 0x00000009 | ADMA Ptr: 0x000000236d43820c mmc0: sdhci: ============================================ mmc0: error -5 whilst initialising SD card mmc0: ADMA error mmc0: sdhci: ============ SDHCI REGISTER DUMP =========== mmc0: sdhci: Sys addr: 0x00000000 | Version: 0x00002202 mmc0: sdhci: Blk size: 0x00000008 | Blk cnt: 0x00000001 mmc0: sdhci: Argument: 0x00000000 | Trn mode: 0x00000013 mmc0: sdhci: Present: 0x01f50008 | Host ctl: 0x00000038 mmc0: sdhci: Power: 0x00000003 | Blk gap: 0x00000000 mmc0: sdhci: Wake-up: 0x00000000 | Clock: 0x000080f8 mmc0: sdhci: Timeout: 0x00000002 | Int stat: 0x00000001 mmc0: sdhci: Int enab: 0x037f108f | Sig enab: 0x037f108b mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00002202 mmc0: sdhci: Caps: 0x35fa0000 | Caps_1: 0x0000af00 mmc0: sdhci: Cmd: 0x0000333a | Max curr: 0x00000000 mmc0: sdhci: Resp[0]: 0x00000920 | Resp[1]: 0x001d8a33 mmc0: sdhci: Resp[2]: 0x325b5900 | Resp[3]: 0x3f400e00 mmc0: sdhci: Host ctl2: 0x00000000 mmc0: sdhci: ADMA Err: 0x00000009 | ADMA Ptr: 0x000000236d43820c mmc0: sdhci: ============================================ mmc0: error -5 whilst initialising SD card The platform has an iommu, which is in pass-through mode, via arm_smmu.disable_bypass=0. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A 2019-09-16 17:15 [REGRESSION] sdhci no longer detects SD cards on LX2160A Russell King - ARM Linux admin @ 2019-09-16 22:57 ` Russell King - ARM Linux admin 2019-09-17 8:06 ` Marc Gonzalez 1 sibling, 0 replies; 31+ messages in thread From: Russell King - ARM Linux admin @ 2019-09-16 22:57 UTC (permalink / raw) To: Adrian Hunter, linux-mmc, linux-arm-kernel The problem below seems to be sporadic with v5.3, which makes finding its cause quite difficult - and makes bisecting useless. I'd describe it as "likely" to affect either mmc0 or mmc1 on v5.3. I haven't seen it at all with v5.2. I'll see whether I can pin anything down tomorrow. On Mon, Sep 16, 2019 at 06:15:10PM +0100, Russell King - ARM Linux admin wrote: > Hi, > > It seems that somewhere between v5.2 and v5.3, sdhci fails to detect > SD cards on the NXP LX2160A, but continues to work with eMMC. > This uses the sdhci-esdhc driver. > > v5.2: > > sdhci: Secure Digital Host Controller Interface driver > sdhci: Copyright(c) Pierre Ossman > sdhci-pltfm: SDHCI platform and OF driver helper > mmc0: SDHCI controller on 2140000.esdhc [2140000.esdhc] using ADMA 64-bit > mmc1: SDHCI controller on 2150000.esdhc [2150000.esdhc] using ADMA 64-bit > mmc0: new high speed SDHC card at address aaaa > mmcblk0: mmc0:aaaa SU04G 3.69 GiB > mmc1: new HS400 MMC card at address 0001 > mmcblk1: mmc1:0001 DF4064 58.2 GiB > mmcblk1boot0: mmc1:0001 DF4064 partition 1 4.00 MiB > mmcblk1boot1: mmc1:0001 DF4064 partition 2 4.00 MiB > mmcblk1rpmb: mmc1:0001 DF4064 partition 3 4.00 MiB, chardev (247:0) > mmcblk1: p1 > > v5.3: > > sdhci: Secure Digital Host Controller Interface driver > sdhci: Copyright(c) Pierre Ossman > sdhci-pltfm: SDHCI platform and OF driver helper > mmc0: SDHCI controller on 2140000.esdhc [2140000.esdhc] using ADMA 64-bit > mmc1: SDHCI controller on 2150000.esdhc [2150000.esdhc] using ADMA 64-bit > mmc0: ADMA error > mmc0: sdhci: ============ SDHCI REGISTER DUMP =========== > mmc0: sdhci: Sys addr: 0x00000000 | Version: 0x00002202 > mmc0: sdhci: Blk size: 0x00000008 | Blk cnt: 0x00000001 > mmc0: sdhci: Argument: 0x00000000 | Trn mode: 0x00000013 > mmc0: sdhci: Present: 0x01f50008 | Host ctl: 0x00000038 > mmc0: sdhci: Power: 0x00000003 | Blk gap: 0x00000000 > mmc0: sdhci: Wake-up: 0x00000000 | Clock: 0x000040d8 > mmc0: sdhci: Timeout: 0x00000003 | Int stat: 0x00000001 > mmc0: sdhci: Int enab: 0x037f108f | Sig enab: 0x037f108b > mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00002202 > mmc0: sdhci: Caps: 0x35fa0000 | Caps_1: 0x0000af00 > mmc0: sdhci: Cmd: 0x0000333a | Max curr: 0x00000000 > mmc0: sdhci: Resp[0]: 0x00000920 | Resp[1]: 0x001d8a33 > mmc0: sdhci: Resp[2]: 0x325b5900 | Resp[3]: 0x3f400e00 > mmc0: sdhci: Host ctl2: 0x00000000 > mmc0: sdhci: ADMA Err: 0x00000009 | ADMA Ptr: 0x000000236d43820c > mmc0: sdhci: ============================================ > mmc0: error -5 whilst initialising SD card > mmc0: ADMA error > mmc0: sdhci: ============ SDHCI REGISTER DUMP =========== > mmc0: sdhci: Sys addr: 0x00000000 | Version: 0x00002202 > mmc0: sdhci: Blk size: 0x00000008 | Blk cnt: 0x00000001 > mmc0: sdhci: Argument: 0x00000000 | Trn mode: 0x00000013 > mmc0: sdhci: Present: 0x01f50008 | Host ctl: 0x00000038 > mmc0: sdhci: Power: 0x00000003 | Blk gap: 0x00000000 > mmc0: sdhci: Wake-up: 0x00000000 | Clock: 0x00008098 > mmc0: sdhci: Timeout: 0x00000002 | Int stat: 0x00000001 > mmc0: sdhci: Int enab: 0x037f108f | Sig enab: 0x037f108b > mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00002202 > mmc0: sdhci: Caps: 0x35fa0000 | Caps_1: 0x0000af00 > mmc0: sdhci: Cmd: 0x0000333a | Max curr: 0x00000000 > mmc0: sdhci: Resp[0]: 0x00000920 | Resp[1]: 0x001d8a33 > mmc0: sdhci: Resp[2]: 0x325b5900 | Resp[3]: 0x3f400e00 > mmc0: sdhci: Host ctl2: 0x00000000 > mmc0: sdhci: ADMA Err: 0x00000009 | ADMA Ptr: 0x000000236d43820c > mmc0: sdhci: ============================================ > mmc0: error -5 whilst initialising SD card > mmc1: new HS400 MMC card at address 0001 > mmcblk1: mmc1:0001 DF4064 58.2 GiB > mmcblk1boot0: mmc1:0001 DF4064 partition 1 4.00 MiB > mmcblk1boot1: mmc1:0001 DF4064 partition 2 4.00 MiB > mmcblk1rpmb: mmc1:0001 DF4064 partition 3 4.00 MiB, chardev (247:0) > mmcblk1: p1 > mmc0: ADMA error > mmc0: sdhci: ============ SDHCI REGISTER DUMP =========== > mmc0: sdhci: Sys addr: 0x00000000 | Version: 0x00002202 > mmc0: sdhci: Blk size: 0x00000008 | Blk cnt: 0x00000001 > mmc0: sdhci: Argument: 0x00000000 | Trn mode: 0x00000013 > mmc0: sdhci: Present: 0x01f50008 | Host ctl: 0x00000038 > mmc0: sdhci: Power: 0x00000003 | Blk gap: 0x00000000 > mmc0: sdhci: Wake-up: 0x00000000 | Clock: 0x000080d8 > mmc0: sdhci: Timeout: 0x00000002 | Int stat: 0x00000001 > mmc0: sdhci: Int enab: 0x037f108f | Sig enab: 0x037f108b > mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00002202 > mmc0: sdhci: Caps: 0x35fa0000 | Caps_1: 0x0000af00 > mmc0: sdhci: Cmd: 0x0000333a | Max curr: 0x00000000 > mmc0: sdhci: Resp[0]: 0x00000920 | Resp[1]: 0x001d8a33 > mmc0: sdhci: Resp[2]: 0x325b5900 | Resp[3]: 0x3f400e00 > mmc0: sdhci: Host ctl2: 0x00000000 > mmc0: sdhci: ADMA Err: 0x00000009 | ADMA Ptr: 0x000000236d43820c > mmc0: sdhci: ============================================ > mmc0: error -5 whilst initialising SD card > mmc0: ADMA error > mmc0: sdhci: ============ SDHCI REGISTER DUMP =========== > mmc0: sdhci: Sys addr: 0x00000000 | Version: 0x00002202 > mmc0: sdhci: Blk size: 0x00000008 | Blk cnt: 0x00000001 > mmc0: sdhci: Argument: 0x00000000 | Trn mode: 0x00000013 > mmc0: sdhci: Present: 0x01f50008 | Host ctl: 0x00000038 > mmc0: sdhci: Power: 0x00000003 | Blk gap: 0x00000000 > mmc0: sdhci: Wake-up: 0x00000000 | Clock: 0x000080f8 > mmc0: sdhci: Timeout: 0x00000002 | Int stat: 0x00000001 > mmc0: sdhci: Int enab: 0x037f108f | Sig enab: 0x037f108b > mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00002202 > mmc0: sdhci: Caps: 0x35fa0000 | Caps_1: 0x0000af00 > mmc0: sdhci: Cmd: 0x0000333a | Max curr: 0x00000000 > mmc0: sdhci: Resp[0]: 0x00000920 | Resp[1]: 0x001d8a33 > mmc0: sdhci: Resp[2]: 0x325b5900 | Resp[3]: 0x3f400e00 > mmc0: sdhci: Host ctl2: 0x00000000 > mmc0: sdhci: ADMA Err: 0x00000009 | ADMA Ptr: 0x000000236d43820c > mmc0: sdhci: ============================================ > mmc0: error -5 whilst initialising SD card > > The platform has an iommu, which is in pass-through mode, via > arm_smmu.disable_bypass=0. > > -- > RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ > FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up > According to speedtest.net: 11.9Mbps down 500kbps up > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel > -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A 2019-09-16 17:15 [REGRESSION] sdhci no longer detects SD cards on LX2160A Russell King - ARM Linux admin 2019-09-16 22:57 ` Russell King - ARM Linux admin @ 2019-09-17 8:06 ` Marc Gonzalez 2019-09-17 8:19 ` Russell King - ARM Linux admin 1 sibling, 1 reply; 31+ messages in thread From: Marc Gonzalez @ 2019-09-17 8:06 UTC (permalink / raw) To: Russell King - ARM Linux admin, Adrian Hunter; +Cc: linux-mmc, Linux ARM On 16/09/2019 19:15, Russell King - ARM Linux admin wrote: > The platform has an iommu, which is in pass-through mode, via > arm_smmu.disable_bypass=0. Could be 954a03be033c7cef80ddc232e7cbdb17df735663 "iommu/arm-smmu: Break insecure users by disabling bypass by default" Although it had already landed in v5.2 Regards. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A 2019-09-17 8:06 ` Marc Gonzalez @ 2019-09-17 8:19 ` Russell King - ARM Linux admin 2019-09-17 10:42 ` Russell King - ARM Linux admin 0 siblings, 1 reply; 31+ messages in thread From: Russell King - ARM Linux admin @ 2019-09-17 8:19 UTC (permalink / raw) To: Marc Gonzalez; +Cc: linux-mmc, Adrian Hunter, Linux ARM On Tue, Sep 17, 2019 at 10:06:12AM +0200, Marc Gonzalez wrote: > On 16/09/2019 19:15, Russell King - ARM Linux admin wrote: > > > The platform has an iommu, which is in pass-through mode, via > > arm_smmu.disable_bypass=0. > > Could be 954a03be033c7cef80ddc232e7cbdb17df735663 > "iommu/arm-smmu: Break insecure users by disabling bypass by default" > > Although it had already landed in v5.2 It is not - and the two lines that you quoted above are sufficient to negate that as a cause. (Please read the help for the option that the commit referrs to.) In fact, with bypass disabled, the SoC fails due to other masters. That's already been discussed privately between myself and Will Deacon. arm_smmu.disable_bypass=0 re-enables bypass mode irrespective of the default setting in the Kconfig. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A 2019-09-17 8:19 ` Russell King - ARM Linux admin @ 2019-09-17 10:42 ` Russell King - ARM Linux admin 2019-09-17 11:16 ` Russell King - ARM Linux admin 0 siblings, 1 reply; 31+ messages in thread From: Russell King - ARM Linux admin @ 2019-09-17 10:42 UTC (permalink / raw) To: Adrian Hunter; +Cc: linux-mmc, Linux ARM On Tue, Sep 17, 2019 at 09:19:31AM +0100, Russell King - ARM Linux admin wrote: > On Tue, Sep 17, 2019 at 10:06:12AM +0200, Marc Gonzalez wrote: > > On 16/09/2019 19:15, Russell King - ARM Linux admin wrote: > > > > > The platform has an iommu, which is in pass-through mode, via > > > arm_smmu.disable_bypass=0. > > > > Could be 954a03be033c7cef80ddc232e7cbdb17df735663 > > "iommu/arm-smmu: Break insecure users by disabling bypass by default" > > > > Although it had already landed in v5.2 > > It is not - and the two lines that you quoted above are sufficient > to negate that as a cause. (Please read the help for the option that > the commit referrs to.) > > In fact, with bypass disabled, the SoC fails due to other masters. > That's already been discussed privately between myself and Will > Deacon. > > arm_smmu.disable_bypass=0 re-enables bypass mode irrespective of > the default setting in the Kconfig. Adding some further debugging, and fixing the existing ADMA debugging shows: mmc0: ADMA error: 0x02000000 So this is an ADMA error without the transfer having completed. mmc0: sdhci: Blk size: 0x00000008 | Blk cnt: 0x00000001 The block size is 8, with one block. mmc0: sdhci: ADMA Err: 0x00000009 | ADMA Ptr: 0x000000236df1d20c The ADMA error is a descriptor error at address 0x000000236df1d20c. The descriptor table contains (including the following entry): mmc0: sdhci: 236df1d200: DMA 0x000000236d40e980, LEN 0x0008, Attr=0x23 mmc0: sdhci: 236df1d20c: DMA 0x0000000000000000, LEN 0x0000, Attr=0x00 The descriptor table contains one descriptor of 8 bytes, is marked as the last (END bit set) and is at DMA address 0x236df1d200. The following descriptor is empty, with VALID=0. One may be tempted to blame it on the following descriptor, but having had another example on eMMC while userspace was booting (rootfs on eMMC): mmc1: ADMA error: 0x02000000 mmc1: sdhci: Blk size: 0x00000200 | Blk cnt: 0x00000099 mmc1: sdhci: ADMA Err: 0x00000006 | ADMA Ptr: 0x000000236dbfa26c mmc1: sdhci: 236dbfa200: DMA 0x000000236c25c000, LEN 0x2000, Attr=0x21 mmc1: sdhci: 236dbfa20c: DMA 0x000000236938c000, LEN 0x0000, Attr=0x21 mmc1: sdhci: 236dbfa218: DMA 0x000000236939c000, LEN 0x5000, Attr=0x21 mmc1: sdhci: 236dbfa224: DMA 0x0000002368545000, LEN 0x1000, Attr=0x21 mmc1: sdhci: 236dbfa230: DMA 0x00000023684f1000, LEN 0x1000, Attr=0x21 mmc1: sdhci: 236dbfa23c: DMA 0x0000002368504000, LEN 0x2000, Attr=0x21 mmc1: sdhci: 236dbfa248: DMA 0x0000002368546000, LEN 0x2000, Attr=0x21 mmc1: sdhci: 236dbfa254: DMA 0x00000023684f2000, LEN 0x2000, Attr=0x21 mmc1: sdhci: 236dbfa260: DMA 0x0000002368500000, LEN 0x1000, Attr=0x23 mmc1: sdhci: 236dbfa26c: DMA 0x000000236b55d000, LEN 0x1000, Attr=0x21 ... which is interesting for several reasons: - The ADMA error register indicates a length mismatch error. The transfer was for 0x99 blocks of 0x200, which is 0x13200 bytes. Summing the ADMA lengths up to the last descriptor (length=0 is 0x10000 bytes) gives 0x20000 bytes. So the DMA table contains more bytes than the requested transfer. - The ADMA error register indicates ST_CADR, which is described as "This state is never set because do not generate ADMA error in this state." - The error descriptor is again after the descriptor with END=1, but this time has VALID=1. This _feels_ like a coherency issue, where the SDHCI engine is not correctly seeing the descriptor table, but then I would have expected userspace (which is basically debian stable) to fail to boot every time given that its rootfs is on eMMC. The other weird thing is if I wind the core MMC code back via: $ git diff -u 7559d612dff0..v5.3 drivers/mmc/core | patch -p1 -R and fix the lack of dma_max_pfn(), then SDHCI is more stable - not completely stable, but way better than plain v5.3. I don't see much in that diff which would be responsible for this - although it does seem that hch's DMA changes do make the problem more likely. (going from 1 in 3 boots with a problem to being not able to boot.) Note, with v5.2, I _never_ saw any ADMA errors, except if I disabled bypass mode on the IOMMU (but then I saw global smmu errors right from when the IOMMU had bypass disabled before MMC was probed - the reason being is the SoC is not currently setup to have the MMU bypass mode disabled.) -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A 2019-09-17 10:42 ` Russell King - ARM Linux admin @ 2019-09-17 11:16 ` Russell King - ARM Linux admin 2019-09-17 11:42 ` Russell King - ARM Linux admin 0 siblings, 1 reply; 31+ messages in thread From: Russell King - ARM Linux admin @ 2019-09-17 11:16 UTC (permalink / raw) To: Adrian Hunter, Will Deacon; +Cc: linux-mmc, Linux ARM On Tue, Sep 17, 2019 at 11:42:00AM +0100, Russell King - ARM Linux admin wrote: > On Tue, Sep 17, 2019 at 09:19:31AM +0100, Russell King - ARM Linux admin wrote: > > On Tue, Sep 17, 2019 at 10:06:12AM +0200, Marc Gonzalez wrote: > > > On 16/09/2019 19:15, Russell King - ARM Linux admin wrote: > > > > > > > The platform has an iommu, which is in pass-through mode, via > > > > arm_smmu.disable_bypass=0. > > > > > > Could be 954a03be033c7cef80ddc232e7cbdb17df735663 > > > "iommu/arm-smmu: Break insecure users by disabling bypass by default" > > > > > > Although it had already landed in v5.2 > > > > It is not - and the two lines that you quoted above are sufficient > > to negate that as a cause. (Please read the help for the option that > > the commit referrs to.) > > > > In fact, with bypass disabled, the SoC fails due to other masters. > > That's already been discussed privately between myself and Will > > Deacon. > > > > arm_smmu.disable_bypass=0 re-enables bypass mode irrespective of > > the default setting in the Kconfig. > > Adding some further debugging, and fixing the existing ADMA debugging > shows: > > mmc0: ADMA error: 0x02000000 > > So this is an ADMA error without the transfer having completed. > > mmc0: sdhci: Blk size: 0x00000008 | Blk cnt: 0x00000001 > > The block size is 8, with one block. > > mmc0: sdhci: ADMA Err: 0x00000009 | ADMA Ptr: 0x000000236df1d20c > > The ADMA error is a descriptor error at address 0x000000236df1d20c. > The descriptor table contains (including the following entry): > > mmc0: sdhci: 236df1d200: DMA 0x000000236d40e980, LEN 0x0008, Attr=0x23 > mmc0: sdhci: 236df1d20c: DMA 0x0000000000000000, LEN 0x0000, Attr=0x00 > > The descriptor table contains one descriptor of 8 bytes, is marked > as the last (END bit set) and is at DMA address 0x236df1d200. The > following descriptor is empty, with VALID=0. > > One may be tempted to blame it on the following descriptor, but having > had another example on eMMC while userspace was booting (rootfs on > eMMC): > > mmc1: ADMA error: 0x02000000 > mmc1: sdhci: Blk size: 0x00000200 | Blk cnt: 0x00000099 > mmc1: sdhci: ADMA Err: 0x00000006 | ADMA Ptr: 0x000000236dbfa26c > mmc1: sdhci: 236dbfa200: DMA 0x000000236c25c000, LEN 0x2000, Attr=0x21 > mmc1: sdhci: 236dbfa20c: DMA 0x000000236938c000, LEN 0x0000, Attr=0x21 > mmc1: sdhci: 236dbfa218: DMA 0x000000236939c000, LEN 0x5000, Attr=0x21 > mmc1: sdhci: 236dbfa224: DMA 0x0000002368545000, LEN 0x1000, Attr=0x21 > mmc1: sdhci: 236dbfa230: DMA 0x00000023684f1000, LEN 0x1000, Attr=0x21 > mmc1: sdhci: 236dbfa23c: DMA 0x0000002368504000, LEN 0x2000, Attr=0x21 > mmc1: sdhci: 236dbfa248: DMA 0x0000002368546000, LEN 0x2000, Attr=0x21 > mmc1: sdhci: 236dbfa254: DMA 0x00000023684f2000, LEN 0x2000, Attr=0x21 > mmc1: sdhci: 236dbfa260: DMA 0x0000002368500000, LEN 0x1000, Attr=0x23 > mmc1: sdhci: 236dbfa26c: DMA 0x000000236b55d000, LEN 0x1000, Attr=0x21 > > ... which is interesting for several reasons: > - The ADMA error register indicates a length mismatch error. The > transfer was for 0x99 blocks of 0x200, which is 0x13200 bytes. > Summing the ADMA lengths up to the last descriptor (length=0 is > 0x10000 bytes) gives 0x20000 bytes. So the DMA table contains more > bytes than the requested transfer. > > - The ADMA error register indicates ST_CADR, which is described as > "This state is never set because do not generate ADMA error in this > state." > > - The error descriptor is again after the descriptor with END=1, but > this time has VALID=1. > > This _feels_ like a coherency issue, where the SDHCI engine is not > correctly seeing the descriptor table, but then I would have expected > userspace (which is basically debian stable) to fail to boot every > time given that its rootfs is on eMMC. > > The other weird thing is if I wind the core MMC code back via: > > $ git diff -u 7559d612dff0..v5.3 drivers/mmc/core | patch -p1 -R > > and fix the lack of dma_max_pfn(), then SDHCI is more stable - not > completely stable, but way better than plain v5.3. I don't see > much in that diff which would be responsible for this - although it > does seem that hch's DMA changes do make the problem more likely. > (going from 1 in 3 boots with a problem to being not able to boot.) > > Note, with v5.2, I _never_ saw any ADMA errors, except if I disabled > bypass mode on the IOMMU (but then I saw global smmu errors right > from when the IOMMU had bypass disabled before MMC was probed - the > reason being is the SoC is not currently setup to have the MMU > bypass mode disabled.) This looks like an ARM64 coherency issue. I first tried adding a dma_wmb() to the end of sdhci_adma_table_pre(), which had no effect. I then tried adding: + __dma_flush_area(host->adma_table, desc - host->adma_table); + dma_wmb(); and so far I haven't had any further ADMA errors. Adding Will Deacon to the thread. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A 2019-09-17 11:16 ` Russell King - ARM Linux admin @ 2019-09-17 11:42 ` Russell King - ARM Linux admin 2019-09-17 12:33 ` Russell King - ARM Linux admin 0 siblings, 1 reply; 31+ messages in thread From: Russell King - ARM Linux admin @ 2019-09-17 11:42 UTC (permalink / raw) To: Adrian Hunter, Will Deacon; +Cc: linux-mmc, Linux ARM On Tue, Sep 17, 2019 at 12:16:31PM +0100, Russell King - ARM Linux admin wrote: > On Tue, Sep 17, 2019 at 11:42:00AM +0100, Russell King - ARM Linux admin wrote: > > On Tue, Sep 17, 2019 at 09:19:31AM +0100, Russell King - ARM Linux admin wrote: > > > On Tue, Sep 17, 2019 at 10:06:12AM +0200, Marc Gonzalez wrote: > > > > On 16/09/2019 19:15, Russell King - ARM Linux admin wrote: > > > > > > > > > The platform has an iommu, which is in pass-through mode, via > > > > > arm_smmu.disable_bypass=0. > > > > > > > > Could be 954a03be033c7cef80ddc232e7cbdb17df735663 > > > > "iommu/arm-smmu: Break insecure users by disabling bypass by default" > > > > > > > > Although it had already landed in v5.2 > > > > > > It is not - and the two lines that you quoted above are sufficient > > > to negate that as a cause. (Please read the help for the option that > > > the commit referrs to.) > > > > > > In fact, with bypass disabled, the SoC fails due to other masters. > > > That's already been discussed privately between myself and Will > > > Deacon. > > > > > > arm_smmu.disable_bypass=0 re-enables bypass mode irrespective of > > > the default setting in the Kconfig. > > > > Adding some further debugging, and fixing the existing ADMA debugging > > shows: > > > > mmc0: ADMA error: 0x02000000 > > > > So this is an ADMA error without the transfer having completed. > > > > mmc0: sdhci: Blk size: 0x00000008 | Blk cnt: 0x00000001 > > > > The block size is 8, with one block. > > > > mmc0: sdhci: ADMA Err: 0x00000009 | ADMA Ptr: 0x000000236df1d20c > > > > The ADMA error is a descriptor error at address 0x000000236df1d20c. > > The descriptor table contains (including the following entry): > > > > mmc0: sdhci: 236df1d200: DMA 0x000000236d40e980, LEN 0x0008, Attr=0x23 > > mmc0: sdhci: 236df1d20c: DMA 0x0000000000000000, LEN 0x0000, Attr=0x00 > > > > The descriptor table contains one descriptor of 8 bytes, is marked > > as the last (END bit set) and is at DMA address 0x236df1d200. The > > following descriptor is empty, with VALID=0. > > > > One may be tempted to blame it on the following descriptor, but having > > had another example on eMMC while userspace was booting (rootfs on > > eMMC): > > > > mmc1: ADMA error: 0x02000000 > > mmc1: sdhci: Blk size: 0x00000200 | Blk cnt: 0x00000099 > > mmc1: sdhci: ADMA Err: 0x00000006 | ADMA Ptr: 0x000000236dbfa26c > > mmc1: sdhci: 236dbfa200: DMA 0x000000236c25c000, LEN 0x2000, Attr=0x21 > > mmc1: sdhci: 236dbfa20c: DMA 0x000000236938c000, LEN 0x0000, Attr=0x21 > > mmc1: sdhci: 236dbfa218: DMA 0x000000236939c000, LEN 0x5000, Attr=0x21 > > mmc1: sdhci: 236dbfa224: DMA 0x0000002368545000, LEN 0x1000, Attr=0x21 > > mmc1: sdhci: 236dbfa230: DMA 0x00000023684f1000, LEN 0x1000, Attr=0x21 > > mmc1: sdhci: 236dbfa23c: DMA 0x0000002368504000, LEN 0x2000, Attr=0x21 > > mmc1: sdhci: 236dbfa248: DMA 0x0000002368546000, LEN 0x2000, Attr=0x21 > > mmc1: sdhci: 236dbfa254: DMA 0x00000023684f2000, LEN 0x2000, Attr=0x21 > > mmc1: sdhci: 236dbfa260: DMA 0x0000002368500000, LEN 0x1000, Attr=0x23 > > mmc1: sdhci: 236dbfa26c: DMA 0x000000236b55d000, LEN 0x1000, Attr=0x21 > > > > ... which is interesting for several reasons: > > - The ADMA error register indicates a length mismatch error. The > > transfer was for 0x99 blocks of 0x200, which is 0x13200 bytes. > > Summing the ADMA lengths up to the last descriptor (length=0 is > > 0x10000 bytes) gives 0x20000 bytes. So the DMA table contains more > > bytes than the requested transfer. > > > > - The ADMA error register indicates ST_CADR, which is described as > > "This state is never set because do not generate ADMA error in this > > state." > > > > - The error descriptor is again after the descriptor with END=1, but > > this time has VALID=1. > > > > This _feels_ like a coherency issue, where the SDHCI engine is not > > correctly seeing the descriptor table, but then I would have expected > > userspace (which is basically debian stable) to fail to boot every > > time given that its rootfs is on eMMC. > > > > The other weird thing is if I wind the core MMC code back via: > > > > $ git diff -u 7559d612dff0..v5.3 drivers/mmc/core | patch -p1 -R > > > > and fix the lack of dma_max_pfn(), then SDHCI is more stable - not > > completely stable, but way better than plain v5.3. I don't see > > much in that diff which would be responsible for this - although it > > does seem that hch's DMA changes do make the problem more likely. > > (going from 1 in 3 boots with a problem to being not able to boot.) > > > > Note, with v5.2, I _never_ saw any ADMA errors, except if I disabled > > bypass mode on the IOMMU (but then I saw global smmu errors right > > from when the IOMMU had bypass disabled before MMC was probed - the > > reason being is the SoC is not currently setup to have the MMU > > bypass mode disabled.) > > This looks like an ARM64 coherency issue. > > I first tried adding a dma_wmb() to the end of sdhci_adma_table_pre(), > which had no effect. I then tried adding: > > + __dma_flush_area(host->adma_table, desc - host->adma_table); > + dma_wmb(); > > and so far I haven't had any further ADMA errors. Adding Will Deacon > to the thread. These are the changes to sdhci that I'm currently running. I think some of the debugging related changes are probably worth adding to the driver, particularly printing the intmask on ADMA error (which is not printed by the register dump, as the value is lost) and printing the DMA addresses of the descriptor table entries which can be tied up with the DMA address error register. Also, maybe printing the DMA descriptor table with the register dump, rather than having to resort to enabling debug would be a good idea? diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c index a5dc5aae973e..884dcaa9cad5 100644 --- a/drivers/mmc/host/sdhci.c +++ b/drivers/mmc/host/sdhci.c @@ -773,6 +773,8 @@ static void sdhci_adma_table_pre(struct sdhci_host *host, /* Add a terminating entry - nop, end, valid */ __sdhci_adma_write_desc(host, &desc, 0, 0, ADMA2_NOP_END_VALID); } + __dma_flush_area(host->adma_table, desc - host->adma_table); + dma_wmb(); } static void sdhci_adma_table_post(struct sdhci_host *host, @@ -2855,6 +2857,8 @@ static void sdhci_cmd_irq(struct sdhci_host *host, u32 intmask, u32 *intmask_p) static void sdhci_adma_show_error(struct sdhci_host *host) { void *desc = host->adma_table; + dma_addr_t dma = host->adma_addr; + bool end = false; sdhci_dumpregs(host); @@ -2862,21 +2866,26 @@ static void sdhci_adma_show_error(struct sdhci_host *host) struct sdhci_adma2_64_desc *dma_desc = desc; if (host->flags & SDHCI_USE_64_BIT_DMA) - DBG("%p: DMA 0x%08x%08x, LEN 0x%04x, Attr=0x%02x\n", - desc, le32_to_cpu(dma_desc->addr_hi), + SDHCI_DUMP("%08llx: DMA 0x%08x%08x, LEN 0x%04x, Attr=0x%02x\n", + (unsigned long long)dma, + le32_to_cpu(dma_desc->addr_hi), le32_to_cpu(dma_desc->addr_lo), le16_to_cpu(dma_desc->len), le16_to_cpu(dma_desc->cmd)); else - DBG("%p: DMA 0x%08x, LEN 0x%04x, Attr=0x%02x\n", - desc, le32_to_cpu(dma_desc->addr_lo), + SDHCI_DUMP("%08llx: DMA 0x%08x, LEN 0x%04x, Attr=0x%02x\n", + (unsigned long long)dma, + le32_to_cpu(dma_desc->addr_lo), le16_to_cpu(dma_desc->len), le16_to_cpu(dma_desc->cmd)); + if (end) break; + desc += host->desc_sz; + dma += host->desc_sz; if (dma_desc->cmd & cpu_to_le16(ADMA2_END)) - break; + end = true; } } @@ -2949,7 +2958,7 @@ static void sdhci_data_irq(struct sdhci_host *host, u32 intmask) != MMC_BUS_TEST_R) host->data->error = -EILSEQ; else if (intmask & SDHCI_INT_ADMA_ERROR) { - pr_err("%s: ADMA error\n", mmc_hostname(host->mmc)); + pr_err("%s: ADMA error: 0x%08x\n", mmc_hostname(host->mmc), intmask); sdhci_adma_show_error(host); host->data->error = -EIO; if (host->ops->adma_workaround) -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A 2019-09-17 11:42 ` Russell King - ARM Linux admin @ 2019-09-17 12:33 ` Russell King - ARM Linux admin 2019-09-17 13:03 ` Robin Murphy 2019-09-17 13:07 ` Russell King - ARM Linux admin 0 siblings, 2 replies; 31+ messages in thread From: Russell King - ARM Linux admin @ 2019-09-17 12:33 UTC (permalink / raw) To: Adrian Hunter, Will Deacon; +Cc: linux-mmc, Linux ARM On Tue, Sep 17, 2019 at 12:42:10PM +0100, Russell King - ARM Linux admin wrote: > On Tue, Sep 17, 2019 at 12:16:31PM +0100, Russell King - ARM Linux admin wrote: > > On Tue, Sep 17, 2019 at 11:42:00AM +0100, Russell King - ARM Linux admin wrote: > > > On Tue, Sep 17, 2019 at 09:19:31AM +0100, Russell King - ARM Linux admin wrote: > > > > On Tue, Sep 17, 2019 at 10:06:12AM +0200, Marc Gonzalez wrote: > > > > > On 16/09/2019 19:15, Russell King - ARM Linux admin wrote: > > > > > > > > > > > The platform has an iommu, which is in pass-through mode, via > > > > > > arm_smmu.disable_bypass=0. > > > > > > > > > > Could be 954a03be033c7cef80ddc232e7cbdb17df735663 > > > > > "iommu/arm-smmu: Break insecure users by disabling bypass by default" > > > > > > > > > > Although it had already landed in v5.2 > > > > > > > > It is not - and the two lines that you quoted above are sufficient > > > > to negate that as a cause. (Please read the help for the option that > > > > the commit referrs to.) > > > > > > > > In fact, with bypass disabled, the SoC fails due to other masters. > > > > That's already been discussed privately between myself and Will > > > > Deacon. > > > > > > > > arm_smmu.disable_bypass=0 re-enables bypass mode irrespective of > > > > the default setting in the Kconfig. > > > > > > Adding some further debugging, and fixing the existing ADMA debugging > > > shows: > > > > > > mmc0: ADMA error: 0x02000000 > > > > > > So this is an ADMA error without the transfer having completed. > > > > > > mmc0: sdhci: Blk size: 0x00000008 | Blk cnt: 0x00000001 > > > > > > The block size is 8, with one block. > > > > > > mmc0: sdhci: ADMA Err: 0x00000009 | ADMA Ptr: 0x000000236df1d20c > > > > > > The ADMA error is a descriptor error at address 0x000000236df1d20c. > > > The descriptor table contains (including the following entry): > > > > > > mmc0: sdhci: 236df1d200: DMA 0x000000236d40e980, LEN 0x0008, Attr=0x23 > > > mmc0: sdhci: 236df1d20c: DMA 0x0000000000000000, LEN 0x0000, Attr=0x00 > > > > > > The descriptor table contains one descriptor of 8 bytes, is marked > > > as the last (END bit set) and is at DMA address 0x236df1d200. The > > > following descriptor is empty, with VALID=0. > > > > > > One may be tempted to blame it on the following descriptor, but having > > > had another example on eMMC while userspace was booting (rootfs on > > > eMMC): > > > > > > mmc1: ADMA error: 0x02000000 > > > mmc1: sdhci: Blk size: 0x00000200 | Blk cnt: 0x00000099 > > > mmc1: sdhci: ADMA Err: 0x00000006 | ADMA Ptr: 0x000000236dbfa26c > > > mmc1: sdhci: 236dbfa200: DMA 0x000000236c25c000, LEN 0x2000, Attr=0x21 > > > mmc1: sdhci: 236dbfa20c: DMA 0x000000236938c000, LEN 0x0000, Attr=0x21 > > > mmc1: sdhci: 236dbfa218: DMA 0x000000236939c000, LEN 0x5000, Attr=0x21 > > > mmc1: sdhci: 236dbfa224: DMA 0x0000002368545000, LEN 0x1000, Attr=0x21 > > > mmc1: sdhci: 236dbfa230: DMA 0x00000023684f1000, LEN 0x1000, Attr=0x21 > > > mmc1: sdhci: 236dbfa23c: DMA 0x0000002368504000, LEN 0x2000, Attr=0x21 > > > mmc1: sdhci: 236dbfa248: DMA 0x0000002368546000, LEN 0x2000, Attr=0x21 > > > mmc1: sdhci: 236dbfa254: DMA 0x00000023684f2000, LEN 0x2000, Attr=0x21 > > > mmc1: sdhci: 236dbfa260: DMA 0x0000002368500000, LEN 0x1000, Attr=0x23 > > > mmc1: sdhci: 236dbfa26c: DMA 0x000000236b55d000, LEN 0x1000, Attr=0x21 > > > > > > ... which is interesting for several reasons: > > > - The ADMA error register indicates a length mismatch error. The > > > transfer was for 0x99 blocks of 0x200, which is 0x13200 bytes. > > > Summing the ADMA lengths up to the last descriptor (length=0 is > > > 0x10000 bytes) gives 0x20000 bytes. So the DMA table contains more > > > bytes than the requested transfer. > > > > > > - The ADMA error register indicates ST_CADR, which is described as > > > "This state is never set because do not generate ADMA error in this > > > state." > > > > > > - The error descriptor is again after the descriptor with END=1, but > > > this time has VALID=1. > > > > > > This _feels_ like a coherency issue, where the SDHCI engine is not > > > correctly seeing the descriptor table, but then I would have expected > > > userspace (which is basically debian stable) to fail to boot every > > > time given that its rootfs is on eMMC. > > > > > > The other weird thing is if I wind the core MMC code back via: > > > > > > $ git diff -u 7559d612dff0..v5.3 drivers/mmc/core | patch -p1 -R > > > > > > and fix the lack of dma_max_pfn(), then SDHCI is more stable - not > > > completely stable, but way better than plain v5.3. I don't see > > > much in that diff which would be responsible for this - although it > > > does seem that hch's DMA changes do make the problem more likely. > > > (going from 1 in 3 boots with a problem to being not able to boot.) > > > > > > Note, with v5.2, I _never_ saw any ADMA errors, except if I disabled > > > bypass mode on the IOMMU (but then I saw global smmu errors right > > > from when the IOMMU had bypass disabled before MMC was probed - the > > > reason being is the SoC is not currently setup to have the MMU > > > bypass mode disabled.) > > > > This looks like an ARM64 coherency issue. > > > > I first tried adding a dma_wmb() to the end of sdhci_adma_table_pre(), > > which had no effect. I then tried adding: > > > > + __dma_flush_area(host->adma_table, desc - host->adma_table); > > + dma_wmb(); > > > > and so far I haven't had any further ADMA errors. Adding Will Deacon > > to the thread. > > These are the changes to sdhci that I'm currently running. I think > some of the debugging related changes are probably worth adding to > the driver, particularly printing the intmask on ADMA error (which > is not printed by the register dump, as the value is lost) and printing > the DMA addresses of the descriptor table entries which can be tied > up with the DMA address error register. Also, maybe printing the > DMA descriptor table with the register dump, rather than having to > resort to enabling debug would be a good idea? > > diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c > index a5dc5aae973e..884dcaa9cad5 100644 > --- a/drivers/mmc/host/sdhci.c > +++ b/drivers/mmc/host/sdhci.c > @@ -773,6 +773,8 @@ static void sdhci_adma_table_pre(struct sdhci_host *host, > /* Add a terminating entry - nop, end, valid */ > __sdhci_adma_write_desc(host, &desc, 0, 0, ADMA2_NOP_END_VALID); > } > + __dma_flush_area(host->adma_table, desc - host->adma_table); > + dma_wmb(); > } > > static void sdhci_adma_table_post(struct sdhci_host *host, > @@ -2855,6 +2857,8 @@ static void sdhci_cmd_irq(struct sdhci_host *host, u32 intmask, u32 *intmask_p) > static void sdhci_adma_show_error(struct sdhci_host *host) > { > void *desc = host->adma_table; > + dma_addr_t dma = host->adma_addr; > + bool end = false; > > sdhci_dumpregs(host); > > @@ -2862,21 +2866,26 @@ static void sdhci_adma_show_error(struct sdhci_host *host) > struct sdhci_adma2_64_desc *dma_desc = desc; > > if (host->flags & SDHCI_USE_64_BIT_DMA) > - DBG("%p: DMA 0x%08x%08x, LEN 0x%04x, Attr=0x%02x\n", > - desc, le32_to_cpu(dma_desc->addr_hi), > + SDHCI_DUMP("%08llx: DMA 0x%08x%08x, LEN 0x%04x, Attr=0x%02x\n", > + (unsigned long long)dma, > + le32_to_cpu(dma_desc->addr_hi), > le32_to_cpu(dma_desc->addr_lo), > le16_to_cpu(dma_desc->len), > le16_to_cpu(dma_desc->cmd)); > else > - DBG("%p: DMA 0x%08x, LEN 0x%04x, Attr=0x%02x\n", > - desc, le32_to_cpu(dma_desc->addr_lo), > + SDHCI_DUMP("%08llx: DMA 0x%08x, LEN 0x%04x, Attr=0x%02x\n", > + (unsigned long long)dma, > + le32_to_cpu(dma_desc->addr_lo), > le16_to_cpu(dma_desc->len), > le16_to_cpu(dma_desc->cmd)); > > + if (end) break; > + > desc += host->desc_sz; > + dma += host->desc_sz; > > if (dma_desc->cmd & cpu_to_le16(ADMA2_END)) > - break; > + end = true; > } > } > > @@ -2949,7 +2958,7 @@ static void sdhci_data_irq(struct sdhci_host *host, u32 intmask) > != MMC_BUS_TEST_R) > host->data->error = -EILSEQ; > else if (intmask & SDHCI_INT_ADMA_ERROR) { > - pr_err("%s: ADMA error\n", mmc_hostname(host->mmc)); > + pr_err("%s: ADMA error: 0x%08x\n", mmc_hostname(host->mmc), intmask); > sdhci_adma_show_error(host); > host->data->error = -EIO; > if (host->ops->adma_workaround) Further debug shows: coherent=0 - sdhci device is not cache coherent swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000081cac000 [ffffff8010fd5200] pgd=000000237ffff003, pud=000000237ffff003, pmd=000000237fffb003, pte=00e800236d62270f The mapping for the ADMA table seems to be using MAIR index 3, which is MT_MEMORY_NC, so should be non-cacheable. vmallocinfo: 0xffffff8010fd5000-0xffffff8010fd7000 8192 dma_direct_alloc+0x4c/0x54 user So this memory has been remapped. Could there be an alias that has cache lines still in the cache for the physical address, and could we be hitting those cache lines while accessing through a non-cacheable mapping? (On 32-bit ARM, this is "unpredictable" and this problem definitely _feels_ like it has unpredictable attributes!) Also, given that this memory is mapped NC, then surely __dma_flush_area() should have no effect? However, it _does_ have the effect of reliably solving the problem, which to me implies that there _are_ cache lines in this NC mapping. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A 2019-09-17 12:33 ` Russell King - ARM Linux admin @ 2019-09-17 13:03 ` Robin Murphy 2019-09-17 13:28 ` Russell King - ARM Linux admin 2019-09-17 13:07 ` Russell King - ARM Linux admin 1 sibling, 1 reply; 31+ messages in thread From: Robin Murphy @ 2019-09-17 13:03 UTC (permalink / raw) To: Russell King - ARM Linux admin, Adrian Hunter, Will Deacon Cc: linux-mmc, Linux ARM On 17/09/2019 13:33, Russell King - ARM Linux admin wrote: [...] > Further debug shows: > > coherent=0 - sdhci device is not cache coherent > swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000081cac000 > [ffffff8010fd5200] pgd=000000237ffff003, pud=000000237ffff003, > pmd=000000237fffb003, pte=00e800236d62270f > > The mapping for the ADMA table seems to be using MAIR index 3, which is > MT_MEMORY_NC, so should be non-cacheable. > > vmallocinfo: > 0xffffff8010fd5000-0xffffff8010fd7000 8192 dma_direct_alloc+0x4c/0x54 > user > > So this memory has been remapped. Could there be an alias that has > cache lines still in the cache for the physical address, and could we > be hitting those cache lines while accessing through a non-cacheable > mapping? (On 32-bit ARM, this is "unpredictable" and this problem > definitely _feels_ like it has unpredictable attributes!) > > Also, given that this memory is mapped NC, then surely > __dma_flush_area() should have no effect? However, it _does_ have the > effect of reliably solving the problem, which to me implies that there > _are_ cache lines in this NC mapping. The non-cacheable mapping of the descriptor table will still have its cacheable linear map alias, so it's quite likely that the invalidate aspect of __dma_flush_area(), rather than the clean, is what's making the difference - if using __dma_clean_area() instead doesn't help, it would more or less confirm that. One possibility in that case is that you might actually have the rare backwards coherency problem - if the device *is* actually snooping the cache, then it could hit lines which were speculatively prefetched via the cacheable alias before the descriptors were fully written, rather than the up-to-date data which went straight to RAM via the NC mapping. I'd try declaring the device as "dma-coherent" to see if that's actually true (and it should become pretty obvious if it isn't). Robin. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A 2019-09-17 13:03 ` Robin Murphy @ 2019-09-17 13:28 ` Russell King - ARM Linux admin 0 siblings, 0 replies; 31+ messages in thread From: Russell King - ARM Linux admin @ 2019-09-17 13:28 UTC (permalink / raw) To: Robin Murphy; +Cc: linux-mmc, Will Deacon, Adrian Hunter, Linux ARM On Tue, Sep 17, 2019 at 02:03:04PM +0100, Robin Murphy wrote: > On 17/09/2019 13:33, Russell King - ARM Linux admin wrote: > [...] > > Further debug shows: > > > > coherent=0 - sdhci device is not cache coherent > > swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000081cac000 > > [ffffff8010fd5200] pgd=000000237ffff003, pud=000000237ffff003, > > pmd=000000237fffb003, pte=00e800236d62270f > > > > The mapping for the ADMA table seems to be using MAIR index 3, which is > > MT_MEMORY_NC, so should be non-cacheable. > > > > vmallocinfo: > > 0xffffff8010fd5000-0xffffff8010fd7000 8192 dma_direct_alloc+0x4c/0x54 > > user > > > > So this memory has been remapped. Could there be an alias that has > > cache lines still in the cache for the physical address, and could we > > be hitting those cache lines while accessing through a non-cacheable > > mapping? (On 32-bit ARM, this is "unpredictable" and this problem > > definitely _feels_ like it has unpredictable attributes!) > > > > Also, given that this memory is mapped NC, then surely > > __dma_flush_area() should have no effect? However, it _does_ have the > > effect of reliably solving the problem, which to me implies that there > > _are_ cache lines in this NC mapping. > > The non-cacheable mapping of the descriptor table will still have its > cacheable linear map alias, so it's quite likely that the invalidate aspect > of __dma_flush_area(), rather than the clean, is what's making the > difference - if using __dma_clean_area() instead doesn't help, it would more > or less confirm that. > > One possibility in that case is that you might actually have the rare > backwards coherency problem - if the device *is* actually snooping the > cache, then it could hit lines which were speculatively prefetched via the > cacheable alias before the descriptors were fully written, rather than the > up-to-date data which went straight to RAM via the NC mapping. I'd try > declaring the device as "dma-coherent" to see if that's actually true (and > it should become pretty obvious if it isn't). As just mentioned in my previous reply, there's a commit to the dma-contiguous which changes where the CMA memory comes from. [ffffff8010fd5200] pgd=000000237ffff003, pud=000000237ffff003, pmd=000000237fffb003, pte=00e800236d62270f vs [ffffff8010fd5200] pgd=000000237ffff003, pud=000000237ffff003, pmd=000000237fffb003, pte=00e80000f9c9a70f Former is with the patch applied, latter is with it reverted. This makes me question whether the cache handling for a page that is remapped is being performed. If there's cache lines present for a page that is being remapped as non-cacheable, what prevents those cache lines from being dirty and possibly being written-back at some point after the non-cacheable mapping as been started to be used? And yes, it looks like adding "dma-coherent" to the SDHCI controller with the SD card in resolves the issue, so your hypothesis may be true. On the other hand, I haven't added "dma-coherent" to the eMMC side, and that's also working fine over several reboots without the offending commit reverted _nor_ with my __dma_flush_area() hack in place. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A 2019-09-17 12:33 ` Russell King - ARM Linux admin 2019-09-17 13:03 ` Robin Murphy @ 2019-09-17 13:07 ` Russell King - ARM Linux admin 2019-09-17 13:24 ` Fabio Estevam 2019-09-17 13:38 ` Robin Murphy 1 sibling, 2 replies; 31+ messages in thread From: Russell King - ARM Linux admin @ 2019-09-17 13:07 UTC (permalink / raw) To: Adrian Hunter, Will Deacon, Nicolin Chen, dann frazier, Christoph Hellwig Cc: linux-mmc, Linux ARM On Tue, Sep 17, 2019 at 01:33:26PM +0100, Russell King - ARM Linux admin wrote: > On Tue, Sep 17, 2019 at 12:42:10PM +0100, Russell King - ARM Linux admin wrote: > > On Tue, Sep 17, 2019 at 12:16:31PM +0100, Russell King - ARM Linux admin wrote: > > > On Tue, Sep 17, 2019 at 11:42:00AM +0100, Russell King - ARM Linux admin wrote: > > > > On Tue, Sep 17, 2019 at 09:19:31AM +0100, Russell King - ARM Linux admin wrote: > > > > > On Tue, Sep 17, 2019 at 10:06:12AM +0200, Marc Gonzalez wrote: > > > > > > On 16/09/2019 19:15, Russell King - ARM Linux admin wrote: > > > > > > > > > > > > > The platform has an iommu, which is in pass-through mode, via > > > > > > > arm_smmu.disable_bypass=0. > > > > > > > > > > > > Could be 954a03be033c7cef80ddc232e7cbdb17df735663 > > > > > > "iommu/arm-smmu: Break insecure users by disabling bypass by default" > > > > > > > > > > > > Although it had already landed in v5.2 > > > > > > > > > > It is not - and the two lines that you quoted above are sufficient > > > > > to negate that as a cause. (Please read the help for the option that > > > > > the commit referrs to.) > > > > > > > > > > In fact, with bypass disabled, the SoC fails due to other masters. > > > > > That's already been discussed privately between myself and Will > > > > > Deacon. > > > > > > > > > > arm_smmu.disable_bypass=0 re-enables bypass mode irrespective of > > > > > the default setting in the Kconfig. > > > > > > > > Adding some further debugging, and fixing the existing ADMA debugging > > > > shows: > > > > > > > > mmc0: ADMA error: 0x02000000 > > > > > > > > So this is an ADMA error without the transfer having completed. > > > > > > > > mmc0: sdhci: Blk size: 0x00000008 | Blk cnt: 0x00000001 > > > > > > > > The block size is 8, with one block. > > > > > > > > mmc0: sdhci: ADMA Err: 0x00000009 | ADMA Ptr: 0x000000236df1d20c > > > > > > > > The ADMA error is a descriptor error at address 0x000000236df1d20c. > > > > The descriptor table contains (including the following entry): > > > > > > > > mmc0: sdhci: 236df1d200: DMA 0x000000236d40e980, LEN 0x0008, Attr=0x23 > > > > mmc0: sdhci: 236df1d20c: DMA 0x0000000000000000, LEN 0x0000, Attr=0x00 > > > > > > > > The descriptor table contains one descriptor of 8 bytes, is marked > > > > as the last (END bit set) and is at DMA address 0x236df1d200. The > > > > following descriptor is empty, with VALID=0. > > > > > > > > One may be tempted to blame it on the following descriptor, but having > > > > had another example on eMMC while userspace was booting (rootfs on > > > > eMMC): > > > > > > > > mmc1: ADMA error: 0x02000000 > > > > mmc1: sdhci: Blk size: 0x00000200 | Blk cnt: 0x00000099 > > > > mmc1: sdhci: ADMA Err: 0x00000006 | ADMA Ptr: 0x000000236dbfa26c > > > > mmc1: sdhci: 236dbfa200: DMA 0x000000236c25c000, LEN 0x2000, Attr=0x21 > > > > mmc1: sdhci: 236dbfa20c: DMA 0x000000236938c000, LEN 0x0000, Attr=0x21 > > > > mmc1: sdhci: 236dbfa218: DMA 0x000000236939c000, LEN 0x5000, Attr=0x21 > > > > mmc1: sdhci: 236dbfa224: DMA 0x0000002368545000, LEN 0x1000, Attr=0x21 > > > > mmc1: sdhci: 236dbfa230: DMA 0x00000023684f1000, LEN 0x1000, Attr=0x21 > > > > mmc1: sdhci: 236dbfa23c: DMA 0x0000002368504000, LEN 0x2000, Attr=0x21 > > > > mmc1: sdhci: 236dbfa248: DMA 0x0000002368546000, LEN 0x2000, Attr=0x21 > > > > mmc1: sdhci: 236dbfa254: DMA 0x00000023684f2000, LEN 0x2000, Attr=0x21 > > > > mmc1: sdhci: 236dbfa260: DMA 0x0000002368500000, LEN 0x1000, Attr=0x23 > > > > mmc1: sdhci: 236dbfa26c: DMA 0x000000236b55d000, LEN 0x1000, Attr=0x21 > > > > > > > > ... which is interesting for several reasons: > > > > - The ADMA error register indicates a length mismatch error. The > > > > transfer was for 0x99 blocks of 0x200, which is 0x13200 bytes. > > > > Summing the ADMA lengths up to the last descriptor (length=0 is > > > > 0x10000 bytes) gives 0x20000 bytes. So the DMA table contains more > > > > bytes than the requested transfer. > > > > > > > > - The ADMA error register indicates ST_CADR, which is described as > > > > "This state is never set because do not generate ADMA error in this > > > > state." > > > > > > > > - The error descriptor is again after the descriptor with END=1, but > > > > this time has VALID=1. > > > > > > > > This _feels_ like a coherency issue, where the SDHCI engine is not > > > > correctly seeing the descriptor table, but then I would have expected > > > > userspace (which is basically debian stable) to fail to boot every > > > > time given that its rootfs is on eMMC. > > > > > > > > The other weird thing is if I wind the core MMC code back via: > > > > > > > > $ git diff -u 7559d612dff0..v5.3 drivers/mmc/core | patch -p1 -R > > > > > > > > and fix the lack of dma_max_pfn(), then SDHCI is more stable - not > > > > completely stable, but way better than plain v5.3. I don't see > > > > much in that diff which would be responsible for this - although it > > > > does seem that hch's DMA changes do make the problem more likely. > > > > (going from 1 in 3 boots with a problem to being not able to boot.) > > > > > > > > Note, with v5.2, I _never_ saw any ADMA errors, except if I disabled > > > > bypass mode on the IOMMU (but then I saw global smmu errors right > > > > from when the IOMMU had bypass disabled before MMC was probed - the > > > > reason being is the SoC is not currently setup to have the MMU > > > > bypass mode disabled.) > > > > > > This looks like an ARM64 coherency issue. > > > > > > I first tried adding a dma_wmb() to the end of sdhci_adma_table_pre(), > > > which had no effect. I then tried adding: > > > > > > + __dma_flush_area(host->adma_table, desc - host->adma_table); > > > + dma_wmb(); > > > > > > and so far I haven't had any further ADMA errors. Adding Will Deacon > > > to the thread. > > > > These are the changes to sdhci that I'm currently running. I think > > some of the debugging related changes are probably worth adding to > > the driver, particularly printing the intmask on ADMA error (which > > is not printed by the register dump, as the value is lost) and printing > > the DMA addresses of the descriptor table entries which can be tied > > up with the DMA address error register. Also, maybe printing the > > DMA descriptor table with the register dump, rather than having to > > resort to enabling debug would be a good idea? > > > > diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c > > index a5dc5aae973e..884dcaa9cad5 100644 > > --- a/drivers/mmc/host/sdhci.c > > +++ b/drivers/mmc/host/sdhci.c > > @@ -773,6 +773,8 @@ static void sdhci_adma_table_pre(struct sdhci_host *host, > > /* Add a terminating entry - nop, end, valid */ > > __sdhci_adma_write_desc(host, &desc, 0, 0, ADMA2_NOP_END_VALID); > > } > > + __dma_flush_area(host->adma_table, desc - host->adma_table); > > + dma_wmb(); > > } > > > > static void sdhci_adma_table_post(struct sdhci_host *host, > > @@ -2855,6 +2857,8 @@ static void sdhci_cmd_irq(struct sdhci_host *host, u32 intmask, u32 *intmask_p) > > static void sdhci_adma_show_error(struct sdhci_host *host) > > { > > void *desc = host->adma_table; > > + dma_addr_t dma = host->adma_addr; > > + bool end = false; > > > > sdhci_dumpregs(host); > > > > @@ -2862,21 +2866,26 @@ static void sdhci_adma_show_error(struct sdhci_host *host) > > struct sdhci_adma2_64_desc *dma_desc = desc; > > > > if (host->flags & SDHCI_USE_64_BIT_DMA) > > - DBG("%p: DMA 0x%08x%08x, LEN 0x%04x, Attr=0x%02x\n", > > - desc, le32_to_cpu(dma_desc->addr_hi), > > + SDHCI_DUMP("%08llx: DMA 0x%08x%08x, LEN 0x%04x, Attr=0x%02x\n", > > + (unsigned long long)dma, > > + le32_to_cpu(dma_desc->addr_hi), > > le32_to_cpu(dma_desc->addr_lo), > > le16_to_cpu(dma_desc->len), > > le16_to_cpu(dma_desc->cmd)); > > else > > - DBG("%p: DMA 0x%08x, LEN 0x%04x, Attr=0x%02x\n", > > - desc, le32_to_cpu(dma_desc->addr_lo), > > + SDHCI_DUMP("%08llx: DMA 0x%08x, LEN 0x%04x, Attr=0x%02x\n", > > + (unsigned long long)dma, > > + le32_to_cpu(dma_desc->addr_lo), > > le16_to_cpu(dma_desc->len), > > le16_to_cpu(dma_desc->cmd)); > > > > + if (end) break; > > + > > desc += host->desc_sz; > > + dma += host->desc_sz; > > > > if (dma_desc->cmd & cpu_to_le16(ADMA2_END)) > > - break; > > + end = true; > > } > > } > > > > @@ -2949,7 +2958,7 @@ static void sdhci_data_irq(struct sdhci_host *host, u32 intmask) > > != MMC_BUS_TEST_R) > > host->data->error = -EILSEQ; > > else if (intmask & SDHCI_INT_ADMA_ERROR) { > > - pr_err("%s: ADMA error\n", mmc_hostname(host->mmc)); > > + pr_err("%s: ADMA error: 0x%08x\n", mmc_hostname(host->mmc), intmask); > > sdhci_adma_show_error(host); > > host->data->error = -EIO; > > if (host->ops->adma_workaround) > > Further debug shows: > > coherent=0 - sdhci device is not cache coherent > swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000081cac000 > [ffffff8010fd5200] pgd=000000237ffff003, pud=000000237ffff003, > pmd=000000237fffb003, pte=00e800236d62270f > > The mapping for the ADMA table seems to be using MAIR index 3, which is > MT_MEMORY_NC, so should be non-cacheable. > > vmallocinfo: > 0xffffff8010fd5000-0xffffff8010fd7000 8192 dma_direct_alloc+0x4c/0x54 > user > > So this memory has been remapped. Could there be an alias that has > cache lines still in the cache for the physical address, and could we > be hitting those cache lines while accessing through a non-cacheable > mapping? (On 32-bit ARM, this is "unpredictable" and this problem > definitely _feels_ like it has unpredictable attributes!) > > Also, given that this memory is mapped NC, then surely > __dma_flush_area() should have no effect? However, it _does_ have the > effect of reliably solving the problem, which to me implies that there > _are_ cache lines in this NC mapping. Will suggested reverting bd2e75633c80 ("dma-contiguous: use fallback alloc_pages for single pages") which has been implicated in the same problem here: https://www.spinics.net/lists/arm-kernel/msg750623.html Although reverting the commit is not clean, this also fixes the issue for me. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A 2019-09-17 13:07 ` Russell King - ARM Linux admin @ 2019-09-17 13:24 ` Fabio Estevam 2019-09-17 13:33 ` Russell King - ARM Linux admin 2019-09-17 13:38 ` Robin Murphy 1 sibling, 1 reply; 31+ messages in thread From: Fabio Estevam @ 2019-09-17 13:24 UTC (permalink / raw) To: Russell King - ARM Linux admin Cc: dann frazier, linux-mmc, Adrian Hunter, Will Deacon, Nicolin Chen, Christoph Hellwig, Linux ARM Hi Russell, On Tue, Sep 17, 2019 at 10:10 AM Russell King - ARM Linux admin <linux@armlinux.org.uk> wrote: > Will suggested reverting bd2e75633c80 ("dma-contiguous: use fallback > alloc_pages for single pages") which has been implicated in the same > problem here: > > https://www.spinics.net/lists/arm-kernel/msg750623.html It seems that the final fix was: https://lore.kernel.org/patchwork/patch/1121600/ Does it work for you if we follow the same idea? --- a/drivers/mmc/host/sdhci-of-esdhc.c +++ b/drivers/mmc/host/sdhci-of-esdhc.c @@ -1105,6 +1105,7 @@ static int sdhci_esdhc_probe(struct platform_device *pdev) esdhc_init(pdev, host); + sdhci_enable_v4_mode(host); sdhci_get_of_property(pdev); pltfm_host = sdhci_priv(host); _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A 2019-09-17 13:24 ` Fabio Estevam @ 2019-09-17 13:33 ` Russell King - ARM Linux admin 2019-09-17 13:43 ` Fabio Estevam 0 siblings, 1 reply; 31+ messages in thread From: Russell King - ARM Linux admin @ 2019-09-17 13:33 UTC (permalink / raw) To: Fabio Estevam Cc: dann frazier, linux-mmc, Adrian Hunter, Will Deacon, Nicolin Chen, Christoph Hellwig, Linux ARM On Tue, Sep 17, 2019 at 10:24:57AM -0300, Fabio Estevam wrote: > Hi Russell, > > On Tue, Sep 17, 2019 at 10:10 AM Russell King - ARM Linux admin > <linux@armlinux.org.uk> wrote: > > > Will suggested reverting bd2e75633c80 ("dma-contiguous: use fallback > > alloc_pages for single pages") which has been implicated in the same > > problem here: > > > > https://www.spinics.net/lists/arm-kernel/msg750623.html > > It seems that the final fix was: > https://lore.kernel.org/patchwork/patch/1121600/ > > Does it work for you if we follow the same idea? > > --- a/drivers/mmc/host/sdhci-of-esdhc.c > +++ b/drivers/mmc/host/sdhci-of-esdhc.c > @@ -1105,6 +1105,7 @@ static int sdhci_esdhc_probe(struct platform_device *pdev) > > esdhc_init(pdev, host); > > + sdhci_enable_v4_mode(host); > sdhci_get_of_property(pdev); > > pltfm_host = sdhci_priv(host); That attempts to set bit 12 of the host control register 2 (0x3e). The LX2160A documentation states that bit 28 of 0x3c (they're 32-bit wide registers there) is "reserved". So, you're asking for a documented reserved bit to be set... -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A 2019-09-17 13:33 ` Russell King - ARM Linux admin @ 2019-09-17 13:43 ` Fabio Estevam 2019-09-17 13:51 ` Russell King - ARM Linux admin 0 siblings, 1 reply; 31+ messages in thread From: Fabio Estevam @ 2019-09-17 13:43 UTC (permalink / raw) To: Russell King - ARM Linux admin Cc: dann frazier, linux-mmc, Adrian Hunter, Will Deacon, Nicolin Chen, Christoph Hellwig, Linux ARM On Tue, Sep 17, 2019 at 10:33 AM Russell King - ARM Linux admin <linux@armlinux.org.uk> wrote: > That attempts to set bit 12 of the host control register 2 (0x3e). > The LX2160A documentation states that bit 28 of 0x3c (they're 32-bit > wide registers there) is "reserved". > > So, you're asking for a documented reserved bit to be set... Correct, v4 is not supported here indeed. From the LX2160 doc: "Conforms to the SD Host Controller Standard Specification version 3.0" _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A 2019-09-17 13:43 ` Fabio Estevam @ 2019-09-17 13:51 ` Russell King - ARM Linux admin 2019-09-17 13:56 ` Fabio Estevam 0 siblings, 1 reply; 31+ messages in thread From: Russell King - ARM Linux admin @ 2019-09-17 13:51 UTC (permalink / raw) To: Fabio Estevam Cc: dann frazier, linux-mmc, Adrian Hunter, Will Deacon, Nicolin Chen, Christoph Hellwig, Linux ARM On Tue, Sep 17, 2019 at 10:43:35AM -0300, Fabio Estevam wrote: > On Tue, Sep 17, 2019 at 10:33 AM Russell King - ARM Linux admin > <linux@armlinux.org.uk> wrote: > > > That attempts to set bit 12 of the host control register 2 (0x3e). > > The LX2160A documentation states that bit 28 of 0x3c (they're 32-bit > > wide registers there) is "reserved". > > > > So, you're asking for a documented reserved bit to be set... > > Correct, v4 is not supported here indeed. > > From the LX2160 doc: > "Conforms to the SD Host Controller Standard Specification version 3.0" The pressing question seems to be this: Are the eSDHC on the LX2160A DMA coherent or are they not? Any chances of finding out internally what the true answer to that, rather than me poking about trying stuff experimentally? Having a definitive answer for a potentially data-corrupting change would be really good... -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A 2019-09-17 13:51 ` Russell King - ARM Linux admin @ 2019-09-17 13:56 ` Fabio Estevam [not found] ` <CADRPPNQ-WTY0QC7_bX=N0QeueKve=k0SaMvbjOrByyvzFojz2g@mail.gmail.com> 0 siblings, 1 reply; 31+ messages in thread From: Fabio Estevam @ 2019-09-17 13:56 UTC (permalink / raw) To: Russell King - ARM Linux admin, Li Yang Cc: dann frazier, linux-mmc, Adrian Hunter, Will Deacon, Nicolin Chen, Christoph Hellwig, Linux ARM [Adding Li Yang] On Tue, Sep 17, 2019 at 10:52 AM Russell King - ARM Linux admin <linux@armlinux.org.uk> wrote: > The pressing question seems to be this: > > Are the eSDHC on the LX2160A DMA coherent or are they not? > > Any chances of finding out internally what the true answer to that, > rather than me poking about trying stuff experimentally? Having a > definitive answer for a potentially data-corrupting change would > be really good... Li Yang, Could you please help to confirm Russell's question? _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <CADRPPNQ-WTY0QC7_bX=N0QeueKve=k0SaMvbjOrByyvzFojz2g@mail.gmail.com>]
* RE: [REGRESSION] sdhci no longer detects SD cards on LX2160A [not found] ` <CADRPPNQ-WTY0QC7_bX=N0QeueKve=k0SaMvbjOrByyvzFojz2g@mail.gmail.com> @ 2019-09-19 4:13 ` Y.b. Lu 2019-09-19 7:04 ` Russell King - ARM Linux admin 0 siblings, 1 reply; 31+ messages in thread From: Y.b. Lu @ 2019-09-19 4:13 UTC (permalink / raw) To: Leo Li, Fabio Estevam, Russell King - ARM Linux admin Cc: dann frazier, Will Deacon, Adrian Hunter, Nicolin Chen, linux-mmc, Christoph Hellwig, Linux ARM Sorry. My email was rejected by mailing lists. Let me re-send. Hi Russell, I’m not sure what board you were using for LX2160A. We had an known issue for eSDHC controller and all NXP Layerscape RDB boards. eSDHC couldn’t provide power-cycle to SD card, and even worse, board reset couldn’t provide power-cycle to SD card either. But for UHS-I SD card, it’s required to have a power-cycle to reset card if it goes into UHS-I mode. Otherwise, we don’t know what will happen when kernel initializes SD card after a reboot/reset. I could reproduce that issue with below steps on latest mainline kernel. 1. Power off board, and power on board. 2. Start up kernel, the SD card works fine in UHS-I mode. 3. Reboot/reset board. (This couldn’t provide power-cycle to SD card) 4. Start up kernel, the SD card gets that ADMA error issue. So could you have a try to power off/power on the board, and then start up kernel. Don’t use reboot, or board reset button. Or you can remove SD card and start up kernel, and insert SD card when kernel has been started up. Thanks a lot. Best regards, Yangbo Lu From: Li Yang <leoyang.li@nxp.com> Sent: Wednesday, September 18, 2019 1:48 AM To: Fabio Estevam <festevam@gmail.com>; Y.b. Lu <yangbo.lu@nxp.com> Cc: Adrian Hunter <adrian.hunter@intel.com>; Christoph Hellwig <hch@lst.de>; Linux ARM <linux-arm-kernel@lists.infradead.org>; Nicolin Chen <nicoleotsuka@gmail.com>; Russell King - ARM Linux admin <linux@armlinux.org.uk>; Will Deacon <will.deacon@arm.com>; dann frazier <dann.frazier@canonical.com>; linux-mmc <linux-mmc@vger.kernel.org> Subject: Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A On Tue, Sep 17, 2019 at 6:31 PM Fabio Estevam <mailto:festevam@gmail.com> wrote: [Adding Li Yang] On Tue, Sep 17, 2019 at 10:52 AM Russell King - ARM Linux admin <mailto:linux@armlinux.org.uk> wrote: > The pressing question seems to be this: > > Are the eSDHC on the LX2160A DMA coherent or are they not? > > Any chances of finding out internally what the true answer to that, > rather than me poking about trying stuff experimentally? Having a > definitive answer for a potentially data-corrupting change would > be really good... Li Yang, Could you please help to confirm Russell's question? Adding Yangbo who is working on SDHC. Regards, Leo _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A 2019-09-19 4:13 ` Y.b. Lu @ 2019-09-19 7:04 ` Russell King - ARM Linux admin 2019-09-19 8:15 ` Y.b. Lu 0 siblings, 1 reply; 31+ messages in thread From: Russell King - ARM Linux admin @ 2019-09-19 7:04 UTC (permalink / raw) To: Y.b. Lu Cc: dann frazier, Will Deacon, Adrian Hunter, Leo Li, Nicolin Chen, linux-mmc, Fabio Estevam, Christoph Hellwig, Linux ARM Hi, This is not the issue, since the problem has been observed with eMMC too, and is sporadic in nature. Please could you answer the question posed: are the eSDHC controllers DMA coherent or are they not coherent? Thanks. On Thu, Sep 19, 2019 at 04:13:20AM +0000, Y.b. Lu wrote: > Sorry. My email was rejected by mailing lists. Let me re-send. > > Hi Russell, > > I’m not sure what board you were using for LX2160A. > We had an known issue for eSDHC controller and all NXP Layerscape RDB boards. > eSDHC couldn’t provide power-cycle to SD card, and even worse, board reset couldn’t provide power-cycle to SD card either. > But for UHS-I SD card, it’s required to have a power-cycle to reset card if it goes into UHS-I mode. Otherwise, we don’t know what will happen when kernel initializes SD card after a reboot/reset. > > I could reproduce that issue with below steps on latest mainline kernel. > 1. Power off board, and power on board. > 2. Start up kernel, the SD card works fine in UHS-I mode. > 3. Reboot/reset board. (This couldn’t provide power-cycle to SD card) > 4. Start up kernel, the SD card gets that ADMA error issue. > > So could you have a try to power off/power on the board, and then start up kernel. Don’t use reboot, or board reset button. > Or you can remove SD card and start up kernel, and insert SD card when kernel has been started up. > Thanks a lot. > > Best regards, > Yangbo Lu > > > From: Li Yang <leoyang.li@nxp.com> > Sent: Wednesday, September 18, 2019 1:48 AM > To: Fabio Estevam <festevam@gmail.com>; Y.b. Lu <yangbo.lu@nxp.com> > Cc: Adrian Hunter <adrian.hunter@intel.com>; Christoph Hellwig <hch@lst.de>; Linux ARM <linux-arm-kernel@lists.infradead.org>; Nicolin Chen <nicoleotsuka@gmail.com>; Russell King - ARM Linux admin <linux@armlinux.org.uk>; Will Deacon <will.deacon@arm.com>; dann frazier <dann.frazier@canonical.com>; linux-mmc <linux-mmc@vger.kernel.org> > Subject: Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A > > > > On Tue, Sep 17, 2019 at 6:31 PM Fabio Estevam <mailto:festevam@gmail.com> wrote: > [Adding Li Yang] > > On Tue, Sep 17, 2019 at 10:52 AM Russell King - ARM Linux admin > <mailto:linux@armlinux.org.uk> wrote: > > > The pressing question seems to be this: > > > > Are the eSDHC on the LX2160A DMA coherent or are they not? > > > > Any chances of finding out internally what the true answer to that, > > rather than me poking about trying stuff experimentally? Having a > > definitive answer for a potentially data-corrupting change would > > be really good... > > Li Yang, > > Could you please help to confirm Russell's question? > Adding Yangbo who is working on SDHC. > > Regards, > Leo -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 31+ messages in thread
* RE: [REGRESSION] sdhci no longer detects SD cards on LX2160A 2019-09-19 7:04 ` Russell King - ARM Linux admin @ 2019-09-19 8:15 ` Y.b. Lu 2019-09-19 8:38 ` Russell King - ARM Linux admin 0 siblings, 1 reply; 31+ messages in thread From: Y.b. Lu @ 2019-09-19 8:15 UTC (permalink / raw) To: Russell King - ARM Linux admin Cc: dann frazier, Will Deacon, Adrian Hunter, Leo Li, Nicolin Chen, linux-mmc, Fabio Estevam, Christoph Hellwig, Linux ARM Hi Russell, The ESDHC_DMA_SNOOP bit is always set in eSDHC driver for DMA. 1b - DMA transactions are snooped by the CPU data cache. 0b - DMA transactions are not snooped by the CPU data cache. Thanks a lot. Best regards, Yangbo Lu > -----Original Message----- > From: Russell King - ARM Linux admin <linux@armlinux.org.uk> > Sent: Thursday, September 19, 2019 3:05 PM > To: Y.b. Lu <yangbo.lu@nxp.com> > Cc: Leo Li <leoyang.li@nxp.com>; Fabio Estevam <festevam@gmail.com>; > Adrian Hunter <adrian.hunter@intel.com>; Christoph Hellwig <hch@lst.de>; > Linux ARM <linux-arm-kernel@lists.infradead.org>; Nicolin Chen > <nicoleotsuka@gmail.com>; Will Deacon <will.deacon@arm.com>; dann > frazier <dann.frazier@canonical.com>; linux-mmc > <linux-mmc@vger.kernel.org> > Subject: Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A > > Hi, > > This is not the issue, since the problem has been observed with eMMC too, > and is sporadic in nature. > > Please could you answer the question posed: are the eSDHC controllers DMA > coherent or are they not coherent? > > Thanks. > > On Thu, Sep 19, 2019 at 04:13:20AM +0000, Y.b. Lu wrote: > > Sorry. My email was rejected by mailing lists. Let me re-send. > > > > Hi Russell, > > > > I’m not sure what board you were using for LX2160A. > > We had an known issue for eSDHC controller and all NXP Layerscape RDB > boards. > > eSDHC couldn’t provide power-cycle to SD card, and even worse, board > reset couldn’t provide power-cycle to SD card either. > > But for UHS-I SD card, it’s required to have a power-cycle to reset card if it > goes into UHS-I mode. Otherwise, we don’t know what will happen when > kernel initializes SD card after a reboot/reset. > > > > I could reproduce that issue with below steps on latest mainline kernel. > > 1. Power off board, and power on board. > > 2. Start up kernel, the SD card works fine in UHS-I mode. > > 3. Reboot/reset board. (This couldn’t provide power-cycle to SD card) > > 4. Start up kernel, the SD card gets that ADMA error issue. > > > > So could you have a try to power off/power on the board, and then start up > kernel. Don’t use reboot, or board reset button. > > Or you can remove SD card and start up kernel, and insert SD card when > kernel has been started up. > > Thanks a lot. > > > > Best regards, > > Yangbo Lu > > > > > > From: Li Yang <leoyang.li@nxp.com> > > Sent: Wednesday, September 18, 2019 1:48 AM > > To: Fabio Estevam <festevam@gmail.com>; Y.b. Lu <yangbo.lu@nxp.com> > > Cc: Adrian Hunter <adrian.hunter@intel.com>; Christoph Hellwig > > <hch@lst.de>; Linux ARM <linux-arm-kernel@lists.infradead.org>; > > Nicolin Chen <nicoleotsuka@gmail.com>; Russell King - ARM Linux admin > > <linux@armlinux.org.uk>; Will Deacon <will.deacon@arm.com>; dann > > frazier <dann.frazier@canonical.com>; linux-mmc > > <linux-mmc@vger.kernel.org> > > Subject: Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A > > > > > > > > On Tue, Sep 17, 2019 at 6:31 PM Fabio Estevam > <mailto:festevam@gmail.com> wrote: > > [Adding Li Yang] > > > > On Tue, Sep 17, 2019 at 10:52 AM Russell King - ARM Linux admin > > <mailto:linux@armlinux.org.uk> wrote: > > > > > The pressing question seems to be this: > > > > > > Are the eSDHC on the LX2160A DMA coherent or are they not? > > > > > > Any chances of finding out internally what the true answer to that, > > > rather than me poking about trying stuff experimentally? Having a > > > definitive answer for a potentially data-corrupting change would be > > > really good... > > > > Li Yang, > > > > Could you please help to confirm Russell's question? > > Adding Yangbo who is working on SDHC. > > > > Regards, > > Leo > > -- > RMK's Patch system: > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ar > mlinux.org.uk%2Fdeveloper%2Fpatches%2F&data=02%7C01%7Cyangbo.l > u%40nxp.com%7C7eca2b9652104c95a52008d73ccfa99a%7C686ea1d3bc2b4 > c6fa92cd99c5c301635%7C0%7C0%7C637044734911465102&sdata=QB > SEzA9L2HC99gm65P965E3o%2FhNM18u2SouOZxTEs6s%3D&reserved=0 > FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps > up According to speedtest.net: 11.9Mbps down 500kbps up _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A 2019-09-19 8:15 ` Y.b. Lu @ 2019-09-19 8:38 ` Russell King - ARM Linux admin 2019-09-19 9:22 ` Russell King - ARM Linux admin 0 siblings, 1 reply; 31+ messages in thread From: Russell King - ARM Linux admin @ 2019-09-19 8:38 UTC (permalink / raw) To: Y.b. Lu Cc: dann frazier, Will Deacon, Adrian Hunter, Leo Li, Nicolin Chen, linux-mmc, Fabio Estevam, Christoph Hellwig, Linux ARM Hi, Thanks for the reply. I see that this bit is marked "reserved" in the LX2160A reference manual. This brings up some further questions. The DT property "dma-coherent" is used to tell the OS whether the device is DMA coherent or not. If this property is missing, but the device is set as DMA coherent, and the OS uses "normal, non-cacheable" memory for the ADMA table, then errors can occur if there are stale cache lines corresponding to the memory used. The eSDHC controller will see the stale cache lines, but the CPU will not. Adding "dma-coherent" to the DT declarations alone does not seem to be the right solution - if we have an OS that does not set the ESDHC_DMA_SNOOP bit, then we have a similar issue. Shouldn't ESDHC_DMA_SNOOP be set depending on whether the device is DMA coherent or not? Note that the device is _not_ marked as "dma-coherent" in either mainline nor in the LSDK-19.06-V4.19 branch of https://source.codeaurora.org/external/qoriq/qoriq-components/linux to avoid ADMA descriptor fetch errors, which leads to this error that has now been observed with v5.3 kernels - caused precisely as I describe above. Thanks. On Thu, Sep 19, 2019 at 08:15:00AM +0000, Y.b. Lu wrote: > Hi Russell, > > The ESDHC_DMA_SNOOP bit is always set in eSDHC driver for DMA. > > 1b - DMA transactions are snooped by the CPU data cache. > 0b - DMA transactions are not snooped by the CPU data cache. > > Thanks a lot. > > Best regards, > Yangbo Lu > > > -----Original Message----- > > From: Russell King - ARM Linux admin <linux@armlinux.org.uk> > > Sent: Thursday, September 19, 2019 3:05 PM > > To: Y.b. Lu <yangbo.lu@nxp.com> > > Cc: Leo Li <leoyang.li@nxp.com>; Fabio Estevam <festevam@gmail.com>; > > Adrian Hunter <adrian.hunter@intel.com>; Christoph Hellwig <hch@lst.de>; > > Linux ARM <linux-arm-kernel@lists.infradead.org>; Nicolin Chen > > <nicoleotsuka@gmail.com>; Will Deacon <will.deacon@arm.com>; dann > > frazier <dann.frazier@canonical.com>; linux-mmc > > <linux-mmc@vger.kernel.org> > > Subject: Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A > > > > Hi, > > > > This is not the issue, since the problem has been observed with eMMC too, > > and is sporadic in nature. > > > > Please could you answer the question posed: are the eSDHC controllers DMA > > coherent or are they not coherent? > > > > Thanks. > > > > On Thu, Sep 19, 2019 at 04:13:20AM +0000, Y.b. Lu wrote: > > > Sorry. My email was rejected by mailing lists. Let me re-send. > > > > > > Hi Russell, > > > > > > I’m not sure what board you were using for LX2160A. > > > We had an known issue for eSDHC controller and all NXP Layerscape RDB > > boards. > > > eSDHC couldn’t provide power-cycle to SD card, and even worse, board > > reset couldn’t provide power-cycle to SD card either. > > > But for UHS-I SD card, it’s required to have a power-cycle to reset card if it > > goes into UHS-I mode. Otherwise, we don’t know what will happen when > > kernel initializes SD card after a reboot/reset. > > > > > > I could reproduce that issue with below steps on latest mainline kernel. > > > 1. Power off board, and power on board. > > > 2. Start up kernel, the SD card works fine in UHS-I mode. > > > 3. Reboot/reset board. (This couldn’t provide power-cycle to SD card) > > > 4. Start up kernel, the SD card gets that ADMA error issue. > > > > > > So could you have a try to power off/power on the board, and then start up > > kernel. Don’t use reboot, or board reset button. > > > Or you can remove SD card and start up kernel, and insert SD card when > > kernel has been started up. > > > Thanks a lot. > > > > > > Best regards, > > > Yangbo Lu > > > > > > > > > From: Li Yang <leoyang.li@nxp.com> > > > Sent: Wednesday, September 18, 2019 1:48 AM > > > To: Fabio Estevam <festevam@gmail.com>; Y.b. Lu <yangbo.lu@nxp.com> > > > Cc: Adrian Hunter <adrian.hunter@intel.com>; Christoph Hellwig > > > <hch@lst.de>; Linux ARM <linux-arm-kernel@lists.infradead.org>; > > > Nicolin Chen <nicoleotsuka@gmail.com>; Russell King - ARM Linux admin > > > <linux@armlinux.org.uk>; Will Deacon <will.deacon@arm.com>; dann > > > frazier <dann.frazier@canonical.com>; linux-mmc > > > <linux-mmc@vger.kernel.org> > > > Subject: Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A > > > > > > > > > > > > On Tue, Sep 17, 2019 at 6:31 PM Fabio Estevam > > <mailto:festevam@gmail.com> wrote: > > > [Adding Li Yang] > > > > > > On Tue, Sep 17, 2019 at 10:52 AM Russell King - ARM Linux admin > > > <mailto:linux@armlinux.org.uk> wrote: > > > > > > > The pressing question seems to be this: > > > > > > > > Are the eSDHC on the LX2160A DMA coherent or are they not? > > > > > > > > Any chances of finding out internally what the true answer to that, > > > > rather than me poking about trying stuff experimentally? Having a > > > > definitive answer for a potentially data-corrupting change would be > > > > really good... > > > > > > Li Yang, > > > > > > Could you please help to confirm Russell's question? > > > Adding Yangbo who is working on SDHC. > > > > > > Regards, > > > Leo > > > > -- > > RMK's Patch system: > > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ar > > mlinux.org.uk%2Fdeveloper%2Fpatches%2F&data=02%7C01%7Cyangbo.l > > u%40nxp.com%7C7eca2b9652104c95a52008d73ccfa99a%7C686ea1d3bc2b4 > > c6fa92cd99c5c301635%7C0%7C0%7C637044734911465102&sdata=QB > > SEzA9L2HC99gm65P965E3o%2FhNM18u2SouOZxTEs6s%3D&reserved=0 > > FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps > > up According to speedtest.net: 11.9Mbps down 500kbps up -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A 2019-09-19 8:38 ` Russell King - ARM Linux admin @ 2019-09-19 9:22 ` Russell King - ARM Linux admin 0 siblings, 0 replies; 31+ messages in thread From: Russell King - ARM Linux admin @ 2019-09-19 9:22 UTC (permalink / raw) To: Y.b. Lu Cc: dann frazier, linux-mmc, Adrian Hunter, Will Deacon, Leo Li, Nicolin Chen, Fabio Estevam, Christoph Hellwig, Linux ARM On Thu, Sep 19, 2019 at 09:38:04AM +0100, Russell King - ARM Linux admin wrote: > Hi, > > Thanks for the reply. > > I see that this bit is marked "reserved" in the LX2160A reference > manual. Sorry, I was lookin at bit 5, it's actually bit 6. Rest of my mail is still relevent. > > This brings up some further questions. > > The DT property "dma-coherent" is used to tell the OS whether the > device is DMA coherent or not. If this property is missing, but the > device is set as DMA coherent, and the OS uses "normal, non-cacheable" > memory for the ADMA table, then errors can occur if there are stale > cache lines corresponding to the memory used. The eSDHC controller > will see the stale cache lines, but the CPU will not. > > Adding "dma-coherent" to the DT declarations alone does not seem to > be the right solution - if we have an OS that does not set the > ESDHC_DMA_SNOOP bit, then we have a similar issue. > > Shouldn't ESDHC_DMA_SNOOP be set depending on whether the device is > DMA coherent or not? > > Note that the device is _not_ marked as "dma-coherent" in either > mainline nor in the LSDK-19.06-V4.19 branch of > https://source.codeaurora.org/external/qoriq/qoriq-components/linux > to avoid ADMA descriptor fetch errors, which leads to this error that > has now been observed with v5.3 kernels - caused precisely as I > describe above. > > Thanks. > > On Thu, Sep 19, 2019 at 08:15:00AM +0000, Y.b. Lu wrote: > > Hi Russell, > > > > The ESDHC_DMA_SNOOP bit is always set in eSDHC driver for DMA. > > > > 1b - DMA transactions are snooped by the CPU data cache. > > 0b - DMA transactions are not snooped by the CPU data cache. > > > > Thanks a lot. > > > > Best regards, > > Yangbo Lu > > > > > -----Original Message----- > > > From: Russell King - ARM Linux admin <linux@armlinux.org.uk> > > > Sent: Thursday, September 19, 2019 3:05 PM > > > To: Y.b. Lu <yangbo.lu@nxp.com> > > > Cc: Leo Li <leoyang.li@nxp.com>; Fabio Estevam <festevam@gmail.com>; > > > Adrian Hunter <adrian.hunter@intel.com>; Christoph Hellwig <hch@lst.de>; > > > Linux ARM <linux-arm-kernel@lists.infradead.org>; Nicolin Chen > > > <nicoleotsuka@gmail.com>; Will Deacon <will.deacon@arm.com>; dann > > > frazier <dann.frazier@canonical.com>; linux-mmc > > > <linux-mmc@vger.kernel.org> > > > Subject: Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A > > > > > > Hi, > > > > > > This is not the issue, since the problem has been observed with eMMC too, > > > and is sporadic in nature. > > > > > > Please could you answer the question posed: are the eSDHC controllers DMA > > > coherent or are they not coherent? > > > > > > Thanks. > > > > > > On Thu, Sep 19, 2019 at 04:13:20AM +0000, Y.b. Lu wrote: > > > > Sorry. My email was rejected by mailing lists. Let me re-send. > > > > > > > > Hi Russell, > > > > > > > > I’m not sure what board you were using for LX2160A. > > > > We had an known issue for eSDHC controller and all NXP Layerscape RDB > > > boards. > > > > eSDHC couldn’t provide power-cycle to SD card, and even worse, board > > > reset couldn’t provide power-cycle to SD card either. > > > > But for UHS-I SD card, it’s required to have a power-cycle to reset card if it > > > goes into UHS-I mode. Otherwise, we don’t know what will happen when > > > kernel initializes SD card after a reboot/reset. > > > > > > > > I could reproduce that issue with below steps on latest mainline kernel. > > > > 1. Power off board, and power on board. > > > > 2. Start up kernel, the SD card works fine in UHS-I mode. > > > > 3. Reboot/reset board. (This couldn’t provide power-cycle to SD card) > > > > 4. Start up kernel, the SD card gets that ADMA error issue. > > > > > > > > So could you have a try to power off/power on the board, and then start up > > > kernel. Don’t use reboot, or board reset button. > > > > Or you can remove SD card and start up kernel, and insert SD card when > > > kernel has been started up. > > > > Thanks a lot. > > > > > > > > Best regards, > > > > Yangbo Lu > > > > > > > > > > > > From: Li Yang <leoyang.li@nxp.com> > > > > Sent: Wednesday, September 18, 2019 1:48 AM > > > > To: Fabio Estevam <festevam@gmail.com>; Y.b. Lu <yangbo.lu@nxp.com> > > > > Cc: Adrian Hunter <adrian.hunter@intel.com>; Christoph Hellwig > > > > <hch@lst.de>; Linux ARM <linux-arm-kernel@lists.infradead.org>; > > > > Nicolin Chen <nicoleotsuka@gmail.com>; Russell King - ARM Linux admin > > > > <linux@armlinux.org.uk>; Will Deacon <will.deacon@arm.com>; dann > > > > frazier <dann.frazier@canonical.com>; linux-mmc > > > > <linux-mmc@vger.kernel.org> > > > > Subject: Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A > > > > > > > > > > > > > > > > On Tue, Sep 17, 2019 at 6:31 PM Fabio Estevam > > > <mailto:festevam@gmail.com> wrote: > > > > [Adding Li Yang] > > > > > > > > On Tue, Sep 17, 2019 at 10:52 AM Russell King - ARM Linux admin > > > > <mailto:linux@armlinux.org.uk> wrote: > > > > > > > > > The pressing question seems to be this: > > > > > > > > > > Are the eSDHC on the LX2160A DMA coherent or are they not? > > > > > > > > > > Any chances of finding out internally what the true answer to that, > > > > > rather than me poking about trying stuff experimentally? Having a > > > > > definitive answer for a potentially data-corrupting change would be > > > > > really good... > > > > > > > > Li Yang, > > > > > > > > Could you please help to confirm Russell's question? > > > > Adding Yangbo who is working on SDHC. > > > > > > > > Regards, > > > > Leo > > > > > > -- > > > RMK's Patch system: > > > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ar > > > mlinux.org.uk%2Fdeveloper%2Fpatches%2F&data=02%7C01%7Cyangbo.l > > > u%40nxp.com%7C7eca2b9652104c95a52008d73ccfa99a%7C686ea1d3bc2b4 > > > c6fa92cd99c5c301635%7C0%7C0%7C637044734911465102&sdata=QB > > > SEzA9L2HC99gm65P965E3o%2FhNM18u2SouOZxTEs6s%3D&reserved=0 > > > FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps > > > up According to speedtest.net: 11.9Mbps down 500kbps up > > -- > RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ > FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up > According to speedtest.net: 11.9Mbps down 500kbps up > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A 2019-09-17 13:07 ` Russell King - ARM Linux admin 2019-09-17 13:24 ` Fabio Estevam @ 2019-09-17 13:38 ` Robin Murphy 2019-09-17 13:49 ` Russell King - ARM Linux admin 2019-09-17 13:50 ` Will Deacon 1 sibling, 2 replies; 31+ messages in thread From: Robin Murphy @ 2019-09-17 13:38 UTC (permalink / raw) To: Russell King - ARM Linux admin, Adrian Hunter, Will Deacon, Nicolin Chen, dann frazier, Christoph Hellwig Cc: linux-mmc, Linux ARM On 17/09/2019 14:07, Russell King - ARM Linux admin wrote: > On Tue, Sep 17, 2019 at 01:33:26PM +0100, Russell King - ARM Linux admin wrote: >> On Tue, Sep 17, 2019 at 12:42:10PM +0100, Russell King - ARM Linux admin wrote: >>> On Tue, Sep 17, 2019 at 12:16:31PM +0100, Russell King - ARM Linux admin wrote: >>>> On Tue, Sep 17, 2019 at 11:42:00AM +0100, Russell King - ARM Linux admin wrote: >>>>> On Tue, Sep 17, 2019 at 09:19:31AM +0100, Russell King - ARM Linux admin wrote: >>>>>> On Tue, Sep 17, 2019 at 10:06:12AM +0200, Marc Gonzalez wrote: >>>>>>> On 16/09/2019 19:15, Russell King - ARM Linux admin wrote: >>>>>>> >>>>>>>> The platform has an iommu, which is in pass-through mode, via >>>>>>>> arm_smmu.disable_bypass=0. >>>>>>> >>>>>>> Could be 954a03be033c7cef80ddc232e7cbdb17df735663 >>>>>>> "iommu/arm-smmu: Break insecure users by disabling bypass by default" >>>>>>> >>>>>>> Although it had already landed in v5.2 >>>>>> >>>>>> It is not - and the two lines that you quoted above are sufficient >>>>>> to negate that as a cause. (Please read the help for the option that >>>>>> the commit referrs to.) >>>>>> >>>>>> In fact, with bypass disabled, the SoC fails due to other masters. >>>>>> That's already been discussed privately between myself and Will >>>>>> Deacon. >>>>>> >>>>>> arm_smmu.disable_bypass=0 re-enables bypass mode irrespective of >>>>>> the default setting in the Kconfig. >>>>> >>>>> Adding some further debugging, and fixing the existing ADMA debugging >>>>> shows: >>>>> >>>>> mmc0: ADMA error: 0x02000000 >>>>> >>>>> So this is an ADMA error without the transfer having completed. >>>>> >>>>> mmc0: sdhci: Blk size: 0x00000008 | Blk cnt: 0x00000001 >>>>> >>>>> The block size is 8, with one block. >>>>> >>>>> mmc0: sdhci: ADMA Err: 0x00000009 | ADMA Ptr: 0x000000236df1d20c >>>>> >>>>> The ADMA error is a descriptor error at address 0x000000236df1d20c. >>>>> The descriptor table contains (including the following entry): >>>>> >>>>> mmc0: sdhci: 236df1d200: DMA 0x000000236d40e980, LEN 0x0008, Attr=0x23 >>>>> mmc0: sdhci: 236df1d20c: DMA 0x0000000000000000, LEN 0x0000, Attr=0x00 >>>>> >>>>> The descriptor table contains one descriptor of 8 bytes, is marked >>>>> as the last (END bit set) and is at DMA address 0x236df1d200. The >>>>> following descriptor is empty, with VALID=0. >>>>> >>>>> One may be tempted to blame it on the following descriptor, but having >>>>> had another example on eMMC while userspace was booting (rootfs on >>>>> eMMC): >>>>> >>>>> mmc1: ADMA error: 0x02000000 >>>>> mmc1: sdhci: Blk size: 0x00000200 | Blk cnt: 0x00000099 >>>>> mmc1: sdhci: ADMA Err: 0x00000006 | ADMA Ptr: 0x000000236dbfa26c >>>>> mmc1: sdhci: 236dbfa200: DMA 0x000000236c25c000, LEN 0x2000, Attr=0x21 >>>>> mmc1: sdhci: 236dbfa20c: DMA 0x000000236938c000, LEN 0x0000, Attr=0x21 >>>>> mmc1: sdhci: 236dbfa218: DMA 0x000000236939c000, LEN 0x5000, Attr=0x21 >>>>> mmc1: sdhci: 236dbfa224: DMA 0x0000002368545000, LEN 0x1000, Attr=0x21 >>>>> mmc1: sdhci: 236dbfa230: DMA 0x00000023684f1000, LEN 0x1000, Attr=0x21 >>>>> mmc1: sdhci: 236dbfa23c: DMA 0x0000002368504000, LEN 0x2000, Attr=0x21 >>>>> mmc1: sdhci: 236dbfa248: DMA 0x0000002368546000, LEN 0x2000, Attr=0x21 >>>>> mmc1: sdhci: 236dbfa254: DMA 0x00000023684f2000, LEN 0x2000, Attr=0x21 >>>>> mmc1: sdhci: 236dbfa260: DMA 0x0000002368500000, LEN 0x1000, Attr=0x23 >>>>> mmc1: sdhci: 236dbfa26c: DMA 0x000000236b55d000, LEN 0x1000, Attr=0x21 >>>>> >>>>> ... which is interesting for several reasons: >>>>> - The ADMA error register indicates a length mismatch error. The >>>>> transfer was for 0x99 blocks of 0x200, which is 0x13200 bytes. >>>>> Summing the ADMA lengths up to the last descriptor (length=0 is >>>>> 0x10000 bytes) gives 0x20000 bytes. So the DMA table contains more >>>>> bytes than the requested transfer. >>>>> >>>>> - The ADMA error register indicates ST_CADR, which is described as >>>>> "This state is never set because do not generate ADMA error in this >>>>> state." >>>>> >>>>> - The error descriptor is again after the descriptor with END=1, but >>>>> this time has VALID=1. >>>>> >>>>> This _feels_ like a coherency issue, where the SDHCI engine is not >>>>> correctly seeing the descriptor table, but then I would have expected >>>>> userspace (which is basically debian stable) to fail to boot every >>>>> time given that its rootfs is on eMMC. >>>>> >>>>> The other weird thing is if I wind the core MMC code back via: >>>>> >>>>> $ git diff -u 7559d612dff0..v5.3 drivers/mmc/core | patch -p1 -R >>>>> >>>>> and fix the lack of dma_max_pfn(), then SDHCI is more stable - not >>>>> completely stable, but way better than plain v5.3. I don't see >>>>> much in that diff which would be responsible for this - although it >>>>> does seem that hch's DMA changes do make the problem more likely. >>>>> (going from 1 in 3 boots with a problem to being not able to boot.) >>>>> >>>>> Note, with v5.2, I _never_ saw any ADMA errors, except if I disabled >>>>> bypass mode on the IOMMU (but then I saw global smmu errors right >>>>> from when the IOMMU had bypass disabled before MMC was probed - the >>>>> reason being is the SoC is not currently setup to have the MMU >>>>> bypass mode disabled.) >>>> >>>> This looks like an ARM64 coherency issue. >>>> >>>> I first tried adding a dma_wmb() to the end of sdhci_adma_table_pre(), >>>> which had no effect. I then tried adding: >>>> >>>> + __dma_flush_area(host->adma_table, desc - host->adma_table); >>>> + dma_wmb(); >>>> >>>> and so far I haven't had any further ADMA errors. Adding Will Deacon >>>> to the thread. >>> >>> These are the changes to sdhci that I'm currently running. I think >>> some of the debugging related changes are probably worth adding to >>> the driver, particularly printing the intmask on ADMA error (which >>> is not printed by the register dump, as the value is lost) and printing >>> the DMA addresses of the descriptor table entries which can be tied >>> up with the DMA address error register. Also, maybe printing the >>> DMA descriptor table with the register dump, rather than having to >>> resort to enabling debug would be a good idea? >>> >>> diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c >>> index a5dc5aae973e..884dcaa9cad5 100644 >>> --- a/drivers/mmc/host/sdhci.c >>> +++ b/drivers/mmc/host/sdhci.c >>> @@ -773,6 +773,8 @@ static void sdhci_adma_table_pre(struct sdhci_host *host, >>> /* Add a terminating entry - nop, end, valid */ >>> __sdhci_adma_write_desc(host, &desc, 0, 0, ADMA2_NOP_END_VALID); >>> } >>> + __dma_flush_area(host->adma_table, desc - host->adma_table); >>> + dma_wmb(); >>> } >>> >>> static void sdhci_adma_table_post(struct sdhci_host *host, >>> @@ -2855,6 +2857,8 @@ static void sdhci_cmd_irq(struct sdhci_host *host, u32 intmask, u32 *intmask_p) >>> static void sdhci_adma_show_error(struct sdhci_host *host) >>> { >>> void *desc = host->adma_table; >>> + dma_addr_t dma = host->adma_addr; >>> + bool end = false; >>> >>> sdhci_dumpregs(host); >>> >>> @@ -2862,21 +2866,26 @@ static void sdhci_adma_show_error(struct sdhci_host *host) >>> struct sdhci_adma2_64_desc *dma_desc = desc; >>> >>> if (host->flags & SDHCI_USE_64_BIT_DMA) >>> - DBG("%p: DMA 0x%08x%08x, LEN 0x%04x, Attr=0x%02x\n", >>> - desc, le32_to_cpu(dma_desc->addr_hi), >>> + SDHCI_DUMP("%08llx: DMA 0x%08x%08x, LEN 0x%04x, Attr=0x%02x\n", >>> + (unsigned long long)dma, >>> + le32_to_cpu(dma_desc->addr_hi), >>> le32_to_cpu(dma_desc->addr_lo), >>> le16_to_cpu(dma_desc->len), >>> le16_to_cpu(dma_desc->cmd)); >>> else >>> - DBG("%p: DMA 0x%08x, LEN 0x%04x, Attr=0x%02x\n", >>> - desc, le32_to_cpu(dma_desc->addr_lo), >>> + SDHCI_DUMP("%08llx: DMA 0x%08x, LEN 0x%04x, Attr=0x%02x\n", >>> + (unsigned long long)dma, >>> + le32_to_cpu(dma_desc->addr_lo), >>> le16_to_cpu(dma_desc->len), >>> le16_to_cpu(dma_desc->cmd)); >>> >>> + if (end) break; >>> + >>> desc += host->desc_sz; >>> + dma += host->desc_sz; >>> >>> if (dma_desc->cmd & cpu_to_le16(ADMA2_END)) >>> - break; >>> + end = true; >>> } >>> } >>> >>> @@ -2949,7 +2958,7 @@ static void sdhci_data_irq(struct sdhci_host *host, u32 intmask) >>> != MMC_BUS_TEST_R) >>> host->data->error = -EILSEQ; >>> else if (intmask & SDHCI_INT_ADMA_ERROR) { >>> - pr_err("%s: ADMA error\n", mmc_hostname(host->mmc)); >>> + pr_err("%s: ADMA error: 0x%08x\n", mmc_hostname(host->mmc), intmask); >>> sdhci_adma_show_error(host); >>> host->data->error = -EIO; >>> if (host->ops->adma_workaround) >> >> Further debug shows: >> >> coherent=0 - sdhci device is not cache coherent >> swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000081cac000 >> [ffffff8010fd5200] pgd=000000237ffff003, pud=000000237ffff003, >> pmd=000000237fffb003, pte=00e800236d62270f >> >> The mapping for the ADMA table seems to be using MAIR index 3, which is >> MT_MEMORY_NC, so should be non-cacheable. >> >> vmallocinfo: >> 0xffffff8010fd5000-0xffffff8010fd7000 8192 dma_direct_alloc+0x4c/0x54 >> user >> >> So this memory has been remapped. Could there be an alias that has >> cache lines still in the cache for the physical address, and could we >> be hitting those cache lines while accessing through a non-cacheable >> mapping? (On 32-bit ARM, this is "unpredictable" and this problem >> definitely _feels_ like it has unpredictable attributes!) >> >> Also, given that this memory is mapped NC, then surely >> __dma_flush_area() should have no effect? However, it _does_ have the >> effect of reliably solving the problem, which to me implies that there >> _are_ cache lines in this NC mapping. > > Will suggested reverting bd2e75633c80 ("dma-contiguous: use fallback > alloc_pages for single pages") which has been implicated in the same > problem here: > > https://www.spinics.net/lists/arm-kernel/msg750623.html > > Although reverting the commit is not clean, this also fixes the issue > for me. Note that that one turned out to be something totally different, namely that the single-page allocations, in difference to CMA, came from physical addresses that the controller needed additional configuration to be able to access[1] - no amount of cache maintenance would affect that. However, the other difference between getting a single page directly from the page allocator vs. the CMA area is that accesses to the linear mapping of the CMA area are probably pretty rare, whereas for the single-page case it's much more likely that kernel tasks using adjacent pages could lead to prefetching of the descriptor page's cacheable alias. That could certainly explain how reverting that commit manages to hide an apparent coherency issue. Robin. [1] https://lore.kernel.org/patchwork/patch/1121600/ _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A 2019-09-17 13:38 ` Robin Murphy @ 2019-09-17 13:49 ` Russell King - ARM Linux admin 2019-09-17 14:03 ` Robin Murphy 2019-09-17 13:50 ` Will Deacon 1 sibling, 1 reply; 31+ messages in thread From: Russell King - ARM Linux admin @ 2019-09-17 13:49 UTC (permalink / raw) To: Robin Murphy Cc: dann frazier, linux-mmc, Adrian Hunter, Will Deacon, Nicolin Chen, Christoph Hellwig, Linux ARM On Tue, Sep 17, 2019 at 02:38:00PM +0100, Robin Murphy wrote: > On 17/09/2019 14:07, Russell King - ARM Linux admin wrote: > > On Tue, Sep 17, 2019 at 01:33:26PM +0100, Russell King - ARM Linux admin wrote: > > > On Tue, Sep 17, 2019 at 12:42:10PM +0100, Russell King - ARM Linux admin wrote: > > > > On Tue, Sep 17, 2019 at 12:16:31PM +0100, Russell King - ARM Linux admin wrote: > > > > > On Tue, Sep 17, 2019 at 11:42:00AM +0100, Russell King - ARM Linux admin wrote: > > > > > > On Tue, Sep 17, 2019 at 09:19:31AM +0100, Russell King - ARM Linux admin wrote: > > > > > > > On Tue, Sep 17, 2019 at 10:06:12AM +0200, Marc Gonzalez wrote: > > > > > > > > On 16/09/2019 19:15, Russell King - ARM Linux admin wrote: > > > > > > > > > > > > > > > > > The platform has an iommu, which is in pass-through mode, via > > > > > > > > > arm_smmu.disable_bypass=0. > > > > > > > > > > > > > > > > Could be 954a03be033c7cef80ddc232e7cbdb17df735663 > > > > > > > > "iommu/arm-smmu: Break insecure users by disabling bypass by default" > > > > > > > > > > > > > > > > Although it had already landed in v5.2 > > > > > > > > > > > > > > It is not - and the two lines that you quoted above are sufficient > > > > > > > to negate that as a cause. (Please read the help for the option that > > > > > > > the commit referrs to.) > > > > > > > > > > > > > > In fact, with bypass disabled, the SoC fails due to other masters. > > > > > > > That's already been discussed privately between myself and Will > > > > > > > Deacon. > > > > > > > > > > > > > > arm_smmu.disable_bypass=0 re-enables bypass mode irrespective of > > > > > > > the default setting in the Kconfig. > > > > > > > > > > > > Adding some further debugging, and fixing the existing ADMA debugging > > > > > > shows: > > > > > > > > > > > > mmc0: ADMA error: 0x02000000 > > > > > > > > > > > > So this is an ADMA error without the transfer having completed. > > > > > > > > > > > > mmc0: sdhci: Blk size: 0x00000008 | Blk cnt: 0x00000001 > > > > > > > > > > > > The block size is 8, with one block. > > > > > > > > > > > > mmc0: sdhci: ADMA Err: 0x00000009 | ADMA Ptr: 0x000000236df1d20c > > > > > > > > > > > > The ADMA error is a descriptor error at address 0x000000236df1d20c. > > > > > > The descriptor table contains (including the following entry): > > > > > > > > > > > > mmc0: sdhci: 236df1d200: DMA 0x000000236d40e980, LEN 0x0008, Attr=0x23 > > > > > > mmc0: sdhci: 236df1d20c: DMA 0x0000000000000000, LEN 0x0000, Attr=0x00 > > > > > > > > > > > > The descriptor table contains one descriptor of 8 bytes, is marked > > > > > > as the last (END bit set) and is at DMA address 0x236df1d200. The > > > > > > following descriptor is empty, with VALID=0. > > > > > > > > > > > > One may be tempted to blame it on the following descriptor, but having > > > > > > had another example on eMMC while userspace was booting (rootfs on > > > > > > eMMC): > > > > > > > > > > > > mmc1: ADMA error: 0x02000000 > > > > > > mmc1: sdhci: Blk size: 0x00000200 | Blk cnt: 0x00000099 > > > > > > mmc1: sdhci: ADMA Err: 0x00000006 | ADMA Ptr: 0x000000236dbfa26c > > > > > > mmc1: sdhci: 236dbfa200: DMA 0x000000236c25c000, LEN 0x2000, Attr=0x21 > > > > > > mmc1: sdhci: 236dbfa20c: DMA 0x000000236938c000, LEN 0x0000, Attr=0x21 > > > > > > mmc1: sdhci: 236dbfa218: DMA 0x000000236939c000, LEN 0x5000, Attr=0x21 > > > > > > mmc1: sdhci: 236dbfa224: DMA 0x0000002368545000, LEN 0x1000, Attr=0x21 > > > > > > mmc1: sdhci: 236dbfa230: DMA 0x00000023684f1000, LEN 0x1000, Attr=0x21 > > > > > > mmc1: sdhci: 236dbfa23c: DMA 0x0000002368504000, LEN 0x2000, Attr=0x21 > > > > > > mmc1: sdhci: 236dbfa248: DMA 0x0000002368546000, LEN 0x2000, Attr=0x21 > > > > > > mmc1: sdhci: 236dbfa254: DMA 0x00000023684f2000, LEN 0x2000, Attr=0x21 > > > > > > mmc1: sdhci: 236dbfa260: DMA 0x0000002368500000, LEN 0x1000, Attr=0x23 > > > > > > mmc1: sdhci: 236dbfa26c: DMA 0x000000236b55d000, LEN 0x1000, Attr=0x21 > > > > > > > > > > > > ... which is interesting for several reasons: > > > > > > - The ADMA error register indicates a length mismatch error. The > > > > > > transfer was for 0x99 blocks of 0x200, which is 0x13200 bytes. > > > > > > Summing the ADMA lengths up to the last descriptor (length=0 is > > > > > > 0x10000 bytes) gives 0x20000 bytes. So the DMA table contains more > > > > > > bytes than the requested transfer. > > > > > > > > > > > > - The ADMA error register indicates ST_CADR, which is described as > > > > > > "This state is never set because do not generate ADMA error in this > > > > > > state." > > > > > > > > > > > > - The error descriptor is again after the descriptor with END=1, but > > > > > > this time has VALID=1. > > > > > > > > > > > > This _feels_ like a coherency issue, where the SDHCI engine is not > > > > > > correctly seeing the descriptor table, but then I would have expected > > > > > > userspace (which is basically debian stable) to fail to boot every > > > > > > time given that its rootfs is on eMMC. > > > > > > > > > > > > The other weird thing is if I wind the core MMC code back via: > > > > > > > > > > > > $ git diff -u 7559d612dff0..v5.3 drivers/mmc/core | patch -p1 -R > > > > > > > > > > > > and fix the lack of dma_max_pfn(), then SDHCI is more stable - not > > > > > > completely stable, but way better than plain v5.3. I don't see > > > > > > much in that diff which would be responsible for this - although it > > > > > > does seem that hch's DMA changes do make the problem more likely. > > > > > > (going from 1 in 3 boots with a problem to being not able to boot.) > > > > > > > > > > > > Note, with v5.2, I _never_ saw any ADMA errors, except if I disabled > > > > > > bypass mode on the IOMMU (but then I saw global smmu errors right > > > > > > from when the IOMMU had bypass disabled before MMC was probed - the > > > > > > reason being is the SoC is not currently setup to have the MMU > > > > > > bypass mode disabled.) > > > > > > > > > > This looks like an ARM64 coherency issue. > > > > > > > > > > I first tried adding a dma_wmb() to the end of sdhci_adma_table_pre(), > > > > > which had no effect. I then tried adding: > > > > > > > > > > + __dma_flush_area(host->adma_table, desc - host->adma_table); > > > > > + dma_wmb(); > > > > > > > > > > and so far I haven't had any further ADMA errors. Adding Will Deacon > > > > > to the thread. > > > > > > > > These are the changes to sdhci that I'm currently running. I think > > > > some of the debugging related changes are probably worth adding to > > > > the driver, particularly printing the intmask on ADMA error (which > > > > is not printed by the register dump, as the value is lost) and printing > > > > the DMA addresses of the descriptor table entries which can be tied > > > > up with the DMA address error register. Also, maybe printing the > > > > DMA descriptor table with the register dump, rather than having to > > > > resort to enabling debug would be a good idea? > > > > > > > > diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c > > > > index a5dc5aae973e..884dcaa9cad5 100644 > > > > --- a/drivers/mmc/host/sdhci.c > > > > +++ b/drivers/mmc/host/sdhci.c > > > > @@ -773,6 +773,8 @@ static void sdhci_adma_table_pre(struct sdhci_host *host, > > > > /* Add a terminating entry - nop, end, valid */ > > > > __sdhci_adma_write_desc(host, &desc, 0, 0, ADMA2_NOP_END_VALID); > > > > } > > > > + __dma_flush_area(host->adma_table, desc - host->adma_table); > > > > + dma_wmb(); > > > > } > > > > static void sdhci_adma_table_post(struct sdhci_host *host, > > > > @@ -2855,6 +2857,8 @@ static void sdhci_cmd_irq(struct sdhci_host *host, u32 intmask, u32 *intmask_p) > > > > static void sdhci_adma_show_error(struct sdhci_host *host) > > > > { > > > > void *desc = host->adma_table; > > > > + dma_addr_t dma = host->adma_addr; > > > > + bool end = false; > > > > sdhci_dumpregs(host); > > > > @@ -2862,21 +2866,26 @@ static void sdhci_adma_show_error(struct sdhci_host *host) > > > > struct sdhci_adma2_64_desc *dma_desc = desc; > > > > if (host->flags & SDHCI_USE_64_BIT_DMA) > > > > - DBG("%p: DMA 0x%08x%08x, LEN 0x%04x, Attr=0x%02x\n", > > > > - desc, le32_to_cpu(dma_desc->addr_hi), > > > > + SDHCI_DUMP("%08llx: DMA 0x%08x%08x, LEN 0x%04x, Attr=0x%02x\n", > > > > + (unsigned long long)dma, > > > > + le32_to_cpu(dma_desc->addr_hi), > > > > le32_to_cpu(dma_desc->addr_lo), > > > > le16_to_cpu(dma_desc->len), > > > > le16_to_cpu(dma_desc->cmd)); > > > > else > > > > - DBG("%p: DMA 0x%08x, LEN 0x%04x, Attr=0x%02x\n", > > > > - desc, le32_to_cpu(dma_desc->addr_lo), > > > > + SDHCI_DUMP("%08llx: DMA 0x%08x, LEN 0x%04x, Attr=0x%02x\n", > > > > + (unsigned long long)dma, > > > > + le32_to_cpu(dma_desc->addr_lo), > > > > le16_to_cpu(dma_desc->len), > > > > le16_to_cpu(dma_desc->cmd)); > > > > + if (end) break; > > > > + > > > > desc += host->desc_sz; > > > > + dma += host->desc_sz; > > > > if (dma_desc->cmd & cpu_to_le16(ADMA2_END)) > > > > - break; > > > > + end = true; > > > > } > > > > } > > > > @@ -2949,7 +2958,7 @@ static void sdhci_data_irq(struct sdhci_host *host, u32 intmask) > > > > != MMC_BUS_TEST_R) > > > > host->data->error = -EILSEQ; > > > > else if (intmask & SDHCI_INT_ADMA_ERROR) { > > > > - pr_err("%s: ADMA error\n", mmc_hostname(host->mmc)); > > > > + pr_err("%s: ADMA error: 0x%08x\n", mmc_hostname(host->mmc), intmask); > > > > sdhci_adma_show_error(host); > > > > host->data->error = -EIO; > > > > if (host->ops->adma_workaround) > > > > > > Further debug shows: > > > > > > coherent=0 - sdhci device is not cache coherent > > > swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000081cac000 > > > [ffffff8010fd5200] pgd=000000237ffff003, pud=000000237ffff003, > > > pmd=000000237fffb003, pte=00e800236d62270f > > > > > > The mapping for the ADMA table seems to be using MAIR index 3, which is > > > MT_MEMORY_NC, so should be non-cacheable. > > > > > > vmallocinfo: > > > 0xffffff8010fd5000-0xffffff8010fd7000 8192 dma_direct_alloc+0x4c/0x54 > > > user > > > > > > So this memory has been remapped. Could there be an alias that has > > > cache lines still in the cache for the physical address, and could we > > > be hitting those cache lines while accessing through a non-cacheable > > > mapping? (On 32-bit ARM, this is "unpredictable" and this problem > > > definitely _feels_ like it has unpredictable attributes!) > > > > > > Also, given that this memory is mapped NC, then surely > > > __dma_flush_area() should have no effect? However, it _does_ have the > > > effect of reliably solving the problem, which to me implies that there > > > _are_ cache lines in this NC mapping. > > > > Will suggested reverting bd2e75633c80 ("dma-contiguous: use fallback > > alloc_pages for single pages") which has been implicated in the same > > problem here: > > > > https://www.spinics.net/lists/arm-kernel/msg750623.html > > > > Although reverting the commit is not clean, this also fixes the issue > > for me. > > Note that that one turned out to be something totally different, namely that > the single-page allocations, in difference to CMA, came from physical > addresses that the controller needed additional configuration to be able to > access[1] - no amount of cache maintenance would affect that. As already replied, v4 mode is not documented as being available on the LX2160A - the bit in the control register is marked as "reserved". This is as expected as it is documented that it is using a v3.00 of the SDHCI standard, rather than v4.00. So, sorry, enabling "v4 mode" isn't a workaround in this scenario. Given that v4 mode is not mandatory, this shouldn't be a work-around. Given that it _does_ work some of the time with the table >4GB, then this is not an addressing limitation. > However, the other difference between getting a single page directly from > the page allocator vs. the CMA area is that accesses to the linear mapping > of the CMA area are probably pretty rare, whereas for the single-page case > it's much more likely that kernel tasks using adjacent pages could lead to > prefetching of the descriptor page's cacheable alias. That could certainly > explain how reverting that commit manages to hide an apparent coherency > issue. Right, so how do we fix this? -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A 2019-09-17 13:49 ` Russell King - ARM Linux admin @ 2019-09-17 14:03 ` Robin Murphy 2019-09-19 9:16 ` Russell King - ARM Linux admin 0 siblings, 1 reply; 31+ messages in thread From: Robin Murphy @ 2019-09-17 14:03 UTC (permalink / raw) To: Russell King - ARM Linux admin Cc: dann frazier, linux-mmc, Adrian Hunter, Will Deacon, Nicolin Chen, Christoph Hellwig, Linux ARM On 17/09/2019 14:49, Russell King - ARM Linux admin wrote: > On Tue, Sep 17, 2019 at 02:38:00PM +0100, Robin Murphy wrote: >> On 17/09/2019 14:07, Russell King - ARM Linux admin wrote: >>> On Tue, Sep 17, 2019 at 01:33:26PM +0100, Russell King - ARM Linux admin wrote: >>>> On Tue, Sep 17, 2019 at 12:42:10PM +0100, Russell King - ARM Linux admin wrote: >>>>> On Tue, Sep 17, 2019 at 12:16:31PM +0100, Russell King - ARM Linux admin wrote: >>>>>> On Tue, Sep 17, 2019 at 11:42:00AM +0100, Russell King - ARM Linux admin wrote: >>>>>>> On Tue, Sep 17, 2019 at 09:19:31AM +0100, Russell King - ARM Linux admin wrote: >>>>>>>> On Tue, Sep 17, 2019 at 10:06:12AM +0200, Marc Gonzalez wrote: >>>>>>>>> On 16/09/2019 19:15, Russell King - ARM Linux admin wrote: >>>>>>>>> >>>>>>>>>> The platform has an iommu, which is in pass-through mode, via >>>>>>>>>> arm_smmu.disable_bypass=0. >>>>>>>>> >>>>>>>>> Could be 954a03be033c7cef80ddc232e7cbdb17df735663 >>>>>>>>> "iommu/arm-smmu: Break insecure users by disabling bypass by default" >>>>>>>>> >>>>>>>>> Although it had already landed in v5.2 >>>>>>>> >>>>>>>> It is not - and the two lines that you quoted above are sufficient >>>>>>>> to negate that as a cause. (Please read the help for the option that >>>>>>>> the commit referrs to.) >>>>>>>> >>>>>>>> In fact, with bypass disabled, the SoC fails due to other masters. >>>>>>>> That's already been discussed privately between myself and Will >>>>>>>> Deacon. >>>>>>>> >>>>>>>> arm_smmu.disable_bypass=0 re-enables bypass mode irrespective of >>>>>>>> the default setting in the Kconfig. >>>>>>> >>>>>>> Adding some further debugging, and fixing the existing ADMA debugging >>>>>>> shows: >>>>>>> >>>>>>> mmc0: ADMA error: 0x02000000 >>>>>>> >>>>>>> So this is an ADMA error without the transfer having completed. >>>>>>> >>>>>>> mmc0: sdhci: Blk size: 0x00000008 | Blk cnt: 0x00000001 >>>>>>> >>>>>>> The block size is 8, with one block. >>>>>>> >>>>>>> mmc0: sdhci: ADMA Err: 0x00000009 | ADMA Ptr: 0x000000236df1d20c >>>>>>> >>>>>>> The ADMA error is a descriptor error at address 0x000000236df1d20c. >>>>>>> The descriptor table contains (including the following entry): >>>>>>> >>>>>>> mmc0: sdhci: 236df1d200: DMA 0x000000236d40e980, LEN 0x0008, Attr=0x23 >>>>>>> mmc0: sdhci: 236df1d20c: DMA 0x0000000000000000, LEN 0x0000, Attr=0x00 >>>>>>> >>>>>>> The descriptor table contains one descriptor of 8 bytes, is marked >>>>>>> as the last (END bit set) and is at DMA address 0x236df1d200. The >>>>>>> following descriptor is empty, with VALID=0. >>>>>>> >>>>>>> One may be tempted to blame it on the following descriptor, but having >>>>>>> had another example on eMMC while userspace was booting (rootfs on >>>>>>> eMMC): >>>>>>> >>>>>>> mmc1: ADMA error: 0x02000000 >>>>>>> mmc1: sdhci: Blk size: 0x00000200 | Blk cnt: 0x00000099 >>>>>>> mmc1: sdhci: ADMA Err: 0x00000006 | ADMA Ptr: 0x000000236dbfa26c >>>>>>> mmc1: sdhci: 236dbfa200: DMA 0x000000236c25c000, LEN 0x2000, Attr=0x21 >>>>>>> mmc1: sdhci: 236dbfa20c: DMA 0x000000236938c000, LEN 0x0000, Attr=0x21 >>>>>>> mmc1: sdhci: 236dbfa218: DMA 0x000000236939c000, LEN 0x5000, Attr=0x21 >>>>>>> mmc1: sdhci: 236dbfa224: DMA 0x0000002368545000, LEN 0x1000, Attr=0x21 >>>>>>> mmc1: sdhci: 236dbfa230: DMA 0x00000023684f1000, LEN 0x1000, Attr=0x21 >>>>>>> mmc1: sdhci: 236dbfa23c: DMA 0x0000002368504000, LEN 0x2000, Attr=0x21 >>>>>>> mmc1: sdhci: 236dbfa248: DMA 0x0000002368546000, LEN 0x2000, Attr=0x21 >>>>>>> mmc1: sdhci: 236dbfa254: DMA 0x00000023684f2000, LEN 0x2000, Attr=0x21 >>>>>>> mmc1: sdhci: 236dbfa260: DMA 0x0000002368500000, LEN 0x1000, Attr=0x23 >>>>>>> mmc1: sdhci: 236dbfa26c: DMA 0x000000236b55d000, LEN 0x1000, Attr=0x21 >>>>>>> >>>>>>> ... which is interesting for several reasons: >>>>>>> - The ADMA error register indicates a length mismatch error. The >>>>>>> transfer was for 0x99 blocks of 0x200, which is 0x13200 bytes. >>>>>>> Summing the ADMA lengths up to the last descriptor (length=0 is >>>>>>> 0x10000 bytes) gives 0x20000 bytes. So the DMA table contains more >>>>>>> bytes than the requested transfer. >>>>>>> >>>>>>> - The ADMA error register indicates ST_CADR, which is described as >>>>>>> "This state is never set because do not generate ADMA error in this >>>>>>> state." >>>>>>> >>>>>>> - The error descriptor is again after the descriptor with END=1, but >>>>>>> this time has VALID=1. >>>>>>> >>>>>>> This _feels_ like a coherency issue, where the SDHCI engine is not >>>>>>> correctly seeing the descriptor table, but then I would have expected >>>>>>> userspace (which is basically debian stable) to fail to boot every >>>>>>> time given that its rootfs is on eMMC. >>>>>>> >>>>>>> The other weird thing is if I wind the core MMC code back via: >>>>>>> >>>>>>> $ git diff -u 7559d612dff0..v5.3 drivers/mmc/core | patch -p1 -R >>>>>>> >>>>>>> and fix the lack of dma_max_pfn(), then SDHCI is more stable - not >>>>>>> completely stable, but way better than plain v5.3. I don't see >>>>>>> much in that diff which would be responsible for this - although it >>>>>>> does seem that hch's DMA changes do make the problem more likely. >>>>>>> (going from 1 in 3 boots with a problem to being not able to boot.) >>>>>>> >>>>>>> Note, with v5.2, I _never_ saw any ADMA errors, except if I disabled >>>>>>> bypass mode on the IOMMU (but then I saw global smmu errors right >>>>>>> from when the IOMMU had bypass disabled before MMC was probed - the >>>>>>> reason being is the SoC is not currently setup to have the MMU >>>>>>> bypass mode disabled.) >>>>>> >>>>>> This looks like an ARM64 coherency issue. >>>>>> >>>>>> I first tried adding a dma_wmb() to the end of sdhci_adma_table_pre(), >>>>>> which had no effect. I then tried adding: >>>>>> >>>>>> + __dma_flush_area(host->adma_table, desc - host->adma_table); >>>>>> + dma_wmb(); >>>>>> >>>>>> and so far I haven't had any further ADMA errors. Adding Will Deacon >>>>>> to the thread. >>>>> >>>>> These are the changes to sdhci that I'm currently running. I think >>>>> some of the debugging related changes are probably worth adding to >>>>> the driver, particularly printing the intmask on ADMA error (which >>>>> is not printed by the register dump, as the value is lost) and printing >>>>> the DMA addresses of the descriptor table entries which can be tied >>>>> up with the DMA address error register. Also, maybe printing the >>>>> DMA descriptor table with the register dump, rather than having to >>>>> resort to enabling debug would be a good idea? >>>>> >>>>> diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c >>>>> index a5dc5aae973e..884dcaa9cad5 100644 >>>>> --- a/drivers/mmc/host/sdhci.c >>>>> +++ b/drivers/mmc/host/sdhci.c >>>>> @@ -773,6 +773,8 @@ static void sdhci_adma_table_pre(struct sdhci_host *host, >>>>> /* Add a terminating entry - nop, end, valid */ >>>>> __sdhci_adma_write_desc(host, &desc, 0, 0, ADMA2_NOP_END_VALID); >>>>> } >>>>> + __dma_flush_area(host->adma_table, desc - host->adma_table); >>>>> + dma_wmb(); >>>>> } >>>>> static void sdhci_adma_table_post(struct sdhci_host *host, >>>>> @@ -2855,6 +2857,8 @@ static void sdhci_cmd_irq(struct sdhci_host *host, u32 intmask, u32 *intmask_p) >>>>> static void sdhci_adma_show_error(struct sdhci_host *host) >>>>> { >>>>> void *desc = host->adma_table; >>>>> + dma_addr_t dma = host->adma_addr; >>>>> + bool end = false; >>>>> sdhci_dumpregs(host); >>>>> @@ -2862,21 +2866,26 @@ static void sdhci_adma_show_error(struct sdhci_host *host) >>>>> struct sdhci_adma2_64_desc *dma_desc = desc; >>>>> if (host->flags & SDHCI_USE_64_BIT_DMA) >>>>> - DBG("%p: DMA 0x%08x%08x, LEN 0x%04x, Attr=0x%02x\n", >>>>> - desc, le32_to_cpu(dma_desc->addr_hi), >>>>> + SDHCI_DUMP("%08llx: DMA 0x%08x%08x, LEN 0x%04x, Attr=0x%02x\n", >>>>> + (unsigned long long)dma, >>>>> + le32_to_cpu(dma_desc->addr_hi), >>>>> le32_to_cpu(dma_desc->addr_lo), >>>>> le16_to_cpu(dma_desc->len), >>>>> le16_to_cpu(dma_desc->cmd)); >>>>> else >>>>> - DBG("%p: DMA 0x%08x, LEN 0x%04x, Attr=0x%02x\n", >>>>> - desc, le32_to_cpu(dma_desc->addr_lo), >>>>> + SDHCI_DUMP("%08llx: DMA 0x%08x, LEN 0x%04x, Attr=0x%02x\n", >>>>> + (unsigned long long)dma, >>>>> + le32_to_cpu(dma_desc->addr_lo), >>>>> le16_to_cpu(dma_desc->len), >>>>> le16_to_cpu(dma_desc->cmd)); >>>>> + if (end) break; >>>>> + >>>>> desc += host->desc_sz; >>>>> + dma += host->desc_sz; >>>>> if (dma_desc->cmd & cpu_to_le16(ADMA2_END)) >>>>> - break; >>>>> + end = true; >>>>> } >>>>> } >>>>> @@ -2949,7 +2958,7 @@ static void sdhci_data_irq(struct sdhci_host *host, u32 intmask) >>>>> != MMC_BUS_TEST_R) >>>>> host->data->error = -EILSEQ; >>>>> else if (intmask & SDHCI_INT_ADMA_ERROR) { >>>>> - pr_err("%s: ADMA error\n", mmc_hostname(host->mmc)); >>>>> + pr_err("%s: ADMA error: 0x%08x\n", mmc_hostname(host->mmc), intmask); >>>>> sdhci_adma_show_error(host); >>>>> host->data->error = -EIO; >>>>> if (host->ops->adma_workaround) >>>> >>>> Further debug shows: >>>> >>>> coherent=0 - sdhci device is not cache coherent >>>> swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000081cac000 >>>> [ffffff8010fd5200] pgd=000000237ffff003, pud=000000237ffff003, >>>> pmd=000000237fffb003, pte=00e800236d62270f >>>> >>>> The mapping for the ADMA table seems to be using MAIR index 3, which is >>>> MT_MEMORY_NC, so should be non-cacheable. >>>> >>>> vmallocinfo: >>>> 0xffffff8010fd5000-0xffffff8010fd7000 8192 dma_direct_alloc+0x4c/0x54 >>>> user >>>> >>>> So this memory has been remapped. Could there be an alias that has >>>> cache lines still in the cache for the physical address, and could we >>>> be hitting those cache lines while accessing through a non-cacheable >>>> mapping? (On 32-bit ARM, this is "unpredictable" and this problem >>>> definitely _feels_ like it has unpredictable attributes!) >>>> >>>> Also, given that this memory is mapped NC, then surely >>>> __dma_flush_area() should have no effect? However, it _does_ have the >>>> effect of reliably solving the problem, which to me implies that there >>>> _are_ cache lines in this NC mapping. >>> >>> Will suggested reverting bd2e75633c80 ("dma-contiguous: use fallback >>> alloc_pages for single pages") which has been implicated in the same >>> problem here: >>> >>> https://www.spinics.net/lists/arm-kernel/msg750623.html >>> >>> Although reverting the commit is not clean, this also fixes the issue >>> for me. >> >> Note that that one turned out to be something totally different, namely that >> the single-page allocations, in difference to CMA, came from physical >> addresses that the controller needed additional configuration to be able to >> access[1] - no amount of cache maintenance would affect that. > > As already replied, v4 mode is not documented as being available on > the LX2160A - the bit in the control register is marked as "reserved". > This is as expected as it is documented that it is using a v3.00 of > the SDHCI standard, rather than v4.00. > > So, sorry, enabling "v4 mode" isn't a workaround in this scenario. > > Given that v4 mode is not mandatory, this shouldn't be a work-around. > > Given that it _does_ work some of the time with the table >4GB, then > this is not an addressing limitation. Yes, that's what "something totally different" usually means. >> However, the other difference between getting a single page directly from >> the page allocator vs. the CMA area is that accesses to the linear mapping >> of the CMA area are probably pretty rare, whereas for the single-page case >> it's much more likely that kernel tasks using adjacent pages could lead to >> prefetching of the descriptor page's cacheable alias. That could certainly >> explain how reverting that commit manages to hide an apparent coherency >> issue. > > Right, so how do we fix this? By describing the hardware correctly in the DT. Robin. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A 2019-09-17 14:03 ` Robin Murphy @ 2019-09-19 9:16 ` Russell King - ARM Linux admin 2019-09-19 14:02 ` Robin Murphy 0 siblings, 1 reply; 31+ messages in thread From: Russell King - ARM Linux admin @ 2019-09-19 9:16 UTC (permalink / raw) To: Robin Murphy, Y.b. Lu Cc: dann frazier, linux-mmc, Adrian Hunter, Will Deacon, Nicolin Chen, Christoph Hellwig, Linux ARM On Tue, Sep 17, 2019 at 03:03:29PM +0100, Robin Murphy wrote: > On 17/09/2019 14:49, Russell King - ARM Linux admin wrote: > > As already replied, v4 mode is not documented as being available on > > the LX2160A - the bit in the control register is marked as "reserved". > > This is as expected as it is documented that it is using a v3.00 of > > the SDHCI standard, rather than v4.00. > > > > So, sorry, enabling "v4 mode" isn't a workaround in this scenario. > > > > Given that v4 mode is not mandatory, this shouldn't be a work-around. > > > > Given that it _does_ work some of the time with the table >4GB, then > > this is not an addressing limitation. > > Yes, that's what "something totally different" usually means. > > > > However, the other difference between getting a single page directly from > > > the page allocator vs. the CMA area is that accesses to the linear mapping > > > of the CMA area are probably pretty rare, whereas for the single-page case > > > it's much more likely that kernel tasks using adjacent pages could lead to > > > prefetching of the descriptor page's cacheable alias. That could certainly > > > explain how reverting that commit manages to hide an apparent coherency > > > issue. > > > > Right, so how do we fix this? > > By describing the hardware correctly in the DT. It would appear that it _is_ correctly described given the default hardware configuration, but the driver sets a bit in a control register that enables cache snooping. Adding "dma-coherent" to the DT description does not seem to be the correct solution, as we are reliant on the DT description and driver implementation both agreeing, which is fragile. From what I can see, there isn't a way for a driver to say "I've made this device is coherent now" and I suspect making the driver set the DMA snoop bit depending on whether "dma-coherent" is present in DT or not will cause data-corrupting regressions for other people. So, we're back to where we started - what is the right solution to this problem? The only thing I can think is that the driver needs to do something like: WARN_ON(!dev_is_dma_coherent(dev)); in esdhc_of_enable_dma() as a first step, and ensuring that the snoop bit matches the state of dev_is_dma_coherent(dev)? Is it permitted to use dev_is_dma_coherent() in drivers - it doesn't seem to be part of the normal DMA API? -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A 2019-09-19 9:16 ` Russell King - ARM Linux admin @ 2019-09-19 14:02 ` Robin Murphy 2019-09-19 17:23 ` Russell King - ARM Linux admin 0 siblings, 1 reply; 31+ messages in thread From: Robin Murphy @ 2019-09-19 14:02 UTC (permalink / raw) To: Russell King - ARM Linux admin, Y.b. Lu Cc: dann frazier, linux-mmc, Adrian Hunter, Will Deacon, Nicolin Chen, Christoph Hellwig, Linux ARM On 19/09/2019 10:16, Russell King - ARM Linux admin wrote: > On Tue, Sep 17, 2019 at 03:03:29PM +0100, Robin Murphy wrote: >> On 17/09/2019 14:49, Russell King - ARM Linux admin wrote: >>> As already replied, v4 mode is not documented as being available on >>> the LX2160A - the bit in the control register is marked as "reserved". >>> This is as expected as it is documented that it is using a v3.00 of >>> the SDHCI standard, rather than v4.00. >>> >>> So, sorry, enabling "v4 mode" isn't a workaround in this scenario. >>> >>> Given that v4 mode is not mandatory, this shouldn't be a work-around. >>> >>> Given that it _does_ work some of the time with the table >4GB, then >>> this is not an addressing limitation. >> >> Yes, that's what "something totally different" usually means. >> >>>> However, the other difference between getting a single page directly from >>>> the page allocator vs. the CMA area is that accesses to the linear mapping >>>> of the CMA area are probably pretty rare, whereas for the single-page case >>>> it's much more likely that kernel tasks using adjacent pages could lead to >>>> prefetching of the descriptor page's cacheable alias. That could certainly >>>> explain how reverting that commit manages to hide an apparent coherency >>>> issue. >>> >>> Right, so how do we fix this? >> >> By describing the hardware correctly in the DT. > > It would appear that it _is_ correctly described given the default > hardware configuration, but the driver sets a bit in a control > register that enables cache snooping. Oh, fun. FWIW, the more general form of that statement would be "by ensuring that the device behaviour and the DT description are consistent", it's just rare to have both degrees of freedom. Even in these cases, though, it tends to be ultimately necessary to defer to what the DT says, because there can be situations where the IP believes itself capable of enabling snoops, but the integration failed to wire things up correctly for them to actually work. I know we have to deal with that in arm-smmu, for one example. > Adding "dma-coherent" to the DT description does not seem to be the > correct solution, as we are reliant on the DT description and driver > implementation both agreeing, which is fragile. > > From what I can see, there isn't a way for a driver to say "I've made > this device is coherent now" and I suspect making the driver set the > DMA snoop bit depending on whether "dma-coherent" is present in DT or > not will cause data-corrupting regressions for other people. > > So, we're back to where we started - what is the right solution to > this problem? > > The only thing I can think is that the driver needs to do something > like: > > WARN_ON(!dev_is_dma_coherent(dev)); > > in esdhc_of_enable_dma() as a first step, and ensuring that the snoop > bit matches the state of dev_is_dma_coherent(dev)? Is it permitted to > use dev_is_dma_coherent() in drivers - it doesn't seem to be part of > the normal DMA API? The safest option would be to query the firmware property layer via device_get_dma_attr() - or potentially short-cut to of_dma_is_coherent() for a pure DT driver. Even disregarding API purity, I don't think the DMA API internals are really generic enough yet to reliably poke at (although FWIW, *certain* cases like dma_direct_ops would now actually work as expected if one did the unspeakable and flipped dev->dma_coherent from a driver, but that would definitely not win any friends). Robin. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A 2019-09-19 14:02 ` Robin Murphy @ 2019-09-19 17:23 ` Russell King - ARM Linux admin 2019-09-20 9:55 ` Russell King - ARM Linux admin 0 siblings, 1 reply; 31+ messages in thread From: Russell King - ARM Linux admin @ 2019-09-19 17:23 UTC (permalink / raw) To: Robin Murphy Cc: dann frazier, Will Deacon, Adrian Hunter, Nicolin Chen, Y.b. Lu, linux-mmc, Christoph Hellwig, Linux ARM On Thu, Sep 19, 2019 at 03:02:39PM +0100, Robin Murphy wrote: > On 19/09/2019 10:16, Russell King - ARM Linux admin wrote: > > On Tue, Sep 17, 2019 at 03:03:29PM +0100, Robin Murphy wrote: > > > On 17/09/2019 14:49, Russell King - ARM Linux admin wrote: > > > > As already replied, v4 mode is not documented as being available on > > > > the LX2160A - the bit in the control register is marked as "reserved". > > > > This is as expected as it is documented that it is using a v3.00 of > > > > the SDHCI standard, rather than v4.00. > > > > > > > > So, sorry, enabling "v4 mode" isn't a workaround in this scenario. > > > > > > > > Given that v4 mode is not mandatory, this shouldn't be a work-around. > > > > > > > > Given that it _does_ work some of the time with the table >4GB, then > > > > this is not an addressing limitation. > > > > > > Yes, that's what "something totally different" usually means. > > > > > > > > However, the other difference between getting a single page directly from > > > > > the page allocator vs. the CMA area is that accesses to the linear mapping > > > > > of the CMA area are probably pretty rare, whereas for the single-page case > > > > > it's much more likely that kernel tasks using adjacent pages could lead to > > > > > prefetching of the descriptor page's cacheable alias. That could certainly > > > > > explain how reverting that commit manages to hide an apparent coherency > > > > > issue. > > > > > > > > Right, so how do we fix this? > > > > > > By describing the hardware correctly in the DT. > > > > It would appear that it _is_ correctly described given the default > > hardware configuration, but the driver sets a bit in a control > > register that enables cache snooping. > > Oh, fun. FWIW, the more general form of that statement would be "by ensuring > that the device behaviour and the DT description are consistent", it's just > rare to have both degrees of freedom. > > Even in these cases, though, it tends to be ultimately necessary to defer to > what the DT says, because there can be situations where the IP believes > itself capable of enabling snoops, but the integration failed to wire things > up correctly for them to actually work. I know we have to deal with that in > arm-smmu, for one example. > > > Adding "dma-coherent" to the DT description does not seem to be the > > correct solution, as we are reliant on the DT description and driver > > implementation both agreeing, which is fragile. > > > > From what I can see, there isn't a way for a driver to say "I've made > > this device is coherent now" and I suspect making the driver set the > > DMA snoop bit depending on whether "dma-coherent" is present in DT or > > not will cause data-corrupting regressions for other people. > > > > So, we're back to where we started - what is the right solution to > > this problem? > > > > The only thing I can think is that the driver needs to do something > > like: > > > > WARN_ON(!dev_is_dma_coherent(dev)); > > > > in esdhc_of_enable_dma() as a first step, and ensuring that the snoop > > bit matches the state of dev_is_dma_coherent(dev)? Is it permitted to > > use dev_is_dma_coherent() in drivers - it doesn't seem to be part of > > the normal DMA API? > > The safest option would be to query the firmware property layer via > device_get_dma_attr() - or potentially short-cut to of_dma_is_coherent() for > a pure DT driver. Even disregarding API purity, I don't think the DMA API > internals are really generic enough yet to reliably poke at (although FWIW, > *certain* cases like dma_direct_ops would now actually work as expected if > one did the unspeakable and flipped dev->dma_coherent from a driver, but > that would definitely not win any friends). So, I prepared a few options, and option 2 was: drivers/mmc/host/sdhci-of-esdhc.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/drivers/mmc/host/sdhci-of-esdhc.c b/drivers/mmc/host/sdhci-of-esdhc.c index 4dd43b1adf2c..8076a1322499 100644 --- a/drivers/mmc/host/sdhci-of-esdhc.c +++ b/drivers/mmc/host/sdhci-of-esdhc.c @@ -19,6 +19,7 @@ #include <linux/clk.h> #include <linux/ktime.h> #include <linux/dma-mapping.h> +#include <linux/dma-noncoherent.h> #include <linux/mmc/host.h> #include <linux/mmc/mmc.h> #include "sdhci-pltfm.h" @@ -495,7 +496,12 @@ static int esdhc_of_enable_dma(struct sdhci_host *host) dma_set_mask_and_coherent(dev, DMA_BIT_MASK(40)); value = sdhci_readl(host, ESDHC_DMA_SYSCTL); - value |= ESDHC_DMA_SNOOP; + + if (dev_is_dma_coherent(dev)) + value |= ESDHC_DMA_SNOOP; + else + value &= ~ESDHC_DMA_SNOOP; + sdhci_writel(host, value, ESDHC_DMA_SYSCTL); return 0; } The dev_is_dma_coherent() could be changed to something like device_get_dma_attr() if that's the correct thing to base this off of. However, if it returns DEV_DMA_NOT_SUPPORTED, then what? Assume non-coherent or assume coherent? What will the DMA API layer assume? It seems to me that we want the DMA API layer and the driver to both agree whether the device is to be coherent or not, and for the sake of data integrity, we do not want any possibility for them to deviate in that decision making process. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A 2019-09-19 17:23 ` Russell King - ARM Linux admin @ 2019-09-20 9:55 ` Russell King - ARM Linux admin 0 siblings, 0 replies; 31+ messages in thread From: Russell King - ARM Linux admin @ 2019-09-20 9:55 UTC (permalink / raw) To: Robin Murphy Cc: dann frazier, linux-mmc, Adrian Hunter, Will Deacon, Nicolin Chen, Y.b. Lu, Christoph Hellwig, Linux ARM On Thu, Sep 19, 2019 at 06:23:49PM +0100, Russell King - ARM Linux admin wrote: > On Thu, Sep 19, 2019 at 03:02:39PM +0100, Robin Murphy wrote: > > On 19/09/2019 10:16, Russell King - ARM Linux admin wrote: > > > On Tue, Sep 17, 2019 at 03:03:29PM +0100, Robin Murphy wrote: > > > > On 17/09/2019 14:49, Russell King - ARM Linux admin wrote: > > > > > As already replied, v4 mode is not documented as being available on > > > > > the LX2160A - the bit in the control register is marked as "reserved". > > > > > This is as expected as it is documented that it is using a v3.00 of > > > > > the SDHCI standard, rather than v4.00. > > > > > > > > > > So, sorry, enabling "v4 mode" isn't a workaround in this scenario. > > > > > > > > > > Given that v4 mode is not mandatory, this shouldn't be a work-around. > > > > > > > > > > Given that it _does_ work some of the time with the table >4GB, then > > > > > this is not an addressing limitation. > > > > > > > > Yes, that's what "something totally different" usually means. > > > > > > > > > > However, the other difference between getting a single page directly from > > > > > > the page allocator vs. the CMA area is that accesses to the linear mapping > > > > > > of the CMA area are probably pretty rare, whereas for the single-page case > > > > > > it's much more likely that kernel tasks using adjacent pages could lead to > > > > > > prefetching of the descriptor page's cacheable alias. That could certainly > > > > > > explain how reverting that commit manages to hide an apparent coherency > > > > > > issue. > > > > > > > > > > Right, so how do we fix this? > > > > > > > > By describing the hardware correctly in the DT. > > > > > > It would appear that it _is_ correctly described given the default > > > hardware configuration, but the driver sets a bit in a control > > > register that enables cache snooping. > > > > Oh, fun. FWIW, the more general form of that statement would be "by ensuring > > that the device behaviour and the DT description are consistent", it's just > > rare to have both degrees of freedom. > > > > Even in these cases, though, it tends to be ultimately necessary to defer to > > what the DT says, because there can be situations where the IP believes > > itself capable of enabling snoops, but the integration failed to wire things > > up correctly for them to actually work. I know we have to deal with that in > > arm-smmu, for one example. > > > > > Adding "dma-coherent" to the DT description does not seem to be the > > > correct solution, as we are reliant on the DT description and driver > > > implementation both agreeing, which is fragile. > > > > > > From what I can see, there isn't a way for a driver to say "I've made > > > this device is coherent now" and I suspect making the driver set the > > > DMA snoop bit depending on whether "dma-coherent" is present in DT or > > > not will cause data-corrupting regressions for other people. > > > > > > So, we're back to where we started - what is the right solution to > > > this problem? > > > > > > The only thing I can think is that the driver needs to do something > > > like: > > > > > > WARN_ON(!dev_is_dma_coherent(dev)); > > > > > > in esdhc_of_enable_dma() as a first step, and ensuring that the snoop > > > bit matches the state of dev_is_dma_coherent(dev)? Is it permitted to > > > use dev_is_dma_coherent() in drivers - it doesn't seem to be part of > > > the normal DMA API? > > > > The safest option would be to query the firmware property layer via > > device_get_dma_attr() - or potentially short-cut to of_dma_is_coherent() for > > a pure DT driver. Even disregarding API purity, I don't think the DMA API > > internals are really generic enough yet to reliably poke at (although FWIW, > > *certain* cases like dma_direct_ops would now actually work as expected if > > one did the unspeakable and flipped dev->dma_coherent from a driver, but > > that would definitely not win any friends). > > So, I prepared a few options, and option 2 was: > > drivers/mmc/host/sdhci-of-esdhc.c | 8 +++++++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/drivers/mmc/host/sdhci-of-esdhc.c b/drivers/mmc/host/sdhci-of-esdhc.c > index 4dd43b1adf2c..8076a1322499 100644 > --- a/drivers/mmc/host/sdhci-of-esdhc.c > +++ b/drivers/mmc/host/sdhci-of-esdhc.c > @@ -19,6 +19,7 @@ > #include <linux/clk.h> > #include <linux/ktime.h> > #include <linux/dma-mapping.h> > +#include <linux/dma-noncoherent.h> > #include <linux/mmc/host.h> > #include <linux/mmc/mmc.h> > #include "sdhci-pltfm.h" > @@ -495,7 +496,12 @@ static int esdhc_of_enable_dma(struct sdhci_host *host) > dma_set_mask_and_coherent(dev, DMA_BIT_MASK(40)); > > value = sdhci_readl(host, ESDHC_DMA_SYSCTL); > - value |= ESDHC_DMA_SNOOP; > + > + if (dev_is_dma_coherent(dev)) > + value |= ESDHC_DMA_SNOOP; > + else > + value &= ~ESDHC_DMA_SNOOP; > + > sdhci_writel(host, value, ESDHC_DMA_SYSCTL); > return 0; > } > > The dev_is_dma_coherent() could be changed to something like > device_get_dma_attr() if that's the correct thing to base this > off of. However, if it returns DEV_DMA_NOT_SUPPORTED, then what? > Assume non-coherent or assume coherent? What will the DMA API > layer assume? > > It seems to me that we want the DMA API layer and the driver to > both agree whether the device is to be coherent or not, and for > the sake of data integrity, we do not want any possibility for > them to deviate in that decision making process. I think using of_dma_is_coherent() is the safest, as if the driver needs to be updated to ACPI, the problem will need to be readdressed. The conditions on which dev->dma_coherent is set by the ACPI code differs from the conditions that determine the return value of acpi_get_dma_attr(). So, how about this: drivers/mmc/host/sdhci-of-esdhc.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/mmc/host/sdhci-of-esdhc.c b/drivers/mmc/host/sdhci-of-esdhc.c index 4dd43b1adf2c..74de5e8c45c8 100644 --- a/drivers/mmc/host/sdhci-of-esdhc.c +++ b/drivers/mmc/host/sdhci-of-esdhc.c @@ -495,7 +495,12 @@ static int esdhc_of_enable_dma(struct sdhci_host *host) dma_set_mask_and_coherent(dev, DMA_BIT_MASK(40)); value = sdhci_readl(host, ESDHC_DMA_SYSCTL); - value |= ESDHC_DMA_SNOOP; + + if (of_dma_is_coherent(dev->of_node)) + value |= ESDHC_DMA_SNOOP; + else + value &= ~ESDHC_DMA_SNOOP; + sdhci_writel(host, value, ESDHC_DMA_SYSCTL); return 0; } -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A 2019-09-17 13:38 ` Robin Murphy 2019-09-17 13:49 ` Russell King - ARM Linux admin @ 2019-09-17 13:50 ` Will Deacon 2019-09-17 13:55 ` Robin Murphy 1 sibling, 1 reply; 31+ messages in thread From: Will Deacon @ 2019-09-17 13:50 UTC (permalink / raw) To: Robin Murphy Cc: dann frazier, Will Deacon, Russell King - ARM Linux admin, Adrian Hunter, Nicolin Chen, linux-mmc, Christoph Hellwig, Linux ARM On Tue, Sep 17, 2019 at 02:38:00PM +0100, Robin Murphy wrote: > On 17/09/2019 14:07, Russell King - ARM Linux admin wrote: > > On Tue, Sep 17, 2019 at 01:33:26PM +0100, Russell King - ARM Linux admin wrote: > > > On Tue, Sep 17, 2019 at 12:42:10PM +0100, Russell King - ARM Linux admin wrote: > > > > On Tue, Sep 17, 2019 at 12:16:31PM +0100, Russell King - ARM Linux admin wrote: > > > > > On Tue, Sep 17, 2019 at 11:42:00AM +0100, Russell King - ARM Linux admin wrote: > > > > > > On Tue, Sep 17, 2019 at 09:19:31AM +0100, Russell King - ARM Linux admin wrote: > > > > > > > On Tue, Sep 17, 2019 at 10:06:12AM +0200, Marc Gonzalez wrote: > > > > > > > > On 16/09/2019 19:15, Russell King - ARM Linux admin wrote: > > > > > > > > > > > > > > > > > The platform has an iommu, which is in pass-through mode, via > > > > > > > > > arm_smmu.disable_bypass=0. > > > > > > > > > > > > > > > > Could be 954a03be033c7cef80ddc232e7cbdb17df735663 > > > > > > > > "iommu/arm-smmu: Break insecure users by disabling bypass by default" > > > > > > > > > > > > > > > > Although it had already landed in v5.2 > > > > > > > > > > > > > > It is not - and the two lines that you quoted above are sufficient > > > > > > > to negate that as a cause. (Please read the help for the option that > > > > > > > the commit referrs to.) > > > > > > > > > > > > > > In fact, with bypass disabled, the SoC fails due to other masters. > > > > > > > That's already been discussed privately between myself and Will > > > > > > > Deacon. > > > > > > > > > > > > > > arm_smmu.disable_bypass=0 re-enables bypass mode irrespective of > > > > > > > the default setting in the Kconfig. > > > > > > > > > > > > Adding some further debugging, and fixing the existing ADMA debugging > > > > > > shows: > > > > > > > > > > > > mmc0: ADMA error: 0x02000000 > > > > > > > > > > > > So this is an ADMA error without the transfer having completed. > > > > > > > > > > > > mmc0: sdhci: Blk size: 0x00000008 | Blk cnt: 0x00000001 > > > > > > > > > > > > The block size is 8, with one block. > > > > > > > > > > > > mmc0: sdhci: ADMA Err: 0x00000009 | ADMA Ptr: 0x000000236df1d20c > > > > > > > > > > > > The ADMA error is a descriptor error at address 0x000000236df1d20c. > > > > > > The descriptor table contains (including the following entry): > > > > > > > > > > > > mmc0: sdhci: 236df1d200: DMA 0x000000236d40e980, LEN 0x0008, Attr=0x23 > > > > > > mmc0: sdhci: 236df1d20c: DMA 0x0000000000000000, LEN 0x0000, Attr=0x00 > > > > > > > > > > > > The descriptor table contains one descriptor of 8 bytes, is marked > > > > > > as the last (END bit set) and is at DMA address 0x236df1d200. The > > > > > > following descriptor is empty, with VALID=0. > > > > > > > > > > > > One may be tempted to blame it on the following descriptor, but having > > > > > > had another example on eMMC while userspace was booting (rootfs on > > > > > > eMMC): > > > > > > > > > > > > mmc1: ADMA error: 0x02000000 > > > > > > mmc1: sdhci: Blk size: 0x00000200 | Blk cnt: 0x00000099 > > > > > > mmc1: sdhci: ADMA Err: 0x00000006 | ADMA Ptr: 0x000000236dbfa26c > > > > > > mmc1: sdhci: 236dbfa200: DMA 0x000000236c25c000, LEN 0x2000, Attr=0x21 > > > > > > mmc1: sdhci: 236dbfa20c: DMA 0x000000236938c000, LEN 0x0000, Attr=0x21 > > > > > > mmc1: sdhci: 236dbfa218: DMA 0x000000236939c000, LEN 0x5000, Attr=0x21 > > > > > > mmc1: sdhci: 236dbfa224: DMA 0x0000002368545000, LEN 0x1000, Attr=0x21 > > > > > > mmc1: sdhci: 236dbfa230: DMA 0x00000023684f1000, LEN 0x1000, Attr=0x21 > > > > > > mmc1: sdhci: 236dbfa23c: DMA 0x0000002368504000, LEN 0x2000, Attr=0x21 > > > > > > mmc1: sdhci: 236dbfa248: DMA 0x0000002368546000, LEN 0x2000, Attr=0x21 > > > > > > mmc1: sdhci: 236dbfa254: DMA 0x00000023684f2000, LEN 0x2000, Attr=0x21 > > > > > > mmc1: sdhci: 236dbfa260: DMA 0x0000002368500000, LEN 0x1000, Attr=0x23 > > > > > > mmc1: sdhci: 236dbfa26c: DMA 0x000000236b55d000, LEN 0x1000, Attr=0x21 > > > > > > > > > > > > ... which is interesting for several reasons: > > > > > > - The ADMA error register indicates a length mismatch error. The > > > > > > transfer was for 0x99 blocks of 0x200, which is 0x13200 bytes. > > > > > > Summing the ADMA lengths up to the last descriptor (length=0 is > > > > > > 0x10000 bytes) gives 0x20000 bytes. So the DMA table contains more > > > > > > bytes than the requested transfer. > > > > > > > > > > > > - The ADMA error register indicates ST_CADR, which is described as > > > > > > "This state is never set because do not generate ADMA error in this > > > > > > state." > > > > > > > > > > > > - The error descriptor is again after the descriptor with END=1, but > > > > > > this time has VALID=1. > > > > > > > > > > > > This _feels_ like a coherency issue, where the SDHCI engine is not > > > > > > correctly seeing the descriptor table, but then I would have expected > > > > > > userspace (which is basically debian stable) to fail to boot every > > > > > > time given that its rootfs is on eMMC. > > > > > > > > > > > > The other weird thing is if I wind the core MMC code back via: > > > > > > > > > > > > $ git diff -u 7559d612dff0..v5.3 drivers/mmc/core | patch -p1 -R > > > > > > > > > > > > and fix the lack of dma_max_pfn(), then SDHCI is more stable - not > > > > > > completely stable, but way better than plain v5.3. I don't see > > > > > > much in that diff which would be responsible for this - although it > > > > > > does seem that hch's DMA changes do make the problem more likely. > > > > > > (going from 1 in 3 boots with a problem to being not able to boot.) > > > > > > > > > > > > Note, with v5.2, I _never_ saw any ADMA errors, except if I disabled > > > > > > bypass mode on the IOMMU (but then I saw global smmu errors right > > > > > > from when the IOMMU had bypass disabled before MMC was probed - the > > > > > > reason being is the SoC is not currently setup to have the MMU > > > > > > bypass mode disabled.) > > > > > > > > > > This looks like an ARM64 coherency issue. > > > > > > > > > > I first tried adding a dma_wmb() to the end of sdhci_adma_table_pre(), > > > > > which had no effect. I then tried adding: > > > > > > > > > > + __dma_flush_area(host->adma_table, desc - host->adma_table); > > > > > + dma_wmb(); > > > > > > > > > > and so far I haven't had any further ADMA errors. Adding Will Deacon > > > > > to the thread. > > > > > > > > These are the changes to sdhci that I'm currently running. I think > > > > some of the debugging related changes are probably worth adding to > > > > the driver, particularly printing the intmask on ADMA error (which > > > > is not printed by the register dump, as the value is lost) and printing > > > > the DMA addresses of the descriptor table entries which can be tied > > > > up with the DMA address error register. Also, maybe printing the > > > > DMA descriptor table with the register dump, rather than having to > > > > resort to enabling debug would be a good idea? > > > > > > > > diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c > > > > index a5dc5aae973e..884dcaa9cad5 100644 > > > > --- a/drivers/mmc/host/sdhci.c > > > > +++ b/drivers/mmc/host/sdhci.c > > > > @@ -773,6 +773,8 @@ static void sdhci_adma_table_pre(struct sdhci_host *host, > > > > /* Add a terminating entry - nop, end, valid */ > > > > __sdhci_adma_write_desc(host, &desc, 0, 0, ADMA2_NOP_END_VALID); > > > > } > > > > + __dma_flush_area(host->adma_table, desc - host->adma_table); > > > > + dma_wmb(); > > > > } > > > > static void sdhci_adma_table_post(struct sdhci_host *host, > > > > @@ -2855,6 +2857,8 @@ static void sdhci_cmd_irq(struct sdhci_host *host, u32 intmask, u32 *intmask_p) > > > > static void sdhci_adma_show_error(struct sdhci_host *host) > > > > { > > > > void *desc = host->adma_table; > > > > + dma_addr_t dma = host->adma_addr; > > > > + bool end = false; > > > > sdhci_dumpregs(host); > > > > @@ -2862,21 +2866,26 @@ static void sdhci_adma_show_error(struct sdhci_host *host) > > > > struct sdhci_adma2_64_desc *dma_desc = desc; > > > > if (host->flags & SDHCI_USE_64_BIT_DMA) > > > > - DBG("%p: DMA 0x%08x%08x, LEN 0x%04x, Attr=0x%02x\n", > > > > - desc, le32_to_cpu(dma_desc->addr_hi), > > > > + SDHCI_DUMP("%08llx: DMA 0x%08x%08x, LEN 0x%04x, Attr=0x%02x\n", > > > > + (unsigned long long)dma, > > > > + le32_to_cpu(dma_desc->addr_hi), > > > > le32_to_cpu(dma_desc->addr_lo), > > > > le16_to_cpu(dma_desc->len), > > > > le16_to_cpu(dma_desc->cmd)); > > > > else > > > > - DBG("%p: DMA 0x%08x, LEN 0x%04x, Attr=0x%02x\n", > > > > - desc, le32_to_cpu(dma_desc->addr_lo), > > > > + SDHCI_DUMP("%08llx: DMA 0x%08x, LEN 0x%04x, Attr=0x%02x\n", > > > > + (unsigned long long)dma, > > > > + le32_to_cpu(dma_desc->addr_lo), > > > > le16_to_cpu(dma_desc->len), > > > > le16_to_cpu(dma_desc->cmd)); > > > > + if (end) break; > > > > + > > > > desc += host->desc_sz; > > > > + dma += host->desc_sz; > > > > if (dma_desc->cmd & cpu_to_le16(ADMA2_END)) > > > > - break; > > > > + end = true; > > > > } > > > > } > > > > @@ -2949,7 +2958,7 @@ static void sdhci_data_irq(struct sdhci_host *host, u32 intmask) > > > > != MMC_BUS_TEST_R) > > > > host->data->error = -EILSEQ; > > > > else if (intmask & SDHCI_INT_ADMA_ERROR) { > > > > - pr_err("%s: ADMA error\n", mmc_hostname(host->mmc)); > > > > + pr_err("%s: ADMA error: 0x%08x\n", mmc_hostname(host->mmc), intmask); > > > > sdhci_adma_show_error(host); > > > > host->data->error = -EIO; > > > > if (host->ops->adma_workaround) > > > > > > Further debug shows: > > > > > > coherent=0 - sdhci device is not cache coherent > > > swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000081cac000 > > > [ffffff8010fd5200] pgd=000000237ffff003, pud=000000237ffff003, > > > pmd=000000237fffb003, pte=00e800236d62270f > > > > > > The mapping for the ADMA table seems to be using MAIR index 3, which is > > > MT_MEMORY_NC, so should be non-cacheable. > > > > > > vmallocinfo: > > > 0xffffff8010fd5000-0xffffff8010fd7000 8192 dma_direct_alloc+0x4c/0x54 > > > user > > > > > > So this memory has been remapped. Could there be an alias that has > > > cache lines still in the cache for the physical address, and could we > > > be hitting those cache lines while accessing through a non-cacheable > > > mapping? (On 32-bit ARM, this is "unpredictable" and this problem > > > definitely _feels_ like it has unpredictable attributes!) > > > > > > Also, given that this memory is mapped NC, then surely > > > __dma_flush_area() should have no effect? However, it _does_ have the > > > effect of reliably solving the problem, which to me implies that there > > > _are_ cache lines in this NC mapping. > > > > Will suggested reverting bd2e75633c80 ("dma-contiguous: use fallback > > alloc_pages for single pages") which has been implicated in the same > > problem here: > > > > https://www.spinics.net/lists/arm-kernel/msg750623.html > > > > Although reverting the commit is not clean, this also fixes the issue > > for me. > > Note that that one turned out to be something totally different, namely that > the single-page allocations, in difference to CMA, came from physical > addresses that the controller needed additional configuration to be able to > access[1] - no amount of cache maintenance would affect that. To be honest, the conclusion in that other thread wasn't exactly satisfying. The reporter says "Probably, my device is not 64-bit capable." and the fix changes the buffer allocation enough that I wouldn't rule out the same coherency issue as being the root cause. I don't think we ever tried adding cache flushing to see if it helped. Will _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A 2019-09-17 13:50 ` Will Deacon @ 2019-09-17 13:55 ` Robin Murphy 2019-09-17 14:12 ` Russell King - ARM Linux admin 0 siblings, 1 reply; 31+ messages in thread From: Robin Murphy @ 2019-09-17 13:55 UTC (permalink / raw) To: Will Deacon Cc: dann frazier, Will Deacon, Russell King - ARM Linux admin, Adrian Hunter, Nicolin Chen, linux-mmc, Christoph Hellwig, Linux ARM On 17/09/2019 14:50, Will Deacon wrote: > On Tue, Sep 17, 2019 at 02:38:00PM +0100, Robin Murphy wrote: >> On 17/09/2019 14:07, Russell King - ARM Linux admin wrote: >>> On Tue, Sep 17, 2019 at 01:33:26PM +0100, Russell King - ARM Linux admin wrote: >>>> On Tue, Sep 17, 2019 at 12:42:10PM +0100, Russell King - ARM Linux admin wrote: >>>>> On Tue, Sep 17, 2019 at 12:16:31PM +0100, Russell King - ARM Linux admin wrote: >>>>>> On Tue, Sep 17, 2019 at 11:42:00AM +0100, Russell King - ARM Linux admin wrote: >>>>>>> On Tue, Sep 17, 2019 at 09:19:31AM +0100, Russell King - ARM Linux admin wrote: >>>>>>>> On Tue, Sep 17, 2019 at 10:06:12AM +0200, Marc Gonzalez wrote: >>>>>>>>> On 16/09/2019 19:15, Russell King - ARM Linux admin wrote: >>>>>>>>> >>>>>>>>>> The platform has an iommu, which is in pass-through mode, via >>>>>>>>>> arm_smmu.disable_bypass=0. >>>>>>>>> >>>>>>>>> Could be 954a03be033c7cef80ddc232e7cbdb17df735663 >>>>>>>>> "iommu/arm-smmu: Break insecure users by disabling bypass by default" >>>>>>>>> >>>>>>>>> Although it had already landed in v5.2 >>>>>>>> >>>>>>>> It is not - and the two lines that you quoted above are sufficient >>>>>>>> to negate that as a cause. (Please read the help for the option that >>>>>>>> the commit referrs to.) >>>>>>>> >>>>>>>> In fact, with bypass disabled, the SoC fails due to other masters. >>>>>>>> That's already been discussed privately between myself and Will >>>>>>>> Deacon. >>>>>>>> >>>>>>>> arm_smmu.disable_bypass=0 re-enables bypass mode irrespective of >>>>>>>> the default setting in the Kconfig. >>>>>>> >>>>>>> Adding some further debugging, and fixing the existing ADMA debugging >>>>>>> shows: >>>>>>> >>>>>>> mmc0: ADMA error: 0x02000000 >>>>>>> >>>>>>> So this is an ADMA error without the transfer having completed. >>>>>>> >>>>>>> mmc0: sdhci: Blk size: 0x00000008 | Blk cnt: 0x00000001 >>>>>>> >>>>>>> The block size is 8, with one block. >>>>>>> >>>>>>> mmc0: sdhci: ADMA Err: 0x00000009 | ADMA Ptr: 0x000000236df1d20c >>>>>>> >>>>>>> The ADMA error is a descriptor error at address 0x000000236df1d20c. >>>>>>> The descriptor table contains (including the following entry): >>>>>>> >>>>>>> mmc0: sdhci: 236df1d200: DMA 0x000000236d40e980, LEN 0x0008, Attr=0x23 >>>>>>> mmc0: sdhci: 236df1d20c: DMA 0x0000000000000000, LEN 0x0000, Attr=0x00 >>>>>>> >>>>>>> The descriptor table contains one descriptor of 8 bytes, is marked >>>>>>> as the last (END bit set) and is at DMA address 0x236df1d200. The >>>>>>> following descriptor is empty, with VALID=0. >>>>>>> >>>>>>> One may be tempted to blame it on the following descriptor, but having >>>>>>> had another example on eMMC while userspace was booting (rootfs on >>>>>>> eMMC): >>>>>>> >>>>>>> mmc1: ADMA error: 0x02000000 >>>>>>> mmc1: sdhci: Blk size: 0x00000200 | Blk cnt: 0x00000099 >>>>>>> mmc1: sdhci: ADMA Err: 0x00000006 | ADMA Ptr: 0x000000236dbfa26c >>>>>>> mmc1: sdhci: 236dbfa200: DMA 0x000000236c25c000, LEN 0x2000, Attr=0x21 >>>>>>> mmc1: sdhci: 236dbfa20c: DMA 0x000000236938c000, LEN 0x0000, Attr=0x21 >>>>>>> mmc1: sdhci: 236dbfa218: DMA 0x000000236939c000, LEN 0x5000, Attr=0x21 >>>>>>> mmc1: sdhci: 236dbfa224: DMA 0x0000002368545000, LEN 0x1000, Attr=0x21 >>>>>>> mmc1: sdhci: 236dbfa230: DMA 0x00000023684f1000, LEN 0x1000, Attr=0x21 >>>>>>> mmc1: sdhci: 236dbfa23c: DMA 0x0000002368504000, LEN 0x2000, Attr=0x21 >>>>>>> mmc1: sdhci: 236dbfa248: DMA 0x0000002368546000, LEN 0x2000, Attr=0x21 >>>>>>> mmc1: sdhci: 236dbfa254: DMA 0x00000023684f2000, LEN 0x2000, Attr=0x21 >>>>>>> mmc1: sdhci: 236dbfa260: DMA 0x0000002368500000, LEN 0x1000, Attr=0x23 >>>>>>> mmc1: sdhci: 236dbfa26c: DMA 0x000000236b55d000, LEN 0x1000, Attr=0x21 >>>>>>> >>>>>>> ... which is interesting for several reasons: >>>>>>> - The ADMA error register indicates a length mismatch error. The >>>>>>> transfer was for 0x99 blocks of 0x200, which is 0x13200 bytes. >>>>>>> Summing the ADMA lengths up to the last descriptor (length=0 is >>>>>>> 0x10000 bytes) gives 0x20000 bytes. So the DMA table contains more >>>>>>> bytes than the requested transfer. >>>>>>> >>>>>>> - The ADMA error register indicates ST_CADR, which is described as >>>>>>> "This state is never set because do not generate ADMA error in this >>>>>>> state." >>>>>>> >>>>>>> - The error descriptor is again after the descriptor with END=1, but >>>>>>> this time has VALID=1. >>>>>>> >>>>>>> This _feels_ like a coherency issue, where the SDHCI engine is not >>>>>>> correctly seeing the descriptor table, but then I would have expected >>>>>>> userspace (which is basically debian stable) to fail to boot every >>>>>>> time given that its rootfs is on eMMC. >>>>>>> >>>>>>> The other weird thing is if I wind the core MMC code back via: >>>>>>> >>>>>>> $ git diff -u 7559d612dff0..v5.3 drivers/mmc/core | patch -p1 -R >>>>>>> >>>>>>> and fix the lack of dma_max_pfn(), then SDHCI is more stable - not >>>>>>> completely stable, but way better than plain v5.3. I don't see >>>>>>> much in that diff which would be responsible for this - although it >>>>>>> does seem that hch's DMA changes do make the problem more likely. >>>>>>> (going from 1 in 3 boots with a problem to being not able to boot.) >>>>>>> >>>>>>> Note, with v5.2, I _never_ saw any ADMA errors, except if I disabled >>>>>>> bypass mode on the IOMMU (but then I saw global smmu errors right >>>>>>> from when the IOMMU had bypass disabled before MMC was probed - the >>>>>>> reason being is the SoC is not currently setup to have the MMU >>>>>>> bypass mode disabled.) >>>>>> >>>>>> This looks like an ARM64 coherency issue. >>>>>> >>>>>> I first tried adding a dma_wmb() to the end of sdhci_adma_table_pre(), >>>>>> which had no effect. I then tried adding: >>>>>> >>>>>> + __dma_flush_area(host->adma_table, desc - host->adma_table); >>>>>> + dma_wmb(); >>>>>> >>>>>> and so far I haven't had any further ADMA errors. Adding Will Deacon >>>>>> to the thread. >>>>> >>>>> These are the changes to sdhci that I'm currently running. I think >>>>> some of the debugging related changes are probably worth adding to >>>>> the driver, particularly printing the intmask on ADMA error (which >>>>> is not printed by the register dump, as the value is lost) and printing >>>>> the DMA addresses of the descriptor table entries which can be tied >>>>> up with the DMA address error register. Also, maybe printing the >>>>> DMA descriptor table with the register dump, rather than having to >>>>> resort to enabling debug would be a good idea? >>>>> >>>>> diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c >>>>> index a5dc5aae973e..884dcaa9cad5 100644 >>>>> --- a/drivers/mmc/host/sdhci.c >>>>> +++ b/drivers/mmc/host/sdhci.c >>>>> @@ -773,6 +773,8 @@ static void sdhci_adma_table_pre(struct sdhci_host *host, >>>>> /* Add a terminating entry - nop, end, valid */ >>>>> __sdhci_adma_write_desc(host, &desc, 0, 0, ADMA2_NOP_END_VALID); >>>>> } >>>>> + __dma_flush_area(host->adma_table, desc - host->adma_table); >>>>> + dma_wmb(); >>>>> } >>>>> static void sdhci_adma_table_post(struct sdhci_host *host, >>>>> @@ -2855,6 +2857,8 @@ static void sdhci_cmd_irq(struct sdhci_host *host, u32 intmask, u32 *intmask_p) >>>>> static void sdhci_adma_show_error(struct sdhci_host *host) >>>>> { >>>>> void *desc = host->adma_table; >>>>> + dma_addr_t dma = host->adma_addr; >>>>> + bool end = false; >>>>> sdhci_dumpregs(host); >>>>> @@ -2862,21 +2866,26 @@ static void sdhci_adma_show_error(struct sdhci_host *host) >>>>> struct sdhci_adma2_64_desc *dma_desc = desc; >>>>> if (host->flags & SDHCI_USE_64_BIT_DMA) >>>>> - DBG("%p: DMA 0x%08x%08x, LEN 0x%04x, Attr=0x%02x\n", >>>>> - desc, le32_to_cpu(dma_desc->addr_hi), >>>>> + SDHCI_DUMP("%08llx: DMA 0x%08x%08x, LEN 0x%04x, Attr=0x%02x\n", >>>>> + (unsigned long long)dma, >>>>> + le32_to_cpu(dma_desc->addr_hi), >>>>> le32_to_cpu(dma_desc->addr_lo), >>>>> le16_to_cpu(dma_desc->len), >>>>> le16_to_cpu(dma_desc->cmd)); >>>>> else >>>>> - DBG("%p: DMA 0x%08x, LEN 0x%04x, Attr=0x%02x\n", >>>>> - desc, le32_to_cpu(dma_desc->addr_lo), >>>>> + SDHCI_DUMP("%08llx: DMA 0x%08x, LEN 0x%04x, Attr=0x%02x\n", >>>>> + (unsigned long long)dma, >>>>> + le32_to_cpu(dma_desc->addr_lo), >>>>> le16_to_cpu(dma_desc->len), >>>>> le16_to_cpu(dma_desc->cmd)); >>>>> + if (end) break; >>>>> + >>>>> desc += host->desc_sz; >>>>> + dma += host->desc_sz; >>>>> if (dma_desc->cmd & cpu_to_le16(ADMA2_END)) >>>>> - break; >>>>> + end = true; >>>>> } >>>>> } >>>>> @@ -2949,7 +2958,7 @@ static void sdhci_data_irq(struct sdhci_host *host, u32 intmask) >>>>> != MMC_BUS_TEST_R) >>>>> host->data->error = -EILSEQ; >>>>> else if (intmask & SDHCI_INT_ADMA_ERROR) { >>>>> - pr_err("%s: ADMA error\n", mmc_hostname(host->mmc)); >>>>> + pr_err("%s: ADMA error: 0x%08x\n", mmc_hostname(host->mmc), intmask); >>>>> sdhci_adma_show_error(host); >>>>> host->data->error = -EIO; >>>>> if (host->ops->adma_workaround) >>>> >>>> Further debug shows: >>>> >>>> coherent=0 - sdhci device is not cache coherent >>>> swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000081cac000 >>>> [ffffff8010fd5200] pgd=000000237ffff003, pud=000000237ffff003, >>>> pmd=000000237fffb003, pte=00e800236d62270f >>>> >>>> The mapping for the ADMA table seems to be using MAIR index 3, which is >>>> MT_MEMORY_NC, so should be non-cacheable. >>>> >>>> vmallocinfo: >>>> 0xffffff8010fd5000-0xffffff8010fd7000 8192 dma_direct_alloc+0x4c/0x54 >>>> user >>>> >>>> So this memory has been remapped. Could there be an alias that has >>>> cache lines still in the cache for the physical address, and could we >>>> be hitting those cache lines while accessing through a non-cacheable >>>> mapping? (On 32-bit ARM, this is "unpredictable" and this problem >>>> definitely _feels_ like it has unpredictable attributes!) >>>> >>>> Also, given that this memory is mapped NC, then surely >>>> __dma_flush_area() should have no effect? However, it _does_ have the >>>> effect of reliably solving the problem, which to me implies that there >>>> _are_ cache lines in this NC mapping. >>> >>> Will suggested reverting bd2e75633c80 ("dma-contiguous: use fallback >>> alloc_pages for single pages") which has been implicated in the same >>> problem here: >>> >>> https://www.spinics.net/lists/arm-kernel/msg750623.html >>> >>> Although reverting the commit is not clean, this also fixes the issue >>> for me. >> >> Note that that one turned out to be something totally different, namely that >> the single-page allocations, in difference to CMA, came from physical >> addresses that the controller needed additional configuration to be able to >> access[1] - no amount of cache maintenance would affect that. > > To be honest, the conclusion in that other thread wasn't exactly satisfying. > The reporter says "Probably, my device is not 64-bit capable." and the fix > changes the buffer allocation enough that I wouldn't rule out the same > coherency issue as being the root cause. I don't think we ever tried adding > cache flushing to see if it helped. Huh? The conclusion of that thread seemed pretty clear to me: https://lore.kernel.org/linux-arm-kernel/CAK7LNASs2qkpGY_BkL--hvmKm3FJ9sEK4+v5VVYc1_CrowAB4w@mail.gmail.com/ which is where I linked that patch from :/ Robin. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [REGRESSION] sdhci no longer detects SD cards on LX2160A 2019-09-17 13:55 ` Robin Murphy @ 2019-09-17 14:12 ` Russell King - ARM Linux admin 0 siblings, 0 replies; 31+ messages in thread From: Russell King - ARM Linux admin @ 2019-09-17 14:12 UTC (permalink / raw) To: Robin Murphy Cc: dann frazier, Will Deacon, Adrian Hunter, Nicolin Chen, linux-mmc, Will Deacon, Christoph Hellwig, Linux ARM On Tue, Sep 17, 2019 at 02:55:51PM +0100, Robin Murphy wrote: > On 17/09/2019 14:50, Will Deacon wrote: > > On Tue, Sep 17, 2019 at 02:38:00PM +0100, Robin Murphy wrote: > > > On 17/09/2019 14:07, Russell King - ARM Linux admin wrote: > > > > On Tue, Sep 17, 2019 at 01:33:26PM +0100, Russell King - ARM Linux admin wrote: > > > > > On Tue, Sep 17, 2019 at 12:42:10PM +0100, Russell King - ARM Linux admin wrote: > > > > > > On Tue, Sep 17, 2019 at 12:16:31PM +0100, Russell King - ARM Linux admin wrote: > > > > > > > On Tue, Sep 17, 2019 at 11:42:00AM +0100, Russell King - ARM Linux admin wrote: > > > > > > > > On Tue, Sep 17, 2019 at 09:19:31AM +0100, Russell King - ARM Linux admin wrote: > > > > > > > > > On Tue, Sep 17, 2019 at 10:06:12AM +0200, Marc Gonzalez wrote: > > > > > > > > > > On 16/09/2019 19:15, Russell King - ARM Linux admin wrote: > > > > > > > > > > > > > > > > > > > > > The platform has an iommu, which is in pass-through mode, via > > > > > > > > > > > arm_smmu.disable_bypass=0. > > > > > > > > > > > > > > > > > > > > Could be 954a03be033c7cef80ddc232e7cbdb17df735663 > > > > > > > > > > "iommu/arm-smmu: Break insecure users by disabling bypass by default" > > > > > > > > > > > > > > > > > > > > Although it had already landed in v5.2 > > > > > > > > > > > > > > > > > > It is not - and the two lines that you quoted above are sufficient > > > > > > > > > to negate that as a cause. (Please read the help for the option that > > > > > > > > > the commit referrs to.) > > > > > > > > > > > > > > > > > > In fact, with bypass disabled, the SoC fails due to other masters. > > > > > > > > > That's already been discussed privately between myself and Will > > > > > > > > > Deacon. > > > > > > > > > > > > > > > > > > arm_smmu.disable_bypass=0 re-enables bypass mode irrespective of > > > > > > > > > the default setting in the Kconfig. > > > > > > > > > > > > > > > > Adding some further debugging, and fixing the existing ADMA debugging > > > > > > > > shows: > > > > > > > > > > > > > > > > mmc0: ADMA error: 0x02000000 > > > > > > > > > > > > > > > > So this is an ADMA error without the transfer having completed. > > > > > > > > > > > > > > > > mmc0: sdhci: Blk size: 0x00000008 | Blk cnt: 0x00000001 > > > > > > > > > > > > > > > > The block size is 8, with one block. > > > > > > > > > > > > > > > > mmc0: sdhci: ADMA Err: 0x00000009 | ADMA Ptr: 0x000000236df1d20c > > > > > > > > > > > > > > > > The ADMA error is a descriptor error at address 0x000000236df1d20c. > > > > > > > > The descriptor table contains (including the following entry): > > > > > > > > > > > > > > > > mmc0: sdhci: 236df1d200: DMA 0x000000236d40e980, LEN 0x0008, Attr=0x23 > > > > > > > > mmc0: sdhci: 236df1d20c: DMA 0x0000000000000000, LEN 0x0000, Attr=0x00 > > > > > > > > > > > > > > > > The descriptor table contains one descriptor of 8 bytes, is marked > > > > > > > > as the last (END bit set) and is at DMA address 0x236df1d200. The > > > > > > > > following descriptor is empty, with VALID=0. > > > > > > > > > > > > > > > > One may be tempted to blame it on the following descriptor, but having > > > > > > > > had another example on eMMC while userspace was booting (rootfs on > > > > > > > > eMMC): > > > > > > > > > > > > > > > > mmc1: ADMA error: 0x02000000 > > > > > > > > mmc1: sdhci: Blk size: 0x00000200 | Blk cnt: 0x00000099 > > > > > > > > mmc1: sdhci: ADMA Err: 0x00000006 | ADMA Ptr: 0x000000236dbfa26c > > > > > > > > mmc1: sdhci: 236dbfa200: DMA 0x000000236c25c000, LEN 0x2000, Attr=0x21 > > > > > > > > mmc1: sdhci: 236dbfa20c: DMA 0x000000236938c000, LEN 0x0000, Attr=0x21 > > > > > > > > mmc1: sdhci: 236dbfa218: DMA 0x000000236939c000, LEN 0x5000, Attr=0x21 > > > > > > > > mmc1: sdhci: 236dbfa224: DMA 0x0000002368545000, LEN 0x1000, Attr=0x21 > > > > > > > > mmc1: sdhci: 236dbfa230: DMA 0x00000023684f1000, LEN 0x1000, Attr=0x21 > > > > > > > > mmc1: sdhci: 236dbfa23c: DMA 0x0000002368504000, LEN 0x2000, Attr=0x21 > > > > > > > > mmc1: sdhci: 236dbfa248: DMA 0x0000002368546000, LEN 0x2000, Attr=0x21 > > > > > > > > mmc1: sdhci: 236dbfa254: DMA 0x00000023684f2000, LEN 0x2000, Attr=0x21 > > > > > > > > mmc1: sdhci: 236dbfa260: DMA 0x0000002368500000, LEN 0x1000, Attr=0x23 > > > > > > > > mmc1: sdhci: 236dbfa26c: DMA 0x000000236b55d000, LEN 0x1000, Attr=0x21 > > > > > > > > > > > > > > > > ... which is interesting for several reasons: > > > > > > > > - The ADMA error register indicates a length mismatch error. The > > > > > > > > transfer was for 0x99 blocks of 0x200, which is 0x13200 bytes. > > > > > > > > Summing the ADMA lengths up to the last descriptor (length=0 is > > > > > > > > 0x10000 bytes) gives 0x20000 bytes. So the DMA table contains more > > > > > > > > bytes than the requested transfer. > > > > > > > > > > > > > > > > - The ADMA error register indicates ST_CADR, which is described as > > > > > > > > "This state is never set because do not generate ADMA error in this > > > > > > > > state." > > > > > > > > > > > > > > > > - The error descriptor is again after the descriptor with END=1, but > > > > > > > > this time has VALID=1. > > > > > > > > > > > > > > > > This _feels_ like a coherency issue, where the SDHCI engine is not > > > > > > > > correctly seeing the descriptor table, but then I would have expected > > > > > > > > userspace (which is basically debian stable) to fail to boot every > > > > > > > > time given that its rootfs is on eMMC. > > > > > > > > > > > > > > > > The other weird thing is if I wind the core MMC code back via: > > > > > > > > > > > > > > > > $ git diff -u 7559d612dff0..v5.3 drivers/mmc/core | patch -p1 -R > > > > > > > > > > > > > > > > and fix the lack of dma_max_pfn(), then SDHCI is more stable - not > > > > > > > > completely stable, but way better than plain v5.3. I don't see > > > > > > > > much in that diff which would be responsible for this - although it > > > > > > > > does seem that hch's DMA changes do make the problem more likely. > > > > > > > > (going from 1 in 3 boots with a problem to being not able to boot.) > > > > > > > > > > > > > > > > Note, with v5.2, I _never_ saw any ADMA errors, except if I disabled > > > > > > > > bypass mode on the IOMMU (but then I saw global smmu errors right > > > > > > > > from when the IOMMU had bypass disabled before MMC was probed - the > > > > > > > > reason being is the SoC is not currently setup to have the MMU > > > > > > > > bypass mode disabled.) > > > > > > > > > > > > > > This looks like an ARM64 coherency issue. > > > > > > > > > > > > > > I first tried adding a dma_wmb() to the end of sdhci_adma_table_pre(), > > > > > > > which had no effect. I then tried adding: > > > > > > > > > > > > > > + __dma_flush_area(host->adma_table, desc - host->adma_table); > > > > > > > + dma_wmb(); > > > > > > > > > > > > > > and so far I haven't had any further ADMA errors. Adding Will Deacon > > > > > > > to the thread. > > > > > > > > > > > > These are the changes to sdhci that I'm currently running. I think > > > > > > some of the debugging related changes are probably worth adding to > > > > > > the driver, particularly printing the intmask on ADMA error (which > > > > > > is not printed by the register dump, as the value is lost) and printing > > > > > > the DMA addresses of the descriptor table entries which can be tied > > > > > > up with the DMA address error register. Also, maybe printing the > > > > > > DMA descriptor table with the register dump, rather than having to > > > > > > resort to enabling debug would be a good idea? > > > > > > > > > > > > diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c > > > > > > index a5dc5aae973e..884dcaa9cad5 100644 > > > > > > --- a/drivers/mmc/host/sdhci.c > > > > > > +++ b/drivers/mmc/host/sdhci.c > > > > > > @@ -773,6 +773,8 @@ static void sdhci_adma_table_pre(struct sdhci_host *host, > > > > > > /* Add a terminating entry - nop, end, valid */ > > > > > > __sdhci_adma_write_desc(host, &desc, 0, 0, ADMA2_NOP_END_VALID); > > > > > > } > > > > > > + __dma_flush_area(host->adma_table, desc - host->adma_table); > > > > > > + dma_wmb(); > > > > > > } > > > > > > static void sdhci_adma_table_post(struct sdhci_host *host, > > > > > > @@ -2855,6 +2857,8 @@ static void sdhci_cmd_irq(struct sdhci_host *host, u32 intmask, u32 *intmask_p) > > > > > > static void sdhci_adma_show_error(struct sdhci_host *host) > > > > > > { > > > > > > void *desc = host->adma_table; > > > > > > + dma_addr_t dma = host->adma_addr; > > > > > > + bool end = false; > > > > > > sdhci_dumpregs(host); > > > > > > @@ -2862,21 +2866,26 @@ static void sdhci_adma_show_error(struct sdhci_host *host) > > > > > > struct sdhci_adma2_64_desc *dma_desc = desc; > > > > > > if (host->flags & SDHCI_USE_64_BIT_DMA) > > > > > > - DBG("%p: DMA 0x%08x%08x, LEN 0x%04x, Attr=0x%02x\n", > > > > > > - desc, le32_to_cpu(dma_desc->addr_hi), > > > > > > + SDHCI_DUMP("%08llx: DMA 0x%08x%08x, LEN 0x%04x, Attr=0x%02x\n", > > > > > > + (unsigned long long)dma, > > > > > > + le32_to_cpu(dma_desc->addr_hi), > > > > > > le32_to_cpu(dma_desc->addr_lo), > > > > > > le16_to_cpu(dma_desc->len), > > > > > > le16_to_cpu(dma_desc->cmd)); > > > > > > else > > > > > > - DBG("%p: DMA 0x%08x, LEN 0x%04x, Attr=0x%02x\n", > > > > > > - desc, le32_to_cpu(dma_desc->addr_lo), > > > > > > + SDHCI_DUMP("%08llx: DMA 0x%08x, LEN 0x%04x, Attr=0x%02x\n", > > > > > > + (unsigned long long)dma, > > > > > > + le32_to_cpu(dma_desc->addr_lo), > > > > > > le16_to_cpu(dma_desc->len), > > > > > > le16_to_cpu(dma_desc->cmd)); > > > > > > + if (end) break; > > > > > > + > > > > > > desc += host->desc_sz; > > > > > > + dma += host->desc_sz; > > > > > > if (dma_desc->cmd & cpu_to_le16(ADMA2_END)) > > > > > > - break; > > > > > > + end = true; > > > > > > } > > > > > > } > > > > > > @@ -2949,7 +2958,7 @@ static void sdhci_data_irq(struct sdhci_host *host, u32 intmask) > > > > > > != MMC_BUS_TEST_R) > > > > > > host->data->error = -EILSEQ; > > > > > > else if (intmask & SDHCI_INT_ADMA_ERROR) { > > > > > > - pr_err("%s: ADMA error\n", mmc_hostname(host->mmc)); > > > > > > + pr_err("%s: ADMA error: 0x%08x\n", mmc_hostname(host->mmc), intmask); > > > > > > sdhci_adma_show_error(host); > > > > > > host->data->error = -EIO; > > > > > > if (host->ops->adma_workaround) > > > > > > > > > > Further debug shows: > > > > > > > > > > coherent=0 - sdhci device is not cache coherent > > > > > swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000081cac000 > > > > > [ffffff8010fd5200] pgd=000000237ffff003, pud=000000237ffff003, > > > > > pmd=000000237fffb003, pte=00e800236d62270f > > > > > > > > > > The mapping for the ADMA table seems to be using MAIR index 3, which is > > > > > MT_MEMORY_NC, so should be non-cacheable. > > > > > > > > > > vmallocinfo: > > > > > 0xffffff8010fd5000-0xffffff8010fd7000 8192 dma_direct_alloc+0x4c/0x54 > > > > > user > > > > > > > > > > So this memory has been remapped. Could there be an alias that has > > > > > cache lines still in the cache for the physical address, and could we > > > > > be hitting those cache lines while accessing through a non-cacheable > > > > > mapping? (On 32-bit ARM, this is "unpredictable" and this problem > > > > > definitely _feels_ like it has unpredictable attributes!) > > > > > > > > > > Also, given that this memory is mapped NC, then surely > > > > > __dma_flush_area() should have no effect? However, it _does_ have the > > > > > effect of reliably solving the problem, which to me implies that there > > > > > _are_ cache lines in this NC mapping. > > > > > > > > Will suggested reverting bd2e75633c80 ("dma-contiguous: use fallback > > > > alloc_pages for single pages") which has been implicated in the same > > > > problem here: > > > > > > > > https://www.spinics.net/lists/arm-kernel/msg750623.html > > > > > > > > Although reverting the commit is not clean, this also fixes the issue > > > > for me. > > > > > > Note that that one turned out to be something totally different, namely that > > > the single-page allocations, in difference to CMA, came from physical > > > addresses that the controller needed additional configuration to be able to > > > access[1] - no amount of cache maintenance would affect that. > > > > To be honest, the conclusion in that other thread wasn't exactly satisfying. > > The reporter says "Probably, my device is not 64-bit capable." and the fix > > changes the buffer allocation enough that I wouldn't rule out the same > > coherency issue as being the root cause. I don't think we ever tried adding > > cache flushing to see if it helped. > > Huh? The conclusion of that thread seemed pretty clear to me: > > https://lore.kernel.org/linux-arm-kernel/CAK7LNASs2qkpGY_BkL--hvmKm3FJ9sEK4+v5VVYc1_CrowAB4w@mail.gmail.com/ > > which is where I linked that patch from :/ ... and it is looking like the conclusion of that was incorrect given that I'm seeing the same issue on hardware that isn't capable of v4 operation but is capable of addressing the full range of system memory. I can confirm that the registers that set the ADMA table base are 64-bit wide in the LX2160A reference manual. I can also confirm that the eSDHC driver is using 96-bit descriptor format, which is what is required to support 64-bit DMA addresses, and the eSDHC hardware supports this format. So, this does _not_ appear to be some addressing limitation of the device. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2019-09-20 9:58 UTC | newest]
Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-09-16 17:15 [REGRESSION] sdhci no longer detects SD cards on LX2160A Russell King - ARM Linux admin
2019-09-16 22:57 ` Russell King - ARM Linux admin
2019-09-17 8:06 ` Marc Gonzalez
2019-09-17 8:19 ` Russell King - ARM Linux admin
2019-09-17 10:42 ` Russell King - ARM Linux admin
2019-09-17 11:16 ` Russell King - ARM Linux admin
2019-09-17 11:42 ` Russell King - ARM Linux admin
2019-09-17 12:33 ` Russell King - ARM Linux admin
2019-09-17 13:03 ` Robin Murphy
2019-09-17 13:28 ` Russell King - ARM Linux admin
2019-09-17 13:07 ` Russell King - ARM Linux admin
2019-09-17 13:24 ` Fabio Estevam
2019-09-17 13:33 ` Russell King - ARM Linux admin
2019-09-17 13:43 ` Fabio Estevam
2019-09-17 13:51 ` Russell King - ARM Linux admin
2019-09-17 13:56 ` Fabio Estevam
[not found] ` <CADRPPNQ-WTY0QC7_bX=N0QeueKve=k0SaMvbjOrByyvzFojz2g@mail.gmail.com>
2019-09-19 4:13 ` Y.b. Lu
2019-09-19 7:04 ` Russell King - ARM Linux admin
2019-09-19 8:15 ` Y.b. Lu
2019-09-19 8:38 ` Russell King - ARM Linux admin
2019-09-19 9:22 ` Russell King - ARM Linux admin
2019-09-17 13:38 ` Robin Murphy
2019-09-17 13:49 ` Russell King - ARM Linux admin
2019-09-17 14:03 ` Robin Murphy
2019-09-19 9:16 ` Russell King - ARM Linux admin
2019-09-19 14:02 ` Robin Murphy
2019-09-19 17:23 ` Russell King - ARM Linux admin
2019-09-20 9:55 ` Russell King - ARM Linux admin
2019-09-17 13:50 ` Will Deacon
2019-09-17 13:55 ` Robin Murphy
2019-09-17 14:12 ` Russell King - ARM Linux admin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).