Linux MultiMedia Card development
 help / color / mirror / Atom feed
* eMMC timeout reproduction. Can I send a block of data from the host to the card with invalid CRC?
@ 2025-04-16  0:13 Prasad Koya
  2025-04-16  9:13 ` Christian Loehle
  0 siblings, 1 reply; 8+ messages in thread
From: Prasad Koya @ 2025-04-16  0:13 UTC (permalink / raw)
  To: linux-mmc; +Cc: Baptiste Covolato, Sushrut Shirole, ulf.hansson

Hi

We use eMMC as a boot drive. At a random time, maybe a month or an
year of uptime, we run into an issue where we see the eMMC driver
prints this in the kernel logs and the drive no longer responds after
that. We see this issue in the field on linux kernel 4.19.142 and
5.10.165. From the SDHCI register dump, it looks like CMD25 ran into a
timeout first. After the first timeout, we see CMD12 in the second
SDHCI register dump. Status returned by the card at that point in
card_busy_detect() is 0xE00.

I looked at the recent commits up to 6.15 and I do not see anything
obvious that addresses issue like below.

We are trying to reproduce the issue in our lab and understand the
driver code. Is there a way I can send a block of data as part of
CMD25 with invalid CRC?  Appreciate any pointers.

Below are the kernel logs and some more information about our drive
from debugfs.

[43976878.027119] mmc0: Timeout waiting for hardware interrupt.
[43976878.094848] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
[43976878.175087] mmc0: sdhci: Sys addr:  0xd3c00600 | Version:  0x00001002
[43976878.255333] mmc0: sdhci: Blk size:  0x00007200 | Blk cnt:  0x0000000d
[43976878.335565] mmc0: sdhci: Argument:  0x00460810 | Trn mode: 0x00000023
[43976878.415798] mmc0: sdhci: Present:   0x01ef0106 | Host ctl: 0x00000007
[43976878.496029] mmc0: sdhci: Power:     0x0000000f | Blk gap:  0x00000000
[43976878.576261] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x00000207
[43976878.656492] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 0x00000000
[43976878.736724] mmc0: sdhci: Int enab:  0x02ff008b | Sig enab: 0x02ff008b
[43976878.816958] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x000000ff
[43976878.897189] mmc0: sdhci: Caps:      0x75fec8b2 | Caps_1:   0x00002501
[43976878.977421] mmc0: sdhci: Cmd:       0x0000193a | Max curr: 0x00c80064
[43976879.057651] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  0xffffffff
[43976879.137882] mmc0: sdhci: Resp[2]:   0x320f5903 | Resp[3]:  0x00d04f01
[43976879.218116] mmc0: sdhci: Host ctl2: 0x00000000
[43976879.274385] mmc0: sdhci: ============================================
[43976888.267487] mmc0: Timeout waiting for hardware interrupt.
[43976888.335221] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
[43976888.415455] mmc0: sdhci: Sys addr:  0xd3c00600 | Version:  0x00001002
[43976888.495687] mmc0: sdhci: Blk size:  0x00007200 | Blk cnt:  0x0000000d
[43976888.575920] mmc0: sdhci: Argument:  0x00000000 | Trn mode: 0x00000000
[43976888.656150] mmc0: sdhci: Present:   0x01ef0006 | Host ctl: 0x00000007
[43976888.736377] mmc0: sdhci: Power:     0x0000000f | Blk gap:  0x00000000
[43976888.816606] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x00000207
[43976888.896829] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 0x00000000
[43976888.977051] mmc0: sdhci: Int enab:  0x02ff008b | Sig enab: 0x02ff008b
[43976889.057271] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x000000ff
[43976889.137494] mmc0: sdhci: Caps:      0x75fec8b2 | Caps_1:   0x00002501
[43976889.217718] mmc0: sdhci: Cmd:       0x00000c1b | Max curr: 0x00c80064
[43976889.297948] mmc0: sdhci: Resp[0]:   0x00000c00 | Resp[1]:  0xffffffff
[43976889.378178] mmc0: sdhci: Resp[2]:   0x320f5903 | Resp[3]:  0x00d04f01
[43976889.458408] mmc0: sdhci: Host ctl2: 0x00000000
[43976889.514671] mmc0: sdhci: ============================================
[43976891.203641] sdhci-pci 0000:00:14.7: Card stuck in wrong state!
card_busy_detect status: 0xe00
[43976891.309110] mmc0: cache flush error -110
[43976892.599663] mmc0: tried to HW reset card, got error -110
[43976892.599668] mmcblk0: recovery failed!
[43976892.656500] print_req_error: I/O error, dev mmcblk0, sector 4589584
[43976892.663560] sdhci-pci 0000:00:14.7: error -110 requesting status
[43976892.734746] EXT4-fs warning (device mmcblk0p1):
ext4_end_bio:317: I/O error 10 writing to inode 129797 (offset 0 size
8192 starting block 573700)
[43976892.809713] Buffer I/O error on device mmcblk0p1, logical block 573442
[43976892.809741] mmcblk0: recovery failed!


/sys/kernel/debug/mmc0/caps
0x4007040f
/sys/kernel/debug/mmc0/caps2
0x00024000
/sys/kernel/debug/mmc0/clock
52000000
/sys/kernel/debug/mmc0/ios
clock:          52000000 Hz
actual clock:   50000000 Hz
vdd:            21 (3.3 ~ 3.4 V)
bus mode:       2 (push-pull)
chip select:    0 (don't care)
power mode:     2 (on)
bus width:      2 (4 bits)
timing spec:    1 (mmc high-speed)
signal voltage: 0 (3.30 V)
driver type:    0 (driver type B)

-bash-4.2# cat /sys/kernel/debug/mmc0/mmc0\:0001/ext_csd
00000000000000000000000000000000010100c07400000000000000000000000001010000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000002000002000000000010000000000000000000000000000000000000000000000000000d70100070100000000050020000000000000010000000000000001000100000000000007000200571f04030000000000080808080808000080eb000f131408080101111007200007010155110000000000000001640800000000ff19000400000002000000000000000000010801010b020000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001ffff00000000010307050003013c3c01010100000000000000
-bash-4.2#

-bash-4.2# ./mmcutils extcsd read /dev/mmcblk0
=============================================
  Extended CSD rev 1.7 (MMC 5.0)
=============================================

Card Supported Command sets [S_CMD_SET: 0x01]
HPI Features [HPI_FEATURE: 0x01]: implementation based on CMD13
Background operations support [BKOPS_SUPPORT: 0x01]
Max Packet Read Cmd [MAX_PACKED_READS: 0x3c]
Max Packet Write Cmd [MAX_PACKED_WRITES: 0x3c]
Data TAG support [DATA_TAG_SUPPORT: 0x01]
Data TAG Unit Size [TAG_UNIT_SIZE: 0x03]
Tag Resources Size [TAG_RES_SIZE: 0x00]
Context Management Capabilities [CONTEXT_CAPABILITIES: 0x05]
Large Unit Size [LARGE_UNIT_SIZE_M1: 0x07]
Extended partition attribute support [EXT_SUPPORT: 0x03]
Generic CMD6 Timer [GENERIC_CMD6_TIME: 0x19]
Power off notification [POWER_OFF_LONG_TIME: 0xff]
Cache Size [CACHE_SIZE] is 128 KiB
Background operations status [BKOPS_STATUS: 0x00]
1st Initialisation Time after programmed sector [INI_TIMEOUT_AP: 0x64]
Power class for 52MHz, DDR at 3.6V [PWR_CL_DDR_52_360: 0x00]
Power class for 52MHz, DDR at 1.95V [PWR_CL_DDR_52_195: 0x00]
Power class for 200MHz at 3.6V [PWR_CL_200_360: 0x00]
Power class for 200MHz, at 1.95V [PWR_CL_200_195: 0x00]
Minimum Performance for 8bit at 52MHz in DDR mode:
 [MIN_PERF_DDR_W_8_52: 0x00]
 [MIN_PERF_DDR_R_8_52: 0x00]
TRIM Multiplier [TRIM_MULT: 0x11]
Secure Feature support [SEC_FEATURE_SUPPORT: 0x55]
Boot Information [BOOT_INFO: 0x07]
 Device supports alternative boot method
 Device supports dual data rate during boot
 Device supports high speed timing during boot
Boot partition size [BOOT_SIZE_MULTI: 0x20]
Access size [ACC_SIZE: 0x07]
High-capacity erase unit size [HC_ERASE_GRP_SIZE: 0x10]
 i.e. 8192 KiB
High-capacity erase timeout [ERASE_TIMEOUT_MULT: 0x11]
Reliable write sector count [REL_WR_SEC_C: 0x01]
High-capacity W protect group size [HC_WP_GRP_SIZE: 0x01]
 i.e. 8192 KiB
Sleep current (VCC) [S_C_VCC: 0x08]
Sleep current (VCCQ) [S_C_VCCQ: 0x08]
Sleep/awake timeout [S_A_TIMEOUT: 0x13]
Sector Count [SEC_COUNT: 0x00eb8000]
 Device is block-addressed
Minimum Write Performance for 8bit:
 [MIN_PERF_W_8_52: 0x08]
 [MIN_PERF_R_8_52: 0x08]
 [MIN_PERF_W_8_26_4_52: 0x08]
 [MIN_PERF_R_8_26_4_52: 0x08]
Minimum Write Performance for 4bit:
 [MIN_PERF_W_4_26: 0x08]
 [MIN_PERF_R_4_26: 0x08]
Power classes registers:
 [PWR_CL_26_360: 0x00]
 [PWR_CL_52_360: 0x00]
 [PWR_CL_26_195: 0x00]
 [PWR_CL_52_195: 0x00]
Partition switching timing [PARTITION_SWITCH_TIME: 0x03]
Out-of-interrupt busy timing [OUT_OF_INTERRUPT_TIME: 0x04]
I/O Driver Strength [DRIVER_STRENGTH: 0x1f]
Card Type [CARD_TYPE: 0x57]
 HS400 Dual Data Rate eMMC @200MHz 1.8VI/O
 HS200 Single Data Rate eMMC @200MHz 1.8VI/O
 HS Dual Data Rate eMMC @52MHz 1.8V or 3VI/O
 HS eMMC @52MHz - at rated device voltage(s)
 HS eMMC @26MHz - at rated device voltage(s)
CSD structure version [CSD_STRUCTURE: 0x02]
Command set [CMD_SET: 0x00]
Command set revision [CMD_SET_REV: 0x00]
Power class [POWER_CLASS: 0x00]
High-speed interface timing [HS_TIMING: 0x01]
Erased memory content [ERASED_MEM_CONT: 0x00]
Boot configuration bytes [PARTITION_CONFIG: 0x00]
 Not boot enable
 No access to boot partition
Boot config protection [BOOT_CONFIG_PROT: 0x00]
Boot bus Conditions [BOOT_BUS_CONDITIONS: 0x00]
High-density erase group definition [ERASE_GROUP_DEF: 0x01]
Boot write protection status registers [BOOT_WP_STATUS]: 0x00
Boot Area Write protection [BOOT_WP]: 0x00
 Power ro locking: possible
 Permanent ro locking: possible
 partition 0 ro lock status: not locked
 partition 1 ro lock status: not locked
User area write protection register [USER_WP]: 0x00
FW configuration [FW_CONFIG]: 0x00
RPMB Size [RPMB_SIZE_MULT]: 0x20
Write reliability setting register [WR_REL_SET]: 0x00
 user area: existing data is at risk if a power failure occurs during
a write operation
 partition 1: existing data is at risk if a power failure occurs
during a write operation
 partition 2: existing data is at risk if a power failure occurs
during a write operation
 partition 3: existing data is at risk if a power failure occurs
during a write operation
 partition 4: existing data is at risk if a power failure occurs
during a write operation
Write reliability parameter register [WR_REL_PARAM]: 0x05
 Device supports writing EXT_CSD_WR_REL_SET
 Device supports the enhanced def. of reliable write
Enable background operations handshake [BKOPS_EN]: 0x00
H/W reset function [RST_N_FUNCTION]: 0x00
HPI management [HPI_MGMT]: 0x01
Partitioning Support [PARTITIONING_SUPPORT]: 0x07
 Device support partitioning feature
 Device can have enhanced tech.
Max Enhanced Area Size [MAX_ENH_SIZE_MULT]: 0x0001d7
 i.e. 3858432 KiB
Partitions attribute [PARTITIONS_ATTRIBUTE]: 0x00
Partitioning Setting [PARTITION_SETTING_COMPLETED]: 0x00
 Device partition setting NOT complete
General Purpose Partition Size
 [GP_SIZE_MULT_4]: 0x000000
 [GP_SIZE_MULT_3]: 0x000000
 [GP_SIZE_MULT_2]: 0x000000
 [GP_SIZE_MULT_1]: 0x000000
Enhanced User Data Area Size [ENH_SIZE_MULT]: 0x000000
 i.e. 0 KiB
Enhanced User Data Start Address [ENH_START_ADDR]: 0x00000000
 i.e. 0 bytes offset
Bad Block Management mode [SEC_BAD_BLK_MGMNT]: 0x00
Periodic Wake-up [PERIODIC_WAKEUP]: 0x00
Program CID/CSD in DDR mode support [PROGRAM_CID_CSD_DDR_SUPPORT]: 0x01
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[127]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[126]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[125]]: 0x20
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[124]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[123]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[122]]: 0x20
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[121]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[120]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[119]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[118]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[117]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[116]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[115]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[114]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[113]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[112]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[111]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[112]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[111]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[110]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[109]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[108]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[107]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[106]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[105]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[104]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[103]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[102]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[101]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[100]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[99]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[98]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[97]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[96]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[95]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[94]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[93]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[92]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[91]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[90]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[89]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[88]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[87]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[86]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[85]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[84]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[83]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[82]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[81]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[80]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[79]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[78]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[77]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[76]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[75]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[74]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[73]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[72]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[71]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[70]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[69]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[68]]: 0x01
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[67]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[66]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[65]]: 0x00
Vendor Specific Fields [VENDOR_SPECIFIC_FIELD[64]]: 0x00
Native sector size [NATIVE_SECTOR_SIZE]: 0x00
Sector size emulation [USE_NATIVE_SECTOR]: 0x00
Sector size emulation [USE_NATIVE_SECTOR]: 0x00
Sector size [DATA_SECTOR_SIZE]: 0x00
1st initialization after disabling sector size emulation [INI_TIMEOUT_EMU]: 0x00
Class 6 commands control [CLASS_6_CTRL]: 0x00
Number of addressed group to be Released[DYNCAP_NEEDED]: 0x00
Exception events control [EXCEPTION_EVENTS_CTRL]: 0x0000
Exception events status[EXCEPTION_EVENTS_STATUS]: 0x0000
Extended Partitions Attribute [EXT_PARTITIONS_ATTRIBUTE]: 0x0000
Context configuration [CONTEXT_CONF[51]]: 0x00
Context configuration [CONTEXT_CONF[50]]: 0x00
Context configuration [CONTEXT_CONF[49]]: 0x00
Context configuration [CONTEXT_CONF[48]]: 0x00
Context configuration [CONTEXT_CONF[47]]: 0x00
Context configuration [CONTEXT_CONF[46]]: 0x00
Context configuration [CONTEXT_CONF[45]]: 0x00
Context configuration [CONTEXT_CONF[44]]: 0x00
Context configuration [CONTEXT_CONF[43]]: 0x00
Context configuration [CONTEXT_CONF[42]]: 0x00
Context configuration [CONTEXT_CONF[41]]: 0x00
Context configuration [CONTEXT_CONF[40]]: 0x00
Context configuration [CONTEXT_CONF[39]]: 0x00
Context configuration [CONTEXT_CONF[38]]: 0x00
Context configuration [CONTEXT_CONF[37]]: 0x00
Packed command status [PACKED_COMMAND_STATUS]: 0x00
Packed command failure index [PACKED_FAILURE_INDEX]: 0x00
Power Off Notification [POWER_OFF_NOTIFICATION]: 0x01
Control to turn the Cache ON/OFF [CACHE_CTRL]: 0x01
Control to turn the Cache Barrier ON/OFF [BARRIER_CTRL]: 0x00
eMMC Firmware Version:
eMMC Life Time Estimation A [EXT_CSD_DEVICE_LIFE_TIME_EST_TYP_A]: 0x0b
eMMC Life Time Estimation B [EXT_CSD_DEVICE_LIFE_TIME_EST_TYP_B]: 0x02
eMMC Pre EOL information [EXT_CSD_PRE_EOL_INFO]: 0x01
Secure Removal Type [SECURE_REMOVAL_TYPE]: 0x01
 information is configured to be removed by an erase of the physical memory
 Supported Secure Removal Type:
  information removed by an erase of the physical memory
-bash-4.2#

Thank you.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: eMMC timeout reproduction. Can I send a block of data from the host to the card with invalid CRC?
  2025-04-16  0:13 eMMC timeout reproduction. Can I send a block of data from the host to the card with invalid CRC? Prasad Koya
@ 2025-04-16  9:13 ` Christian Loehle
  2025-04-22 17:37   ` Prasad Koya
  0 siblings, 1 reply; 8+ messages in thread
From: Christian Loehle @ 2025-04-16  9:13 UTC (permalink / raw)
  To: Prasad Koya, linux-mmc; +Cc: Baptiste Covolato, Sushrut Shirole, ulf.hansson

On 4/16/25 01:13, Prasad Koya wrote:
> Hi
> 
> We use eMMC as a boot drive. At a random time, maybe a month or an
> year of uptime, we run into an issue where we see the eMMC driver
> prints this in the kernel logs and the drive no longer responds after
> that. We see this issue in the field on linux kernel 4.19.142 and
> 5.10.165. From the SDHCI register dump, it looks like CMD25 ran into a
> timeout first. After the first timeout, we see CMD12 in the second
> SDHCI register dump. Status returned by the card at that point in
> card_busy_detect() is 0xE00.
> 
> I looked at the recent commits up to 6.15 and I do not see anything
> obvious that addresses issue like below.
> 
> We are trying to reproduce the issue in our lab and understand the
> driver code. Is there a way I can send a block of data as part of
> CMD25 with invalid CRC?  Appreciate any pointers.

CRC is appended by hardware, so it will always be correct.
What you can do though is change the blocksize of the transfer on the
host side (e.g. from 512 to 514 bytes) without changing it using CMD16.
The 2 bytes will be latched in as CRC by the card. Of course you won't
see the CRC ACK from the card in that case (as it clashes with the
hardware CRC signal from the host). 
You can do all this using the ioctl interface in userspace btw.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: eMMC timeout reproduction. Can I send a block of data from the host to the card with invalid CRC?
  2025-04-16  9:13 ` Christian Loehle
@ 2025-04-22 17:37   ` Prasad Koya
  2025-04-22 21:53     ` Christian Loehle
  0 siblings, 1 reply; 8+ messages in thread
From: Prasad Koya @ 2025-04-22 17:37 UTC (permalink / raw)
  To: Christian Loehle; +Cc: Baptiste Covolato, Sushrut Shirole, linux-mmc

Hi Christian

Thanks for going over and replying.

I couldn't find any ioctl() that changed host's block size in the 6.14
tree. We are using 5.10 kernel.

I did try mmc_test module so I can modify an existing test case and
make it a simple one. Hacking this module, I could call CMD16 via
mmc_test_set_blksize() and set the block size to 514 (0x202). Below
are the eMMC traces for this test. A CMD16, followed by CMD25 with
data, stop opcode and then CMD13. The multi-block write seems to go
fine.

            bash-3958  [001] .... 34861.450688: mmc_request_start:
mmc0: start struct mmc_request[00000000fe878338]: cmd_opcode=16
cmd_arg=0x202 cmd_flags=0x95 cmd_retries=5 stop_opcode=0 stop_arg=0x0
stop_flags=0x0 stop_retries=0 sbc_opcode=0 sbc_arg=0x0 sbc_flags=0x0
sbc_retires=0 blocks=0 block_size=0 blk_addr=0 data_flags=0x0 tag=0
can_retune=0 doing_retune=0 retune_now=0 need_retune=0 hold_retune=1
retune_period=0
       in:imklog-2971  [000] ..s. 34861.450737: mmc_request_done:
mmc0: end struct mmc_request[00000000fe878338]: cmd_opcode=16
cmd_err=0 cmd_resp=0x20000900 0x0 0x0 0x0 cmd_retries=5 stop_opcode=0
stop_err=0 stop_resp=0x0 0x0 0x0 0x0 stop_retries=0 sbc_opcode=0
sbc_err=0 sbc_resp=0x0 0x0 0x0 0x0 sbc_retries=0 bytes_xfered=0
data_err=0 tag=0 can_retune=0 doing_retune=0 retune_now=0
need_retune=0 hold_retune=1 retune_period=0
            bash-3958  [001] .... 34861.451064: mmc_request_start:
mmc0: start struct mmc_request[00000000c2f30356]: cmd_opcode=25
cmd_arg=0x3a0000 cmd_flags=0x35 cmd_retries=0 stop_opcode=12
stop_arg=0x0 stop_flags=0x1d stop_retries=0 sbc_opcode=0 sbc_arg=0x0
sbc_flags=0x0 sbc_retires=0 blocks=128 block_size=512 blk_addr=0
data_flags=0x100 tag=0 can_retune=0 doing_retune=0 retune_now=0
need_retune=0 hold_retune=1 retune_period=0
          <idle>-0     [000] ..s1 34861.456533: mmc_request_done:
mmc0: end struct mmc_request[00000000c2f30356]: cmd_opcode=25
cmd_err=0 cmd_resp=0x900 0x0 0x0 0x0 cmd_retries=0 stop_opcode=12
stop_err=0 stop_resp=0xc00 0x0 0x0 0x0 stop_retries=0 sbc_opcode=0
sbc_err=0 sbc_resp=0x0 0x0 0x0 0x0 sbc_retries=0 bytes_xfered=65536
data_err=0 tag=0 can_retune=0 doing_retune=0 retune_now=0
need_retune=0 hold_retune=1 retune_period=0
            bash-3958  [001] .... 34861.456562: mmc_request_start:
mmc0: start struct mmc_request[00000000e93d6161]: cmd_opcode=13
cmd_arg=0x10000 cmd_flags=0x15 cmd_retries=0 stop_opcode=0
stop_arg=0x0 stop_flags=0x0 stop_retries=0 sbc_opcode=0 sbc_arg=0x0
sbc_flags=0x0 sbc_retires=0 blocks=0 block_size=0 blk_addr=0
data_flags=0x0 tag=0 can_retune=0 doing_retune=0 retune_now=0
need_retune=0 hold_retune=1 retune_period=0
   rs:main Q:Reg-2972  [000] ..s. 34861.456688: mmc_request_done:
mmc0: end struct mmc_request[00000000e93d6161]: cmd_opcode=13
cmd_err=0 cmd_resp=0x900 0x0 0x0 0x0 cmd_retries=0 stop_opcode=0
stop_err=0 stop_resp=0x0 0x0 0x0 0x0 stop_retries=0 sbc_opcode=0
sbc_err=0 sbc_resp=0x0 0x0 0x0 0x0 sbc_retries=0 bytes_xfered=0
data_err=0 tag=0 can_retune=0 doing_retune=0 retune_now=0
need_retune=0 hold_retune=1 retune_period=0

What we believe could be happening is an SEU that flips a bit in the
data block after the host controller has sent the block on the DAT
lines. Card should reject that after CRC mismatch. Does the card
reject the command right when it sees a CRC mismatch or does it have
to wait till all blocks and CMD12 is sent by the host?

Thank you.



On Wed, Apr 16, 2025 at 2:13 AM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 4/16/25 01:13, Prasad Koya wrote:
> > Hi
> >
> > We use eMMC as a boot drive. At a random time, maybe a month or an
> > year of uptime, we run into an issue where we see the eMMC driver
> > prints this in the kernel logs and the drive no longer responds after
> > that. We see this issue in the field on linux kernel 4.19.142 and
> > 5.10.165. From the SDHCI register dump, it looks like CMD25 ran into a
> > timeout first. After the first timeout, we see CMD12 in the second
> > SDHCI register dump. Status returned by the card at that point in
> > card_busy_detect() is 0xE00.
> >
> > I looked at the recent commits up to 6.15 and I do not see anything
> > obvious that addresses issue like below.
> >
> > We are trying to reproduce the issue in our lab and understand the
> > driver code. Is there a way I can send a block of data as part of
> > CMD25 with invalid CRC?  Appreciate any pointers.
>
> CRC is appended by hardware, so it will always be correct.
> What you can do though is change the blocksize of the transfer on the
> host side (e.g. from 512 to 514 bytes) without changing it using CMD16.
> The 2 bytes will be latched in as CRC by the card. Of course you won't
> see the CRC ACK from the card in that case (as it clashes with the
> hardware CRC signal from the host).
> You can do all this using the ioctl interface in userspace btw.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: eMMC timeout reproduction. Can I send a block of data from the host to the card with invalid CRC?
  2025-04-22 17:37   ` Prasad Koya
@ 2025-04-22 21:53     ` Christian Loehle
  2025-04-25  4:59       ` Prasad Koya
  0 siblings, 1 reply; 8+ messages in thread
From: Christian Loehle @ 2025-04-22 21:53 UTC (permalink / raw)
  To: Prasad Koya; +Cc: Baptiste Covolato, Sushrut Shirole, linux-mmc

On 4/22/25 18:37, Prasad Koya wrote:
> Hi Christian
> 
> Thanks for going over and replying.
> 
> I couldn't find any ioctl() that changed host's block size in the 6.14
> tree. We are using 5.10 kernel.

Works on any kernel since a long time.
You can find a caller instance in mmc-utils: mmc_cmds.c, the field is
called blksz, if you've already familiarized yourself with the test
module I'd say don't bother though.

> 
> I did try mmc_test module so I can modify an existing test case and
> make it a simple one. Hacking this module, I could call CMD16 via
> mmc_test_set_blksize() and set the block size to 514 (0x202). Below
> are the eMMC traces for this test. A CMD16, followed by CMD25 with
> data, stop opcode and then CMD13. The multi-block write seems to go
> fine.

Sorry, misunderstanding here. I meant changing the blocksize of the
transfer for the host *without* letting the card know (by issuing CMD16).
So leave out the CMD16 part and you should achieve what you want.

> 
>             bash-3958  [001] .... 34861.450688: mmc_request_start:
> mmc0: start struct mmc_request[00000000fe878338]: cmd_opcode=16
> cmd_arg=0x202 cmd_flags=0x95 cmd_retries=5 stop_opcode=0 stop_arg=0x0
> stop_flags=0x0 stop_retries=0 sbc_opcode=0 sbc_arg=0x0 sbc_flags=0x0
> sbc_retires=0 blocks=0 block_size=0 blk_addr=0 data_flags=0x0 tag=0
> can_retune=0 doing_retune=0 retune_now=0 need_retune=0 hold_retune=1
> retune_period=0
>        in:imklog-2971  [000] ..s. 34861.450737: mmc_request_done:
> mmc0: end struct mmc_request[00000000fe878338]: cmd_opcode=16
> cmd_err=0 cmd_resp=0x20000900 0x0 0x0 0x0 cmd_retries=5 stop_opcode=0
> stop_err=0 stop_resp=0x0 0x0 0x0 0x0 stop_retries=0 sbc_opcode=0
> sbc_err=0 sbc_resp=0x0 0x0 0x0 0x0 sbc_retries=0 bytes_xfered=0
> data_err=0 tag=0 can_retune=0 doing_retune=0 retune_now=0
> need_retune=0 hold_retune=1 retune_period=0
>             bash-3958  [001] .... 34861.451064: mmc_request_start:
> mmc0: start struct mmc_request[00000000c2f30356]: cmd_opcode=25
> cmd_arg=0x3a0000 cmd_flags=0x35 cmd_retries=0 stop_opcode=12
> stop_arg=0x0 stop_flags=0x1d stop_retries=0 sbc_opcode=0 sbc_arg=0x0
> sbc_flags=0x0 sbc_retires=0 blocks=128 block_size=512 blk_addr=0
> data_flags=0x100 tag=0 can_retune=0 doing_retune=0 retune_now=0
> need_retune=0 hold_retune=1 retune_period=0
>           <idle>-0     [000] ..s1 34861.456533: mmc_request_done:
> mmc0: end struct mmc_request[00000000c2f30356]: cmd_opcode=25
> cmd_err=0 cmd_resp=0x900 0x0 0x0 0x0 cmd_retries=0 stop_opcode=12
> stop_err=0 stop_resp=0xc00 0x0 0x0 0x0 stop_retries=0 sbc_opcode=0
> sbc_err=0 sbc_resp=0x0 0x0 0x0 0x0 sbc_retries=0 bytes_xfered=65536
> data_err=0 tag=0 can_retune=0 doing_retune=0 retune_now=0
> need_retune=0 hold_retune=1 retune_period=0
>             bash-3958  [001] .... 34861.456562: mmc_request_start:
> mmc0: start struct mmc_request[00000000e93d6161]: cmd_opcode=13
> cmd_arg=0x10000 cmd_flags=0x15 cmd_retries=0 stop_opcode=0
> stop_arg=0x0 stop_flags=0x0 stop_retries=0 sbc_opcode=0 sbc_arg=0x0
> sbc_flags=0x0 sbc_retires=0 blocks=0 block_size=0 blk_addr=0
> data_flags=0x0 tag=0 can_retune=0 doing_retune=0 retune_now=0
> need_retune=0 hold_retune=1 retune_period=0
>    rs:main Q:Reg-2972  [000] ..s. 34861.456688: mmc_request_done:
> mmc0: end struct mmc_request[00000000e93d6161]: cmd_opcode=13
> cmd_err=0 cmd_resp=0x900 0x0 0x0 0x0 cmd_retries=0 stop_opcode=0
> stop_err=0 stop_resp=0x0 0x0 0x0 0x0 stop_retries=0 sbc_opcode=0
> sbc_err=0 sbc_resp=0x0 0x0 0x0 0x0 sbc_retries=0 bytes_xfered=0
> data_err=0 tag=0 can_retune=0 doing_retune=0 retune_now=0
> need_retune=0 hold_retune=1 retune_period=0
> 
> What we believe could be happening is an SEU that flips a bit in the
> data block after the host controller has sent the block on the DAT
> lines. Card should reject that after CRC mismatch. Does the card
> reject the command right when it sees a CRC mismatch or does it have
> to wait till all blocks and CMD12 is sent by the host?

Open writes with CMD25-CMD12 indeed send all blocks until the card
can signal CRC error.

> 
> Thank you.
> 
> 
> 
> On Wed, Apr 16, 2025 at 2:13 AM Christian Loehle
> <christian.loehle@arm.com> wrote:
>>
>> On 4/16/25 01:13, Prasad Koya wrote:
>>> Hi
>>>
>>> We use eMMC as a boot drive. At a random time, maybe a month or an
>>> year of uptime, we run into an issue where we see the eMMC driver
>>> prints this in the kernel logs and the drive no longer responds after
>>> that. We see this issue in the field on linux kernel 4.19.142 and
>>> 5.10.165. From the SDHCI register dump, it looks like CMD25 ran into a
>>> timeout first. After the first timeout, we see CMD12 in the second
>>> SDHCI register dump. Status returned by the card at that point in
>>> card_busy_detect() is 0xE00.
>>>
>>> I looked at the recent commits up to 6.15 and I do not see anything
>>> obvious that addresses issue like below.
>>>
>>> We are trying to reproduce the issue in our lab and understand the
>>> driver code. Is there a way I can send a block of data as part of
>>> CMD25 with invalid CRC?  Appreciate any pointers.
>>
>> CRC is appended by hardware, so it will always be correct.
>> What you can do though is change the blocksize of the transfer on the
>> host side (e.g. from 512 to 514 bytes) without changing it using CMD16.
>> The 2 bytes will be latched in as CRC by the card. Of course you won't
>> see the CRC ACK from the card in that case (as it clashes with the
>> hardware CRC signal from the host).
>> You can do all this using the ioctl interface in userspace btw.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: eMMC timeout reproduction. Can I send a block of data from the host to the card with invalid CRC?
  2025-04-22 21:53     ` Christian Loehle
@ 2025-04-25  4:59       ` Prasad Koya
  2025-04-25  8:38         ` Christian Loehle
  0 siblings, 1 reply; 8+ messages in thread
From: Prasad Koya @ 2025-04-25  4:59 UTC (permalink / raw)
  To: Christian Loehle; +Cc: Baptiste Covolato, Sushrut Shirole, linux-mmc

Thanks Christian.

I tried CMD25 with a different block size (mrq->data->blksz = 514) and
I can see that the command was rejected with -EILSEQ (data->error =
-84). But the card wasn't stuck. After the command was rejected, I see
that the driver is sending CMD12 to signal the card the end of write.
I see this comment in sdhci.c. Is CMD25 called an "open-ended
multiblock transfer"?

1291     /*
1292      * Need to send CMD12 if -
1293      * a) open-ended multiblock transfer (no CMD23)
1294      * b) error in multiblock transfer
1295      */
1296     if (data->stop &&
1297         (data->error ||
1298          !data->mrq->sbc)) {
:
1308         } else {
1309             /* Avoid triggering warning in sdhci_send_command() */
1310             host->cmd = NULL;
1311             sdhci_send_command(host, data->stop);
1312         }

Here are my emmc traces:

            bash-10454 [000] ....  6902.122069: mmc_request_start:
mmc0: start struct mmc_request[0000000001eb0468]: cmd_opcode=25
cmd_arg=0x3a0000 cmd_flags=0x35 cmd_retries=0 stop_opcode=12
stop_arg=0x0 stop_flags=0x1d stop_retries=0 sbc_opcode=0 sbc_arg=0x0
sbc_flags=0x0 sbc_retires=0 blocks=17 block_size=1719
blk_addr=15433700 data_flags=0x100 tag=0 can_retune=0 doing_retune=0
retune_now=0 need_retune=0 hold_retune=1 retune_period=0
   rs:main Q:Reg-3221  [000] ..s.  6902.125240: mmc_request_done:
mmc0: end struct mmc_request[0000000001eb0468]: cmd_opcode=25
cmd_err=0 cmd_resp=0x900 0x0 0x0 0x0 cmd_retries=0 stop_opcode=12
stop_err=0 stop_resp=0xc00 0x0 0x0 0x0 stop_retries=0 sbc_opcode=0
sbc_err=0 sbc_resp=0x0 0x0 0x0 0x0 sbc_retries=0 bytes_xfered=0
data_err=-84 tag=0 can_retune=0 doing_retune=0 retune_now=0
need_retune=0 hold_retune=1 retune_period=0
            bash-10454 [000] ....  6902.126438: mmc_request_start:
mmc0: start struct mmc_request[000000001c8ae402]: cmd_opcode=13
cmd_arg=0x10000 cmd_flags=0x15 cmd_retries=0 stop_opcode=0
stop_arg=0x0 stop_flags=0x0 stop_retries=0 sbc_opcode=0 sbc_arg=0x0
sbc_flags=0x0 sbc_retires=0 blocks=0 block_size=0 blk_addr=0
data_flags=0x0 tag=0 can_retune=0 doing_retune=0 retune_now=0
need_retune=0 hold_retune=1 retune_period=0
            bash-10454 [000] ..s1  6902.126911: mmc_request_done:
mmc0: end struct mmc_request[000000001c8ae402]: cmd_opcode=13
cmd_err=0 cmd_resp=0x900 0x0 0x0 0x0 cmd_retries=0 stop_opcode=0
stop_err=0 stop_resp=0x0 0x0 0x0 0x0 stop_retries=0 sbc_opcode=0
sbc_err=0 sbc_resp=0x0 0x0 0x0 0x0 sbc_retries=0 bytes_xfered=0
data_err=0 tag=0 can_retune=0 doing_retune=0 retune_now=0
need_retune=0 hold_retune=1 retune_period=0

For our customers, something from the host to the card gets the card
stuck. The status returned at that point is 0xe00. After CMD25 and
CMD12 that took 10s each to timeout, all subsequent commands from the
host fail immediately with -110 error.

Is there any other forum where I can post such questions. I'm a newbie
to the driver. I'm looking at the JEDEC spec and trying to figure out
the implementation.

Thanks.


On Tue, Apr 22, 2025 at 2:54 PM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 4/22/25 18:37, Prasad Koya wrote:
> > Hi Christian
> >
> > Thanks for going over and replying.
> >
> > I couldn't find any ioctl() that changed host's block size in the 6.14
> > tree. We are using 5.10 kernel.
>
> Works on any kernel since a long time.
> You can find a caller instance in mmc-utils: mmc_cmds.c, the field is
> called blksz, if you've already familiarized yourself with the test
> module I'd say don't bother though.
>
> >
> > I did try mmc_test module so I can modify an existing test case and
> > make it a simple one. Hacking this module, I could call CMD16 via
> > mmc_test_set_blksize() and set the block size to 514 (0x202). Below
> > are the eMMC traces for this test. A CMD16, followed by CMD25 with
> > data, stop opcode and then CMD13. The multi-block write seems to go
> > fine.
>
> Sorry, misunderstanding here. I meant changing the blocksize of the
> transfer for the host *without* letting the card know (by issuing CMD16).
> So leave out the CMD16 part and you should achieve what you want.
>
> >
> >             bash-3958  [001] .... 34861.450688: mmc_request_start:
> > mmc0: start struct mmc_request[00000000fe878338]: cmd_opcode=16
> > cmd_arg=0x202 cmd_flags=0x95 cmd_retries=5 stop_opcode=0 stop_arg=0x0
> > stop_flags=0x0 stop_retries=0 sbc_opcode=0 sbc_arg=0x0 sbc_flags=0x0
> > sbc_retires=0 blocks=0 block_size=0 blk_addr=0 data_flags=0x0 tag=0
> > can_retune=0 doing_retune=0 retune_now=0 need_retune=0 hold_retune=1
> > retune_period=0
> >        in:imklog-2971  [000] ..s. 34861.450737: mmc_request_done:
> > mmc0: end struct mmc_request[00000000fe878338]: cmd_opcode=16
> > cmd_err=0 cmd_resp=0x20000900 0x0 0x0 0x0 cmd_retries=5 stop_opcode=0
> > stop_err=0 stop_resp=0x0 0x0 0x0 0x0 stop_retries=0 sbc_opcode=0
> > sbc_err=0 sbc_resp=0x0 0x0 0x0 0x0 sbc_retries=0 bytes_xfered=0
> > data_err=0 tag=0 can_retune=0 doing_retune=0 retune_now=0
> > need_retune=0 hold_retune=1 retune_period=0
> >             bash-3958  [001] .... 34861.451064: mmc_request_start:
> > mmc0: start struct mmc_request[00000000c2f30356]: cmd_opcode=25
> > cmd_arg=0x3a0000 cmd_flags=0x35 cmd_retries=0 stop_opcode=12
> > stop_arg=0x0 stop_flags=0x1d stop_retries=0 sbc_opcode=0 sbc_arg=0x0
> > sbc_flags=0x0 sbc_retires=0 blocks=128 block_size=512 blk_addr=0
> > data_flags=0x100 tag=0 can_retune=0 doing_retune=0 retune_now=0
> > need_retune=0 hold_retune=1 retune_period=0
> >           <idle>-0     [000] ..s1 34861.456533: mmc_request_done:
> > mmc0: end struct mmc_request[00000000c2f30356]: cmd_opcode=25
> > cmd_err=0 cmd_resp=0x900 0x0 0x0 0x0 cmd_retries=0 stop_opcode=12
> > stop_err=0 stop_resp=0xc00 0x0 0x0 0x0 stop_retries=0 sbc_opcode=0
> > sbc_err=0 sbc_resp=0x0 0x0 0x0 0x0 sbc_retries=0 bytes_xfered=65536
> > data_err=0 tag=0 can_retune=0 doing_retune=0 retune_now=0
> > need_retune=0 hold_retune=1 retune_period=0
> >             bash-3958  [001] .... 34861.456562: mmc_request_start:
> > mmc0: start struct mmc_request[00000000e93d6161]: cmd_opcode=13
> > cmd_arg=0x10000 cmd_flags=0x15 cmd_retries=0 stop_opcode=0
> > stop_arg=0x0 stop_flags=0x0 stop_retries=0 sbc_opcode=0 sbc_arg=0x0
> > sbc_flags=0x0 sbc_retires=0 blocks=0 block_size=0 blk_addr=0
> > data_flags=0x0 tag=0 can_retune=0 doing_retune=0 retune_now=0
> > need_retune=0 hold_retune=1 retune_period=0
> >    rs:main Q:Reg-2972  [000] ..s. 34861.456688: mmc_request_done:
> > mmc0: end struct mmc_request[00000000e93d6161]: cmd_opcode=13
> > cmd_err=0 cmd_resp=0x900 0x0 0x0 0x0 cmd_retries=0 stop_opcode=0
> > stop_err=0 stop_resp=0x0 0x0 0x0 0x0 stop_retries=0 sbc_opcode=0
> > sbc_err=0 sbc_resp=0x0 0x0 0x0 0x0 sbc_retries=0 bytes_xfered=0
> > data_err=0 tag=0 can_retune=0 doing_retune=0 retune_now=0
> > need_retune=0 hold_retune=1 retune_period=0
> >
> > What we believe could be happening is an SEU that flips a bit in the
> > data block after the host controller has sent the block on the DAT
> > lines. Card should reject that after CRC mismatch. Does the card
> > reject the command right when it sees a CRC mismatch or does it have
> > to wait till all blocks and CMD12 is sent by the host?
>
> Open writes with CMD25-CMD12 indeed send all blocks until the card
> can signal CRC error.
>
> >
> > Thank you.
> >
> >
> >
> > On Wed, Apr 16, 2025 at 2:13 AM Christian Loehle
> > <christian.loehle@arm.com> wrote:
> >>
> >> On 4/16/25 01:13, Prasad Koya wrote:
> >>> Hi
> >>>
> >>> We use eMMC as a boot drive. At a random time, maybe a month or an
> >>> year of uptime, we run into an issue where we see the eMMC driver
> >>> prints this in the kernel logs and the drive no longer responds after
> >>> that. We see this issue in the field on linux kernel 4.19.142 and
> >>> 5.10.165. From the SDHCI register dump, it looks like CMD25 ran into a
> >>> timeout first. After the first timeout, we see CMD12 in the second
> >>> SDHCI register dump. Status returned by the card at that point in
> >>> card_busy_detect() is 0xE00.
> >>>
> >>> I looked at the recent commits up to 6.15 and I do not see anything
> >>> obvious that addresses issue like below.
> >>>
> >>> We are trying to reproduce the issue in our lab and understand the
> >>> driver code. Is there a way I can send a block of data as part of
> >>> CMD25 with invalid CRC?  Appreciate any pointers.
> >>
> >> CRC is appended by hardware, so it will always be correct.
> >> What you can do though is change the blocksize of the transfer on the
> >> host side (e.g. from 512 to 514 bytes) without changing it using CMD16.
> >> The 2 bytes will be latched in as CRC by the card. Of course you won't
> >> see the CRC ACK from the card in that case (as it clashes with the
> >> hardware CRC signal from the host).
> >> You can do all this using the ioctl interface in userspace btw.
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: eMMC timeout reproduction. Can I send a block of data from the host to the card with invalid CRC?
  2025-04-25  4:59       ` Prasad Koya
@ 2025-04-25  8:38         ` Christian Loehle
  2025-04-29 18:53           ` Prasad Koya
  0 siblings, 1 reply; 8+ messages in thread
From: Christian Loehle @ 2025-04-25  8:38 UTC (permalink / raw)
  To: Prasad Koya; +Cc: Baptiste Covolato, Sushrut Shirole, linux-mmc

On 4/25/25 05:59, Prasad Koya wrote:
> Thanks Christian.
> 
> I tried CMD25 with a different block size (mrq->data->blksz = 514) and
> I can see that the command was rejected with -EILSEQ (data->error =
> -84). But the card wasn't stuck. After the command was rejected, I see
> that the driver is sending CMD12 to signal the card the end of write.
> I see this comment in sdhci.c. Is CMD25 called an "open-ended
> multiblock transfer"?

Yes indeed that's how that works.
IIUC your hypothesis was "card gets stuck in busy when CMD25 data was sent
with an invalid CRC"? That hypothesis you can now reject. (For open-ended
CMD25 anyway, you can also try CMD23+CMD25 and CMD24 writes).

> 
> 1291     /*
> 1292      * Need to send CMD12 if -
> 1293      * a) open-ended multiblock transfer (no CMD23)
> 1294      * b) error in multiblock transfer
> 1295      */
> 1296     if (data->stop &&
> 1297         (data->error ||
> 1298          !data->mrq->sbc)) {
> :
> 1308         } else {
> 1309             /* Avoid triggering warning in sdhci_send_command() */
> 1310             host->cmd = NULL;
> 1311             sdhci_send_command(host, data->stop);
> 1312         }
> 
> Here are my emmc traces:
> 
>             bash-10454 [000] ....  6902.122069: mmc_request_start:
> mmc0: start struct mmc_request[0000000001eb0468]: cmd_opcode=25
> cmd_arg=0x3a0000 cmd_flags=0x35 cmd_retries=0 stop_opcode=12
> stop_arg=0x0 stop_flags=0x1d stop_retries=0 sbc_opcode=0 sbc_arg=0x0
> sbc_flags=0x0 sbc_retires=0 blocks=17 block_size=1719
> blk_addr=15433700 data_flags=0x100 tag=0 can_retune=0 doing_retune=0
> retune_now=0 need_retune=0 hold_retune=1 retune_period=0
>    rs:main Q:Reg-3221  [000] ..s.  6902.125240: mmc_request_done:
> mmc0: end struct mmc_request[0000000001eb0468]: cmd_opcode=25
> cmd_err=0 cmd_resp=0x900 0x0 0x0 0x0 cmd_retries=0 stop_opcode=12
> stop_err=0 stop_resp=0xc00 0x0 0x0 0x0 stop_retries=0 sbc_opcode=0
> sbc_err=0 sbc_resp=0x0 0x0 0x0 0x0 sbc_retries=0 bytes_xfered=0
> data_err=-84 tag=0 can_retune=0 doing_retune=0 retune_now=0
> need_retune=0 hold_retune=1 retune_period=0
>             bash-10454 [000] ....  6902.126438: mmc_request_start:
> mmc0: start struct mmc_request[000000001c8ae402]: cmd_opcode=13
> cmd_arg=0x10000 cmd_flags=0x15 cmd_retries=0 stop_opcode=0
> stop_arg=0x0 stop_flags=0x0 stop_retries=0 sbc_opcode=0 sbc_arg=0x0
> sbc_flags=0x0 sbc_retires=0 blocks=0 block_size=0 blk_addr=0
> data_flags=0x0 tag=0 can_retune=0 doing_retune=0 retune_now=0
> need_retune=0 hold_retune=1 retune_period=0
>             bash-10454 [000] ..s1  6902.126911: mmc_request_done:
> mmc0: end struct mmc_request[000000001c8ae402]: cmd_opcode=13
> cmd_err=0 cmd_resp=0x900 0x0 0x0 0x0 cmd_retries=0 stop_opcode=0
> stop_err=0 stop_resp=0x0 0x0 0x0 0x0 stop_retries=0 sbc_opcode=0
> sbc_err=0 sbc_resp=0x0 0x0 0x0 0x0 sbc_retries=0 bytes_xfered=0
> data_err=0 tag=0 can_retune=0 doing_retune=0 retune_now=0
> need_retune=0 hold_retune=1 retune_period=0
> 
> For our customers, something from the host to the card gets the card
> stuck. The status returned at that point is 0xe00. After CMD25 and
> CMD12 that took 10s each to timeout, all subsequent commands from the
> host fail immediately with -110 error.

If the card is in busy it wont accept any new commands, so that's in
line. The only question is why is it stuck in busy? (The spec
mandates maximum busy times, they are sometimes violated by some card
vendors (especially SD cards), but nothing on the order of 10s should be
observed.)
Have you contacted the card vendor about this? I don't see any hint
on wrongdoing of the host here.


> Is there any other forum where I can post such questions. I'm a newbie
> to the driver. I'm looking at the JEDEC spec and trying to figure out
> the implementation.

Not that I'm aware of.

> 
> Thanks.
> 
> 
> On Tue, Apr 22, 2025 at 2:54 PM Christian Loehle
> <christian.loehle@arm.com> wrote:
>>
>> On 4/22/25 18:37, Prasad Koya wrote:
>>> Hi Christian
>>>
>>> Thanks for going over and replying.
>>>
>>> I couldn't find any ioctl() that changed host's block size in the 6.14
>>> tree. We are using 5.10 kernel.
>>
>> Works on any kernel since a long time.
>> You can find a caller instance in mmc-utils: mmc_cmds.c, the field is
>> called blksz, if you've already familiarized yourself with the test
>> module I'd say don't bother though.
>>
>>>
>>> I did try mmc_test module so I can modify an existing test case and
>>> make it a simple one. Hacking this module, I could call CMD16 via
>>> mmc_test_set_blksize() and set the block size to 514 (0x202). Below
>>> are the eMMC traces for this test. A CMD16, followed by CMD25 with
>>> data, stop opcode and then CMD13. The multi-block write seems to go
>>> fine.
>>
>> Sorry, misunderstanding here. I meant changing the blocksize of the
>> transfer for the host *without* letting the card know (by issuing CMD16).
>> So leave out the CMD16 part and you should achieve what you want.
>>
>>>
>>>             bash-3958  [001] .... 34861.450688: mmc_request_start:
>>> mmc0: start struct mmc_request[00000000fe878338]: cmd_opcode=16
>>> cmd_arg=0x202 cmd_flags=0x95 cmd_retries=5 stop_opcode=0 stop_arg=0x0
>>> stop_flags=0x0 stop_retries=0 sbc_opcode=0 sbc_arg=0x0 sbc_flags=0x0
>>> sbc_retires=0 blocks=0 block_size=0 blk_addr=0 data_flags=0x0 tag=0
>>> can_retune=0 doing_retune=0 retune_now=0 need_retune=0 hold_retune=1
>>> retune_period=0
>>>        in:imklog-2971  [000] ..s. 34861.450737: mmc_request_done:
>>> mmc0: end struct mmc_request[00000000fe878338]: cmd_opcode=16
>>> cmd_err=0 cmd_resp=0x20000900 0x0 0x0 0x0 cmd_retries=5 stop_opcode=0
>>> stop_err=0 stop_resp=0x0 0x0 0x0 0x0 stop_retries=0 sbc_opcode=0
>>> sbc_err=0 sbc_resp=0x0 0x0 0x0 0x0 sbc_retries=0 bytes_xfered=0
>>> data_err=0 tag=0 can_retune=0 doing_retune=0 retune_now=0
>>> need_retune=0 hold_retune=1 retune_period=0
>>>             bash-3958  [001] .... 34861.451064: mmc_request_start:
>>> mmc0: start struct mmc_request[00000000c2f30356]: cmd_opcode=25
>>> cmd_arg=0x3a0000 cmd_flags=0x35 cmd_retries=0 stop_opcode=12
>>> stop_arg=0x0 stop_flags=0x1d stop_retries=0 sbc_opcode=0 sbc_arg=0x0
>>> sbc_flags=0x0 sbc_retires=0 blocks=128 block_size=512 blk_addr=0
>>> data_flags=0x100 tag=0 can_retune=0 doing_retune=0 retune_now=0
>>> need_retune=0 hold_retune=1 retune_period=0
>>>           <idle>-0     [000] ..s1 34861.456533: mmc_request_done:
>>> mmc0: end struct mmc_request[00000000c2f30356]: cmd_opcode=25
>>> cmd_err=0 cmd_resp=0x900 0x0 0x0 0x0 cmd_retries=0 stop_opcode=12
>>> stop_err=0 stop_resp=0xc00 0x0 0x0 0x0 stop_retries=0 sbc_opcode=0
>>> sbc_err=0 sbc_resp=0x0 0x0 0x0 0x0 sbc_retries=0 bytes_xfered=65536
>>> data_err=0 tag=0 can_retune=0 doing_retune=0 retune_now=0
>>> need_retune=0 hold_retune=1 retune_period=0
>>>             bash-3958  [001] .... 34861.456562: mmc_request_start:
>>> mmc0: start struct mmc_request[00000000e93d6161]: cmd_opcode=13
>>> cmd_arg=0x10000 cmd_flags=0x15 cmd_retries=0 stop_opcode=0
>>> stop_arg=0x0 stop_flags=0x0 stop_retries=0 sbc_opcode=0 sbc_arg=0x0
>>> sbc_flags=0x0 sbc_retires=0 blocks=0 block_size=0 blk_addr=0
>>> data_flags=0x0 tag=0 can_retune=0 doing_retune=0 retune_now=0
>>> need_retune=0 hold_retune=1 retune_period=0
>>>    rs:main Q:Reg-2972  [000] ..s. 34861.456688: mmc_request_done:
>>> mmc0: end struct mmc_request[00000000e93d6161]: cmd_opcode=13
>>> cmd_err=0 cmd_resp=0x900 0x0 0x0 0x0 cmd_retries=0 stop_opcode=0
>>> stop_err=0 stop_resp=0x0 0x0 0x0 0x0 stop_retries=0 sbc_opcode=0
>>> sbc_err=0 sbc_resp=0x0 0x0 0x0 0x0 sbc_retries=0 bytes_xfered=0
>>> data_err=0 tag=0 can_retune=0 doing_retune=0 retune_now=0
>>> need_retune=0 hold_retune=1 retune_period=0
>>>
>>> What we believe could be happening is an SEU that flips a bit in the
>>> data block after the host controller has sent the block on the DAT
>>> lines. Card should reject that after CRC mismatch. Does the card
>>> reject the command right when it sees a CRC mismatch or does it have
>>> to wait till all blocks and CMD12 is sent by the host?
>>
>> Open writes with CMD25-CMD12 indeed send all blocks until the card
>> can signal CRC error.
>>
>>>
>>> Thank you.
>>>
>>>
>>>
>>> On Wed, Apr 16, 2025 at 2:13 AM Christian Loehle
>>> <christian.loehle@arm.com> wrote:
>>>>
>>>> On 4/16/25 01:13, Prasad Koya wrote:
>>>>> Hi
>>>>>
>>>>> We use eMMC as a boot drive. At a random time, maybe a month or an
>>>>> year of uptime, we run into an issue where we see the eMMC driver
>>>>> prints this in the kernel logs and the drive no longer responds after
>>>>> that. We see this issue in the field on linux kernel 4.19.142 and
>>>>> 5.10.165. From the SDHCI register dump, it looks like CMD25 ran into a
>>>>> timeout first. After the first timeout, we see CMD12 in the second
>>>>> SDHCI register dump. Status returned by the card at that point in
>>>>> card_busy_detect() is 0xE00.
>>>>>
>>>>> I looked at the recent commits up to 6.15 and I do not see anything
>>>>> obvious that addresses issue like below.
>>>>>
>>>>> We are trying to reproduce the issue in our lab and understand the
>>>>> driver code. Is there a way I can send a block of data as part of
>>>>> CMD25 with invalid CRC?  Appreciate any pointers.
>>>>
>>>> CRC is appended by hardware, so it will always be correct.
>>>> What you can do though is change the blocksize of the transfer on the
>>>> host side (e.g. from 512 to 514 bytes) without changing it using CMD16.
>>>> The 2 bytes will be latched in as CRC by the card. Of course you won't
>>>> see the CRC ACK from the card in that case (as it clashes with the
>>>> hardware CRC signal from the host).
>>>> You can do all this using the ioctl interface in userspace btw.
>>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: eMMC timeout reproduction. Can I send a block of data from the host to the card with invalid CRC?
  2025-04-25  8:38         ` Christian Loehle
@ 2025-04-29 18:53           ` Prasad Koya
  2025-04-29 18:58             ` Christian Loehle
  0 siblings, 1 reply; 8+ messages in thread
From: Prasad Koya @ 2025-04-29 18:53 UTC (permalink / raw)
  To: Christian Loehle; +Cc: Baptiste Covolato, Sushrut Shirole, linux-mmc

Thanks.

mmc_test module sets CMD23 for both CMD24 and CMD25.

I ran into something weird, which we also saw in the field is a CMD25
with block count as 0.

Here are the kernel logs from the field:

[Sat Mar 29 01:31:07 2025] mmc0: Timeout waiting for hardware interrupt.
[Sat Mar 29 01:31:07 2025] mmc0: sdhci: ============ SDHCI REGISTER
DUMP ===========
[Sat Mar 29 01:31:07 2025] mmc0: sdhci: Sys addr:  0xd3c01000 |
Version:  0x00001002
[Sat Mar 29 01:31:07 2025] mmc0: sdhci: Blk size:  0x00007200 | Blk
cnt:  0x00000000
[Sat Mar 29 01:31:07 2025] mmc0: sdhci: Argument:  0x0045d1a8 | Trn
mode: 0x00000023
[Sat Mar 29 01:31:07 2025] mmc0: sdhci: Present:   0x01ef0006 | Host
ctl: 0x00000007
[Sat Mar 29 01:31:07 2025] mmc0: sdhci: Power:     0x0000000f | Blk
gap:  0x00000000
[Sat Mar 29 01:31:07 2025] mmc0: sdhci: Wake-up:   0x00000000 | Clock:
   0x00000207
[Sat Mar 29 01:31:07 2025] mmc0: sdhci: Timeout:   0x0000000e | Int
stat: 0x00000000
[Sat Mar 29 01:31:07 2025] mmc0: sdhci: Int enab:  0x02ff008b | Sig
enab: 0x02ff008b
[Sat Mar 29 01:31:07 2025] mmc0: sdhci: ACmd stat: 0x00000000 | Slot
int: 0x000000ff
[Sat Mar 29 01:31:08 2025] mmc0: sdhci: Caps:      0x75fec8b2 |
Caps_1:   0x00002501
[Sat Mar 29 01:31:08 2025] mmc0: sdhci: Cmd:       0x0000193a | Max
curr: 0x00c80064
[Sat Mar 29 01:31:08 2025] mmc0: sdhci: Resp[0]:   0x00000900 |
Resp[1]:  0xffffffff
[Sat Mar 29 01:31:08 2025] mmc0: sdhci: Resp[2]:   0x320f5903 |
Resp[3]:  0x00d04f01
[Sat Mar 29 01:31:08 2025] mmc0: sdhci: Host ctl2: 0x00000000
[Sat Mar 29 01:31:08 2025] mmc0: sdhci:
============================================
[Sat Mar 29 01:31:10 2025] sdhci-pci 0000:00:14.7: Card stuck in wrong
state! card_busy_detect status: 0xe00
[Sat Mar 29 01:31:10 2025] mmc0: cache flush error -110
[Sat Mar 29 01:31:11 2025] mmc0: tried to HW reset card, got error -110

If I send a CMD25 with data->blocks as 0 and with the same 'Argument:'
value above for cmd->arg, I get a 10s timeout. I see that the cmd->arg
is the "device address" or the logical block address as seen by the
kernel.

Subsequent commands are failing with cmd->err as -110. CMD13 result
comes with resp as 0x400d00. I see that the irq count for the sdhci
controller goes up by 1 for each of these commands. Does it mean that
the host controller knows the state of the card and does not forward
the commands to the card?

Below are the eMMC traces on my box with CMD25 and above 'Argument:'
or dev_addr value:

            bash-3281  [000] .... 77826.075030: mmc_request_start:
mmc0: start struct mmc_request[00000000e720518c]: cmd_opcode=25
cmd_arg=0x45d1a8 cmd_flags=0x35 cmd_retries=0 stop_opcode=0
stop_arg=0x0 stop_flags=0x0 stop_retries=0 sbc_opcode=0 sbc_arg=0x0
sbc_flags=0x0 sbc_retires=0 blocks=0 block_size=512 blk_addr=0
data_flags=0x100 tag=0 can_retune=0 doing_retune=0 retune_now=0
need_retune=0 hold_retune=1 retune_period=0
            bash-3281  [000] ..s1 77826.076781: mmc_request_done:
mmc0: end struct mmc_request[00000000e720518c]: cmd_opcode=25
cmd_err=-110 cmd_resp=0x0 0x0 0x0 0x0 cmd_retries=0 stop_opcode=0
stop_err=0 stop_resp=0x0 0x0 0x0 0x0 stop_retries=0 sbc_opcode=0
sbc_err=0 sbc_resp=0x0 0x0 0x0 0x0 sbc_retries=0 bytes_xfered=0
data_err=0 tag=0 can_retune=0 doing_retune=0 retune_now=0
need_retune=0 hold_retune=1 retune_period=0
            bash-3281  [000] .... 77826.076841: mmc_request_start:
mmc0: start struct mmc_request[00000000e8466c12]: cmd_opcode=13
cmd_arg=0x10000 cmd_flags=0x15 cmd_retries=0 stop_opcode=0
stop_arg=0x0 stop_flags=0x0 stop_retries=0 sbc_opcode=0 sbc_arg=0x0
sbc_flags=0x0 sbc_retires=0 blocks=0 block_size=0 blk_addr=0
data_flags=0x0 tag=0 can_retune=0 doing_retune=0 retune_now=0
need_retune=0 hold_retune=1 retune_period=0
            bash-3281  [000] ..s. 77826.076879: mmc_request_done:
mmc0: end struct mmc_request[00000000e8466c12]: cmd_opcode=13
cmd_err=0 cmd_resp=0x400d00 0x0 0x0 0x0 cmd_retries=0 stop_opcode=0
stop_err=0 stop_resp=0x0 0x0 0x0 0x0 stop_retries=0 sbc_opcode=0
sbc_err=0 sbc_resp=0x0 0x0 0x0 0x0 sbc_retries=0 bytes_xfered=0
data_err=0 tag=0 can_retune=0 doing_retune=0 retune_now=0
need_retune=0 hold_retune=1 retune_period=0

Resetting the controller by the reset test (number 45) gets the
controller out of that state.

We are following up with the card vendor as well.

In what situations would the kernel frame up a CMD25 with 0 blocks?
Why does the controller or the card not like the dev_addr in the
cmd->arg field?

Thank you.


On Fri, Apr 25, 2025 at 1:38 AM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 4/25/25 05:59, Prasad Koya wrote:
> > Thanks Christian.
> >
> > I tried CMD25 with a different block size (mrq->data->blksz = 514) and
> > I can see that the command was rejected with -EILSEQ (data->error =
> > -84). But the card wasn't stuck. After the command was rejected, I see
> > that the driver is sending CMD12 to signal the card the end of write.
> > I see this comment in sdhci.c. Is CMD25 called an "open-ended
> > multiblock transfer"?
>
> Yes indeed that's how that works.
> IIUC your hypothesis was "card gets stuck in busy when CMD25 data was sent
> with an invalid CRC"? That hypothesis you can now reject. (For open-ended
> CMD25 anyway, you can also try CMD23+CMD25 and CMD24 writes).
>
> >
> > 1291     /*
> > 1292      * Need to send CMD12 if -
> > 1293      * a) open-ended multiblock transfer (no CMD23)
> > 1294      * b) error in multiblock transfer
> > 1295      */
> > 1296     if (data->stop &&
> > 1297         (data->error ||
> > 1298          !data->mrq->sbc)) {
> > :
> > 1308         } else {
> > 1309             /* Avoid triggering warning in sdhci_send_command() */
> > 1310             host->cmd = NULL;
> > 1311             sdhci_send_command(host, data->stop);
> > 1312         }
> >
> > Here are my emmc traces:
> >
> >             bash-10454 [000] ....  6902.122069: mmc_request_start:
> > mmc0: start struct mmc_request[0000000001eb0468]: cmd_opcode=25
> > cmd_arg=0x3a0000 cmd_flags=0x35 cmd_retries=0 stop_opcode=12
> > stop_arg=0x0 stop_flags=0x1d stop_retries=0 sbc_opcode=0 sbc_arg=0x0
> > sbc_flags=0x0 sbc_retires=0 blocks=17 block_size=1719
> > blk_addr=15433700 data_flags=0x100 tag=0 can_retune=0 doing_retune=0
> > retune_now=0 need_retune=0 hold_retune=1 retune_period=0
> >    rs:main Q:Reg-3221  [000] ..s.  6902.125240: mmc_request_done:
> > mmc0: end struct mmc_request[0000000001eb0468]: cmd_opcode=25
> > cmd_err=0 cmd_resp=0x900 0x0 0x0 0x0 cmd_retries=0 stop_opcode=12
> > stop_err=0 stop_resp=0xc00 0x0 0x0 0x0 stop_retries=0 sbc_opcode=0
> > sbc_err=0 sbc_resp=0x0 0x0 0x0 0x0 sbc_retries=0 bytes_xfered=0
> > data_err=-84 tag=0 can_retune=0 doing_retune=0 retune_now=0
> > need_retune=0 hold_retune=1 retune_period=0
> >             bash-10454 [000] ....  6902.126438: mmc_request_start:
> > mmc0: start struct mmc_request[000000001c8ae402]: cmd_opcode=13
> > cmd_arg=0x10000 cmd_flags=0x15 cmd_retries=0 stop_opcode=0
> > stop_arg=0x0 stop_flags=0x0 stop_retries=0 sbc_opcode=0 sbc_arg=0x0
> > sbc_flags=0x0 sbc_retires=0 blocks=0 block_size=0 blk_addr=0
> > data_flags=0x0 tag=0 can_retune=0 doing_retune=0 retune_now=0
> > need_retune=0 hold_retune=1 retune_period=0
> >             bash-10454 [000] ..s1  6902.126911: mmc_request_done:
> > mmc0: end struct mmc_request[000000001c8ae402]: cmd_opcode=13
> > cmd_err=0 cmd_resp=0x900 0x0 0x0 0x0 cmd_retries=0 stop_opcode=0
> > stop_err=0 stop_resp=0x0 0x0 0x0 0x0 stop_retries=0 sbc_opcode=0
> > sbc_err=0 sbc_resp=0x0 0x0 0x0 0x0 sbc_retries=0 bytes_xfered=0
> > data_err=0 tag=0 can_retune=0 doing_retune=0 retune_now=0
> > need_retune=0 hold_retune=1 retune_period=0
> >
> > For our customers, something from the host to the card gets the card
> > stuck. The status returned at that point is 0xe00. After CMD25 and
> > CMD12 that took 10s each to timeout, all subsequent commands from the
> > host fail immediately with -110 error.
>
> If the card is in busy it wont accept any new commands, so that's in
> line. The only question is why is it stuck in busy? (The spec
> mandates maximum busy times, they are sometimes violated by some card
> vendors (especially SD cards), but nothing on the order of 10s should be
> observed.)
> Have you contacted the card vendor about this? I don't see any hint
> on wrongdoing of the host here.
>
>
> > Is there any other forum where I can post such questions. I'm a newbie
> > to the driver. I'm looking at the JEDEC spec and trying to figure out
> > the implementation.
>
> Not that I'm aware of.
>
> >
> > Thanks.
> >
> >
> > On Tue, Apr 22, 2025 at 2:54 PM Christian Loehle
> > <christian.loehle@arm.com> wrote:
> >>
> >> On 4/22/25 18:37, Prasad Koya wrote:
> >>> Hi Christian
> >>>
> >>> Thanks for going over and replying.
> >>>
> >>> I couldn't find any ioctl() that changed host's block size in the 6.14
> >>> tree. We are using 5.10 kernel.
> >>
> >> Works on any kernel since a long time.
> >> You can find a caller instance in mmc-utils: mmc_cmds.c, the field is
> >> called blksz, if you've already familiarized yourself with the test
> >> module I'd say don't bother though.
> >>
> >>>
> >>> I did try mmc_test module so I can modify an existing test case and
> >>> make it a simple one. Hacking this module, I could call CMD16 via
> >>> mmc_test_set_blksize() and set the block size to 514 (0x202). Below
> >>> are the eMMC traces for this test. A CMD16, followed by CMD25 with
> >>> data, stop opcode and then CMD13. The multi-block write seems to go
> >>> fine.
> >>
> >> Sorry, misunderstanding here. I meant changing the blocksize of the
> >> transfer for the host *without* letting the card know (by issuing CMD16).
> >> So leave out the CMD16 part and you should achieve what you want.
> >>
> >>>
> >>>             bash-3958  [001] .... 34861.450688: mmc_request_start:
> >>> mmc0: start struct mmc_request[00000000fe878338]: cmd_opcode=16
> >>> cmd_arg=0x202 cmd_flags=0x95 cmd_retries=5 stop_opcode=0 stop_arg=0x0
> >>> stop_flags=0x0 stop_retries=0 sbc_opcode=0 sbc_arg=0x0 sbc_flags=0x0
> >>> sbc_retires=0 blocks=0 block_size=0 blk_addr=0 data_flags=0x0 tag=0
> >>> can_retune=0 doing_retune=0 retune_now=0 need_retune=0 hold_retune=1
> >>> retune_period=0
> >>>        in:imklog-2971  [000] ..s. 34861.450737: mmc_request_done:
> >>> mmc0: end struct mmc_request[00000000fe878338]: cmd_opcode=16
> >>> cmd_err=0 cmd_resp=0x20000900 0x0 0x0 0x0 cmd_retries=5 stop_opcode=0
> >>> stop_err=0 stop_resp=0x0 0x0 0x0 0x0 stop_retries=0 sbc_opcode=0
> >>> sbc_err=0 sbc_resp=0x0 0x0 0x0 0x0 sbc_retries=0 bytes_xfered=0
> >>> data_err=0 tag=0 can_retune=0 doing_retune=0 retune_now=0
> >>> need_retune=0 hold_retune=1 retune_period=0
> >>>             bash-3958  [001] .... 34861.451064: mmc_request_start:
> >>> mmc0: start struct mmc_request[00000000c2f30356]: cmd_opcode=25
> >>> cmd_arg=0x3a0000 cmd_flags=0x35 cmd_retries=0 stop_opcode=12
> >>> stop_arg=0x0 stop_flags=0x1d stop_retries=0 sbc_opcode=0 sbc_arg=0x0
> >>> sbc_flags=0x0 sbc_retires=0 blocks=128 block_size=512 blk_addr=0
> >>> data_flags=0x100 tag=0 can_retune=0 doing_retune=0 retune_now=0
> >>> need_retune=0 hold_retune=1 retune_period=0
> >>>           <idle>-0     [000] ..s1 34861.456533: mmc_request_done:
> >>> mmc0: end struct mmc_request[00000000c2f30356]: cmd_opcode=25
> >>> cmd_err=0 cmd_resp=0x900 0x0 0x0 0x0 cmd_retries=0 stop_opcode=12
> >>> stop_err=0 stop_resp=0xc00 0x0 0x0 0x0 stop_retries=0 sbc_opcode=0
> >>> sbc_err=0 sbc_resp=0x0 0x0 0x0 0x0 sbc_retries=0 bytes_xfered=65536
> >>> data_err=0 tag=0 can_retune=0 doing_retune=0 retune_now=0
> >>> need_retune=0 hold_retune=1 retune_period=0
> >>>             bash-3958  [001] .... 34861.456562: mmc_request_start:
> >>> mmc0: start struct mmc_request[00000000e93d6161]: cmd_opcode=13
> >>> cmd_arg=0x10000 cmd_flags=0x15 cmd_retries=0 stop_opcode=0
> >>> stop_arg=0x0 stop_flags=0x0 stop_retries=0 sbc_opcode=0 sbc_arg=0x0
> >>> sbc_flags=0x0 sbc_retires=0 blocks=0 block_size=0 blk_addr=0
> >>> data_flags=0x0 tag=0 can_retune=0 doing_retune=0 retune_now=0
> >>> need_retune=0 hold_retune=1 retune_period=0
> >>>    rs:main Q:Reg-2972  [000] ..s. 34861.456688: mmc_request_done:
> >>> mmc0: end struct mmc_request[00000000e93d6161]: cmd_opcode=13
> >>> cmd_err=0 cmd_resp=0x900 0x0 0x0 0x0 cmd_retries=0 stop_opcode=0
> >>> stop_err=0 stop_resp=0x0 0x0 0x0 0x0 stop_retries=0 sbc_opcode=0
> >>> sbc_err=0 sbc_resp=0x0 0x0 0x0 0x0 sbc_retries=0 bytes_xfered=0
> >>> data_err=0 tag=0 can_retune=0 doing_retune=0 retune_now=0
> >>> need_retune=0 hold_retune=1 retune_period=0
> >>>
> >>> What we believe could be happening is an SEU that flips a bit in the
> >>> data block after the host controller has sent the block on the DAT
> >>> lines. Card should reject that after CRC mismatch. Does the card
> >>> reject the command right when it sees a CRC mismatch or does it have
> >>> to wait till all blocks and CMD12 is sent by the host?
> >>
> >> Open writes with CMD25-CMD12 indeed send all blocks until the card
> >> can signal CRC error.
> >>
> >>>
> >>> Thank you.
> >>>
> >>>
> >>>
> >>> On Wed, Apr 16, 2025 at 2:13 AM Christian Loehle
> >>> <christian.loehle@arm.com> wrote:
> >>>>
> >>>> On 4/16/25 01:13, Prasad Koya wrote:
> >>>>> Hi
> >>>>>
> >>>>> We use eMMC as a boot drive. At a random time, maybe a month or an
> >>>>> year of uptime, we run into an issue where we see the eMMC driver
> >>>>> prints this in the kernel logs and the drive no longer responds after
> >>>>> that. We see this issue in the field on linux kernel 4.19.142 and
> >>>>> 5.10.165. From the SDHCI register dump, it looks like CMD25 ran into a
> >>>>> timeout first. After the first timeout, we see CMD12 in the second
> >>>>> SDHCI register dump. Status returned by the card at that point in
> >>>>> card_busy_detect() is 0xE00.
> >>>>>
> >>>>> I looked at the recent commits up to 6.15 and I do not see anything
> >>>>> obvious that addresses issue like below.
> >>>>>
> >>>>> We are trying to reproduce the issue in our lab and understand the
> >>>>> driver code. Is there a way I can send a block of data as part of
> >>>>> CMD25 with invalid CRC?  Appreciate any pointers.
> >>>>
> >>>> CRC is appended by hardware, so it will always be correct.
> >>>> What you can do though is change the blocksize of the transfer on the
> >>>> host side (e.g. from 512 to 514 bytes) without changing it using CMD16.
> >>>> The 2 bytes will be latched in as CRC by the card. Of course you won't
> >>>> see the CRC ACK from the card in that case (as it clashes with the
> >>>> hardware CRC signal from the host).
> >>>> You can do all this using the ioctl interface in userspace btw.
> >>
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: eMMC timeout reproduction. Can I send a block of data from the host to the card with invalid CRC?
  2025-04-29 18:53           ` Prasad Koya
@ 2025-04-29 18:58             ` Christian Loehle
  0 siblings, 0 replies; 8+ messages in thread
From: Christian Loehle @ 2025-04-29 18:58 UTC (permalink / raw)
  To: Prasad Koya; +Cc: Baptiste Covolato, Sushrut Shirole, linux-mmc

On 4/29/25 19:53, Prasad Koya wrote:
> Thanks.
> 
> mmc_test module sets CMD23 for both CMD24 and CMD25.
> 
> I ran into something weird, which we also saw in the field is a CMD25
> with block count as 0.
> 
> Here are the kernel logs from the field:
> 
> [Sat Mar 29 01:31:07 2025] mmc0: Timeout waiting for hardware interrupt.
> [Sat Mar 29 01:31:07 2025] mmc0: sdhci: ============ SDHCI REGISTER
> DUMP ===========
> [Sat Mar 29 01:31:07 2025] mmc0: sdhci: Sys addr:  0xd3c01000 |
> Version:  0x00001002
> [Sat Mar 29 01:31:07 2025] mmc0: sdhci: Blk size:  0x00007200 | Blk
> cnt:  0x00000000
> [Sat Mar 29 01:31:07 2025] mmc0: sdhci: Argument:  0x0045d1a8 | Trn
> mode: 0x00000023
> [Sat Mar 29 01:31:07 2025] mmc0: sdhci: Present:   0x01ef0006 | Host
> ctl: 0x00000007
> [Sat Mar 29 01:31:07 2025] mmc0: sdhci: Power:     0x0000000f | Blk
> gap:  0x00000000
> [Sat Mar 29 01:31:07 2025] mmc0: sdhci: Wake-up:   0x00000000 | Clock:
>    0x00000207
> [Sat Mar 29 01:31:07 2025] mmc0: sdhci: Timeout:   0x0000000e | Int
> stat: 0x00000000
> [Sat Mar 29 01:31:07 2025] mmc0: sdhci: Int enab:  0x02ff008b | Sig
> enab: 0x02ff008b
> [Sat Mar 29 01:31:07 2025] mmc0: sdhci: ACmd stat: 0x00000000 | Slot
> int: 0x000000ff
> [Sat Mar 29 01:31:08 2025] mmc0: sdhci: Caps:      0x75fec8b2 |
> Caps_1:   0x00002501
> [Sat Mar 29 01:31:08 2025] mmc0: sdhci: Cmd:       0x0000193a | Max
> curr: 0x00c80064
> [Sat Mar 29 01:31:08 2025] mmc0: sdhci: Resp[0]:   0x00000900 |
> Resp[1]:  0xffffffff
> [Sat Mar 29 01:31:08 2025] mmc0: sdhci: Resp[2]:   0x320f5903 |
> Resp[3]:  0x00d04f01
> [Sat Mar 29 01:31:08 2025] mmc0: sdhci: Host ctl2: 0x00000000
> [Sat Mar 29 01:31:08 2025] mmc0: sdhci:
> ============================================
> [Sat Mar 29 01:31:10 2025] sdhci-pci 0000:00:14.7: Card stuck in wrong
> state! card_busy_detect status: 0xe00
> [Sat Mar 29 01:31:10 2025] mmc0: cache flush error -110
> [Sat Mar 29 01:31:11 2025] mmc0: tried to HW reset card, got error -110
> 
> If I send a CMD25 with data->blocks as 0 and with the same 'Argument:'
> value above for cmd->arg, I get a 10s timeout. I see that the cmd->arg
> is the "device address" or the logical block address as seen by the
> kernel.
> 
> Subsequent commands are failing with cmd->err as -110. CMD13 result
> comes with resp as 0x400d00. I see that the irq count for the sdhci
> controller goes up by 1 for each of these commands. Does it mean that
> the host controller knows the state of the card and does not forward
> the commands to the card?

It does send the command, the card rejects them as ILLEGAL and does not
respond, rightfully so, because it is still in data receive state.
A card only moves out of data receive after CMD25 if the blocks
set by CMD23 were sent or CMD12 is received.
The card seems to be behaving fine AFAICT.

> 
> Below are the eMMC traces on my box with CMD25 and above 'Argument:'
> or dev_addr value:
> 
>             bash-3281  [000] .... 77826.075030: mmc_request_start:
> mmc0: start struct mmc_request[00000000e720518c]: cmd_opcode=25
> cmd_arg=0x45d1a8 cmd_flags=0x35 cmd_retries=0 stop_opcode=0
> stop_arg=0x0 stop_flags=0x0 stop_retries=0 sbc_opcode=0 sbc_arg=0x0
> sbc_flags=0x0 sbc_retires=0 blocks=0 block_size=512 blk_addr=0
> data_flags=0x100 tag=0 can_retune=0 doing_retune=0 retune_now=0
> need_retune=0 hold_retune=1 retune_period=0
>             bash-3281  [000] ..s1 77826.076781: mmc_request_done:
> mmc0: end struct mmc_request[00000000e720518c]: cmd_opcode=25
> cmd_err=-110 cmd_resp=0x0 0x0 0x0 0x0 cmd_retries=0 stop_opcode=0
> stop_err=0 stop_resp=0x0 0x0 0x0 0x0 stop_retries=0 sbc_opcode=0
> sbc_err=0 sbc_resp=0x0 0x0 0x0 0x0 sbc_retries=0 bytes_xfered=0
> data_err=0 tag=0 can_retune=0 doing_retune=0 retune_now=0
> need_retune=0 hold_retune=1 retune_period=0
>             bash-3281  [000] .... 77826.076841: mmc_request_start:
> mmc0: start struct mmc_request[00000000e8466c12]: cmd_opcode=13
> cmd_arg=0x10000 cmd_flags=0x15 cmd_retries=0 stop_opcode=0
> stop_arg=0x0 stop_flags=0x0 stop_retries=0 sbc_opcode=0 sbc_arg=0x0
> sbc_flags=0x0 sbc_retires=0 blocks=0 block_size=0 blk_addr=0
> data_flags=0x0 tag=0 can_retune=0 doing_retune=0 retune_now=0
> need_retune=0 hold_retune=1 retune_period=0
>             bash-3281  [000] ..s. 77826.076879: mmc_request_done:
> mmc0: end struct mmc_request[00000000e8466c12]: cmd_opcode=13
> cmd_err=0 cmd_resp=0x400d00 0x0 0x0 0x0 cmd_retries=0 stop_opcode=0
> stop_err=0 stop_resp=0x0 0x0 0x0 0x0 stop_retries=0 sbc_opcode=0
> sbc_err=0 sbc_resp=0x0 0x0 0x0 0x0 sbc_retries=0 bytes_xfered=0
> data_err=0 tag=0 can_retune=0 doing_retune=0 retune_now=0
> need_retune=0 hold_retune=1 retune_period=0
> 
> Resetting the controller by the reset test (number 45) gets the
> controller out of that state.


CMD12 should suffice too, no?


> 
> We are following up with the card vendor as well.
> 
> In what situations would the kernel frame up a CMD25 with 0 blocks?
> Why does the controller or the card not like the dev_addr in the
> cmd->arg field?

This is the real issue, if that does happen.
Particularly the former, the kernel shouldn't send a CMD25 with 0
blocks.
If this happens during normal operation too (not mmc_test) a trace
would be interesting.

> 
> Thank you.
> 
> 
> On Fri, Apr 25, 2025 at 1:38 AM Christian Loehle
> <christian.loehle@arm.com> wrote:
>>
>> On 4/25/25 05:59, Prasad Koya wrote:
>>> Thanks Christian.
>>>
>>> I tried CMD25 with a different block size (mrq->data->blksz = 514) and
>>> I can see that the command was rejected with -EILSEQ (data->error =
>>> -84). But the card wasn't stuck. After the command was rejected, I see
>>> that the driver is sending CMD12 to signal the card the end of write.
>>> I see this comment in sdhci.c. Is CMD25 called an "open-ended
>>> multiblock transfer"?
>>
>> Yes indeed that's how that works.
>> IIUC your hypothesis was "card gets stuck in busy when CMD25 data was sent
>> with an invalid CRC"? That hypothesis you can now reject. (For open-ended
>> CMD25 anyway, you can also try CMD23+CMD25 and CMD24 writes).
>>
>>>
>>> 1291     /*
>>> 1292      * Need to send CMD12 if -
>>> 1293      * a) open-ended multiblock transfer (no CMD23)
>>> 1294      * b) error in multiblock transfer
>>> 1295      */
>>> 1296     if (data->stop &&
>>> 1297         (data->error ||
>>> 1298          !data->mrq->sbc)) {
>>> :
>>> 1308         } else {
>>> 1309             /* Avoid triggering warning in sdhci_send_command() */
>>> 1310             host->cmd = NULL;
>>> 1311             sdhci_send_command(host, data->stop);
>>> 1312         }
>>>
>>> Here are my emmc traces:
>>>
>>>             bash-10454 [000] ....  6902.122069: mmc_request_start:
>>> mmc0: start struct mmc_request[0000000001eb0468]: cmd_opcode=25
>>> cmd_arg=0x3a0000 cmd_flags=0x35 cmd_retries=0 stop_opcode=12
>>> stop_arg=0x0 stop_flags=0x1d stop_retries=0 sbc_opcode=0 sbc_arg=0x0
>>> sbc_flags=0x0 sbc_retires=0 blocks=17 block_size=1719
>>> blk_addr=15433700 data_flags=0x100 tag=0 can_retune=0 doing_retune=0
>>> retune_now=0 need_retune=0 hold_retune=1 retune_period=0
>>>    rs:main Q:Reg-3221  [000] ..s.  6902.125240: mmc_request_done:
>>> mmc0: end struct mmc_request[0000000001eb0468]: cmd_opcode=25
>>> cmd_err=0 cmd_resp=0x900 0x0 0x0 0x0 cmd_retries=0 stop_opcode=12
>>> stop_err=0 stop_resp=0xc00 0x0 0x0 0x0 stop_retries=0 sbc_opcode=0
>>> sbc_err=0 sbc_resp=0x0 0x0 0x0 0x0 sbc_retries=0 bytes_xfered=0
>>> data_err=-84 tag=0 can_retune=0 doing_retune=0 retune_now=0
>>> need_retune=0 hold_retune=1 retune_period=0
>>>             bash-10454 [000] ....  6902.126438: mmc_request_start:
>>> mmc0: start struct mmc_request[000000001c8ae402]: cmd_opcode=13
>>> cmd_arg=0x10000 cmd_flags=0x15 cmd_retries=0 stop_opcode=0
>>> stop_arg=0x0 stop_flags=0x0 stop_retries=0 sbc_opcode=0 sbc_arg=0x0
>>> sbc_flags=0x0 sbc_retires=0 blocks=0 block_size=0 blk_addr=0
>>> data_flags=0x0 tag=0 can_retune=0 doing_retune=0 retune_now=0
>>> need_retune=0 hold_retune=1 retune_period=0
>>>             bash-10454 [000] ..s1  6902.126911: mmc_request_done:
>>> mmc0: end struct mmc_request[000000001c8ae402]: cmd_opcode=13
>>> cmd_err=0 cmd_resp=0x900 0x0 0x0 0x0 cmd_retries=0 stop_opcode=0
>>> stop_err=0 stop_resp=0x0 0x0 0x0 0x0 stop_retries=0 sbc_opcode=0
>>> sbc_err=0 sbc_resp=0x0 0x0 0x0 0x0 sbc_retries=0 bytes_xfered=0
>>> data_err=0 tag=0 can_retune=0 doing_retune=0 retune_now=0
>>> need_retune=0 hold_retune=1 retune_period=0
>>>
>>> For our customers, something from the host to the card gets the card
>>> stuck. The status returned at that point is 0xe00. After CMD25 and
>>> CMD12 that took 10s each to timeout, all subsequent commands from the
>>> host fail immediately with -110 error.
>>
>> If the card is in busy it wont accept any new commands, so that's in
>> line. The only question is why is it stuck in busy? (The spec
>> mandates maximum busy times, they are sometimes violated by some card
>> vendors (especially SD cards), but nothing on the order of 10s should be
>> observed.)
>> Have you contacted the card vendor about this? I don't see any hint
>> on wrongdoing of the host here.
>>
>>
>>> Is there any other forum where I can post such questions. I'm a newbie
>>> to the driver. I'm looking at the JEDEC spec and trying to figure out
>>> the implementation.
>>
>> Not that I'm aware of.
>>
>>>
>>> Thanks.
>>>
>>>
>>> On Tue, Apr 22, 2025 at 2:54 PM Christian Loehle
>>> <christian.loehle@arm.com> wrote:
>>>>
>>>> On 4/22/25 18:37, Prasad Koya wrote:
>>>>> Hi Christian
>>>>>
>>>>> Thanks for going over and replying.
>>>>>
>>>>> I couldn't find any ioctl() that changed host's block size in the 6.14
>>>>> tree. We are using 5.10 kernel.
>>>>
>>>> Works on any kernel since a long time.
>>>> You can find a caller instance in mmc-utils: mmc_cmds.c, the field is
>>>> called blksz, if you've already familiarized yourself with the test
>>>> module I'd say don't bother though.
>>>>
>>>>>
>>>>> I did try mmc_test module so I can modify an existing test case and
>>>>> make it a simple one. Hacking this module, I could call CMD16 via
>>>>> mmc_test_set_blksize() and set the block size to 514 (0x202). Below
>>>>> are the eMMC traces for this test. A CMD16, followed by CMD25 with
>>>>> data, stop opcode and then CMD13. The multi-block write seems to go
>>>>> fine.
>>>>
>>>> Sorry, misunderstanding here. I meant changing the blocksize of the
>>>> transfer for the host *without* letting the card know (by issuing CMD16).
>>>> So leave out the CMD16 part and you should achieve what you want.
>>>>
>>>>>
>>>>>             bash-3958  [001] .... 34861.450688: mmc_request_start:
>>>>> mmc0: start struct mmc_request[00000000fe878338]: cmd_opcode=16
>>>>> cmd_arg=0x202 cmd_flags=0x95 cmd_retries=5 stop_opcode=0 stop_arg=0x0
>>>>> stop_flags=0x0 stop_retries=0 sbc_opcode=0 sbc_arg=0x0 sbc_flags=0x0
>>>>> sbc_retires=0 blocks=0 block_size=0 blk_addr=0 data_flags=0x0 tag=0
>>>>> can_retune=0 doing_retune=0 retune_now=0 need_retune=0 hold_retune=1
>>>>> retune_period=0
>>>>>        in:imklog-2971  [000] ..s. 34861.450737: mmc_request_done:
>>>>> mmc0: end struct mmc_request[00000000fe878338]: cmd_opcode=16
>>>>> cmd_err=0 cmd_resp=0x20000900 0x0 0x0 0x0 cmd_retries=5 stop_opcode=0
>>>>> stop_err=0 stop_resp=0x0 0x0 0x0 0x0 stop_retries=0 sbc_opcode=0
>>>>> sbc_err=0 sbc_resp=0x0 0x0 0x0 0x0 sbc_retries=0 bytes_xfered=0
>>>>> data_err=0 tag=0 can_retune=0 doing_retune=0 retune_now=0
>>>>> need_retune=0 hold_retune=1 retune_period=0
>>>>>             bash-3958  [001] .... 34861.451064: mmc_request_start:
>>>>> mmc0: start struct mmc_request[00000000c2f30356]: cmd_opcode=25
>>>>> cmd_arg=0x3a0000 cmd_flags=0x35 cmd_retries=0 stop_opcode=12
>>>>> stop_arg=0x0 stop_flags=0x1d stop_retries=0 sbc_opcode=0 sbc_arg=0x0
>>>>> sbc_flags=0x0 sbc_retires=0 blocks=128 block_size=512 blk_addr=0
>>>>> data_flags=0x100 tag=0 can_retune=0 doing_retune=0 retune_now=0
>>>>> need_retune=0 hold_retune=1 retune_period=0
>>>>>           <idle>-0     [000] ..s1 34861.456533: mmc_request_done:
>>>>> mmc0: end struct mmc_request[00000000c2f30356]: cmd_opcode=25
>>>>> cmd_err=0 cmd_resp=0x900 0x0 0x0 0x0 cmd_retries=0 stop_opcode=12
>>>>> stop_err=0 stop_resp=0xc00 0x0 0x0 0x0 stop_retries=0 sbc_opcode=0
>>>>> sbc_err=0 sbc_resp=0x0 0x0 0x0 0x0 sbc_retries=0 bytes_xfered=65536
>>>>> data_err=0 tag=0 can_retune=0 doing_retune=0 retune_now=0
>>>>> need_retune=0 hold_retune=1 retune_period=0
>>>>>             bash-3958  [001] .... 34861.456562: mmc_request_start:
>>>>> mmc0: start struct mmc_request[00000000e93d6161]: cmd_opcode=13
>>>>> cmd_arg=0x10000 cmd_flags=0x15 cmd_retries=0 stop_opcode=0
>>>>> stop_arg=0x0 stop_flags=0x0 stop_retries=0 sbc_opcode=0 sbc_arg=0x0
>>>>> sbc_flags=0x0 sbc_retires=0 blocks=0 block_size=0 blk_addr=0
>>>>> data_flags=0x0 tag=0 can_retune=0 doing_retune=0 retune_now=0
>>>>> need_retune=0 hold_retune=1 retune_period=0
>>>>>    rs:main Q:Reg-2972  [000] ..s. 34861.456688: mmc_request_done:
>>>>> mmc0: end struct mmc_request[00000000e93d6161]: cmd_opcode=13
>>>>> cmd_err=0 cmd_resp=0x900 0x0 0x0 0x0 cmd_retries=0 stop_opcode=0
>>>>> stop_err=0 stop_resp=0x0 0x0 0x0 0x0 stop_retries=0 sbc_opcode=0
>>>>> sbc_err=0 sbc_resp=0x0 0x0 0x0 0x0 sbc_retries=0 bytes_xfered=0
>>>>> data_err=0 tag=0 can_retune=0 doing_retune=0 retune_now=0
>>>>> need_retune=0 hold_retune=1 retune_period=0
>>>>>
>>>>> What we believe could be happening is an SEU that flips a bit in the
>>>>> data block after the host controller has sent the block on the DAT
>>>>> lines. Card should reject that after CRC mismatch. Does the card
>>>>> reject the command right when it sees a CRC mismatch or does it have
>>>>> to wait till all blocks and CMD12 is sent by the host?
>>>>
>>>> Open writes with CMD25-CMD12 indeed send all blocks until the card
>>>> can signal CRC error.
>>>>
>>>>>
>>>>> Thank you.
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Apr 16, 2025 at 2:13 AM Christian Loehle
>>>>> <christian.loehle@arm.com> wrote:
>>>>>>
>>>>>> On 4/16/25 01:13, Prasad Koya wrote:
>>>>>>> Hi
>>>>>>>
>>>>>>> We use eMMC as a boot drive. At a random time, maybe a month or an
>>>>>>> year of uptime, we run into an issue where we see the eMMC driver
>>>>>>> prints this in the kernel logs and the drive no longer responds after
>>>>>>> that. We see this issue in the field on linux kernel 4.19.142 and
>>>>>>> 5.10.165. From the SDHCI register dump, it looks like CMD25 ran into a
>>>>>>> timeout first. After the first timeout, we see CMD12 in the second
>>>>>>> SDHCI register dump. Status returned by the card at that point in
>>>>>>> card_busy_detect() is 0xE00.
>>>>>>>
>>>>>>> I looked at the recent commits up to 6.15 and I do not see anything
>>>>>>> obvious that addresses issue like below.
>>>>>>>
>>>>>>> We are trying to reproduce the issue in our lab and understand the
>>>>>>> driver code. Is there a way I can send a block of data as part of
>>>>>>> CMD25 with invalid CRC?  Appreciate any pointers.
>>>>>>
>>>>>> CRC is appended by hardware, so it will always be correct.
>>>>>> What you can do though is change the blocksize of the transfer on the
>>>>>> host side (e.g. from 512 to 514 bytes) without changing it using CMD16.
>>>>>> The 2 bytes will be latched in as CRC by the card. Of course you won't
>>>>>> see the CRC ACK from the card in that case (as it clashes with the
>>>>>> hardware CRC signal from the host).
>>>>>> You can do all this using the ioctl interface in userspace btw.
>>>>
>>


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-04-29 18:58 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-16  0:13 eMMC timeout reproduction. Can I send a block of data from the host to the card with invalid CRC? Prasad Koya
2025-04-16  9:13 ` Christian Loehle
2025-04-22 17:37   ` Prasad Koya
2025-04-22 21:53     ` Christian Loehle
2025-04-25  4:59       ` Prasad Koya
2025-04-25  8:38         ` Christian Loehle
2025-04-29 18:53           ` Prasad Koya
2025-04-29 18:58             ` Christian Loehle

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox