* Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption [not found] ` <26d6d164-5acd-4f85-a7ac-d01f44fb5a87@kernel.org> @ 2025-06-03 14:41 ` Christoph Hellwig 2025-06-03 14:57 ` James Bottomley 0 siblings, 1 reply; 3+ messages in thread From: Christoph Hellwig @ 2025-06-03 14:41 UTC (permalink / raw) To: Damien Le Moal Cc: Yafang Shao, Christoph Hellwig, Matthew Wilcox, Christian Brauner, djwong, cem, linux-xfs, Linux-Fsdevel, Damien Le Moal, Sathya Prakash, Sreekanth Reddy, Suganath Prabu Subramani, Martin K. Petersen, MPT-FusionLinux.pdl, linux-scsi [taking this private to discuss the mpt drivers] > Hmmm... DID_SOFT_ERROR... Normally, this is an immediate retry as this normally > is used to indicate that a command is a collateral abort due to an NCQ error, > and per ATA spec, that command should be retried. However, the *BAD* thing > about Broadcom HBAs using this is that it increments the command retry counter, > so if a command ends up being retried more than 5 times due to other commands > failing, the command runs out of retries and is failed like this. The command > retry counter should *not* be incremented for NCQ collateral aborts. I tried to > fix this, but it is impossible as we actually do not know if this is a > collateral abort or something else. The HBA events used to handle completion do > not allow differentiation. Waiting on Broadcom to do something about this (the > mpi3mr HBA driver has the same nasty issue). Maybe we should just change the mpt3 sas/mr drivers to use DID_SOFT_ERROR less? In fact there's not really a whole lot of DID_SOFT_ERROR users otherwise, and there's probably better status codes whatever they are doing can be translated to that do not increment the retry counter. ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption 2025-06-03 14:41 ` [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption Christoph Hellwig @ 2025-06-03 14:57 ` James Bottomley 2025-06-04 7:29 ` Damien Le Moal 0 siblings, 1 reply; 3+ messages in thread From: James Bottomley @ 2025-06-03 14:57 UTC (permalink / raw) To: Christoph Hellwig, Damien Le Moal Cc: Yafang Shao, Matthew Wilcox, Christian Brauner, djwong, cem, linux-xfs, Linux-Fsdevel, Damien Le Moal, Sathya Prakash, Sreekanth Reddy, Suganath Prabu Subramani, Martin K. Petersen, MPT-FusionLinux.pdl, linux-scsi On Tue, 2025-06-03 at 07:41 -0700, Christoph Hellwig wrote: > [taking this private to discuss the mpt drivers] > > > Hmmm... DID_SOFT_ERROR... Normally, this is an immediate retry as > > this normally is used to indicate that a command is a collateral > > abort due to an NCQ error, and per ATA spec, that command should be > > retried. However, the *BAD* thing about Broadcom HBAs using this is > > that it increments the command retry counter, so if a command ends > > up being retried more than 5 times due to other commands failing, > > the command runs out of retries and is failed like this. The > > command retry counter should *not* be incremented for NCQ > > collateral aborts. I tried to fix this, but it is impossible as we > > actually do not know if this is a collateral abort or something > > else. The HBA events used to handle completion do not allow > > differentiation. Waiting on Broadcom to do something about this > > (the mpi3mr HBA driver has the same nasty issue). > > Maybe we should just change the mpt3 sas/mr drivers to use > DID_SOFT_ERROR less? In fact there's not really a whole lot of > DID_SOFT_ERROR users otherwise, and there's probably better status > codes whatever they are doing can be translated to that do not > increment the retry counter. The status code that does that (retry without incrementing the counter) is DID_IMM_RETRY. The driver has to be a bit careful about using this because we can get into infinite retry loops. Regards, James ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption 2025-06-03 14:57 ` James Bottomley @ 2025-06-04 7:29 ` Damien Le Moal 0 siblings, 0 replies; 3+ messages in thread From: Damien Le Moal @ 2025-06-04 7:29 UTC (permalink / raw) To: James Bottomley, Christoph Hellwig Cc: Yafang Shao, Matthew Wilcox, Christian Brauner, djwong, cem, linux-xfs, Linux-Fsdevel, Damien Le Moal, Sathya Prakash, Sreekanth Reddy, Suganath Prabu Subramani, Martin K. Petersen, MPT-FusionLinux.pdl, linux-scsi On 6/3/25 11:57 PM, James Bottomley wrote: > On Tue, 2025-06-03 at 07:41 -0700, Christoph Hellwig wrote: >> [taking this private to discuss the mpt drivers] >> >>> Hmmm... DID_SOFT_ERROR... Normally, this is an immediate retry as >>> this normally is used to indicate that a command is a collateral >>> abort due to an NCQ error, and per ATA spec, that command should be >>> retried. However, the *BAD* thing about Broadcom HBAs using this is >>> that it increments the command retry counter, so if a command ends >>> up being retried more than 5 times due to other commands failing, >>> the command runs out of retries and is failed like this. The >>> command retry counter should *not* be incremented for NCQ >>> collateral aborts. I tried to fix this, but it is impossible as we >>> actually do not know if this is a collateral abort or something >>> else. The HBA events used to handle completion do not allow >>> differentiation. Waiting on Broadcom to do something about this >>> (the mpi3mr HBA driver has the same nasty issue). >> >> Maybe we should just change the mpt3 sas/mr drivers to use >> DID_SOFT_ERROR less? In fact there's not really a whole lot of >> DID_SOFT_ERROR users otherwise, and there's probably better status >> codes whatever they are doing can be translated to that do not >> increment the retry counter. > > The status code that does that (retry without incrementing the counter) > is DID_IMM_RETRY. The driver has to be a bit careful about using this > because we can get into infinite retry loops. James, Thank you for the information. Will have a try again at changing the driver to use this. -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2025-06-04 7:31 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <aD03HeZWLJihqikU@infradead.org>
[not found] ` <CALOAHbDxgvY7Aozf8H9H2OBedcU1efYBQiEvxMg6pj1+arPETQ@mail.gmail.com>
[not found] ` <aD5obj2G58bRMFlB@casper.infradead.org>
[not found] ` <CALOAHbCWra+DskmcWUWJOenTg9EJQfS23Hi-rB1GLYmcRUKf4A@mail.gmail.com>
[not found] ` <aD5ratf3NF_DUnL-@casper.infradead.org>
[not found] ` <CALOAHbB_p=rxT2-7bWudKLUgbD7AvNoBsge90VDgQFpakfTbCQ@mail.gmail.com>
[not found] ` <aD58p4OpY0QhKl3i@infradead.org>
[not found] ` <e2b4db3d-a282-4c96-b333-8d4698e5a705@kernel.org>
[not found] ` <CALOAHbA_ttJmOejYJ+rrRdzKav_BPtwxuKwCSAf2dwLZJ1UyZQ@mail.gmail.com>
[not found] ` <26d6d164-5acd-4f85-a7ac-d01f44fb5a87@kernel.org>
2025-06-03 14:41 ` [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption Christoph Hellwig
2025-06-03 14:57 ` James Bottomley
2025-06-04 7:29 ` Damien Le Moal
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox