* How does libata handles an 'ATA_ABORTED' error?
@ 2011-12-14 8:48 Juergen Beisert
2011-12-15 5:51 ` Robert Hancock
0 siblings, 1 reply; 7+ messages in thread
From: Juergen Beisert @ 2011-12-14 8:48 UTC (permalink / raw)
To: linux-ide@vger.kernel.org
Hi list,
I have a CF card running in true-ide mode connected to regular PC. This CF
card does wear leveling of its flash memory internally like every other CF
card. With one exception: When the CF's firmware detects a broken NAND page
while writing a sector, it moves around the remaining (good) data to other
pages. To do this job it must discard the already transmitted sector data in
its SRAM, because it needs this SRAM to move around the other flash memory
data.
After the movement the firmware signals an 'ATA_ERR' in the status register
and an 'ATA_ABORTED' in the error register to force the host to repeat to
write the same data again (next time it will be successfull due to internal
wear leveling is already done).
As we see data lost when the systems are running in production, I'm now trying
to find out if the libata/SCSI layer really repeats the sector write for this
case and does the expected (or required) things. But I'm lost in these
software layers and their error path.
I found (in Documentation/DocBook/libata.tmpl):
"This is indicated by UNC bit in the ERROR register. ATA
devices reports UNC error only after certain number of
retries cannot recover the data, so there's nothing much
else to do other than notifying upper layer."
which sounds to me as no repeat will happen for write errors, but
the 'ATA_UNC' bit is not used to signal the "wear leveling case" shown above.
As far as I understand the ATA errors are transformed to SCSI errors and then
handled in the SCSI layer. But the documentation tells me it is not easy to
always find an adequate SCSI error for an ATA error. So, I'm not sure if for
the "wear leveling case" the SCSI layer receives a "valuable" error message.
Does anybody can give me a hint, what really happens when the attached drive
signals an 'ATA_ABORTED'? Does the libata/SCSI give up in this case, or will
it repeat the command?
Regards,
Juergen
--
Pengutronix e.K. | Juergen Beisert |
Linux Solutions for Science and Industry | http://www.pengutronix.de/ |
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: How does libata handles an 'ATA_ABORTED' error?
2011-12-14 8:48 How does libata handles an 'ATA_ABORTED' error? Juergen Beisert
@ 2011-12-15 5:51 ` Robert Hancock
2011-12-15 11:01 ` Juergen Beisert
0 siblings, 1 reply; 7+ messages in thread
From: Robert Hancock @ 2011-12-15 5:51 UTC (permalink / raw)
To: Juergen Beisert; +Cc: linux-ide@vger.kernel.org
On 12/14/2011 02:48 AM, Juergen Beisert wrote:
> Hi list,
>
> I have a CF card running in true-ide mode connected to regular PC. This CF
> card does wear leveling of its flash memory internally like every other CF
> card. With one exception: When the CF's firmware detects a broken NAND page
> while writing a sector, it moves around the remaining (good) data to other
> pages. To do this job it must discard the already transmitted sector data in
> its SRAM, because it needs this SRAM to move around the other flash memory
> data.
>
> After the movement the firmware signals an 'ATA_ERR' in the status register
> and an 'ATA_ABORTED' in the error register to force the host to repeat to
> write the same data again (next time it will be successfull due to internal
> wear leveling is already done).
>
> As we see data lost when the systems are running in production, I'm now trying
> to find out if the libata/SCSI layer really repeats the sector write for this
> case and does the expected (or required) things. But I'm lost in these
> software layers and their error path.
>
> I found (in Documentation/DocBook/libata.tmpl):
>
> "This is indicated by UNC bit in the ERROR register. ATA
> devices reports UNC error only after certain number of
> retries cannot recover the data, so there's nothing much
> else to do other than notifying upper layer."
>
> which sounds to me as no repeat will happen for write errors, but
> the 'ATA_UNC' bit is not used to signal the "wear leveling case" shown above.
That seems like incorrect behavior by the device, ABRT is normally used
to indicate an invalid or unsupported command. UNC would likely be more
appropriate. But I don't think it ultimately makes a difference in this
case.
>
> As far as I understand the ATA errors are transformed to SCSI errors and then
> handled in the SCSI layer. But the documentation tells me it is not easy to
> always find an adequate SCSI error for an ATA error. So, I'm not sure if for
> the "wear leveling case" the SCSI layer receives a "valuable" error message.
From what I can see the SCSI error that gets returned in this case is
just an "aborted command" error.
>
> Does anybody can give me a hint, what really happens when the attached drive
> signals an 'ATA_ABORTED'? Does the libata/SCSI give up in this case, or will
> it repeat the command?
I don't know that the SCSI or block layers really pay much attention to
the error code in this case - I think it would always attempt some retries.
Certainly any of these errors would result in error messages showing up
in dmesg. Are you seeing any of this?
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: How does libata handles an 'ATA_ABORTED' error?
2011-12-15 5:51 ` Robert Hancock
@ 2011-12-15 11:01 ` Juergen Beisert
2011-12-15 18:38 ` Robert Hancock
0 siblings, 1 reply; 7+ messages in thread
From: Juergen Beisert @ 2011-12-15 11:01 UTC (permalink / raw)
To: linux-ide; +Cc: Robert Hancock
Hi Robert,
Robert Hancock wrote:
> On 12/14/2011 02:48 AM, Juergen Beisert wrote:
> > I have a CF card running in true-ide mode connected to regular PC. This
> > CF card does wear leveling of its flash memory internally like every
> > other CF card. With one exception: When the CF's firmware detects a
> > broken NAND page while writing a sector, it moves around the remaining
> > (good) data to other pages. To do this job it must discard the already
> > transmitted sector data in its SRAM, because it needs this SRAM to move
> > around the other flash memory data.
> >
> > After the movement the firmware signals an 'ATA_ERR' in the status
> > register and an 'ATA_ABORTED' in the error register to force the host to
> > repeat to write the same data again (next time it will be successfull due
> > to internal wear leveling is already done).
> >
> > As we see data lost when the systems are running in production, I'm now
> > trying to find out if the libata/SCSI layer really repeats the sector
> > write for this case and does the expected (or required) things. But I'm
> > lost in these software layers and their error path.
> >
> > I found (in Documentation/DocBook/libata.tmpl):
> >
> > "This is indicated by UNC bit in the ERROR register. ATA
> > devices reports UNC error only after certain number of
> > retries cannot recover the data, so there's nothing much
> > else to do other than notifying upper layer."
> >
> > which sounds to me as no repeat will happen for write errors, but
> > the 'ATA_UNC' bit is not used to signal the "wear leveling case" shown
> > above.
>
> That seems like incorrect behavior by the device, ABRT is normally used
> to indicate an invalid or unsupported command. UNC would likely be more
> appropriate. But I don't think it ultimately makes a difference in this
> case.
Okay.
> > As far as I understand the ATA errors are transformed to SCSI errors and
> > then handled in the SCSI layer. But the documentation tells me it is not
> > easy to always find an adequate SCSI error for an ATA error. So, I'm not
> > sure if for the "wear leveling case" the SCSI layer receives a "valuable"
> > error message.
>
> From what I can see the SCSI error that gets returned in this case is
> just an "aborted command" error.
>
> > Does anybody can give me a hint, what really happens when the attached
> > drive signals an 'ATA_ABORTED'? Does the libata/SCSI give up in this
> > case, or will it repeat the command?
>
> I don't know that the SCSI or block layers really pay much attention to
> the error code in this case - I think it would always attempt some retries.
As far as I understand the problem of this kind of errors is for the multi
sector write case. The framework does not know what sectors fails, so the
question is: does it repeat the whole multi sector sequence or what else it
does?
> Certainly any of these errors would result in error messages showing up
> in dmesg. Are you seeing any of this?
Are they enabled by default? Or more like debug messages? We see broken
filesystems and data lost, but currently no related messages in the kernel's
log. This could mean there are no such failures or the messages are not
enabled.
Regards,
Juergen
--
Pengutronix e.K. | Juergen Beisert |
Linux Solutions for Science and Industry | http://www.pengutronix.de/ |
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: How does libata handles an 'ATA_ABORTED' error?
2011-12-15 11:01 ` Juergen Beisert
@ 2011-12-15 18:38 ` Robert Hancock
2011-12-16 4:26 ` Mark Lord
0 siblings, 1 reply; 7+ messages in thread
From: Robert Hancock @ 2011-12-15 18:38 UTC (permalink / raw)
To: Juergen Beisert; +Cc: linux-ide
On Thu, Dec 15, 2011 at 5:01 AM, Juergen Beisert <jbe@pengutronix.de> wrote:
> Hi Robert,
>
> Robert Hancock wrote:
>> On 12/14/2011 02:48 AM, Juergen Beisert wrote:
>> > I have a CF card running in true-ide mode connected to regular PC. This
>> > CF card does wear leveling of its flash memory internally like every
>> > other CF card. With one exception: When the CF's firmware detects a
>> > broken NAND page while writing a sector, it moves around the remaining
>> > (good) data to other pages. To do this job it must discard the already
>> > transmitted sector data in its SRAM, because it needs this SRAM to move
>> > around the other flash memory data.
>> >
>> > After the movement the firmware signals an 'ATA_ERR' in the status
>> > register and an 'ATA_ABORTED' in the error register to force the host to
>> > repeat to write the same data again (next time it will be successfull due
>> > to internal wear leveling is already done).
>> >
>> > As we see data lost when the systems are running in production, I'm now
>> > trying to find out if the libata/SCSI layer really repeats the sector
>> > write for this case and does the expected (or required) things. But I'm
>> > lost in these software layers and their error path.
>> >
>> > I found (in Documentation/DocBook/libata.tmpl):
>> >
>> > "This is indicated by UNC bit in the ERROR register. ATA
>> > devices reports UNC error only after certain number of
>> > retries cannot recover the data, so there's nothing much
>> > else to do other than notifying upper layer."
>> >
>> > which sounds to me as no repeat will happen for write errors, but
>> > the 'ATA_UNC' bit is not used to signal the "wear leveling case" shown
>> > above.
>>
>> That seems like incorrect behavior by the device, ABRT is normally used
>> to indicate an invalid or unsupported command. UNC would likely be more
>> appropriate. But I don't think it ultimately makes a difference in this
>> case.
>
> Okay.
>
>> > As far as I understand the ATA errors are transformed to SCSI errors and
>> > then handled in the SCSI layer. But the documentation tells me it is not
>> > easy to always find an adequate SCSI error for an ATA error. So, I'm not
>> > sure if for the "wear leveling case" the SCSI layer receives a "valuable"
>> > error message.
>>
>> From what I can see the SCSI error that gets returned in this case is
>> just an "aborted command" error.
>>
>> > Does anybody can give me a hint, what really happens when the attached
>> > drive signals an 'ATA_ABORTED'? Does the libata/SCSI give up in this
>> > case, or will it repeat the command?
>>
>> I don't know that the SCSI or block layers really pay much attention to
>> the error code in this case - I think it would always attempt some retries.
>
> As far as I understand the problem of this kind of errors is for the multi
> sector write case. The framework does not know what sectors fails, so the
> question is: does it repeat the whole multi sector sequence or what else it
> does?
The entire request should get retried.
>
>> Certainly any of these errors would result in error messages showing up
>> in dmesg. Are you seeing any of this?
>
> Are they enabled by default? Or more like debug messages? We see broken
> filesystems and data lost, but currently no related messages in the kernel's
> log. This could mean there are no such failures or the messages are not
> enabled.
They should always be enabled. If you don't get any, then presumably
the device is not raising any errors.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: How does libata handles an 'ATA_ABORTED' error?
2011-12-15 18:38 ` Robert Hancock
@ 2011-12-16 4:26 ` Mark Lord
2011-12-17 2:41 ` Robert Hancock
0 siblings, 1 reply; 7+ messages in thread
From: Mark Lord @ 2011-12-16 4:26 UTC (permalink / raw)
To: Robert Hancock; +Cc: Juergen Beisert, linux-ide
On 11-12-15 01:38 PM, Robert Hancock wrote:
> On Thu, Dec 15, 2011 at 5:01 AM, Juergen Beisert <jbe@pengutronix.de> wrote:
..
>> As far as I understand the problem of this kind of errors is for the multi
>> sector write case. The framework does not know what sectors fails, so the
>> question is: does it repeat the whole multi sector sequence or what else it
>> does?
>
> The entire request should get retried.
I'm not so sure that is correct.
The Linux IDE stack will not retry the completed sectors
(those which were successfully transfered in multiple-sector blocks).
Not sure what libata does here.
Cheers
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: How does libata handles an 'ATA_ABORTED' error?
2011-12-16 4:26 ` Mark Lord
@ 2011-12-17 2:41 ` Robert Hancock
2012-01-20 15:00 ` Juergen Beisert
0 siblings, 1 reply; 7+ messages in thread
From: Robert Hancock @ 2011-12-17 2:41 UTC (permalink / raw)
To: Mark Lord; +Cc: Juergen Beisert, linux-ide
On Thu, Dec 15, 2011 at 10:26 PM, Mark Lord <kernel@teksavvy.com> wrote:
> On 11-12-15 01:38 PM, Robert Hancock wrote:
>> On Thu, Dec 15, 2011 at 5:01 AM, Juergen Beisert <jbe@pengutronix.de> wrote:
> ..
>>> As far as I understand the problem of this kind of errors is for the multi
>>> sector write case. The framework does not know what sectors fails, so the
>>> question is: does it repeat the whole multi sector sequence or what else it
>>> does?
>>
>> The entire request should get retried.
>
> I'm not so sure that is correct.
>
> The Linux IDE stack will not retry the completed sectors
> (those which were successfully transfered in multiple-sector blocks).
>
> Not sure what libata does here.
I don't know of any logic in libata that tries to do selective
retries. In many cases we wouldn't know where in the request it failed
in any case. There's no real reason to do this anyway as redoing a bit
of I/O after a device error shouldn't be a big deal.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: How does libata handles an 'ATA_ABORTED' error?
2011-12-17 2:41 ` Robert Hancock
@ 2012-01-20 15:00 ` Juergen Beisert
0 siblings, 0 replies; 7+ messages in thread
From: Juergen Beisert @ 2012-01-20 15:00 UTC (permalink / raw)
To: linux-ide; +Cc: Robert Hancock, Mark Lord
Hi,
Robert Hancock wrote:
> On Thu, Dec 15, 2011 at 10:26 PM, Mark Lord <kernel@teksavvy.com> wrote:
> > On 11-12-15 01:38 PM, Robert Hancock wrote:
> >> On Thu, Dec 15, 2011 at 5:01 AM, Juergen Beisert <jbe@pengutronix.de>
> >> wrote:
> >
> > ..
> >
> >>> As far as I understand the problem of this kind of errors is for the
> >>> multi sector write case. The framework does not know what sectors
> >>> fails, so the question is: does it repeat the whole multi sector
> >>> sequence or what else it does?
> >>
> >> The entire request should get retried.
> >
> > I'm not so sure that is correct.
> >
> > The Linux IDE stack will not retry the completed sectors
> > (those which were successfully transfered in multiple-sector blocks).
> >
> > Not sure what libata does here.
>
> I don't know of any logic in libata that tries to do selective
> retries. In many cases we wouldn't know where in the request it failed
> in any case. There's no real reason to do this anyway as redoing a bit
> of I/O after a device error shouldn't be a big deal.
Some info about the real behaviour. I added some code for fault injection into
the libata to simulate the error report the CF card does when it is in trouble
with its internal flash memory. And at least for the PIO mode case the whole
transfer gets repeated in the case of this error:
[...]
ata_sff_tf_load: feat 0x0 nsect 0x0 lba 0x84 0x8A 0x1 <-- this is the transfer
ata_sff_hsm_move HSM_ST_FIRST <-- state machine starts
FAULT_INJECTION: forcing a failure <-- fault injection happens
ata_sff_hsm_move HSM_ST_ERR <-- state machine enters ERROR detection
ata_sff_tf_read: 'ABORT' flag reported <-- simulated error type
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
ata1.00: failed command: WRITE SECTOR(S)
ata1.00: cmd 30/00:00:84:8a:01/00:00:00:00:00/e0 tag 0 pio 131072 out
res 58/04:b6:ce:8a:01/00:00:00:00:00/e0 Emask 0x3 (HSM violation)
ata1.00: status: { DRDY DRQ }
ata1.00: error: { ABRT }
ata1: soft resetting link
ata1.00: configured for PIO1
ata1: EH complete
ata_sff_tf_load: feat 0x0 nsect 0x0 lba 0x84 0x8A 0x1 <-- the same transfer again \o/
[...]
Now I'm going to check the DMA case. Currently it looks a bit different to me.
Regards,
Juergen
--
Pengutronix e.K. | Juergen Beisert |
Linux Solutions for Science and Industry | http://www.pengutronix.de/ |
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2012-01-20 15:02 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-14 8:48 How does libata handles an 'ATA_ABORTED' error? Juergen Beisert
2011-12-15 5:51 ` Robert Hancock
2011-12-15 11:01 ` Juergen Beisert
2011-12-15 18:38 ` Robert Hancock
2011-12-16 4:26 ` Mark Lord
2011-12-17 2:41 ` Robert Hancock
2012-01-20 15:00 ` Juergen Beisert
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).