IDE DMA errors, massive disk corruption: Why? Fixed Yet? Why not re-do failed op?

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* IDE DMA errors, massive disk corruption:  Why?  Fixed Yet?  Why not  re-do failed op?
@ 2003-10-06 18:42 Daniel B.
  2003-10-06 19:11 ` Bartlomiej Zolnierkiewicz
  0 siblings, 1 reply; 7+ messages in thread
From: Daniel B. @ 2003-10-06 18:42 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org

I just got bitten _again_ by IDE DMA timeout errors and massive 
filesystem corruption in kernel 2.4.22 (on an Asus A7M266-D dual-Athlon 
XP motherboard (AMD 768 chip / amd7441 IDE controller)).

(I had turned DMA off in my init scripts, but apparently Debian 
unstable's k7-smp configuration enables DMA by default before my init
scripts get control.  Ext3 journal "recovery" trashed my system 
partition.)

What's going on with the IDE DMA bugs?  They have existed since 2.2 
(right?), and even at .22 in the 2.4 series they still exist.  Why
have they been around so long?  Is it that few kernel developers use
the combinations of hardware or configuration options that expose
the bugs (like my dual-CPU box with IDE, not SCSI, disks)?

Are the DMA bugs believed to be fixed (for real) yet?  IF so, in which 
version?

Is there any consolidated documentation of the combinations of factors
that cause corruption, or of how to reliably avoid corruption (like
all the things to check to make sure your kernel never even tries to 
enable DMA)?

Also, why does a DMA timeout cause such corruption?  Doesn't the kernel 
keep track of uncompleted operations, retain the information needed to
try again, and try again if there's a failure?  If not, why not?

If it can't try again, shouldn't the kernel at least abort after one 
disk-write failure instead of performing additional writes, which
frequently depend on the previous writes?  (E.g., if I try to read 
block 1's data and write it to block 2, and then write something new 
to block 1, if the first write fails but continue and do the second
write, data gets destroyed.  If the first write fails and I stop right 
away, less is destroyed.)

Daniel
-- 
Daniel Barclay
dsb@smart.net

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: IDE DMA errors, massive disk corruption:  Why?  Fixed Yet?  Why not  re-do failed op?
  2003-10-06 18:42 IDE DMA errors, massive disk corruption: Why? Fixed Yet? Why not re-do failed op? Daniel B.
@ 2003-10-06 19:11 ` Bartlomiej Zolnierkiewicz
  0 siblings, 0 replies; 7+ messages in thread
From: Bartlomiej Zolnierkiewicz @ 2003-10-06 19:11 UTC (permalink / raw)
  To: Daniel B.; +Cc: linux-kernel@vger.kernel.org


There are different IDE DMA errors.
Please post error, dmesg and .config.

On Monday 06 of October 2003 20:42, Daniel B. wrote:
> I just got bitten _again_ by IDE DMA timeout errors and massive
> filesystem corruption in kernel 2.4.22 (on an Asus A7M266-D dual-Athlon
> XP motherboard (AMD 768 chip / amd7441 IDE controller)).
>
> (I had turned DMA off in my init scripts, but apparently Debian
> unstable's k7-smp configuration enables DMA by default before my init
> scripts get control.  Ext3 journal "recovery" trashed my system
> partition.)
>
> What's going on with the IDE DMA bugs?  They have existed since 2.2
> (right?), and even at .22 in the 2.4 series they still exist.  Why
> have they been around so long?  Is it that few kernel developers use
> the combinations of hardware or configuration options that expose
> the bugs (like my dual-CPU box with IDE, not SCSI, disks)?

Well, yes, I have no problems for example :-).

> Are the DMA bugs believed to be fixed (for real) yet?  IF so, in which
> version?
>
> Is there any consolidated documentation of the combinations of factors
> that cause corruption, or of how to reliably avoid corruption (like
> all the things to check to make sure your kernel never even tries to
> enable DMA)?
>
>
> Also, why does a DMA timeout cause such corruption?  Doesn't the kernel
> keep track of uncompleted operations, retain the information needed to
> try again, and try again if there's a failure?  If not, why not?
>
> If it can't try again, shouldn't the kernel at least abort after one
> disk-write failure instead of performing additional writes, which
> frequently depend on the previous writes?  (E.g., if I try to read
> block 1's data and write it to block 2, and then write something new
> to block 1, if the first write fails but continue and do the second
> write, data gets destroyed.  If the first write fails and I stop right
> away, less is destroyed.)

Are you sure you don't have faulty drive?

--bartlomiej


^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: IDE DMA errors, massive disk corruption:  Why?  Fixed Yet?  W hy not  re-do failed op?
@ 2003-10-06 19:32 Mudama, Eric
  2003-10-06 20:20 ` IDE DMA errors, massive disk corruption: Why? Fixed Yet? Why " Daniel B.
  0 siblings, 1 reply; 7+ messages in thread
From: Mudama, Eric @ 2003-10-06 19:32 UTC (permalink / raw)
  To: 'Daniel B.', linux-kernel

> -----Original Message-----
> From: Daniel B. [mailto:dsb@smart.net]
> Sent: Monday, October 06, 2003 12:42 PM
> To: linux-kernel@vger.kernel.org
> Subject: IDE DMA errors, massive disk corruption: Why? Fixed Yet? Why
> not re-do failed op?
> 
> Doesn't the kernel keep track of uncompleted operations,
> retain the information needed to try again, and try again
> if there's a failure?  If not, why not?

If the disk has write cache enabled, this isn't necessarilly possible, since
there's nothing in the IDE specification that guarantees the order of writes
to the media without a FLUSH CACHE (EXT) command.

Hypothetically, if you were doing full-pack random writes continuously with
no idle time and no FLUSH CACHE, you can have writes that are days old still
in the drive's buffer and still un-attempted.  A write with write-cache
enabled reports ending status at the completion of the transfer.  There is
no mechanism to tell the host that a cached write failed, other than giving
an error on the next command.

Obviously, drive companies have techniques to prevent this (data staying in
buffer for too long) from happening, but they are all vendor specific and
not part of the specification.

The flip side of this, running your drive with write cache off, is rather
destructive to performance in a modern IDE drive... anywhere from 33% as
fast to .1% as fast, depending on the workload.

> If it can't try again, shouldn't the kernel at least abort after one 
> disk-write failure instead of performing additional writes, which
> frequently depend on the previous writes?  (E.g., if I try to read 
> block 1's data and write it to block 2, and then write something new 
> to block 1, if the first write fails but continue and do the second
> write, data gets destroyed.  If the first write fails and I 
> stop right 
> away, less is destroyed.)

If a modern IDE disk gets a fatal write, it is toast.  The lengths drives go
through attempting to reassign to a new location are rather heroic IMO.

Any drive that gets a "real" fatal write (0x71 status for example) as
opposed to a timeout needs to be RMA'd back to the vendor.  Some drives will
work in a read-only mode if they get power cycled, but it isn't always
guaranteed.  If you can get your data off, do so immediately, and replace
the drive.

--eric

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: IDE DMA errors, massive disk corruption:  Why?  Fixed Yet?  Why not   re-do failed op?
  2003-10-06 19:32 IDE DMA errors, massive disk corruption: Why? Fixed Yet? W hy " Mudama, Eric
@ 2003-10-06 20:20 ` Daniel B.
  2003-10-06 20:45   ` Valdis.Kletnieks
  0 siblings, 1 reply; 7+ messages in thread
From: Daniel B. @ 2003-10-06 20:20 UTC (permalink / raw)
  Cc: linux-kernel

"Mudama, Eric" wrote:
... 
> > Doesn't the kernel keep track of uncompleted operations,
> > retain the information needed to try again, and try again
> > if there's a failure?  If not, why not?
> 
> If the disk has write cache enabled, this isn't necessarilly possible, since
> there's nothing in the IDE specification that guarantees the order of writes
> to the media without a FLUSH CACHE (EXT) command.

Are you sure?  If you issue a write to block 1 and then issue another
write to block 1, it would have to guarantee the relative order of those 
writes (or equivalent optimization in the write cache), wouldn't it?

> Hypothetically, if you were doing full-pack random writes continuously with
> no idle time and no FLUSH CACHE, you can have writes that are days old still
> in the drive's buffer and still un-attempted.  A write with write-cache
> enabled reports ending status at the completion of the transfer.  There is
> no mechanism to tell the host that a cached write failed, other than giving
> an error on the next command.

But we're not talking about errors IN the disk drive after the communi-
cation between the kernel and drive is already done.  We're talking
about errors in the communication BETWEEN the kernel and the drive (lost
DMA interrupts), aren't we?

If the kernel issues a write command to the drive, and never gets a 
response (DMA-complete interrupt?) from the drive that it has accepted 
the command, why can't the kernel repeat the write command?

Daniel
-- 
Daniel Barclay
dsb@smart.net

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: IDE DMA errors, massive disk corruption: Why? Fixed Yet? Why not re-do failed op?
  2003-10-06 20:20 ` IDE DMA errors, massive disk corruption: Why? Fixed Yet? Why " Daniel B.
@ 2003-10-06 20:45   ` Valdis.Kletnieks
  2003-10-06 21:07     ` Daniel B.
  0 siblings, 1 reply; 7+ messages in thread
From: Valdis.Kletnieks @ 2003-10-06 20:45 UTC (permalink / raw)
  To: Daniel B.; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1041 bytes --]

On Mon, 06 Oct 2003 16:20:42 EDT, "Daniel B." said:

> Are you sure?  If you issue a write to block 1 and then issue another
> write to block 1, it would have to guarantee the relative order of those 
> writes (or equivalent optimization in the write cache), wouldn't it?

If the old 'block 1' data is still in the write cache, then another write
should overlay it - that's a very basic optimization.  Consider the case of a
very active block that has a popular inode that's being atime-updated a lot (or
whatever causes a lot of activity - ignore the in-memory cache and sync/fsync
for the moment). You really don't want 34 writes to the same block taking up 34
blocks of space in the write cache....

The ordering issue comes when the following type of thing happens:

1) a write for block 993 is issued (metadata, perhaps)
2) a write for block 10934 is issued - actual file contents or something that
depends on 993 being written.
3) Disk writes 10934 out.
4) Things go bad  (power hit, whatever) before 993 gets written out.
5) fsck. ;)

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: IDE DMA errors, massive disk corruption: Why? Fixed Yet? Why not  re-do failed op?
  2003-10-06 20:45   ` Valdis.Kletnieks
@ 2003-10-06 21:07     ` Daniel B.
  2003-10-06 21:26       ` Jeff Garzik
  0 siblings, 1 reply; 7+ messages in thread
From: Daniel B. @ 2003-10-06 21:07 UTC (permalink / raw)
  Cc: linux-kernel

Valdis.Kletnieks@vt.edu wrote:
> 
> ...
> 
> The ordering issue comes when the following type of thing happens:
> 
> 1) a write for block 993 is issued (metadata, perhaps)
> 2) a write for block 10934 is issued - actual file contents or something that
> depends on 993 being written.
> 3) Disk writes 10934 out.
> 4) Things go bad  (power hit, whatever) before 993 gets written out.
> 5) fsck. ;)

It that scenario relevant to DMA errors?  

I'm talking about problems in steps 1 and 2, not in later steps.

If the kernel starts a write command for block 993, wouldn't it wait
for a DMA interrupt signalling that the drive has received and accepted
the command before the kernel starts the write command for block 10934?

If it timed out waiting for that interrupt, can't it re-issue the
write for block 993 before proceeding?

Daniel
-- 
Daniel Barclay
dsb@smart.net

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: IDE DMA errors, massive disk corruption: Why? Fixed Yet? Why not  re-do failed op?
  2003-10-06 21:07     ` Daniel B.
@ 2003-10-06 21:26       ` Jeff Garzik
  2003-10-07  5:24         ` IDE DMA errors, massive disk corruption: Why? Fixed Yet? Whynot " Daniel B.
  0 siblings, 1 reply; 7+ messages in thread
From: Jeff Garzik @ 2003-10-06 21:26 UTC (permalink / raw)
  To: Daniel B.; +Cc: linux-kernel

Daniel B. wrote:
> If the kernel starts a write command for block 993, wouldn't it wait
> for a DMA interrupt signalling that the drive has received and accepted
> the command before the kernel starts the write command for block 10934?

With command queueing, no, it would not wait.


> If it timed out waiting for that interrupt, can't it re-issue the
> write for block 993 before proceeding?

Assuming a large amount of sanity in your OS driver... certainly.

	Jeff




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: IDE DMA errors, massive disk corruption: Why? Fixed Yet? Whynot   re-do failed op?
  2003-10-06 21:26       ` Jeff Garzik
@ 2003-10-07  5:24         ` Daniel B.
  2003-10-07  6:03           ` Valdis.Kletnieks
  0 siblings, 1 reply; 7+ messages in thread
From: Daniel B. @ 2003-10-07  5:24 UTC (permalink / raw)
  Cc: linux-kernel

Jeff Garzik wrote:
> 
> Daniel B. wrote:
> > If the kernel starts a write command for block 993, wouldn't it wait
> > for a DMA interrupt signalling that the drive has received and accepted
> > the command before the kernel starts the write command for block 10934?
> 
> With command queueing, no, it would not wait.

Other than the write-back caching, it's not an open-loop system, 
right?  Regardless of how commands are batched or queued, isn't there 
some acknowledgment back from the drive that some batch of commands
(or some command, or some part of some command) was completed?

Surely the kernel checks for such acknowledgments, right? 

DMA-complete interrupts are probably how some of those acknowledgments 
are communicated, right?

So if the kernel doesn't get an expected DMA interrupt, it should
know that some command(/batch/part) wasn't acknowledged successfully,
right?  And surely it can tell _which_ command/batch/part wasn't
acknowledged (if multiple ones can be outstanding), right?

So if some command/batch/etc. wasn't acknowledged, why can't the 
kernel retry the command/batch/etc.?

> > If it timed out waiting for that interrupt, can't it re-issue the
> > write for block 993 before proceeding?
> 
> Assuming a large amount of sanity in your OS driver... certainly.

Given the serious of disk data corruption, why isn't the Linux kernel
more reliable here?  Hasn't this family of IDE problems been around
for a couple of years now?

Daniel
-- 
Daniel Barclay
dsb@smart.net

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: IDE DMA errors, massive disk corruption: Why? Fixed Yet? Whynot re-do failed op?
  2003-10-07  5:24         ` IDE DMA errors, massive disk corruption: Why? Fixed Yet? Whynot " Daniel B.
@ 2003-10-07  6:03           ` Valdis.Kletnieks
  2003-10-07 13:32             ` IDE DMA errors, massive disk corruption: Why? Fixed Yet? Why not " Daniel B.
  0 siblings, 1 reply; 7+ messages in thread
From: Valdis.Kletnieks @ 2003-10-07  6:03 UTC (permalink / raw)
  To: Daniel B.; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1697 bytes --]

On Tue, 07 Oct 2003 01:24:19 EDT, "Daniel B." said:

> So if some command/batch/etc. wasn't acknowledged, why can't the 
> kernel retry the command/batch/etc.?

The problem is that the disk ack'ed the command when the block went into the
write cache.  You *DONT* in general get back another ack when the block
actually hits the platters.

> Given the serious of disk data corruption, why isn't the Linux kernel
> more reliable here?  Hasn't this family of IDE problems been around
> for a couple of years now?

It's hard for the kernel to be more reliable unless you just disable the write cache.

The biggest reason we don't see more issues like this is that the average MTBF
really is up in the 100K hours and up range, and most drives probably get
around to actually writing all the blocks out every minute or so - so you're
looking at literally a 1 in a million shot at corruption.  Most of the time,
it's writing back in-order enough that no badness happens - and with the rise
of journaled file systems like ext3 and jfs and resierfs, the chance of
actually getting bit by it drops even more (you'd have to hit a case where the
blocks were re-ordered *and* the corresponding journal blocks didn't get
written either).

Yes, this family of problems has been around ever since write caches were
introduced. It's just taken until now that we've got file system code that's
rock solid enough that the write cache is a major reliability issue - for the
longest time, one kernel bug or another has been more of a concern.
See the IDE corruption in early 2.5 kernels that scared a LOT of people
away - I believe that one was done all by the kernel, without any help
from the disk's write cache. ;)

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: IDE DMA errors, massive disk corruption: Why? Fixed Yet? Why not  re-do failed op?
  2003-10-07  6:03           ` Valdis.Kletnieks
@ 2003-10-07 13:32             ` Daniel B.
  0 siblings, 0 replies; 7+ messages in thread
From: Daniel B. @ 2003-10-07 13:32 UTC (permalink / raw)
  To: Valdis.Kletnieks; +Cc: linux-kernel

Valdis.Kletnieks@vt.edu wrote:
> 
> On Tue, 07 Oct 2003 01:24:19 EDT, "Daniel B." said:
> 
> > So if some command/batch/etc. wasn't acknowledged, why can't the
> > kernel retry the command/batch/etc.?
> 
> The problem is that the disk ack'ed the command when the block went into the
> write cache.  

That's the acknowledgment I'm talking about.

> You *DONT* in general get back another ack when the block
> actually hits the platters.

I know.  I wasn't talking about any acknowledge after actually writing
the data to the medium.

> > Given the serious of disk data corruption, why isn't the Linux kernel
> > more reliable here?  Hasn't this family of IDE problems been around
> > for a couple of years now?
> 
> It's hard for the kernel to be more reliable unless you just disable the write cache.

Again, I'm NOT talking about write-cache problems.  I'm talking about
problems in the communication/handshaking between the kernel and
the drive.

> The biggest reason we don't see more issues like this is that the average MTBF
> really is up in the 100K hours and up range

That reliability figure is for the _drives_.

That figure obviously does not apply to kernel-to-drive communication,
because I've had dozens of DMA-interrupt corruptions in the last two
or so years.

> Yes, this family of problems has been around ever since write caches were
> introduced. 

I'm not talking about problems related to write caches.  I'm talking 
about DMA interrupt problems.  Why do you think I'm talking about
inside-the-black-box write-cache problems?

> It's just taken until now that we've got file system code that's
> rock solid enough 

Rock solid?  Hah!  If file system (and other disk-related) code is so 
solid why did my root partition get screwed so badly it can't boot?  

(Even if it's bad hardware's fault that an interrupt got lost, and 
even if it's unreasonably complicated (or impossible) for the
kernel to retry an unacknowledged command, why didn't the kernel
stop writing to that disk after the first unacknowledged command?)

> that the write cache is a major reliability issue - for the
> longest time, one kernel bug or another has been more of a concern.

It's not "has been"--it is still a problem, in the newest (is .22 
still the newest) released stable kernel.  

Daniel
-- 
Daniel Barclay
dsb@smart.net

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2003-10-07 13:32 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-10-06 18:42 IDE DMA errors, massive disk corruption: Why? Fixed Yet? Why not re-do failed op? Daniel B.
2003-10-06 19:11 ` Bartlomiej Zolnierkiewicz
  -- strict thread matches above, loose matches on Subject: below --
2003-10-06 19:32 IDE DMA errors, massive disk corruption: Why? Fixed Yet? W hy " Mudama, Eric
2003-10-06 20:20 ` IDE DMA errors, massive disk corruption: Why? Fixed Yet? Why " Daniel B.
2003-10-06 20:45   ` Valdis.Kletnieks
2003-10-06 21:07     ` Daniel B.
2003-10-06 21:26       ` Jeff Garzik
2003-10-07  5:24         ` IDE DMA errors, massive disk corruption: Why? Fixed Yet? Whynot " Daniel B.
2003-10-07  6:03           ` Valdis.Kletnieks
2003-10-07 13:32             ` IDE DMA errors, massive disk corruption: Why? Fixed Yet? Why not " Daniel B.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).