* Problem with disk @ 2006-05-03 20:01 David Ronis 2006-05-03 20:08 ` Ric Wheeler 0 siblings, 1 reply; 13+ messages in thread From: David Ronis @ 2006-05-03 20:01 UTC (permalink / raw) To: linux-ide I have an Toshiba Satellite M40 laptop that has a Fujitsu MHU2100AT ATA Disk drive. I've had two instances of major disk corruption and have brought the laptop back to Toshiba twice. The first time they found a problem in the power supply, but the second they said it was fine. I have windows installed on another partition and have had no corruption problems there since the first repair. I've also run spinwrite-6.0 on the disk and again it reports no problems. Here are some symptoms of the problems under Linux: I restore from dump backups after running mkfs on the Linux partition. During the restore I get some complaints of read/write errors (fortunately all in nonessential files). After the restore is completed, the system seems to be fine, however, after powering down an rebooting, major disk problems are found, and after running fsck, I end up with significant data loss. I've run with and without journaling turned on, but this doesn't seem to make a difference. I notice from hdparm that disk write caching is turned on. Any chance that there is a problem with the cache not being flushed before powering down? I'm running Linux-2.6.15.5 on what is otherwise a slackware-10.2 install. David ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Problem with disk 2006-05-03 20:01 Problem with disk David Ronis @ 2006-05-03 20:08 ` Ric Wheeler 2006-05-05 23:49 ` Mark Hahn 0 siblings, 1 reply; 13+ messages in thread From: Ric Wheeler @ 2006-05-03 20:08 UTC (permalink / raw) To: David.Ronis; +Cc: linux-ide David Ronis wrote: >I notice from hdparm that disk write caching is turned on. Any chance >that there is a problem with the cache not being flushed before powering >down? > >I'm running Linux-2.6.15.5 on what is otherwise a slackware-10.2 >install. > >David > > > > While linux has support for write barriers (that allow you to run safely with the write cache enabled), it needs support from the drive. I would suggest that you should run with the write cache disabled unless you can verify working barrier support. The fact that your drive reports IO errors is also worrying - you might just have a bad drive... You can look at drive help with tools like smartctl. Good luck, ric ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Problem with disk 2006-05-03 20:08 ` Ric Wheeler @ 2006-05-05 23:49 ` Mark Hahn 2006-05-06 0:51 ` Ric Wheeler 0 siblings, 1 reply; 13+ messages in thread From: Mark Hahn @ 2006-05-05 23:49 UTC (permalink / raw) To: David.Ronis; +Cc: linux-ide > >I notice from hdparm that disk write caching is turned on. Any chance > >that there is a problem with the cache not being flushed before powering > >down? pretty unlikly. linux normally offlines the drive before halting. > I would suggest that you should run with the write cache disabled unless > you can verify working barrier support. this is true, but extremely conservative/paranoid. it makes a lot of sense if you're handling banking transactions or if you really see a lot of abrupt power-offs (yank the battery). what are the chances of a drive failing to write dirty blocks when idle, halting? don't get me wrong: write barriers are A Good Thing. just that Linux survived very nicely for many years before such things were bothered with. > The fact that your drive reports IO errors is also worrying - you might > just have a bad drive... You can look at drive help with tools like > smartctl. IO errors trump any concerns for write barriers - there's no need to even think about barriers or cache settings if the disk is, for instance, reporting media errors... ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Problem with disk 2006-05-05 23:49 ` Mark Hahn @ 2006-05-06 0:51 ` Ric Wheeler 2006-05-06 17:11 ` Mark Hahn 0 siblings, 1 reply; 13+ messages in thread From: Ric Wheeler @ 2006-05-06 0:51 UTC (permalink / raw) To: Mark Hahn; +Cc: David.Ronis, linux-ide Mark Hahn wrote: >>I would suggest that you should run with the write cache disabled unless >>you can verify working barrier support. >> >> > >this is true, but extremely conservative/paranoid. it makes a lot >of sense if you're handling banking transactions or if you really >see a lot of abrupt power-offs (yank the battery). what are the chances >of a drive failing to write dirty blocks when idle, halting? > > The write cache in modern drives is multiple megabytes - 8 or 16MB is not uncommon. The chances that you have data that is lost on a power failure in the write cache is actually quite high... I agree that most people should not lose too much sleep over this. >don't get me wrong: write barriers are A Good Thing. just that Linux >survived very nicely for many years before such things were bothered with. > > > >>The fact that your drive reports IO errors is also worrying - you might >>just have a bad drive... You can look at drive help with tools like >>smartctl. >> >> > >IO errors trump any concerns for write barriers - there's no need to >even think about barriers or cache settings if the disk is, for instance, >reporting media errors... > > > Agreed again ;-) ric ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Problem with disk 2006-05-06 0:51 ` Ric Wheeler @ 2006-05-06 17:11 ` Mark Hahn 2006-05-06 18:17 ` Ric Wheeler 0 siblings, 1 reply; 13+ messages in thread From: Mark Hahn @ 2006-05-06 17:11 UTC (permalink / raw) To: Ric Wheeler; +Cc: David.Ronis, linux-ide > >this is true, but extremely conservative/paranoid. it makes a lot > >of sense if you're handling banking transactions or if you really > >see a lot of abrupt power-offs (yank the battery). what are the chances > >of a drive failing to write dirty blocks when idle, halting? > > > The write cache in modern drives is multiple megabytes - 8 or 16MB is > not uncommon. The chances that you have data that is lost on a power > failure in the write cache is actually quite high... but we're not talking about power failures in the middle of peak activity. afaikt, drives also never dedicate their whole cache to writeback - they keep plenty available for reads, as well. it would also be rather surprising if the firmware was completely oblivious about limiting the age of writebacks; after all always delaying writes until you run out of cache capacity is _not_ a winning strategy (even ignoring safety issues.) during a normal shutdown, can you think of some reason the drive would have LOTS of outstanding writes? that's the real point. depending on kernel version, linux should be doing a cache-flush command and standby, then eventually calling bios poweroff. it's very possible that this is going wrong (rumors of disks that claim to implement, but ignore cache-flush, or perhaps ones that stupidly don't flush on standby, or even bios poweroff that happens so fast that the disks isn't done flushing...) but turning off all writeback is overkill (especially when there's some other obvious sign of distress...) ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Problem with disk 2006-05-06 17:11 ` Mark Hahn @ 2006-05-06 18:17 ` Ric Wheeler 2006-05-06 18:34 ` Mark Hahn 0 siblings, 1 reply; 13+ messages in thread From: Ric Wheeler @ 2006-05-06 18:17 UTC (permalink / raw) To: Mark Hahn; +Cc: David.Ronis, linux-ide Mark Hahn wrote: >>The write cache in modern drives is multiple megabytes - 8 or 16MB is >>not uncommon. The chances that you have data that is lost on a power >>failure in the write cache is actually quite high... >> >> > >but we're not talking about power failures in the middle of peak activity. >afaikt, drives also never dedicate their whole cache to writeback - they >keep plenty available for reads, as well. it would also be rather surprising >if the firmware was completely oblivious about limiting the age of >writebacks; after all always delaying writes until you run out of cache >capacity is _not_ a winning strategy (even ignoring safety issues.) > > If you have drives/hardware to test on, you can easily verify (which we do on a regular basis) that running with barriers over power fail testing gets you a solid recovery. Running with write cache on and no barriers gets you file system corruption. As I said before, the data you just wrote (or the file system wrote for you) most recently is the same data that you stand to lose on a powerloss. >during a normal shutdown, can you think of some reason the drive would have >LOTS of outstanding writes? that's the real point. depending on kernel >version, linux should be doing a cache-flush command and standby, then >eventually calling bios poweroff. it's very possible that this is going >wrong (rumors of disks that claim to implement, but ignore cache-flush, >or perhaps ones that stupidly don't flush on standby, or even bios poweroff >that happens so fast that the disks isn't done flushing...) but turning >off all writeback is overkill (especially when there's some other obvious >sign of distress...) > > > We don't test every make of drive, but the modern drives we do test do honor the cache flush commands. It is important to note that drive firmware is like any other bit of code - it can have bugs so this support does need to be reverified on each drive (and version of firmware) before you can trust high value data ;-) If there is a hole in the sequence, dropping to standby could be the source of issues... ric ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Problem with disk 2006-05-06 18:17 ` Ric Wheeler @ 2006-05-06 18:34 ` Mark Hahn 2006-05-06 22:56 ` Tejun Heo 0 siblings, 1 reply; 13+ messages in thread From: Mark Hahn @ 2006-05-06 18:34 UTC (permalink / raw) To: Ric Wheeler; +Cc: David.Ronis, linux-ide > do on a regular basis) that running with barriers over power fail > testing gets you a solid recovery. Running with write cache on and no > barriers gets you file system corruption. in short "barriers work". never doubted! > As I said before, the data you > just wrote (or the file system wrote for you) most recently is the same > data that you stand to lose on a powerloss. obviously. so the question is whether the cache still has dirty writeback when the power drops due to normal poweroff. I'd consider it a bug in the laptop bios to let this happen, but that's not going to make the affected user happy... > If there is a hole in the sequence, dropping to standby could be the > source of issues... I guess it's a matter of how byzantine the bugs are you want to consider. for mass-produced devices, I'm reluctant to assume the disk vendor has forgotten to _ever_ flush writeback data, for instance. and don't forget that a bogus drive that entirely forgets writeback may also not really turn off write caching when you tell it to! I assume that the disk will indeed do writeback if left idle for a little while. on machines where this is a real problem, I would start out by waving relevant chickens like the following to give the best chance of shutting down cleanly: sync blockdev --flushbufs hdparm -W 0 sleep 2 hdparm -y sleep 5 halt -hp rather than _always_ suffering the penalty of disabled write cache, especially on a single slow laptop drive... ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Problem with disk 2006-05-06 18:34 ` Mark Hahn @ 2006-05-06 22:56 ` Tejun Heo 2006-05-07 13:21 ` Ric Wheeler 0 siblings, 1 reply; 13+ messages in thread From: Tejun Heo @ 2006-05-06 22:56 UTC (permalink / raw) To: Mark Hahn; +Cc: Ric Wheeler, David.Ronis, linux-ide Mark Hahn wrote: [--snip--] > I assume that the disk will indeed do writeback if left idle for a little > while. on machines where this is a real problem, I would start out by > waving relevant chickens like the following to give the best chance of > shutting down cleanly: > sync > blockdev --flushbufs > hdparm -W 0 > sleep 2 > hdparm -y > sleep 5 > halt -hp > > rather than _always_ suffering the penalty of disabled write cache, > especially on a single slow laptop drive... This is slightly OT as this thread is talking about normal power down but disabling writeback cache has its advantages. When you have power fluctation (e.g. power source fluctation or new device hot plugged and crappy PSU can't hold the voltage), the harddisk could briefly power down while other parts of system keep running. If the disk was under active FS writes, this ends up in inconsistencies between what the OS thinks the disk has and the disk actually has. Unfortunately, this can result in *massive* destruction of the filesystem. I lost my RAID-1 array earlier this year this way. The FS code systematically destroyed metadata of the filesystem and, on the following reboot, fsck did the final blow, I think. I ended up with 100+Gbytes of unorganized data and I had to recover data by grep + bvi. This is an extreme case but it shows turning off writeback has its advantages. After the initial stress & panic attack subsided, I tried to think about how to prevent such catastrophes, but there doesn't seem to be a good way. There's no way to tell 1. if the harddrive actually lost the writeback cache content 2. if so, how much it has lost. So, unless the OS halts the system everytime something seems weird with the disk, turning off writeback cache seems to be the only solution. -- tejun ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Problem with disk 2006-05-06 22:56 ` Tejun Heo @ 2006-05-07 13:21 ` Ric Wheeler 2006-05-07 13:41 ` Tejun Heo 0 siblings, 1 reply; 13+ messages in thread From: Ric Wheeler @ 2006-05-07 13:21 UTC (permalink / raw) To: Tejun Heo; +Cc: Mark Hahn, David.Ronis, linux-ide, neilb Tejun Heo wrote: > > > Unfortunately, this can result in *massive* destruction of the > filesystem. I lost my RAID-1 array earlier this year this way. The FS > code systematically destroyed metadata of the filesystem and, on the > following reboot, fsck did the final blow, I think. I ended up with > 100+Gbytes of unorganized data and I had to recover data by grep + bvi. Were you running with Neil's fixes that make MD devices properly handle write barrier requests? Until fairly recently (not sure when this was fixed), MD devices more or less dropped the barrier requests. With properly working barriers, any journal file system should get you back to a consistent state after a power drop (although there are many less common ways that drives can potentially drop data). > > This is an extreme case but it shows turning off writeback has its > advantages. After the initial stress & panic attack subsided, I tried > to think about how to prevent such catastrophes, but there doesn't seem > to be a good way. There's no way to tell 1. if the harddrive actually > lost the writeback cache content 2. if so, how much it has lost. So, > unless the OS halts the system everytime something seems weird with the > disk, turning off writeback cache seems to be the only solution. > Turning off the writeback cache is definitely the safe and conservative way to go for mission critical data unless you can be very certain that your barriers are properly working on the drive & IO stack. We validate the cache flush commands with a s-ata analyzer (making sure that we see them on sync/transaction commits) and that they take a reasonable amount of time at the drive... ric ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Problem with disk 2006-05-07 13:21 ` Ric Wheeler @ 2006-05-07 13:41 ` Tejun Heo 2006-05-08 14:33 ` Ric Wheeler 0 siblings, 1 reply; 13+ messages in thread From: Tejun Heo @ 2006-05-07 13:41 UTC (permalink / raw) To: ric; +Cc: Mark Hahn, David.Ronis, linux-ide, neilb Ric Wheeler wrote: > > > Tejun Heo wrote: >> >> >> Unfortunately, this can result in *massive* destruction of the >> filesystem. I lost my RAID-1 array earlier this year this way. The >> FS code systematically destroyed metadata of the filesystem and, on >> the following reboot, fsck did the final blow, I think. I ended up >> with 100+Gbytes of unorganized data and I had to recover data by grep >> + bvi. > > Were you running with Neil's fixes that make MD devices properly handle > write barrier requests? Until fairly recently (not sure when this was > fixed), MD devices more or less dropped the barrier requests. > > With properly working barriers, any journal file system should get you > back to a consistent state after a power drop (although there are many > less common ways that drives can potentially drop data). I'm not sure whether the barrier was working or not. Ummm.. Are you saying that MD is capable of recovering from data drop *during* operation? ie. the system didn't go out, just the harddrives. Data is lost no matter what MD does and MD and the filesystem don't have any way to tell which bits made it to the media and which are lost whether barriers are working or not. To handle such conditions, device driver should tell upper layer that PHY status has changed (or something weird happened which could lead to data loss) and the fs, in return, perform journal replay while still online. I'm pretty sure that isn't implemented in the current kernel. >> >> This is an extreme case but it shows turning off writeback has its >> advantages. After the initial stress & panic attack subsided, I tried >> to think about how to prevent such catastrophes, but there doesn't >> seem to be a good way. There's no way to tell 1. if the harddrive >> actually lost the writeback cache content 2. if so, how much it has >> lost. So, unless the OS halts the system everytime something seems >> weird with the disk, turning off writeback cache seems to be the only >> solution. >> > > Turning off the writeback cache is definitely the safe and conservative > way to go for mission critical data unless you can be very certain that > your barriers are properly working on the drive & IO stack. We validate > the cache flush commands with a s-ata analyzer (making sure that we see > them on sync/transaction commits) and that they take a reasonable amount > of time at the drive... > One thing I'm curious about is how much performance benefit can be obtained from write-back caching. With NCQ/TCQ, latency is much less of an issue and I don't think scheduling and/or buffering inside the drive would result in significant performance increase when so much is done by the vm and block layer (aside from scheduling of currently queued commands). Some linux elevators try pretty hard to not mix read and write requests as they mess up statistics (write back cache absorbs write requests very fast then affect following read requests). So, they basically try to eliminate the effect of write-back caching. Well, benchmark time, it seems. :) -- tejun ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Problem with disk 2006-05-07 13:41 ` Tejun Heo @ 2006-05-08 14:33 ` Ric Wheeler 2006-05-10 22:21 ` Tejun Heo 0 siblings, 1 reply; 13+ messages in thread From: Ric Wheeler @ 2006-05-08 14:33 UTC (permalink / raw) To: Tejun Heo; +Cc: Mark Hahn, David.Ronis, linux-ide, neilb Tejun Heo wrote: > Ric Wheeler wrote: >> >> >> Tejun Heo wrote: >>> >>> >>> Unfortunately, this can result in *massive* destruction of the >>> filesystem. I lost my RAID-1 array earlier this year this way. The >>> FS code systematically destroyed metadata of the filesystem and, on >>> the following reboot, fsck did the final blow, I think. I ended up >>> with 100+Gbytes of unorganized data and I had to recover data by >>> grep + bvi. >> >> Were you running with Neil's fixes that make MD devices properly >> handle write barrier requests? Until fairly recently (not sure when >> this was fixed), MD devices more or less dropped the barrier requests. >> >> With properly working barriers, any journal file system should get >> you back to a consistent state after a power drop (although there are >> many less common ways that drives can potentially drop data). > > I'm not sure whether the barrier was working or not. Ummm.. Are you > saying that MD is capable of recovering from data drop *during* > operation? ie. the system didn't go out, just the harddrives. Data > is lost no matter what MD does and MD and the filesystem don't have > any way to tell which bits made it to the media and which are lost > whether barriers are working or not. I think that MD will do the right thing if the IO terminates with an error condition. If the error is silent (and that can happen during a write), then it clearly cannot recover. > > To handle such conditions, device driver should tell upper layer that > PHY status has changed (or something weird happened which could lead > to data loss) and the fs, in return, perform journal replay while > still online. I'm pretty sure that isn't implemented in the current > kernel. > >>> >>> This is an extreme case but it shows turning off writeback has its >>> advantages. After the initial stress & panic attack subsided, I >>> tried to think about how to prevent such catastrophes, but there >>> doesn't seem to be a good way. There's no way to tell 1. if the >>> harddrive actually lost the writeback cache content 2. if so, how >>> much it has lost. So, unless the OS halts the system everytime >>> something seems weird with the disk, turning off writeback cache >>> seems to be the only solution. >>> >> >> Turning off the writeback cache is definitely the safe and >> conservative way to go for mission critical data unless you can be >> very certain that your barriers are properly working on the drive & >> IO stack. We validate the cache flush commands with a s-ata analyzer >> (making sure that we see them on sync/transaction commits) and that >> they take a reasonable amount of time at the drive... >> > > One thing I'm curious about is how much performance benefit can be > obtained from write-back caching. With NCQ/TCQ, latency is much less > of an issue and I don't think scheduling and/or buffering inside the > drive would result in significant performance increase when so much is > done by the vm and block layer (aside from scheduling of currently > queued commands). > > Some linux elevators try pretty hard to not mix read and write > requests as they mess up statistics (write back cache absorbs write > requests very fast then affect following read requests). So, they > basically try to eliminate the effect of write-back caching. > > Well, benchmark time, it seems. :) My own benchmarks showed a clear win for a write intensive work load with the write cache + barriers enabled using reiserfs. I think that the NCQ/TCQ wins mostly in a read case. ric ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Problem with disk 2006-05-08 14:33 ` Ric Wheeler @ 2006-05-10 22:21 ` Tejun Heo 2006-05-13 19:31 ` Ric Wheeler 0 siblings, 1 reply; 13+ messages in thread From: Tejun Heo @ 2006-05-10 22:21 UTC (permalink / raw) To: Ric Wheeler; +Cc: Mark Hahn, David.Ronis, linux-ide, neilb Ric Wheeler wrote: > I think that MD will do the right thing if the IO terminates with an > error condition. If the error is silent (and that can happen during a > write), then it clearly cannot recover. The condition I've described results in silent loss of data. Depending on type and implementation, LLDD might be able to detect the condition (PHY RDY status changed for SATA), but the event happens after the affected writes are completed successfully. For example, 1. fs issues writes for block #x, #y and then barrier #b. 2. #x gets written to the write-back cache and completed successfully 3. power glitch occurs while #y is in progress. LLDD detects the condition, recovers the drive and retries #y. 4. #y gets written to the write-back cache and completed successfully 4. barrier #b gets executed and #y gets written to the media, but #x is lost and nobody knows about it. I'm worried about the problem because, with libata, hotplug is becoming available to the masses and when average Joe hot plugs a new drive into his machine which has $8 power supply (really, they sell 300w ATX power at 8000 KRW which is about $8), this is going to happen. I had a pretty decent power supply from a reputable maker but I still got hit by the problem. Maybe the correct approach is to establish a warm-plug protocol. Kernel provides a way to plug IOs and user helper program plugs all IOs until the new device settles. Thanks. -- tejun ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Problem with disk 2006-05-10 22:21 ` Tejun Heo @ 2006-05-13 19:31 ` Ric Wheeler 0 siblings, 0 replies; 13+ messages in thread From: Ric Wheeler @ 2006-05-13 19:31 UTC (permalink / raw) To: Tejun Heo; +Cc: Mark Hahn, David.Ronis, linux-ide, neilb Tejun Heo wrote: > Ric Wheeler wrote: > >> I think that MD will do the right thing if the IO terminates with an >> error condition. If the error is silent (and that can happen during >> a write), then it clearly cannot recover. > > > The condition I've described results in silent loss of data. > Depending on type and implementation, LLDD might be able to detect the > condition (PHY RDY status changed for SATA), but the event happens > after the affected writes are completed successfully. For example, > > 1. fs issues writes for block #x, #y and then barrier #b. > 2. #x gets written to the write-back cache and completed successfully > 3. power glitch occurs while #y is in progress. LLDD detects the > condition, recovers the drive and retries #y. > 4. #y gets written to the write-back cache and completed successfully > 4. barrier #b gets executed and #y gets written to the media, but #x > is lost and nobody knows about it. The promise that you get from the barrier is pretty simple - after a successful one, all IO's that have been submitted before then are on platter if the barrier works. In your example, if you mean power glitch as in power loss, x will be lost (and probably lots of other write cache state), but the application should expect it (or add extra barriers).... > > I'm worried about the problem because, with libata, hotplug is > becoming available to the masses and when average Joe hot plugs a new > drive into his machine which has $8 power supply (really, they sell > 300w ATX power at 8000 KRW which is about $8), this is going to > happen. I had a pretty decent power supply from a reputable maker but > I still got hit by the problem. Not sure that I understand exactly how a glitch (as opposed to a full loss) would cause x to get lost - the drive firmware should track the fact that x was in the write cache and not destaged to platter. > > Maybe the correct approach is to establish a warm-plug protocol. > Kernel provides a way to plug IOs and user helper program plugs all > IOs until the new device settles. > > Thanks. > ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2006-05-13 19:32 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-05-03 20:01 Problem with disk David Ronis 2006-05-03 20:08 ` Ric Wheeler 2006-05-05 23:49 ` Mark Hahn 2006-05-06 0:51 ` Ric Wheeler 2006-05-06 17:11 ` Mark Hahn 2006-05-06 18:17 ` Ric Wheeler 2006-05-06 18:34 ` Mark Hahn 2006-05-06 22:56 ` Tejun Heo 2006-05-07 13:21 ` Ric Wheeler 2006-05-07 13:41 ` Tejun Heo 2006-05-08 14:33 ` Ric Wheeler 2006-05-10 22:21 ` Tejun Heo 2006-05-13 19:31 ` Ric Wheeler
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).