* Journaling pointless with today's hard disks?
@ 2001-11-24 13:03 Florian Weimer
2001-11-24 13:40 ` Rik van Riel
` (2 more replies)
0 siblings, 3 replies; 81+ messages in thread
From: Florian Weimer @ 2001-11-24 13:03 UTC (permalink / raw)
To: linux-kernel
In the German computer community, a statement from IBM[1] is
circulating which describes a rather peculiar behavior of certain IBM
IDE hard drivers (the DTLA series):
When the drive is powered down during a write operation, the sector
which was being written has got an incorrect checksum stored on disk.
So far, so good---but if the sector is read later, the drive returns a
*permanent*, *hard* error, which can only be removed by a low-level
format (IBM provides a tool for it). The drive does not automatically
map out such sectors.
IBM claims this isn't a firmware error, but thinks that this explains
the failures frequently observed with DTLA drivers (which might
reflect reality or not, I don't know, but that's not the point
anyway).
Now my question: Obviously, journaling file systems do not work
correctly on drivers with such behavior. In contrast, a vital data
structure is frequently written to (the journal), so such file systems
*increase* the probability of complete failure (with a bad sector in
the journal, the file system is probably unusable; for non-journaling
file systems, only a part of the data becomes unavailable). Is the
DTLA hard disk behavior regarding aborted writes more common among
contemporary hard drives? Wouldn't this make journaling pretty
pointless?
1. http://www.cooling-solutions.de/dtla-faq (German)
--
Florian Weimer Florian.Weimer@RUS.Uni-Stuttgart.DE
University of Stuttgart http://cert.uni-stuttgart.de/
RUS-CERT +49-711-685-5973/fax +49-711-685-5898
^ permalink raw reply [flat|nested] 81+ messages in thread* Re: Journaling pointless with today's hard disks? 2001-11-24 13:03 Journaling pointless with today's hard disks? Florian Weimer @ 2001-11-24 13:40 ` Rik van Riel 2001-11-24 16:36 ` Phil Howard 2001-11-25 9:14 ` Chris Wedgwood 2001-11-26 17:14 ` Steve Brueggeman 2 siblings, 1 reply; 81+ messages in thread From: Rik van Riel @ 2001-11-24 13:40 UTC (permalink / raw) To: Florian Weimer; +Cc: linux-kernel On 24 Nov 2001, Florian Weimer wrote: > In the German computer community, a statement from IBM[1] is > circulating which describes a rather peculiar behavior of certain IBM > IDE hard drivers (the DTLA series): That seems more like a case of "hard drives being pointless for people wanting to store their data" ;) The disks which _do_ store your data right also tend to work great with journaling; in fact, they tend to work better with journaling if you make a habit of crashing your system by hacking the kernel... The article you point to seems more like a "if you value your data, don't use IBM DTLA" thingy. regards, Rik -- Shortwave goes a long way: irc.starchat.net #swl http://www.surriel.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-24 13:40 ` Rik van Riel @ 2001-11-24 16:36 ` Phil Howard 2001-11-24 17:19 ` Charles Marslett ` (2 more replies) 0 siblings, 3 replies; 81+ messages in thread From: Phil Howard @ 2001-11-24 16:36 UTC (permalink / raw) To: linux-kernel On Sat, Nov 24, 2001 at 11:40:11AM -0200, Rik van Riel wrote: | On 24 Nov 2001, Florian Weimer wrote: | | > In the German computer community, a statement from IBM[1] is | > circulating which describes a rather peculiar behavior of certain IBM | > IDE hard drivers (the DTLA series): | | That seems more like a case of "hard drives being pointless | for people wanting to store their data" ;) Or at least "powering down IBM DTLA series hard drives is pointless for people wanting to store their data". Now I can see a problem if the drive can't flush a write-back cache during the "power fade". With some pretty big caches many drives have these days (although I wonder just how useful that is with OS caches being as good as they are), the time it takes to flush could be long (a few seconds ... and lights are out by then). I sure hope all my drives do write-through caching or don't cache writes at all. I would think that as fast as these drives spin these days, they could finish a sector between the time the power fade is detected and the time the voltage is too low to have the correct write current and servo speed. Obviously one problem with lighter weight platters is the momentum advantage is reduced for keeping the speed right as the power is declining (if the speed is an issue, which I am not sure of at all). | The disks which _do_ store your data right also tend to work | great with journaling; in fact, they tend to work better with | journaling if you make a habit of crashing your system by | hacking the kernel... OOC, do you think there is any real advantage to the 1m to 4m cache that drives have these days, considering the effective caching in the OS that all OSes these days have ... over adding that much memory to your system RAM? The only use for caching I can see in a drive is if it has physical sector sizes greater than the logical sector write granularity size which would require a read-mod-write kind of operation internally. But that's not really "cache" anyway. | The article you point to seems more like a "if you value your | data, don't use IBM DTLA" thingy. For now I use Maxtor for new servers. Fortunately the IBM ones I have are on UPS and not doing heavy write applications. -- ----------------------------------------------------------------- | Phil Howard - KA9WGN | Dallas | http://linuxhomepage.com/ | | phil-nospam@ipal.net | Texas, USA | http://phil.ipal.org/ | ----------------------------------------------------------------- ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-24 16:36 ` Phil Howard @ 2001-11-24 17:19 ` Charles Marslett 2001-11-24 17:31 ` Florian Weimer 2001-11-24 17:41 ` Matthias Andree 2 siblings, 0 replies; 81+ messages in thread From: Charles Marslett @ 2001-11-24 17:19 UTC (permalink / raw) To: Phil Howard; +Cc: linux-kernel Phil Howard wrote: > OOC, do you think there is any real advantage to the 1m to 4m cache > that drives have these days, considering the effective caching in > the OS that all OSes these days have ... over adding that much > memory to your system RAM? The only use for caching I can see in > a drive is if it has physical sector sizes greater than the logical > sector write granularity size which would require a read-mod-write > kind of operation internally. But that's not really "cache" anyway. Not asked of me, but as always, I do have an opinion: I think the real reason for the very large disk caches is that the cost of a track buffer for simple read-ahead is about the same as the 1 MB "cache" on cheap modern drives. And with very simple logic they can "cache" several physical tracks, say the ones that contain the inode and the last few sectors of the most recently accessed file. Sometimes this saves you a rotational delay time reading or writing a sector span, so it can do better than the OS then (I admit, that doesn't happen often). And the cost/benefit tradeoff is worth it, because the cost is so little. [Someone who really knows may correct me, however.] --Charles /"\ | \ / ASCII Ribbon Campaign | X Against HTML Mail |--Charles Marslett / \ | www.wordmark.org ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-24 16:36 ` Phil Howard 2001-11-24 17:19 ` Charles Marslett @ 2001-11-24 17:31 ` Florian Weimer 2001-11-24 17:41 ` Matthias Andree 2 siblings, 0 replies; 81+ messages in thread From: Florian Weimer @ 2001-11-24 17:31 UTC (permalink / raw) To: linux-kernel Phil Howard <phil-linux-kernel@ipal.net> writes: > | That seems more like a case of "hard drives being pointless > | for people wanting to store their data" ;) > > Or at least "powering down IBM DTLA series hard drives is pointless > for people wanting to store their data". We have got a DTLA drive which shows the typical symptoms without being powered down regularly. The defective sectors simply appeared during normal operation. But that's not the point, I'm pretty convinced that the DTLA problems are not caused by aborted writes. However, I'm scared by a major hard disk manufacturer using such a faulty approach, and claiming it's reasonable. Maybe you can gain some performance this way, maybe the firmware is easier to write. If there's such a motivation, other manufacturers will follow and soon, there won't be any reliably drives to buy for us (just being a bit paranoid...). > Now I can see a problem if the drive can't flush a write-back cache > during the "power fade". With some pretty big caches many drives > have these days (although I wonder just how useful that is with OS > caches being as good as they are), They can reorder writes and eliminate dead writes, breaking journaling (especially if the journal is on a different disk than the actual data). ;-) In fact, the "cache" is probably just memory used for quite a few different purposes: scatter/gather support, command queuing, storing the firmware, and so on. Emptying the caches in time is not a problem, BTW. You just don't get a full write in this case (and lose some data), but you shouldn't see any bad sectors. -- Florian Weimer Florian.Weimer@RUS.Uni-Stuttgart.DE University of Stuttgart http://cert.uni-stuttgart.de/ RUS-CERT +49-711-685-5973/fax +49-711-685-5898 ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-24 16:36 ` Phil Howard 2001-11-24 17:19 ` Charles Marslett 2001-11-24 17:31 ` Florian Weimer @ 2001-11-24 17:41 ` Matthias Andree 2001-11-24 19:20 ` Florian Weimer 2 siblings, 1 reply; 81+ messages in thread From: Matthias Andree @ 2001-11-24 17:41 UTC (permalink / raw) To: linux-kernel On Sat, 24 Nov 2001, Phil Howard wrote: > Now I can see a problem if the drive can't flush a write-back cache > during the "power fade". With some pretty big caches many drives > have these days (although I wonder just how useful that is with OS > caches being as good as they are), the time it takes to flush could > be long (a few seconds ... and lights are out by then). I sure hope > all my drives do write-through caching or don't cache writes at all. Well, the DTLA drives ship with their writeback cache ENABLED and transparent remapping DISABLED by default, so putting /sbin/hdparm -W0 /dev/hdX into your boot sequence before mounting the first filesystem r/w and before calling upon fsck is certainly not a bad idea with those. Alternatively, you can use IBM's feature tool to reconfigure the drive. On a related issue, I asked a person with access to DARA OEM (2.5" HDD) data to look up caching specifications, and IBM does not guarantee data integrity for cached blocks that have not yet made it to the disk, although the drives start to flush their caches immediately. So up to (cache size / block size) blocks may be lost. With the write cache turned off, the data loss is at most 1 block. > I would think that as fast as these drives spin these days, they > could finish a sector between the time the power fade is detected > and the time the voltage is too low to have the correct write > current and servo speed. Obviously one problem with lighter weight > platters is the momentum advantage is reduced for keeping the speed > right as the power is declining (if the speed is an issue, which I > am not sure of at all). Well, I never saw big capacitors on disks, so they just go park and that's about it. If DTLA corrupt their blocks in a way that low-level formatting becomes necessary, those disk drives must be phased out at once unless IBM update their firmware so to be able "this is a hard checksum error, but actually, we can safely overwrite this block". > OOC, do you think there is any real advantage to the 1m to 4m cache > that drives have these days, considering the effective caching in > the OS that all OSes these days have ... over adding that much > memory to your system RAM? The only use for caching I can see in > a drive is if it has physical sector sizes greater than the logical > sector write granularity size which would require a read-mod-write > kind of operation internally. But that's not really "cache" anyway. Yes, these caches allow for bigger write requests or less latency (didn't figure), doubling throughput on linear writes at least with IBM DTLA and DJNA drives. However, if it's really true that DTLA drives and their successor corrupt blocks (generate bad blocks) on power loss during block writes, these drives are crap. HTH, Matthias ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-24 17:41 ` Matthias Andree @ 2001-11-24 19:20 ` Florian Weimer 2001-11-24 19:29 ` Rik van Riel ` (5 more replies) 0 siblings, 6 replies; 81+ messages in thread From: Florian Weimer @ 2001-11-24 19:20 UTC (permalink / raw) To: linux-kernel Matthias Andree <matthias.andree@stud.uni-dortmund.de> writes: > However, if it's really true that DTLA drives and their successor > corrupt blocks (generate bad blocks) on power loss during block writes, > these drives are crap. They do, even IBM admits that (on http://www.cooling-solutions.de/dtla-faq you find a quote from IBM confirming this). IBM says it's okay, you have to expect this to happen. So much for their expertise in making hard disks. This makes me feel rather dizzy (lots of IBM drives in use). -- Florian Weimer Florian.Weimer@RUS.Uni-Stuttgart.DE University of Stuttgart http://cert.uni-stuttgart.de/ RUS-CERT +49-711-685-5973/fax +49-711-685-5898 ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-24 19:20 ` Florian Weimer @ 2001-11-24 19:29 ` Rik van Riel 2001-11-24 22:51 ` John Alvord 2001-11-24 22:28 ` H. Peter Anvin ` (4 subsequent siblings) 5 siblings, 1 reply; 81+ messages in thread From: Rik van Riel @ 2001-11-24 19:29 UTC (permalink / raw) To: Florian Weimer; +Cc: linux-kernel On 24 Nov 2001, Florian Weimer wrote: > They do, even IBM admits that (on > > http://www.cooling-solutions.de/dtla-faq > > you find a quote from IBM confirming this). IBM says it's okay, That quote is priceless. I know I'll be avoiding IBM disks from now on ;) Rik -- Shortwave goes a long way: irc.starchat.net #swl http://www.surriel.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-24 19:29 ` Rik van Riel @ 2001-11-24 22:51 ` John Alvord 2001-11-24 23:41 ` Phil Howard 0 siblings, 1 reply; 81+ messages in thread From: John Alvord @ 2001-11-24 22:51 UTC (permalink / raw) To: Rik van Riel; +Cc: Florian Weimer, linux-kernel On Sat, 24 Nov 2001, Rik van Riel wrote: > On 24 Nov 2001, Florian Weimer wrote: > > > They do, even IBM admits that (on > > > > http://www.cooling-solutions.de/dtla-faq > > > > you find a quote from IBM confirming this). IBM says it's okay, > > That quote is priceless. I know I'll be avoiding IBM > disks from now on ;) It could be true for many disks and only IBM has admitted it... john ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-24 22:51 ` John Alvord @ 2001-11-24 23:41 ` Phil Howard 2001-11-25 0:24 ` Ian Stirling 0 siblings, 1 reply; 81+ messages in thread From: Phil Howard @ 2001-11-24 23:41 UTC (permalink / raw) To: linux-kernel On Sat, Nov 24, 2001 at 02:51:38PM -0800, John Alvord wrote: | On Sat, 24 Nov 2001, Rik van Riel wrote: | | > On 24 Nov 2001, Florian Weimer wrote: | > | > > They do, even IBM admits that (on | > > | > > http://www.cooling-solutions.de/dtla-faq | > > | > > you find a quote from IBM confirming this). IBM says it's okay, | > | > That quote is priceless. I know I'll be avoiding IBM | > disks from now on ;) | | It could be true for many disks and only IBM has admitted it... Only the IBM drives are having the high return rates. And IBM seems to be blaming this on powering off during writes. But why would the other brands not be having this situation? Is it because they don't get powered off? It could be that other drives have the capability to detect and write over sectors made bad by power off. Or maybe they lock out the sector and map to a spare. They might even have enough spin left to finish the sector correctly in more cases. So I doubt the issue is present in other drives, unless the issue is not really as big of one as we might think and the problems with IBM drives are something else. I do worry that the lighter the platters are, the faster they try to make the drives spin with smaller motors, and the quicker they slow down when power is lost. -- ----------------------------------------------------------------- | Phil Howard - KA9WGN | Dallas | http://linuxhomepage.com/ | | phil-nospam@ipal.net | Texas, USA | http://phil.ipal.org/ | ----------------------------------------------------------------- ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-24 23:41 ` Phil Howard @ 2001-11-25 0:24 ` Ian Stirling 2001-11-25 0:53 ` Phil Howard 0 siblings, 1 reply; 81+ messages in thread From: Ian Stirling @ 2001-11-25 0:24 UTC (permalink / raw) To: Phil Howard; +Cc: linux-kernel <snip> > It could be that other drives have the capability to detect and write > over sectors made bad by power off. Or maybe they lock out the sector > and map to a spare. They might even have enough spin left to finish > the sector correctly in more cases. > > So I doubt the issue is present in other drives, unless the issue is > not really as big of one as we might think and the problems with IBM > drives are something else. > > I do worry that the lighter the platters are, the faster they try to > make the drives spin with smaller motors, and the quicker they slow > down when power is lost. Utterly unimportant. Let's say for the sake of argument that the drives spins down to a stop in 1 second. Now, the datarate for this 40G IDE drive I've got in my box is about 25 megabytes per second, or about 50K sectors per second. Slowing down isn't a problem. Somewhere I've got a databook, ca 85 I think, for a motor driver chip, to drive spindle motors on hard disks, with integrated diodes that rectify the power coming from the disk when the power fails, to give a little grace. If written by people with a clue, the drive does not need to do much seeking to write the data from a write-cache to dics, just one seek to a journal track, and a write. This needs maybe 3 revs to complete, at most. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-25 0:24 ` Ian Stirling @ 2001-11-25 0:53 ` Phil Howard 2001-11-25 1:25 ` H. Peter Anvin 2001-11-25 1:44 ` Sven.Riedel 0 siblings, 2 replies; 81+ messages in thread From: Phil Howard @ 2001-11-25 0:53 UTC (permalink / raw) To: linux-kernel On Sun, Nov 25, 2001 at 12:24:28AM +0000, Ian Stirling wrote: | <snip> | > It could be that other drives have the capability to detect and write | > over sectors made bad by power off. Or maybe they lock out the sector | > and map to a spare. They might even have enough spin left to finish | > the sector correctly in more cases. | > | > So I doubt the issue is present in other drives, unless the issue is | > not really as big of one as we might think and the problems with IBM | > drives are something else. | > | > I do worry that the lighter the platters are, the faster they try to | > make the drives spin with smaller motors, and the quicker they slow | > down when power is lost. | | Utterly unimportant. | Let's say for the sake of argument that the drives spins down to a stop | in 1 second. | Now, the datarate for this 40G IDE drive I've got in my box is about | 25 megabytes per second, or about 50K sectors per second. | Slowing down isn't a problem. If it takes 1 second to spin down to a stop, the it probably will have slowed to a point where serialization writing a sector cannot be kept in sync within 1 to 5 milliseconds. Once they _start_ slowing down, time is an extremely precious resource. That data pattern has to be read back at full speed. | | Somewhere I've got a databook, ca 85 I think, for a motor driver chip, | to drive spindle motors on hard disks, with integrated | diodes that rectify the power coming from the disk when the power fails, | to give a little grace. | | If written by people with a clue, the drive does not need to do much | seeking to write the data from a write-cache to dics, just one seek | to a journal track, and a write. | This needs maybe 3 revs to complete, at most. By the time the seek completes, the speed is probably too slow to do a good write. Options to deal with this include special handling for the emergency track to allow reading it back by intentionally slowing down the drive for that recovery. Another option is flash disk. The apparent problem in the IBM DTLA is the write didn't have enough time to complete with the platter still spinning within spec. That means the sector gets compressed at the end and the bit density is increased beyond readable levels (if it could go higher reliably, they would just record everything that way). That and the end of the sector doesn't fall off into the gap between sectors where there is probably some low level stuff. So on readback, some bits are in error due to the clocking rate rising due to the compression, and the trailing edge hits the previous sector occupant's un-erased end before the gap. -- ----------------------------------------------------------------- | Phil Howard - KA9WGN | Dallas | http://linuxhomepage.com/ | | phil-nospam@ipal.net | Texas, USA | http://phil.ipal.org/ | ----------------------------------------------------------------- ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-25 0:53 ` Phil Howard @ 2001-11-25 1:25 ` H. Peter Anvin 2001-11-25 1:44 ` Sven.Riedel 1 sibling, 0 replies; 81+ messages in thread From: H. Peter Anvin @ 2001-11-25 1:25 UTC (permalink / raw) To: linux-kernel Followup to: <20011124185321.C4372@vega.ipal.net> By author: Phil Howard <phil-linux-kernel@ipal.net> In newsgroup: linux.dev.kernel > > By the time the seek completes, the speed is probably too slow to do a > good write. Options to deal with this include special handling for the > emergency track to allow reading it back by intentionally slowing down > the drive for that recovery. Another option is flash disk. > And yet another option is to dynamically adjust the data speed fed to the head, to match the rotation speed of the platter. This assumes that the rotation speed can be measured, which should be trivial if they use the rotation to power the drive electronics during shutdown. -hpa -- <hpa@transmeta.com> at work, <hpa@zytor.com> in private! "Unix gives you enough rope to shoot yourself in the foot." http://www.zytor.com/~hpa/puzzle.txt <amsp@zytor.com> ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-25 0:53 ` Phil Howard 2001-11-25 1:25 ` H. Peter Anvin @ 2001-11-25 1:44 ` Sven.Riedel 1 sibling, 0 replies; 81+ messages in thread From: Sven.Riedel @ 2001-11-25 1:44 UTC (permalink / raw) To: Phil Howard; +Cc: linux-kernel On Sat, Nov 24, 2001 at 06:53:21PM -0600, Phil Howard wrote: > If it takes 1 second to spin down to a stop, the it probably will > have slowed to a point where serialization writing a sector cannot > be kept in sync within 1 to 5 milliseconds. Once they _start_ > slowing down, time is an extremely precious resource. That data > pattern has to be read back at full speed. Makes you wonder why drive manufacturers don't use some kind of NVRAM to simply remember the sectornumber that is being written as power fails - a capacitor, or even a small rechargeable battery (for the truely paranoid), could supply the writing voltage. No (further) writing to the sector would be needed during spindown. And when the drive initializes again at boottime, it could check to see if the contents of the NVRAM is not an "all OK" pattern, and simply rewrite the CRC of the sector in question, unless that sector is already present in the bad-sector list of the drive. Yes, this would be a bit more complex, and presents one more possible point of failure, but the current situation seems rather abysmal... And the data in that sector is as good as lost, anyway. Regs, Sven -- Sven Riedel sr@gimp.org Osteroeder Str. 6 / App. 13 sven.riedel@tu-clausthal.de 38678 Clausthal "Call me bored, but don't call me boring." - Larry Wall ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-24 19:20 ` Florian Weimer 2001-11-24 19:29 ` Rik van Riel @ 2001-11-24 22:28 ` H. Peter Anvin 2001-11-25 4:49 ` Andre Hedrick 2001-11-24 23:04 ` Pedro M. Rodrigues ` (3 subsequent siblings) 5 siblings, 1 reply; 81+ messages in thread From: H. Peter Anvin @ 2001-11-24 22:28 UTC (permalink / raw) To: linux-kernel Followup to: <tgy9kwf02c.fsf@mercury.rus.uni-stuttgart.de> By author: Florian Weimer <Florian.Weimer@RUS.Uni-Stuttgart.DE> In newsgroup: linux.dev.kernel > > > However, if it's really true that DTLA drives and their successor > > corrupt blocks (generate bad blocks) on power loss during block writes, > > these drives are crap. > > They do, even IBM admits that (on > > http://www.cooling-solutions.de/dtla-faq > > you find a quote from IBM confirming this). IBM says it's okay, you > have to expect this to happen. So much for their expertise in making > hard disks. This makes me feel rather dizzy (lots of IBM drives in > use). > No sh*t. I have always been favouring IBM drives, and I had a RAID system with these drives bought. It will be a LONG time before I buy another IBM drive, that's for sure. I can't believe they don't even have the decency of saying "we fucked". -hpa -- <hpa@transmeta.com> at work, <hpa@zytor.com> in private! "Unix gives you enough rope to shoot yourself in the foot." http://www.zytor.com/~hpa/puzzle.txt <amsp@zytor.com> ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-24 22:28 ` H. Peter Anvin @ 2001-11-25 4:49 ` Andre Hedrick 0 siblings, 0 replies; 81+ messages in thread From: Andre Hedrick @ 2001-11-25 4:49 UTC (permalink / raw) To: H. Peter Anvin; +Cc: linux-kernel On 24 Nov 2001, H. Peter Anvin wrote: > Followup to: <tgy9kwf02c.fsf@mercury.rus.uni-stuttgart.de> > By author: Florian Weimer <Florian.Weimer@RUS.Uni-Stuttgart.DE> > In newsgroup: linux.dev.kernel > > > > > However, if it's really true that DTLA drives and their successor > > > corrupt blocks (generate bad blocks) on power loss during block writes, > > > these drives are crap. > > > > They do, even IBM admits that (on > > > > http://www.cooling-solutions.de/dtla-faq > > > > you find a quote from IBM confirming this). IBM says it's okay, you > > have to expect this to happen. So much for their expertise in making > > hard disks. This makes me feel rather dizzy (lots of IBM drives in > > use). > > > > No sh*t. I have always been favouring IBM drives, and I had a RAID > system with these drives bought. It will be a LONG time before I buy > another IBM drive, that's for sure. I can't believe they don't even > have the decency of saying "we fucked". Peter, Remember my soon to be famous quote. Everything about storage is a LIE, and that is the only true I stand by. Andre Hedrick Linux Disk Certification Project Linux ATA Development ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-24 19:20 ` Florian Weimer 2001-11-24 19:29 ` Rik van Riel 2001-11-24 22:28 ` H. Peter Anvin @ 2001-11-24 23:04 ` Pedro M. Rodrigues 2001-11-24 23:23 ` Stephen Satchell ` (2 subsequent siblings) 5 siblings, 0 replies; 81+ messages in thread From: Pedro M. Rodrigues @ 2001-11-24 23:04 UTC (permalink / raw) To: linux-kernel, Florian Weimer I've always favoured IBM disks in all my hardware, from enterprise external scsi raid hardware to small ide hardware raid devices (3ware fyi). At home all my four disks are IBM (two DTLA). But with your information it seems i have been bitten by that problem twice at the same time. Several months ago a less zealous system administrator, while shutting down a couple servers for maintenance at night, made a mistake in the console kvm switch, and pushed the red button on a live server with four DTLA IBM disks plugged to a 3ware raid card. On recovery, and after some time, one of the volumes started complaining about errors, and went into degraded mode. One of the disks was clearly broken we thought. So we exchanged it, but alas a couple hours later another one in another volume complained. We also exchanged that one and rebuilt everything. After checking the disks with IBM drive fitness software both presented bad blocks that were recovered with a low level format. I dismissed the events as something weird, but with some logical explanation beyond my grasp. Now all makes sense. /Pedro On 24 Nov 2001 at 20:20, Florian Weimer wrote: > Matthias Andree <matthias.andree@stud.uni-dortmund.de> writes: > > > However, if it's really true that DTLA drives and their successor > > corrupt blocks (generate bad blocks) on power loss during block > > writes, these drives are crap. > > They do, even IBM admits that (on > > http://www.cooling-solutions.de/dtla-faq > > you find a quote from IBM confirming this). IBM says it's okay, you > have to expect this to happen. So much for their expertise in making > hard disks. This makes me feel rather dizzy (lots of IBM drives in > use). > > -- > Florian Weimer Florian.Weimer@RUS.Uni-Stuttgart.DE > University of Stuttgart http://cert.uni-stuttgart.de/ > RUS-CERT +49-711-685-5973/fax > +49-711-685-5898 - To unsubscribe from this list: send the line > "unsubscribe linux-kernel" in the body of a message to > majordomo@vger.kernel.org More majordomo info at > http://vger.kernel.org/majordomo-info.html Please read the FAQ at > http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-24 19:20 ` Florian Weimer ` (2 preceding siblings ...) 2001-11-24 23:04 ` Pedro M. Rodrigues @ 2001-11-24 23:23 ` Stephen Satchell 2001-11-24 23:29 ` H. Peter Anvin [not found] ` <mailman.1006644421.6553.linux-kernel2news@redhat.com> 2001-11-25 12:30 ` Matthias Andree 5 siblings, 1 reply; 81+ messages in thread From: Stephen Satchell @ 2001-11-24 23:23 UTC (permalink / raw) To: H. Peter Anvin, linux-kernel At 02:28 PM 11/24/01 -0800, H. Peter Anvin wrote: > > > However, if it's really true that DTLA drives and their successor > > > corrupt blocks (generate bad blocks) on power loss during block writes, > > > these drives are crap. > > > > They do, even IBM admits that (on > > > > http://www.cooling-solutions.de/dtla-faq > > > > you find a quote from IBM confirming this). IBM says it's okay, you > > have to expect this to happen. So much for their expertise in making > > hard disks. This makes me feel rather dizzy (lots of IBM drives in > > use). > > > >No sh*t. I have always been favouring IBM drives, and I had a RAID >system with these drives bought. It will be a LONG time before I buy >another IBM drive, that's for sure. I can't believe they don't even >have the decency of saying "we fucked". It is the responsibility of the power monitor to detect a power-fail event and tell the drive(s) that a power-fail event is occurring. If power goes out of specification before the drive completes a commanded write, what do you expect the poor drive to do? ANY glitch in the write current will corrupt the current block no matter what -- the final CRC isn't recorded. Most drives do have a panic-stop mode when they detect voltage going out of range so as to minimize the damage caused by an out-of-specification power-down event, and more importantly use the energy in the spinning platter to get the heads moved to a safe place before the drive completely spins down. The panic-stop mode is EXACTLY like a Linux OOPS -- it's a catastrophic event that SHOULD NOT OCCUR. Most power supplies are not designed to hold up for more than 30-60 ms at full load upon removal of mains power. Power-fail detect typically requires 12 ms (three-quarters cycle average at 60 Hz) or 15 ms (three-quarters cycle average at 50 Hz) to detect that mains power has failed, leaving your system a very short time to abort that long queue of disk write commands. It's very possible that by the time the system wakes up to the fact that its electron feeding tube is empty it has already started a write operation that cannot be completed before power goes out of specification. It's a race condition. Fix your system. If you don't have a UPS on that RAID, and some means of shutting down the RAID gracefully when mains power goes, you are sealing your own doom, regardless of the maker of the hard drive you use in that RAID. Even the original CDC disk drives, some of the best damn drives ever manufactured in the world, would corrupt data when power failed during a write. Satch ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-24 23:23 ` Stephen Satchell @ 2001-11-24 23:29 ` H. Peter Anvin 2001-11-26 18:05 ` Steve Brueggeman 0 siblings, 1 reply; 81+ messages in thread From: H. Peter Anvin @ 2001-11-24 23:29 UTC (permalink / raw) To: Stephen Satchell; +Cc: linux-kernel Stephen Satchell wrote: > > It is the responsibility of the power monitor to detect a power-fail > event and tell the drive(s) that a power-fail event is occurring. If > power goes out of specification before the drive completes a commanded > write, what do you expect the poor drive to do? ANY glitch in the write > current will corrupt the current block no matter what -- the final CRC > isn't recorded. Most drives do have a panic-stop mode when they detect > voltage going out of range so as to minimize the damage caused by an > out-of-specification power-down event, and more importantly use the > energy in the spinning platter to get the heads moved to a safe place > before the drive completely spins down. The panic-stop mode is EXACTLY > like a Linux OOPS -- it's a catastrophic event that SHOULD NOT OCCUR. > There is no "power monitor" in a PC system (at least not that is visible to the drive) -- if the drive needs it, it has to provide it itself. It's definitely the responsibility of the drive to recover gracefully from such an event, which means that it writes anything that it has committed to the host to write; anything it hasn't gotten committed to write (but has received) can be written or not written, but must not cause a failure of the drive. A drive is a PERSISTENT storage device, and as such has responsibilities the other devices don't. Anything else is brainless rationalization. -hpa ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-24 23:29 ` H. Peter Anvin @ 2001-11-26 18:05 ` Steve Brueggeman 2001-11-26 23:49 ` Martin Eriksson 0 siblings, 1 reply; 81+ messages in thread From: Steve Brueggeman @ 2001-11-26 18:05 UTC (permalink / raw) To: linux-kernel On Sat, 24 Nov 2001 15:29:05 -0800, you wrote: >Stephen Satchell wrote: > >> >> It is the responsibility of the power monitor to detect a power-fail >> event and tell the drive(s) that a power-fail event is occurring. If >> power goes out of specification before the drive completes a commanded >> write, what do you expect the poor drive to do? ANY glitch in the write >> current will corrupt the current block no matter what -- the final CRC >> isn't recorded. Most drives do have a panic-stop mode when they detect >> voltage going out of range so as to minimize the damage caused by an >> out-of-specification power-down event, and more importantly use the >> energy in the spinning platter to get the heads moved to a safe place >> before the drive completely spins down. The panic-stop mode is EXACTLY >> like a Linux OOPS -- it's a catastrophic event that SHOULD NOT OCCUR. >> > Correct, sort-of. The storage is not allowed to corrupt any data that is unrelated to the currently active operation, (ie adjacent tracks or sectors). Of course write-caching is asking for trouble. > >There is no "power monitor" in a PC system (at least not that is visible >to the drive) -- if the drive needs it, it has to provide it itself. > >It's definitely the responsibility of the drive to recover gracefully >from such an event, which means that it writes anything that it has >committed to the host to write; Correct. If a write gets interrupted in the middle of it's operation, it has not yet returned any completion status, (unless you've enabled write-caching, in which case, you're already asking for trouble) A subsequent read of this half-written sector can return uncorrectable status though, which would be unfortunate if this sector was your allocation table, and the write was a read-modify-write. >anything it hasn't gotten committed to >write (but has received) can be written or not written, but must not >cause a failure of the drive. Reading a sector that was a partial-write because of a power-loss, and returning UNCORRECTABLE status, is not a failure of the drive. > >A drive is a PERSISTENT storage device, and as such has responsibilities >the other devices don't. > >Anything else is brainless rationalization. > > -hpa _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-26 18:05 ` Steve Brueggeman @ 2001-11-26 23:49 ` Martin Eriksson 2001-11-27 0:06 ` Andreas Dilger 2001-11-27 0:18 ` Jonathan Lundell 0 siblings, 2 replies; 81+ messages in thread From: Martin Eriksson @ 2001-11-26 23:49 UTC (permalink / raw) To: Steve Brueggeman, linux-kernel ----- Original Message ----- From: "Steve Brueggeman" <xioborg@yahoo.com> To: <linux-kernel@vger.kernel.org> Sent: Monday, November 26, 2001 7:05 PM Subject: Re: Journaling pointless with today's hard disks? <snip> > >There is no "power monitor" in a PC system (at least not that is visible > >to the drive) -- if the drive needs it, it has to provide it itself. > > > >It's definitely the responsibility of the drive to recover gracefully > >from such an event, which means that it writes anything that it has > >committed to the host to write; > Correct. If a write gets interrupted in the middle of it's operation, > it has not yet returned any completion status, (unless you've enabled > write-caching, in which case, you're already asking for trouble) A > subsequent read of this half-written sector can return uncorrectable > status though, which would be unfortunate if this sector was your > allocation table, and the write was a read-modify-write. > > >anything it hasn't gotten committed to > >write (but has received) can be written or not written, but must not > >cause a failure of the drive. > Reading a sector that was a partial-write because of a power-loss, and > returning UNCORRECTABLE status, is not a failure of the drive. I sure think the drives could afford the teeny-weeny cost of a power failure detection unit, that when a power loss/sway is detected, halts all operations to the platters except for the writing of the current sector. _____________________________________________________ | Martin Eriksson <nitrax@giron.wox.org> | MSc CSE student, department of Computing Science | Umeå University, Sweden ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-26 23:49 ` Martin Eriksson @ 2001-11-27 0:06 ` Andreas Dilger 2001-11-27 0:16 ` Andre Hedrick 2001-11-27 0:18 ` Jonathan Lundell 1 sibling, 1 reply; 81+ messages in thread From: Andreas Dilger @ 2001-11-27 0:06 UTC (permalink / raw) To: Martin Eriksson; +Cc: Steve Brueggeman, linux-kernel On Nov 27, 2001 00:49 +0100, Martin Eriksson wrote: > I sure think the drives could afford the teeny-weeny cost of a power failure > detection unit, that when a power loss/sway is detected, halts all > operations to the platters except for the writing of the current sector. What happens if you have a slightly bad power supply? Does it immediately go read only all the time? It would definitely need to be able to recover operations as soon as the power was "normal" again, even if this caused basically "sync" I/O to the disk. Maybe it would be able to report this to the user via SMART, I don't know. Cheers, Andreas -- Andreas Dilger http://sourceforge.net/projects/ext2resize/ http://www-mddsp.enel.ucalgary.ca/People/adilger/ ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-27 0:06 ` Andreas Dilger @ 2001-11-27 0:16 ` Andre Hedrick 2001-11-27 7:38 ` Andreas Dilger 0 siblings, 1 reply; 81+ messages in thread From: Andre Hedrick @ 2001-11-27 0:16 UTC (permalink / raw) To: Andreas Dilger; +Cc: Martin Eriksson, Steve Brueggeman, linux-kernel On Mon, 26 Nov 2001, Andreas Dilger wrote: > On Nov 27, 2001 00:49 +0100, Martin Eriksson wrote: > > I sure think the drives could afford the teeny-weeny cost of a power failure > > detection unit, that when a power loss/sway is detected, halts all > > operations to the platters except for the writing of the current sector. > > What happens if you have a slightly bad power supply? Does it immediately > go read only all the time? It would definitely need to be able to > recover operations as soon as the power was "normal" again, even if this > caused basically "sync" I/O to the disk. Maybe it would be able to > report this to the user via SMART, I don't know. ATA/SCSI SMART is already DONE! To bad most people have not noticed. Regards, Andre Hedrick CEO/President, LAD Storage Consulting Group Linux ATA Development Linux Disk Certification Project ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-27 0:16 ` Andre Hedrick @ 2001-11-27 7:38 ` Andreas Dilger 2001-11-27 11:48 ` Ville Herva 0 siblings, 1 reply; 81+ messages in thread From: Andreas Dilger @ 2001-11-27 7:38 UTC (permalink / raw) To: Andre Hedrick; +Cc: Martin Eriksson, Steve Brueggeman, linux-kernel On Nov 26, 2001 16:16 -0800, Andre Hedrick wrote: > On Mon, 26 Nov 2001, Andreas Dilger wrote: > > What happens if you have a slightly bad power supply? Does it immediately > > go read only all the time? It would definitely need to be able to > > recover operations as soon as the power was "normal" again, even if this > > caused basically "sync" I/O to the disk. Maybe it would be able to > > report this to the user via SMART, I don't know. > > ATA/SCSI SMART is already DONE! > > To bad most people have not noticed. Oh, I know SMART is implemented, although I haven't actually seen/used a tool which takes advantage of it (do you have such a thing?). It would be nice if there were messages appearing in my syslog (just like the AIX days) which said "there were 10 temporary read errors at block M on drive X yesterday" and "1 permanent write error at block M, block remapped on drive X yesterday", so I would know _before_ my drive craps out after all of the remapping table is full, or the temporary read errors become permanent. (I have a special interest in that because my laptop hard drive sounds like a jet engine at times... ;-). What I was originally suggesting is that it have a field which can report to the user that "there were 800 sync/reset operations because of power drops that were later found not to be power failures". That is what I was suggesting SMART report in this case (actual power failures are not interesting). Note also, that this is purely hypothetical, based on only a vague understanding on what actually happens when the drive thinks it is losing power, and only ever having seen the hex output of /proc/ide/hda/smart_{values,thresholds}. Being able to get a number back from the hard drive that it is performing poorly (i.e. synchronous I/O + lots of resets) because of a bad power supply is exactly what SMART was designed to do - predictive failure analysis. Cheers, Andreas -- Andreas Dilger http://sourceforge.net/projects/ext2resize/ http://www-mddsp.enel.ucalgary.ca/People/adilger/ ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-27 7:38 ` Andreas Dilger @ 2001-11-27 11:48 ` Ville Herva 0 siblings, 0 replies; 81+ messages in thread From: Ville Herva @ 2001-11-27 11:48 UTC (permalink / raw) To: adilger; +Cc: linux-kernel On Tue, Nov 27, 2001 at 12:38:43AM -0700, you [Andreas Dilger] claimed: > > Oh, I know SMART is implemented, although I haven't actually seen/used a > tool which takes advantage of it (do you have such a thing?). It would > be nice if there were messages appearing in my syslog (just like the > AIX days) which said "there were 10 temporary read errors at block M on > drive X yesterday" and "1 permanent write error at block M, block remapped > on drive X yesterday", so I would know _before_ my drive craps out There are packaged smartsuite and ide-smart at linux-ide.org. I think smartd from smartsuite does just that. At least smartctl does read the values in understandable format. BTW: does anyone know if it is supposed to understand the temperature sensors supposedly present in newer IBM drives? -- v -- v@iki.fi ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-26 23:49 ` Martin Eriksson 2001-11-27 0:06 ` Andreas Dilger @ 2001-11-27 0:18 ` Jonathan Lundell 2001-11-27 1:01 ` Ian Stirling ` (2 more replies) 1 sibling, 3 replies; 81+ messages in thread From: Jonathan Lundell @ 2001-11-27 0:18 UTC (permalink / raw) To: Martin Eriksson, Steve Brueggeman, linux-kernel At 12:49 AM +0100 11/27/01, Martin Eriksson wrote: >I sure think the drives could afford the teeny-weeny cost of a power failure >detection unit, that when a power loss/sway is detected, halts all >operations to the platters except for the writing of the current sector. That's hard to do. You really need to do the power-fail detection on the AC line, or have some sort of energy storage and a dc-dc converter, which is expensive. If you simply detect a drop in dc power, there simply isn't enough margin to reliably write a block. Years (many years) back, Diablo had a short-lived model (400, IIRC) that had an interesting twist on this. On a power failure, the spinning disk (this was in the days of 14" platters, so plenty of energy) drove the spindle motor as a generator, providing power to the drive electronics for several seconds before it spun down to below operating speed. Of course, that was in the days of thousands of dollars for maybe 20MB of storage.... -- /Jonathan Lundell. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-27 0:18 ` Jonathan Lundell @ 2001-11-27 1:01 ` Ian Stirling 2001-11-27 1:33 ` H. Peter Anvin 2001-11-27 1:57 ` Steve Underwood 2001-11-27 5:04 ` Stephen Satchell 2 siblings, 1 reply; 81+ messages in thread From: Ian Stirling @ 2001-11-27 1:01 UTC (permalink / raw) To: Jonathan Lundell; +Cc: Martin Eriksson, Steve Brueggeman, linux-kernel > > At 12:49 AM +0100 11/27/01, Martin Eriksson wrote: > >I sure think the drives could afford the teeny-weeny cost of a power failure <snip> > converter, which is expensive. If you simply detect a drop in dc > power, there simply isn't enough margin to reliably write a block. > > Years (many years) back, Diablo had a short-lived model (400, IIRC) > that had an interesting twist on this. On a power failure, the > spinning disk (this was in the days of 14" platters, so plenty of > energy) drove the spindle motor as a generator, providing power to > the drive electronics for several seconds before it spun down to > below operating speed. I have a (IIRC) elantec databook from 1985 or so, that I've found chips in disks from the MFM/RLL PC era. These are motor driver chips aimed at PCs, which support generation using the motor. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-27 1:01 ` Ian Stirling @ 2001-11-27 1:33 ` H. Peter Anvin 0 siblings, 0 replies; 81+ messages in thread From: H. Peter Anvin @ 2001-11-27 1:33 UTC (permalink / raw) To: linux-kernel Followup to: <200111270101.BAA01290@mauve.demon.co.uk> By author: Ian Stirling <root@mauve.demon.co.uk> In newsgroup: linux.dev.kernel > > I have a (IIRC) elantec databook from 1985 or so, that I've found chips in > disks from the MFM/RLL PC era. > These are motor driver chips aimed at PCs, which support generation > using the motor. > This is still being done, AFAIK. There is quite some amount of energy in a 7200 rpm platter set. -hpa -- <hpa@transmeta.com> at work, <hpa@zytor.com> in private! "Unix gives you enough rope to shoot yourself in the foot." http://www.zytor.com/~hpa/puzzle.txt <amsp@zytor.com> ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-27 0:18 ` Jonathan Lundell 2001-11-27 1:01 ` Ian Stirling @ 2001-11-27 1:57 ` Steve Underwood 2001-11-27 5:04 ` Stephen Satchell 2 siblings, 0 replies; 81+ messages in thread From: Steve Underwood @ 2001-11-27 1:57 UTC (permalink / raw) To: linux-kernel Jonathan Lundell wrote: > At 12:49 AM +0100 11/27/01, Martin Eriksson wrote: > >> I sure think the drives could afford the teeny-weeny cost of a power >> failure >> detection unit, that when a power loss/sway is detected, halts all >> operations to the platters except for the writing of the current sector. > > > That's hard to do. You really need to do the power-fail detection on the > AC line, or have some sort of energy storage and a dc-dc converter, > which is expensive. If you simply detect a drop in dc power, there > simply isn't enough margin to reliably write a block. > > Years (many years) back, Diablo had a short-lived model (400, IIRC) that > had an interesting twist on this. On a power failure, the spinning disk > (this was in the days of 14" platters, so plenty of energy) drove the > spindle motor as a generator, providing power to the drive electronics > for several seconds before it spun down to below operating speed. > > Of course, that was in the days of thousands of dollars for maybe 20MB > of storage.... Quite true. The drives really need to get an "oh heck, the power's about to die. Quick, tidy up" signal from the outside world (like down the ribbon). Cheap, at the limit, PSUs probably couldn't give enough notice to be very helpful. Server grade ones should - they can usually ride over brief hiccups in the power, so they should be able to give a few 10s of ms notice before the regulated power lines start to droop. Perhaps the ATA command set should include such a feature, so the OS could take instruction from the hardware on the power situation, and tell the drives what to do. Regards, Steve ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-27 0:18 ` Jonathan Lundell 2001-11-27 1:01 ` Ian Stirling 2001-11-27 1:57 ` Steve Underwood @ 2001-11-27 5:04 ` Stephen Satchell 2 siblings, 0 replies; 81+ messages in thread From: Stephen Satchell @ 2001-11-27 5:04 UTC (permalink / raw) To: Steve Underwood, linux-kernel At 09:57 AM 11/27/01 +0800, Steve Underwood wrote: >Quite true. The drives really need to get an "oh heck, the power's about >to die. Quick, tidy up" signal from the outside world (like down the >ribbon). Cheap, at the limit, PSUs probably couldn't give enough notice to >be very helpful. Server grade ones should - they can usually ride over >brief hiccups in the power, so they should be able to give a few 10s of ms >notice before the regulated power lines start to droop. Perhaps the ATA >command set should include such a feature, so the OS could take >instruction from the hardware on the power situation, and tell the drives >what to do. Looking at the various interface specifications, both SCSI and ATA have the ability to signal to the drive that the power is going, and do it in such a way that the drive would have at least 10 milliseconds from the time the hardware signal is received by the drive before +5 and +12 go out of specification. This time is based on the specifications for ATX power supplies, as I assume most modern boxes that are used for production applications would be using an ATX power supply or similar. Lest you think this lets older systems off the hook, the 1981 IBM PC Technical Reference describes (in looser language) a similar requirement. The question remains whether (1) modern motherboards and SCSI controllers pass through the POWER-OK signal to the RESET- line (IDE/ATA) and RSET (SCSI), and (2) the hard drives respond intelligently to power-failure indications. Telling the difference between a bus-reset event and a panic reset would be easy: if the reset signal is asserted for more than a millisecond or two (such as when the POWER-OK signal from the power supply goes away) then the box is in a power panic situation. Preventing spurious power panics is the responsibility of the power supply designer, particularly if the supply uses a large energy-storage capacitor designed to let the system ride out power-switching events without hiccup. Suggestion to the people contributing to ATA-7: write some language that talks specifically about power-failure scenarios, and define a power-crisis state based on the signals available to the drives from ATA interfaces to determine that a power-crisis event has occurred. If the committee would sit still for it, make it a separate section that appears in the table of contents. Suggestion to the people contributing to SCSI standards: ditto. Satch ^ permalink raw reply [flat|nested] 81+ messages in thread
[parent not found: <mailman.1006644421.6553.linux-kernel2news@redhat.com>]
* Re: Journaling pointless with today's hard disks? [not found] ` <mailman.1006644421.6553.linux-kernel2news@redhat.com> @ 2001-11-25 4:20 ` Pete Zaitcev 2001-11-25 13:52 ` Pedro M. Rodrigues 1 sibling, 0 replies; 81+ messages in thread From: Pete Zaitcev @ 2001-11-25 4:20 UTC (permalink / raw) To: satch, linux-kernel >[...] > It is the responsibility of the power monitor to detect a power-fail event > and tell the drive(s) that a power-fail event is occurring. > Most power supplies are not designed to hold up for more than 30-60 ms at > full load upon removal of mains power. Power-fail detect typically > requires 12 ms (three-quarters cycle average at 60 Hz) or 15 ms > (three-quarters cycle average at 50 Hz) to detect that mains power has > failed, leaving your system a very short time to abort that long queue of > disk write commands. This is a total crap argument, because you invent an impossible request, pretend that your opponent made that request, then show that it's impossible to fulfill the impossible requesti. No shit, sherlock! Of course it's "a very short time to abort that long queue of disk write commands". However, what is asked here is entirely different: disks must complete writes of sectors that they started writing, this is all. They do not need to report _anything_ to the host, in fact they may ignore the host interface completely the moment the power failure sequence is triggered. Neither they need to do anything about queued commands: abort them, discard in any way, or whatever. Just complete the sector, and start head landing sequence. IBM Deskstar is completely broken, and that's a fact. BTW, hpa went on how he was buying IBM drives, how good they were, and what a surprise it was that IBM fucked Deskstar. Hardly a surprise. The first time I heard of IBM drive was a horror story. Our company was making RAID arrays, and we sourced new IBM SCSI disks. They were qualified through a rigorous testing as it was the procedure in the company. So, after a while they started to fail. It turned out that bearings leaked grease to platters. Of course, we shipped tens of thousands of those when IBM explained to us that every one of them will die in a year. We shipped Seagates ever after. -- Pete ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? [not found] ` <mailman.1006644421.6553.linux-kernel2news@redhat.com> 2001-11-25 4:20 ` Pete Zaitcev @ 2001-11-25 13:52 ` Pedro M. Rodrigues 1 sibling, 0 replies; 81+ messages in thread From: Pedro M. Rodrigues @ 2001-11-25 13:52 UTC (permalink / raw) To: satch, linux-kernel, Pete Zaitcev With those Seagates you probably just got yourself something else to worry about, maybe even more sneaky than disks failing completely after one year. I've had a $40.000+ external raid system (brand withhold), promising reliability and data security at all levels, and with enough bells and whistles to bore a "geek". It came with Hitachi disks and that surprised me, because that box was replacing a same brand one that was sold with IBM disks - the best and the only thing they used, i was told. I thought maybe they knew something we don't, or maybe they were really special. Anyway, some time later we started having complete disk lockups in the device. Honest, the hardware would find a bad block in one of the disks with parity that weren't remaped. And for some reason the hardware would just freeze after some time. After checking with support we were sent a new batch of disks to replace the current ones, with a different firmware level. It did the trick. After backing up and restoring 360GB of data, of course. But this begs for some questions. And it really makes me worry about where the industry is going. Is it the increasing complexity of the technology? Are they cutting too many corners on trying to reach the market sooner? Or just cost cutting with old fashioned second source suppliers? I am more and more worried about what passes as "enterprise level storage" these days. /Pedro On 24 Nov 2001 at 23:20, Pete Zaitcev wrote: > > IBM Deskstar is completely broken, and that's a fact. > > BTW, hpa went on how he was buying IBM drives, how good they were, and > what a surprise it was that IBM fucked Deskstar. Hardly a surprise. > The first time I heard of IBM drive was a horror story. Our company > was making RAID arrays, and we sourced new IBM SCSI disks. They were > qualified through a rigorous testing as it was the procedure in the > company. So, after a while they started to fail. It turned out that > bearings leaked grease to platters. Of course, we shipped tens of > thousands of those when IBM explained to us that every one of them > will die in a year. We shipped Seagates ever after. > > -- Pete > - > To unsubscribe from this list: send the line "unsubscribe > linux-kernel" in the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-24 19:20 ` Florian Weimer ` (4 preceding siblings ...) [not found] ` <mailman.1006644421.6553.linux-kernel2news@redhat.com> @ 2001-11-25 12:30 ` Matthias Andree 2001-11-25 15:04 ` Barry K. Nathan 5 siblings, 1 reply; 81+ messages in thread From: Matthias Andree @ 2001-11-25 12:30 UTC (permalink / raw) To: linux-kernel On Sat, 24 Nov 2001, Florian Weimer wrote: > > However, if it's really true that DTLA drives and their successor > > corrupt blocks (generate bad blocks) on power loss during block writes, > > these drives are crap. > > They do, even IBM admits that (on > > http://www.cooling-solutions.de/dtla-faq > > you find a quote from IBM confirming this). IBM says it's okay, you > have to expect this to happen. So much for their expertise in making > hard disks. This makes me feel rather dizzy (lots of IBM drives in > use). Well, claiming the OS to cause hard errors? Design fault. Claiming DC loss to cause hard errors? Design fault. IBM would really better shed some real light on this issue, and if they spoiled their firmware (heck, there ARE firmware updates for OEM disks of the 75GXP series) or electronics, they'd better admit that so as to reinstore the trust people had before DTLA drives were sold. FUD works its way, so personally, I'm not buying IBM drives until this issue is FULLY resolved, so I presume, I won't buy any DTLA or and IC35Lxx drives of the current series. This is not a recommendation, just what I'm doing. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-25 12:30 ` Matthias Andree @ 2001-11-25 15:04 ` Barry K. Nathan 2001-11-25 16:31 ` Matthias Andree 0 siblings, 1 reply; 81+ messages in thread From: Barry K. Nathan @ 2001-11-25 15:04 UTC (permalink / raw) To: Matthias Andree; +Cc: linux-kernel > Claiming DC loss to cause hard errors? Design fault. > > IBM would really better shed some real light on this issue, and if they > spoiled their firmware (heck, there ARE firmware updates for OEM disks > of the 75GXP series) or electronics, they'd better admit that so as to > reinstore the trust people had before DTLA drives were sold. "Power off during write operations may make an incomplete sector which will report hard data error when read. The sector can be recovered by a rewrite operation." http://www-3.ibm.com/storage/hdd/tech/techlib.nsf/techdocs/85256AB8006A31E587256A77006E0E91/$file/D60gxp_sp21.pdf Deskstar 60GXP specifications, section 6.0 The above quote and URL are IBM's official word, from their OEM specification manual. FWIW, I checked the OEM manual for the 73LZX as well (not that that drive is available anywhere, but I wanted to see what IBM did/is doing for that drive), and the corresponding section in that manual mentions nothing about incomplete sectors causing hard errors. I just checked the 36LZX OEM spec as well and that also omits the same clause. OTOH, A few hours ago I checked the specs for several TravelStars and they mentioned this incomplete sector thing. So, I guess IBM's position on this is that this failure mode is OK for IDE drives but not for SCSI. Here's a starting point for finding the IBM manuals: http://www-3.ibm.com/storage/hdd/tech/techlib.nsf/pages/main?OpenDocument (Just for my curiosity, I checked for the microdrives too. The phrasing is different there: "There is a possibility that power off during a write operation might make a maximum of 1 sector of data unreadable. This state can be recovered by a rewrite operation.") -Barry K. Nathan <barryn@pobox.com> ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-25 15:04 ` Barry K. Nathan @ 2001-11-25 16:31 ` Matthias Andree 2001-11-27 2:39 ` Pavel Machek 0 siblings, 1 reply; 81+ messages in thread From: Matthias Andree @ 2001-11-25 16:31 UTC (permalink / raw) To: linux-kernel > "Power off during write operations may make an incomplete sector which > will report hard data error when read. The sector can be recovered by a > rewrite operation." So the proper defect management would be to simply initialize the broken sector once a fsck hits it (still, I've never seen disks develop so many bad blocks so quickly as those failed DTLA-307045 drives I had). Note, the specifications say that the write cache setting is ignored when the drive runs out of spare blocks for reassignment after defects (so that the drive can return the error code right away when it cannot guarantee the write actually goes to disk). -- Matthias Andree "They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." Benjamin Franklin ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-25 16:31 ` Matthias Andree @ 2001-11-27 2:39 ` Pavel Machek 2001-12-03 10:23 ` Matthias Andree 0 siblings, 1 reply; 81+ messages in thread From: Pavel Machek @ 2001-11-27 2:39 UTC (permalink / raw) To: linux-kernel Hi! > > "Power off during write operations may make an incomplete sector which > > will report hard data error when read. The sector can be recovered by a > > rewrite operation." > > So the proper defect management would be to simply initialize the broken > sector once a fsck hits it (still, I've never seen disks develop so many > bad blocks so quickly as those failed DTLA-307045 drives I had). > > Note, the specifications say that the write cache setting is ignored > when the drive runs out of spare blocks for reassignment after defects > (so that the drive can return the error code right away when it cannot > guarantee the write actually goes to disk). They should turn off write-back after number-of-spare-block < cache-size, otherwise they are not safe. Pavel -- Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt, details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-27 2:39 ` Pavel Machek @ 2001-12-03 10:23 ` Matthias Andree 0 siblings, 0 replies; 81+ messages in thread From: Matthias Andree @ 2001-12-03 10:23 UTC (permalink / raw) To: linux-kernel On Tue, 27 Nov 2001, Pavel Machek wrote: > > Note, the specifications say that the write cache setting is ignored > > when the drive runs out of spare blocks for reassignment after defects > > (so that the drive can return the error code right away when it cannot > > guarantee the write actually goes to disk). > > They should turn off write-back after number-of-spare-block < cache-size, > otherwise they are not safe. I don't know exactly what they're doing, but they also need to safeguard against defective spare blocks, so number-of-space-blocks < cache-size is not sufficient. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-24 13:03 Journaling pointless with today's hard disks? Florian Weimer 2001-11-24 13:40 ` Rik van Riel @ 2001-11-25 9:14 ` Chris Wedgwood 2001-11-25 22:55 ` Daniel Phillips 2001-11-26 16:59 ` Rob Landley 2001-11-26 17:14 ` Steve Brueggeman 2 siblings, 2 replies; 81+ messages in thread From: Chris Wedgwood @ 2001-11-25 9:14 UTC (permalink / raw) To: Florian Weimer; +Cc: linux-kernel On Sat, Nov 24, 2001 at 02:03:11PM +0100, Florian Weimer wrote: When the drive is powered down during a write operation, the sector which was being written has got an incorrect checksum stored on disk. So far, so good---but if the sector is read later, the drive returns a *permanent*, *hard* error, which can only be removed by a low-level format (IBM provides a tool for it). The drive does not automatically map out such sectors. AVOID SUCH DRIVES... I have both Seagate and IBM SCSI drives which a are hot-swappable in a test machine that I used for testing various journalling filesystems a while back for reliability. Some (many) of those tests involved removed the disk during writes (literally) and checking the results afterwards. The drives were set not to write-cache (they don't by default, but all my IDE drives do, so maybe this is a SCSI thing?) At no point did I ever see a partial write or corrupted sector; nor have I seen any appear in the grown table, so as best as I can tell even under removal with sustain writes there are SOME DRIVES WHERE THIS ISN'T A PROBLEM. Now, since EMC, NetApp, Sun, HP, Compaq, etc. all have products which presumable depend on this behavior, I don't think it's going to go away, it perhaps will just become important to know which drives are brain-damaged and list them so people can avoid them. As this will affect the Windows world too consumer pressure will hopefully rectify this problem. --cw P.S. Write-caching in hard-drives is insanely dangerous for journalling filesystems and can result in all sorts of nasties. I recommend people turn this off in their init scripts (perhaps I will send a patch for the kernel to do this on boot, I just wonder if it will eat some drives). ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-25 9:14 ` Chris Wedgwood @ 2001-11-25 22:55 ` Daniel Phillips 2001-11-26 16:59 ` Rob Landley 1 sibling, 0 replies; 81+ messages in thread From: Daniel Phillips @ 2001-11-25 22:55 UTC (permalink / raw) To: Chris Wedgwood, Florian Weimer; +Cc: linux-kernel, Andre Hedrick On November 25, 2001 10:14 am, Chris Wedgwood wrote: > On Sat, Nov 24, 2001 at 02:03:11PM +0100, Florian Weimer wrote: > Now, since EMC, NetApp, Sun, HP, Compaq, etc. all have products which > presumable depend on this behavior, I don't think it's going to go > away, it perhaps will just become important to know which drives are > brain-damaged and list them so people can avoid them. > > As this will affect the Windows world too consumer pressure will > hopefully rectify this problem. Andre Hedrik has put together a site with exactly this intention, check out: http://linuxdiskcert.org/ Of course, there's a lot of hard work between here and having a useful database, but, hey, well begun and all that... According to Andre: "the requirements are they apply a patch run a series of tests and then I will submit to the OEM for rebutal and if there is no resolution the drive and the test procedure on how to reproduce the error will be posted" -- Daniel ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-25 9:14 ` Chris Wedgwood 2001-11-25 22:55 ` Daniel Phillips @ 2001-11-26 16:59 ` Rob Landley 2001-11-26 20:30 ` Andre Hedrick ` (2 more replies) 1 sibling, 3 replies; 81+ messages in thread From: Rob Landley @ 2001-11-26 16:59 UTC (permalink / raw) To: Chris Wedgwood; +Cc: linux-kernel On Sunday 25 November 2001 04:14, Chris Wedgwood wrote: > > P.S. Write-caching in hard-drives is insanely dangerous for > journalling filesystems and can result in all sorts of nasties. > I recommend people turn this off in their init scripts (perhaps I > will send a patch for the kernel to do this on boot, I just > wonder if it will eat some drives). Anybody remember back when hard drives didn't reliably park themselves when they cut power? This isn't something drive makers seem to pay much attention to until customers scream at them for a while... Having no write caching on the IDE side isn't a solution either. The problem is the largest block of data you can send to an ATA drive in a single command is smaller than modern track sizes (let alone all the tracks under the heads on a multi-head drive), so without any sort of cacheing in the drive at all you add rotational latency between each write request for the point you left off writing to come back under the head again. This will positively KILL write performance. (I suspect the situation's more or less the same for read too, but nobody's objecting to read cacheing.) The solution isn't to avoid write cacheing altogether (performance is 100% guaranteed to suck otherwise, for reasons unrelated to how well your hardware works but to legacy request size limits in the ATA specification), but to have a SMALL write buffer, the size of one or two tracks to allow linear ATA write requests to be assembled into single whole-track writes, and to make sure the disks' electronics has enough capacitance in it to flush this buffer to disk. (How much do capacitors cost? We're talking what, an extra 20 miliseconds? The buffer should be small enough you don't have to do that much seeking.) Just add an off-the-shelf capacitor to your circuit. The firmware already has to detect power failure in order to park the head sanely, so make it flush the buffers along the way. This isn't brain surgery, it just wasn't a design criteria on IBM's checklist of features approved in the meeting. (Maybe they ran out of donuts and adjourned the meeting early?) Rob ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-26 16:59 ` Rob Landley @ 2001-11-26 20:30 ` Andre Hedrick 2001-11-26 20:35 ` Rob Landley 2001-11-26 20:53 ` Richard B. Johnson 2001-11-27 16:39 ` Matthias Andree 2 siblings, 1 reply; 81+ messages in thread From: Andre Hedrick @ 2001-11-26 20:30 UTC (permalink / raw) To: Rob Landley; +Cc: Chris Wedgwood, linux-kernel On Mon, 26 Nov 2001, Rob Landley wrote: > On Sunday 25 November 2001 04:14, Chris Wedgwood wrote: > > > > > P.S. Write-caching in hard-drives is insanely dangerous for > > journalling filesystems and can result in all sorts of nasties. > > I recommend people turn this off in their init scripts (perhaps I > > will send a patch for the kernel to do this on boot, I just > > wonder if it will eat some drives). > > Anybody remember back when hard drives didn't reliably park themselves when > they cut power? This isn't something drive makers seem to pay much attention > to until customers scream at them for a while... > > Having no write caching on the IDE side isn't a solution either. The problem > is the largest block of data you can send to an ATA drive in a single command > is smaller than modern track sizes (let alone all the tracks under the heads > on a multi-head drive), so without any sort of cacheing in the drive at all > you add rotational latency between each write request for the point you left > off writing to come back under the head again. This will positively KILL > write performance. (I suspect the situation's more or less the same for read > too, but nobody's objecting to read cacheing.) > > The solution isn't to avoid write cacheing altogether (performance is 100% > guaranteed to suck otherwise, for reasons unrelated to how well your hardware > works but to legacy request size limits in the ATA specification), but to > have a SMALL write buffer, the size of one or two tracks to allow linear ATA > write requests to be assembled into single whole-track writes, and to make > sure the disks' electronics has enough capacitance in it to flush this buffer > to disk. (How much do capacitors cost? We're talking what, an extra 20 > miliseconds? The buffer should be small enough you don't have to do that > much seeking.) > > Just add an off-the-shelf capacitor to your circuit. The firmware already > has to detect power failure in order to park the head sanely, so make it > flush the buffers along the way. This isn't brain surgery, it just wasn't a > design criteria on IBM's checklist of features approved in the meeting. > (Maybe they ran out of donuts and adjourned the meeting early?) > Rob, Send me an outline/discription and I will present it during the Dec T13 meeting for a proposal number for inclusion into ATA-7. Regards, Andre Hedrick CEO/President, LAD Storage Consulting Group Linux ATA Development Linux Disk Certification Project ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-26 20:30 ` Andre Hedrick @ 2001-11-26 20:35 ` Rob Landley 2001-11-26 23:59 ` Andreas Dilger ` (3 more replies) 0 siblings, 4 replies; 81+ messages in thread From: Rob Landley @ 2001-11-26 20:35 UTC (permalink / raw) To: Andre Hedrick; +Cc: Chris Wedgwood, linux-kernel On Monday 26 November 2001 15:30, Andre Hedrick wrote: > On Mon, 26 Nov 2001, Rob Landley wrote: > > Just add an off-the-shelf capacitor to your circuit. The firmware > > already has to detect power failure in order to park the head sanely, so > > make it flush the buffers along the way. This isn't brain surgery, it > > just wasn't a design criteria on IBM's checklist of features approved in > > the meeting. (Maybe they ran out of donuts and adjourned the meeting > > early?) > > Rob, > > Send me an outline/discription and I will present it during the Dec T13 > meeting for a proposal number for inclusion into ATA-7. What kind of write-up do you want? (How formal?) The trick here is limiting the scope of the problem. Your buffer can't be larger than you can reliably write back on a sudden power failure. (This should be obvious.) So the obvious answer is to make your writeback cache SMALL. The problems that go with flushing it are then correspondingly small. Your READ cache can be as large as you like, but when the disk accepts data written to it, a journaling FS assumes it will be committed to disk. Explicit flush requests are largely trying to get the filesystem to know about disk implementation issues: that's unnecessary complexity. (And something vendors hate to implement because it kills performance.) But either the drive can flush cache when the power goes out, or it's not reliable. So how big of a cache is useful? Well, what does the cache DO? A small 1-track cache helps write full tracks at a time, and if the cache can hold a second track it can start its seek immediately upon finishing the first track without worrying about latency of the OS getting back to it with more data. But more than 2 tracks gives you no benefit: cacheing beyond that is the operating system's job. No benefit, and the liability of more to flush on power failure. The answer is simple: Don't Do That Then. You need stored power to flush the stored data, and a capacitor's better than a battery for several reasons. It's cheaper, it lasts longer (no repeated charge/discharge fatigue), it can provide a LOT of power very quickly (we're actuating motors here), and we're only asking for a fraction of a second's extra power here which isn't what most batteries are designed for anyway. Capacitors are. The people talking about batteries are trying to do battery backed up cache, which is silly and overkill. We want the data to go to a disk which is ALREADY spinning at full speed when we lose power. Current designs already try to flush the cache as they're losing power (a write cache is always in the process of being flushed, barring contention with read requests), and sometimes they even manage to do it. We just need a little extra power to guarantee we can shut down gracefully. A capacitor can provide a few miliseconds worth of power to keep the platters spinning at full speed, power the logic, do a maximum of two seeks, and of course feed power to the write head. Conceptually, our volatile ram cache needs a power cache to flush it on power failure, and cacheing amperage is what a capacitor DOES. Now let's get back to cache size. You need 1 track of cache to get full track writes with ATA. Being able to feed it a second track might be a good idea to avoid latency at the OS between one track finishing and the next starting. (If nothing else, you tell it where to seek to next.) But more than 2 tracks serves no purpose if the OS has a backlog of work for the disk to do, and if it doesn't we're not optimizing anything anyway. Now a cache large enough to hold 2 full tracks could also hold dozens of individual sectors scattered around the disk, which could take a full second to write off and power down. This is a "doctor, it hurts when I do this" question. DON'T DO THAT. The drive should block when it's fed sectors living on more than 2 tracks. Don't bother having the drive implement an elevator algorithm: the OS already has one. Just don't cache sectors living on more than 2 tracks at a time: treat it as a "cache full" situation and BLOCK. And further, don't cache anything for a SECOND track until you've already seeked to the first track. This is to limit the number of potential seeks the capacitor has to power. Reads work into this too: any time you get a "seek request", for read or for write, finish with the track you're on before moving. Accept new write requests into the buffer for the track you're currently on, and for ONE other track. If you're not currently on a track you have anything to write to, you can buffer stuff for only one other track. Anything else blocks just as if the buffer was full. (Because it is.) That way, the power down problem is strictly limited: 1) write out the track you're over 2) seek to the second track 3) write that out too 4) park the head You're done. You can measure this in the lab, determine exactly how much power your capacitor needs to supply to guarantee that, and implement it. Your worst case scenario is a full track write next to where the head normally parks, a full track write at the far end of the disk, and then seeking back to the landing zone. This is two seeks including the park, which should still be easily measured in miliseconds. There's no elevator algorithm (that's the OS's job), no battery backed up cache (not needed, the platters are already persistent), just a cheap solution for a cheap ATA drive, arrived at by limiting the size of the problem you're handling. What new hardware is involved? Add a capacitor. Add a power level sensor. (Drives may already have this to know when to park the head.) Firmware to manage the cache (limiting its data intake, and flushing right before parking). I think that's it. Did I miss anything? Oh yeah, on power fail stop worrying about read requests. (They can theoretically starve the write requests on this capacitor-powered guaranteed seek thing, although if the power IS failing there shouldn't be too many more of them coming in, should there? But they may be queued.) But that's fairly obvious, and there has to be logic for this already or else the read head would run out of power and crash into the disk before it got a chance to park... Again, just unload the real write cacheing on the OS because the purpose of the drive's cacheing is to batch requests to the track level and to disguise seek latency a bit, and if that's ALL it does it's easy to reliably flush that on power down with just a capacitor. Any cacheing beyond 2 tracks worth (not 2 tracks worth of individual sectors scattered all over the disk but "the current track and 1 other track") just gets in the way of reliability. Yes the drive maker may be wasting DRAM by doing that. Tell them they can dedicate that other ram to a read cache, but writes need to block to maintain the implicit guarantee that if the drive accepted the write the data will still be there after a power off.. Rob ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-26 20:35 ` Rob Landley @ 2001-11-26 23:59 ` Andreas Dilger 2001-11-27 0:24 ` H. Peter Anvin 2001-11-27 1:23 ` Ian Stirling ` (2 subsequent siblings) 3 siblings, 1 reply; 81+ messages in thread From: Andreas Dilger @ 2001-11-26 23:59 UTC (permalink / raw) To: Rob Landley; +Cc: Andre Hedrick, Chris Wedgwood, linux-kernel On Nov 26, 2001 15:35 -0500, Rob Landley wrote: > The drive should block when it's fed sectors living on more than 2 tracks. > Don't bother having the drive implement an elevator algorithm: the OS already > has one. Just don't cache sectors living on more than 2 tracks at a time: > treat it as a "cache full" situation and BLOCK. The other thing that concerns a journaling fs is write ordering. If you can _guarantee_ that an entire track (or whatever) can be written to disk in _all_ cases, then it is OK to reorder write requests within that track AS LONG AS YOU DON'T REORDER WRITES WHERE YOU SKIP BLOCKS THAT ARE NOT GUARANTEED TO COMPLETE. Generally, in Linux, ext3 will wait on all of the journal transaction blocks to be written before it writes a commit record, which is its way of guaranteeing that everything before the commit is valid. If you start write cacheing the transaction blocks, return, and then write the commit record to disk before the other transaction blocks are written, this is SEVERELY BROKEN. If it was guaranteed that the commit record would hit the platters at the same time as the other journal transaction blocks, that would be the minimum acceptable behaviour. Obviously a working TCQ or write barrier would also allow you to optimize all writes before the commit block is written, but that should be an _enhancement_ above the basic write operations, only available if you start using this feature. Cheers, Andreas -- Andreas Dilger http://sourceforge.net/projects/ext2resize/ http://www-mddsp.enel.ucalgary.ca/People/adilger/ ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-26 23:59 ` Andreas Dilger @ 2001-11-27 0:24 ` H. Peter Anvin 2001-11-27 0:52 ` H. Peter Anvin 0 siblings, 1 reply; 81+ messages in thread From: H. Peter Anvin @ 2001-11-27 0:24 UTC (permalink / raw) To: linux-kernel Followup to: <20011126165920.N730@lynx.no> By author: Andreas Dilger <adilger@turbolabs.com> In newsgroup: linux.dev.kernel > > The other thing that concerns a journaling fs is write ordering. If you > can _guarantee_ that an entire track (or whatever) can be written to disk > in _all_ cases, then it is OK to reorder write requests within that track > AS LONG AS YOU DON'T REORDER WRITES WHERE YOU SKIP BLOCKS THAT ARE NOT > GUARANTEED TO COMPLETE. > > Generally, in Linux, ext3 will wait on all of the journal transaction > blocks to be written before it writes a commit record, which is its way > of guaranteeing that everything before the commit is valid. If you start > write cacheing the transaction blocks, return, and then write the commit > record to disk before the other transaction blocks are written, this is > SEVERELY BROKEN. If it was guaranteed that the commit record would hit > the platters at the same time as the other journal transaction blocks, > that would be the minimum acceptable behaviour. > > Obviously a working TCQ or write barrier would also allow you to optimize > all writes before the commit block is written, but that should be an > _enhancement_ above the basic write operations, only available if you > start using this feature. > Indeed; having explicit write barriers would be a very useful feature, but the drives MUST default to strict ordering unless reordering (with write barriers) have been enabled explicitly by the OS. Furthermore, I would like to add the following constraint to your writeup: ** For each individual sector, a write MUST either complete or not take place at all. In other words, writes are guaranteed to be atomic on a sector-by-sector basis. -hpa P.S. Thanks, Andre, for taking the initiative of getting an actual commit model into the standardized specification. Otherwise we'd be doomed to continue down the path where what operating systems need for sane operation and what disk drives provide are increasingly divergent. -- <hpa@transmeta.com> at work, <hpa@zytor.com> in private! "Unix gives you enough rope to shoot yourself in the foot." http://www.zytor.com/~hpa/puzzle.txt <amsp@zytor.com> ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-27 0:24 ` H. Peter Anvin @ 2001-11-27 0:52 ` H. Peter Anvin 2001-11-27 1:11 ` Andrew Morton 2001-11-27 16:56 ` Matthias Andree 0 siblings, 2 replies; 81+ messages in thread From: H. Peter Anvin @ 2001-11-27 0:52 UTC (permalink / raw) To: linux-kernel Followup to: <9tumf0$dvr$1@cesium.transmeta.com> By author: "H. Peter Anvin" <hpa@zytor.com> In newsgroup: linux.dev.kernel > > Indeed; having explicit write barriers would be a very useful feature, > but the drives MUST default to strict ordering unless reordering (with > write barriers) have been enabled explicitly by the OS. > On the subject of write barriers... such a setup probably should have a serial number field for each write barrier command, and a "WAIT FOR WRITE BARRIER NUMBER #" command -- which will wait until all writes preceeding the specified write barrier has been committed to stable storage. It might also be worthwhile to have the equivalent nonblocking operation -- QUERY LAST WRITE BARRIER COMMITTED. -hpa -- <hpa@transmeta.com> at work, <hpa@zytor.com> in private! "Unix gives you enough rope to shoot yourself in the foot." http://www.zytor.com/~hpa/puzzle.txt <amsp@zytor.com> ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-27 0:52 ` H. Peter Anvin @ 2001-11-27 1:11 ` Andrew Morton 2001-11-27 1:15 ` H. Peter Anvin 2001-11-27 16:56 ` Matthias Andree 1 sibling, 1 reply; 81+ messages in thread From: Andrew Morton @ 2001-11-27 1:11 UTC (permalink / raw) To: H. Peter Anvin; +Cc: linux-kernel "H. Peter Anvin" wrote: > > Followup to: <9tumf0$dvr$1@cesium.transmeta.com> > By author: "H. Peter Anvin" <hpa@zytor.com> > In newsgroup: linux.dev.kernel > > > > Indeed; having explicit write barriers would be a very useful feature, > > but the drives MUST default to strict ordering unless reordering (with > > write barriers) have been enabled explicitly by the OS. > > > > On the subject of write barriers... such a setup probably should have > a serial number field for each write barrier command, and a "WAIT FOR > WRITE BARRIER NUMBER #" command -- which will wait until all writes > preceeding the specified write barrier has been committed to stable > storage. It might also be worthwhile to have the equivalent > nonblocking operation -- QUERY LAST WRITE BARRIER COMMITTED. > For ext3 at least, all that is needed is a barrier which says "don't reorder writes across here". Asynchronous behaviour beyond that is OK - the disk is free to queue multiple transactions internally as long as the barriers are observed. If the power goes out we'll just recover up to and including the last-written commit block. - ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-27 1:11 ` Andrew Morton @ 2001-11-27 1:15 ` H. Peter Anvin 2001-11-27 16:59 ` Matthias Andree 0 siblings, 1 reply; 81+ messages in thread From: H. Peter Anvin @ 2001-11-27 1:15 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel Andrew Morton wrote: > "H. Peter Anvin" wrote: > >>Followup to: <9tumf0$dvr$1@cesium.transmeta.com> >>By author: "H. Peter Anvin" <hpa@zytor.com> >>In newsgroup: linux.dev.kernel >> >>>Indeed; having explicit write barriers would be a very useful feature, >>>but the drives MUST default to strict ordering unless reordering (with >>>write barriers) have been enabled explicitly by the OS. >>> >>> >>On the subject of write barriers... such a setup probably should have >>a serial number field for each write barrier command, and a "WAIT FOR >>WRITE BARRIER NUMBER #" command -- which will wait until all writes >>preceeding the specified write barrier has been committed to stable >>storage. It might also be worthwhile to have the equivalent >>nonblocking operation -- QUERY LAST WRITE BARRIER COMMITTED. >> >> > > For ext3 at least, all that is needed is a barrier which says > "don't reorder writes across here". Asynchronous behaviour > beyond that is OK - the disk is free to queue multiple transactions > internally as long as the barriers are observed. If the power > goes out we'll just recover up to and including the last-written > commit block. > Waiting for write barriers to clear is key to implementing fsync() efficiently and correctly. -hpa ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-27 1:15 ` H. Peter Anvin @ 2001-11-27 16:59 ` Matthias Andree 0 siblings, 0 replies; 81+ messages in thread From: Matthias Andree @ 2001-11-27 16:59 UTC (permalink / raw) To: linux-kernel On Mon, 26 Nov 2001, H. Peter Anvin wrote: > Waiting for write barriers to clear is key to implementing fsync() > efficiently and correctly. Well, all you want is a feature to write a set of blocks and be acknowledged of completion of the write before you send more data, but OTOH you would not want to serialize fsync() operations, see my "groups" that I told previously. That would probably involve tagging data blocks in the long run. Not sure if the current tag command API of ATA can already provide that, if so, all is there, and the barrier can be implemented in the driver rather than the drive firmware. -- Matthias Andree "They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." Benjamin Franklin ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-27 0:52 ` H. Peter Anvin 2001-11-27 1:11 ` Andrew Morton @ 2001-11-27 16:56 ` Matthias Andree 1 sibling, 0 replies; 81+ messages in thread From: Matthias Andree @ 2001-11-27 16:56 UTC (permalink / raw) To: linux-kernel On Mon, 26 Nov 2001, H. Peter Anvin wrote: > On the subject of write barriers... such a setup probably should have > a serial number field for each write barrier command, and a "WAIT FOR > WRITE BARRIER NUMBER #" command -- which will wait until all writes > preceeding the specified write barrier has been committed to stable > storage. It might also be worthwhile to have the equivalent > nonblocking operation -- QUERY LAST WRITE BARRIER COMMITTED. A query model is not useful, because it involves polling, which is not what you want because it clogs up the CPU. Write barriers may be fun, however, they impose ordering constraints on the host side, which is not too useful. Real tagged commands and tagged completion will be really useful for performance, with write barriers, for example: data000 group A data001 group B data254 group A data253 group A data274 group B barrier group A data002 group B or something, and the drive could reorder anything, but it would only have to guarantee that all group-A data sent before the barrier would have made it to disk when the barrier command completed. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-26 20:35 ` Rob Landley 2001-11-26 23:59 ` Andreas Dilger @ 2001-11-27 1:23 ` Ian Stirling 2001-11-26 23:00 ` Rob Landley 2001-11-27 7:03 ` Ville Herva 2001-11-27 16:50 ` Matthias Andree 3 siblings, 1 reply; 81+ messages in thread From: Ian Stirling @ 2001-11-27 1:23 UTC (permalink / raw) To: landley; +Cc: Andre Hedrick, Chris Wedgwood, linux-kernel > > On Monday 26 November 2001 15:30, Andre Hedrick wrote: > > On Mon, 26 Nov 2001, Rob Landley wrote: > > > > Just add an off-the-shelf capacitor to your circuit. The firmware > > > already has to detect power failure in order to park the head sanely, so > > Send me an outline/discription and I will present it during the Dec T13 > > meeting for a proposal number for inclusion into ATA-7. > > What kind of write-up do you want? (How formal?) > > The trick here is limiting the scope of the problem. Your buffer can't be > larger than you can reliably write back on a sudden power failure. (This > should be obvious.) So the obvious answer is to make your writeback cache > SMALL. The problems that go with flushing it are then correspondingly small. <snip> > > Now a cache large enough to hold 2 full tracks could also hold dozens of > individual sectors scattered around the disk, which could take a full second > to write off and power down. This is a "doctor, it hurts when I do this" > question. DON'T DO THAT. Or, to seek to a journal track, and write the cache to it. Errors are a problem, writing twice may help. This avoids having to block on bad write patterns, for example, if you are writing mixed blocks that go to tracks 1 and 88, you can't start to write blocks that would go to track 44. Performance would rise if it can do the writes in elevator order. <snip> > That way, the power down problem is strictly limited: > > 1) write out the track you're over > 2) seek to the second track > 3) write that out too > 4) park the head Or 2) optionally seek to the journal track, and write the journal. > > What new hardware is involved? > > Add a capacitor. > > Add a power level sensor. (Drives may already have this to know when to park > the head.) Most drives I've taken apart recently seem to have passive means, a spring to move the head to the side, and a magnet to hold it there. <Snip>> > I think that's it. Did I miss anything? Oh yeah, on power fail stop It needs a power switch to stop back-feeding the computer. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-27 1:23 ` Ian Stirling @ 2001-11-26 23:00 ` Rob Landley 2001-11-27 2:41 ` H. Peter Anvin 2001-11-27 3:39 ` Ian Stirling 0 siblings, 2 replies; 81+ messages in thread From: Rob Landley @ 2001-11-26 23:00 UTC (permalink / raw) To: Ian Stirling; +Cc: Andre Hedrick, Chris Wedgwood, linux-kernel On Monday 26 November 2001 20:23, Ian Stirling wrote: > > Now a cache large enough to hold 2 full tracks could also hold dozens of > > individual sectors scattered around the disk, which could take a full > > second to write off and power down. This is a "doctor, it hurts when I > > do this" question. DON'T DO THAT. > > Or, to seek to a journal track, and write the cache to it. Except that at most you have one seek to write out all the pending cache data anyway, so what exactly does seeking to a journal track buy you? Now modern drives have this fun little thing where they remap bad sectors so writing to one logical track can involve a seek, and the idea here is to cap seeks, so the drive has to keep track how where sectors ACTUALLY are and block based on their physical position rather than the logical position they present to the system. Which could be fairly evil. But oh well... (And in theory, if you're doing a linear write on a sector by sector basis, the discontinuous portions of a damaged track (the first half of the track, with one sector out of line, followed by the rest of track) could still be written in one go assuming the system unblocks when it physically seeks to the track in question, allowing the system to write the rest of the data to that track before it seeks away from it...) > Errors are a problem, writing twice may help. > This avoids having to block on bad write patterns, for example, if you > are writing mixed blocks that go to tracks 1 and 88, you can't start to > write blocks that would go to track 44. > Performance would rise if it can do the writes in elevator order. The elevator is the operating system's problem. To reliably write stuff back you can't have an unlimited number of different tracks in cache, or the seeks to write it all out will kill any reasonable finite power reserve you'd want to put in a disk. > <snip> > > > That way, the power down problem is strictly limited: > > > > 1) write out the track you're over > > 2) seek to the second track > > 3) write that out too > > 4) park the head > > Or 2) optionally seek to the journal track, and write the journal. Possibly. I still don't see what it gets you if you only have one track other than the one you're over to write to. (is the journal track near the area the head parks in? That could be a power saving method, I suppose. But it's also wasting disk space that would probably otherwise be used for storage or a block remapping, and how do you remap a bad sector out of the journal track if that happens?) > > What new hardware is involved? > > > > Add a capacitor. > > > > Add a power level sensor. (Drives may already have this to know when to > > park the head.) > > Most drives I've taken apart recently seem to have passive means, > a spring to move the head to the side, and a magnet to hold it there. Yeah, I'd heard that. That's why the word "may" was involved. :) (That and just trusting the inertia of the platter to aerodynamically keep the head airborne before it can snap back to the parking position.) You could still do this, by the way. It reduces the power requirements to only one seek. And with the journal track hack, that seek could be in the direction the spring pulls. Still not too thrilled about that, though... > <Snip>> > > > I think that's it. Did I miss anything? Oh yeah, on power fail stop > > It needs a power switch to stop back-feeding the computer. Yup. Rob ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-26 23:00 ` Rob Landley @ 2001-11-27 2:41 ` H. Peter Anvin 2001-11-27 0:19 ` Rob Landley 2001-11-27 3:39 ` Ian Stirling 1 sibling, 1 reply; 81+ messages in thread From: H. Peter Anvin @ 2001-11-27 2:41 UTC (permalink / raw) To: linux-kernel Followup to: <0111261800340R.02001@localhost.localdomain> By author: Rob Landley <landley@trommello.org> In newsgroup: linux.dev.kernel > > On Monday 26 November 2001 20:23, Ian Stirling wrote: > > > > Now a cache large enough to hold 2 full tracks could also hold dozens of > > > individual sectors scattered around the disk, which could take a full > > > second to write off and power down. This is a "doctor, it hurts when I > > > do this" question. DON'T DO THAT. > > > > Or, to seek to a journal track, and write the cache to it. > > Except that at most you have one seek to write out all the pending cache data > anyway, so what exactly does seeking to a journal track buy you? > It limits the amount you need to seek to exactly one seek. -hpa -- <hpa@transmeta.com> at work, <hpa@zytor.com> in private! "Unix gives you enough rope to shoot yourself in the foot." http://www.zytor.com/~hpa/puzzle.txt <amsp@zytor.com> ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-27 2:41 ` H. Peter Anvin @ 2001-11-27 0:19 ` Rob Landley 2001-11-27 23:35 ` Andreas Bombe 0 siblings, 1 reply; 81+ messages in thread From: Rob Landley @ 2001-11-27 0:19 UTC (permalink / raw) To: H. Peter Anvin, linux-kernel On Monday 26 November 2001 21:41, H. Peter Anvin wrote: > Followup to: <0111261800340R.02001@localhost.localdomain> > By author: Rob Landley <landley@trommello.org> > In newsgroup: linux.dev.kernel > > > On Monday 26 November 2001 20:23, Ian Stirling wrote: > > > > Now a cache large enough to hold 2 full tracks could also hold dozens > > > > of individual sectors scattered around the disk, which could take a > > > > full second to write off and power down. This is a "doctor, it hurts > > > > when I do this" question. DON'T DO THAT. > > > > > > Or, to seek to a journal track, and write the cache to it. > > > > Except that at most you have one seek to write out all the pending cache > > data anyway, so what exactly does seeking to a journal track buy you? > > It limits the amount you need to seek to exactly one seek. > > -hpa But it's already exactly one seek in the scheme I proposed. Notice how of the two tracks you can be write-cacheing data for, one is the track you're currently over (no seek required, you're there). You flush to that track, there's one more seek to flush to the second track (which you were only cacheing data for to avoid latency, so the seek could start immediately without waiting for the OS to provide data), and then park. Now a journal track that's next to where the head parks could combine the "park" sweep with that one seek, and presumably be spring powered and hence save capacitor power. But I'm not 100% certain it would be worth it. (Are normal with-power-on seeks towards the park area powered by the spring, or the... I keep wanting to say "stepper motor" but I don't think those are what drives use anymore, are they? Sigh...) Rob ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-27 0:19 ` Rob Landley @ 2001-11-27 23:35 ` Andreas Bombe 2001-11-28 14:32 ` Rob Landley 0 siblings, 1 reply; 81+ messages in thread From: Andreas Bombe @ 2001-11-27 23:35 UTC (permalink / raw) To: linux-kernel On Mon, Nov 26, 2001 at 07:19:54PM -0500, Rob Landley wrote: > Now a journal track that's next to where the head parks could combine the > "park" sweep with that one seek, and presumably be spring powered and hence > save capacitor power. But I'm not 100% certain it would be worth it. When time if of essence it should be worth it (drive makers will use the smallest possible capacitor, of course). Given that current 7200 RPM disks have marketed seek times of 8 or 9 ms worst case seeks can be much longer. That 8ms is average and likely read seeks are weighted higher than write seeks. Writes have to be exact, but reads can be seeked sloppier (without waiting for the head to stop oscillating after braking) and error correction will take care of the rest. This would gives us what in worst case? 15ms (just a guess)? A journal track could be near parking track and have directly adjacent tracks left free to allow for slightly sloppier/faster seeking. An expert could probably tell us whether this is complete BS or even feasible. > (Are > normal with-power-on seeks towards the park area powered by the spring, or > the... I keep wanting to say "stepper motor" but I don't think those are what > drives use anymore, are they? Sigh...) A simple spring is too slow, I guess. Also, it should not be so hard that it would slow down seeks against the spring. -- Andreas Bombe <bombe@informatik.tu-muenchen.de> DSA key 0x04880A44 ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-27 23:35 ` Andreas Bombe @ 2001-11-28 14:32 ` Rob Landley 0 siblings, 0 replies; 81+ messages in thread From: Rob Landley @ 2001-11-28 14:32 UTC (permalink / raw) To: Andreas Bombe, linux-kernel On Tuesday 27 November 2001 18:35, Andreas Bombe wrote: > On Mon, Nov 26, 2001 at 07:19:54PM -0500, Rob Landley wrote: > > Now a journal track that's next to where the head parks could combine the > > "park" sweep with that one seek, and presumably be spring powered and > > hence save capacitor power. But I'm not 100% certain it would be worth > > it. > > When time if of essence it should be worth it (drive makers will use the > smallest possible capacitor, of course). Given that current 7200 RPM > disks have marketed seek times of 8 or 9 ms worst case seeks can be much > longer. > > That 8ms is average and likely read seeks are weighted higher than write Sure. The time to seek halfway across the disk, probably. > seeks. Writes have to be exact, but reads can be seeked sloppier > (without waiting for the head to stop oscillating after braking) and > error correction will take care of the rest. This would gives us what > in worst case? 15ms (just a guess)? I'd been thinking more like 20, but it really depends on the manufacturer. (And fun little detail, faster seeks can take MORE power, driving the coil thingy harder...) > A journal track could be near parking track and have directly adjacent > tracks left free to allow for slightly sloppier/faster seeking. An > expert could probably tell us whether this is complete BS or even > feasible. > > > (Are > > normal with-power-on seeks towards the park area powered by the spring, > > or the... I keep wanting to say "stepper motor" but I don't think those > > are what drives use anymore, are they? Sigh...) > > A simple spring is too slow, I guess. Also, it should not be so hard > that it would slow down seeks against the spring. I.E. they've already dealt with this problem in existing designs that use some variant of a spring to park, this is Not Our Problem. No, the "not worth it" above, in addition to the extra logic to unjournal the stuff on the next boot (and possibly lose power again during bootup and hopefully not wind up with a brick) , is that the platter slows down if you don't keep it spinning. If the spring is seeking slowly, the capacitor has to keep the platter spinning longer, which could easily eat the power you're trying to avoid seeking with. Add in the extra complexity and it doesn't seem worth it, but that's for the lab guys to decide with measurements... Oh, and one other fun detail. One reason I don't like the "battery backed up SRAM cache", apart from being another way the disk dies of old age, is that it doesn't fix the "we lost power in the middle of writing a sector, so we just created a CRC error on the disk" problem, which is what started this thread. If you're going to fix THAT (which you seem to need a capacitor to do anyway), then you might as well do it right. Rob ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-26 23:00 ` Rob Landley 2001-11-27 2:41 ` H. Peter Anvin @ 2001-11-27 3:39 ` Ian Stirling 1 sibling, 0 replies; 81+ messages in thread From: Ian Stirling @ 2001-11-27 3:39 UTC (permalink / raw) To: landley; +Cc: Ian Stirling, Andre Hedrick, Chris Wedgwood, linux-kernel > > On Monday 26 November 2001 20:23, Ian Stirling wrote: > > > > Now a cache large enough to hold 2 full tracks could also hold dozens of > > > individual sectors scattered around the disk, which could take a full > > > second to write off and power down. This is a "doctor, it hurts when I > > > do this" question. DON'T DO THAT. > > > > Or, to seek to a journal track, and write the cache to it. > > Except that at most you have one seek to write out all the pending cache data > anyway, so what exactly does seeking to a journal track buy you? The ability to possibly dramatically improve performance by allowing more than one or two tracks to be write cached at once. Yes, in theory, the system should be able to elevator all seeks, but it may not know that track 400 has really been remapped to 200, the drive does. With write-caching on, the system doesn't know where the head is, the drive does. And, it's nearly free (an extra meg of space) <snip> > Possibly. I still don't see what it gets you if you only have one track > other than the one you're over to write to. (is the journal track near the > area the head parks in? That could be a power saving method, I suppose. But > it's also wasting disk space that would probably otherwise be used for > storage or a block remapping, and how do you remap a bad sector out of the > journal track if that happens?) You simply pick another track for the journal, the same as you would if an ordinary track goes bad. (it's tested on boot) The waste of disk space is utterly trivial. A meg in drives where the entry level is 40G? ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-26 20:35 ` Rob Landley 2001-11-26 23:59 ` Andreas Dilger 2001-11-27 1:23 ` Ian Stirling @ 2001-11-27 7:03 ` Ville Herva 2001-11-27 16:50 ` Matthias Andree 3 siblings, 0 replies; 81+ messages in thread From: Ville Herva @ 2001-11-27 7:03 UTC (permalink / raw) To: Rob Landley; +Cc: Andre Hedrick, linux-kernel On Mon, Nov 26, 2001 at 03:35:07PM -0500, you [Rob Landley] claimed: > > What kind of write-up do you want? (How formal?) > (...) > That way, the power down problem is strictly limited: > > 1) write out the track you're over > 2) seek to the second track > 3) write that out too > 4) park the head (...) A stupid question. Instead of adding there electric components and smart features to drive logic, couldn't the problem be simply be taken care of by adding an acknowledge message to the ATA protocol (unless it already has one)? So _after_ the data has been 100% committed to _disk_, the disk would acknowledge the OS. The OS wouldn't have to wait on the command (unless it wants to -- think of write ordering barrier!), and the disk could have as large cache as it needs. It would simply accept the write command to its cache and send the ACKs even half a second later. The OS wouldn't consider anything as committed to disk before its gets the ACK. Again, I know nothing of ATA so this can be impossible to do (strict ordered command-reply protocol?), or already implemented but not enough. Please correct me. I must be missing something. -- v -- v@iki.fi ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-26 20:35 ` Rob Landley ` (2 preceding siblings ...) 2001-11-27 7:03 ` Ville Herva @ 2001-11-27 16:50 ` Matthias Andree 2001-11-27 20:31 ` Rob Landley 3 siblings, 1 reply; 81+ messages in thread From: Matthias Andree @ 2001-11-27 16:50 UTC (permalink / raw) To: linux-kernel Please fix your domain in your mailer, localhost.localdomain is prone for Message-ID collisions. Note, the power must RELIABLY last until all of the data has been writen, which includes reassigning, seeking and the like, just don't do it if you cannot get a real solution. battery-backed CMOS, NVRAM/Flash/whatever which lasts a couple of months should be fine though, as long as documents are publicly available that say how long this data lasts. Writing to disk will not work out unless you can keep the drive going for several seconds which will require BIG capacitors, so that's no option, you must go for NVRAM/Flash or something. OTOH, the OS must reliably know when something went wrong (even with good power it has a right to know), and preferably this scheme should not involve disabling the write cache, so TCQ or something mandatory would be useful (not sure if it's mandatory in current ATA standards). If a block has first been reported written OK and the disk later reports error, it must send the block back (incompatible with any current ATA draft I had my hands on), so I think tagged commands which are marked complete only after write+verify are the way to go. -- Matthias Andree "They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." Benjamin Franklin ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-27 16:50 ` Matthias Andree @ 2001-11-27 20:31 ` Rob Landley 2001-11-28 18:43 ` Matthias Andree 0 siblings, 1 reply; 81+ messages in thread From: Rob Landley @ 2001-11-27 20:31 UTC (permalink / raw) To: Matthias Andree, linux-kernel On Tuesday 27 November 2001 11:50, Matthias Andree wrote: > Please fix your domain in your mailer, localhost.localdomain is prone > for Message-ID collisions. I'm using Kmail talking to @home's mail server (to avoid the evil behavior sendmail has behind an IP masquerading firewall that triggers every spam filter in existence), so if either one of them cares about the hostname of my laptop ("driftwood", but apparently not being set right by Red Hat's scripts), then something's wrong anyway. But let's see... Ah fun, if you change the hostname of the box, either X or KDE can't pop up any more new applicatons until you exit X and restart it. Brilliant. Considering how many Konqueror windows I have open at present on my 6 desktops, I think I'll leave fixing this until later in the evening. But thanks for letting me know something's up... > > Note, the power must RELIABLY last until all of the data has been > writen, which includes reassigning, seeking and the like, just don't do > it if you cannot get a real solution. A) At most 1 seek to a track other than the one you're on. B) If sectors have been reassigned outside of this track to a "recovery" track, then that counts as a seperate track. Tough. The point of the buffer is to let the OS feed data to the write head as fast as it can write it (which unbuffered ATA can't do because individual requests are smaller than individual tracks). You need a small buffer to avoid blocking between each and every ATA write while the platter rotates back into position. So you always let it have a little more data so it knows what to do next and can start work on it immediately (doing that next seek, writing that next sector as it passes under the head without having to wait for it to rotate around again.) That's it. No more buffer than does good at the hardware level for request merging and minimizing seek latency. Any buffering over and above that is the operating system's job. Yes the hardware can do a slightly better job with its own elevator algorithm using intimate knowledge of on-disk layout, but the OS can do a fairly decent job as long as logical linear sectors are linearly arranged on disk too. (Relocating bad sectors breaks this, but not fatally. It causes extra seeks in linear writes anyway where the elevator ISN'T involved, so you've already GOT a performance hit. And it just screws up the OS's elevator, not the rest of the scheme. You still have the current track written as one lump and an immediate seek to the other track, at which point the drive electronics can be accepting blocks destined for the track you seek back to.) The advantage of limiting the amount of data buffered to current track plus one other is you have a fixed amount of work to do on a loss of power. One seek, two track writes, and a spring-driven park. The amount of power this takes has a deterministic upper bound. THAT is why you block before accepting more data than that. > battery-backed CMOS, > NVRAM/Flash/whatever which lasts a couple of months should be fine > though, as long as documents are publicly available that say how long > this data lasts. Writing to disk will not work out unless you can keep > the drive going for several seconds which will require BIG capacitors, > so that's no option, you must go for NVRAM/Flash or something. You dont' need several seconds. You need MILISECONDS. Two track writes and one seek. This is why you don't accept more data than that before blocking. Your worst case scenario is a seek from near where the head parks to the other end of the disk, then the spring can pull it back. This should be well under 50 miliseconds. Your huge ram cache is there for reads. For writes you don't accept more than you can reliably flush if you want anything approaching reliability. If you're only going to spring for a capacitor as your power failure hedge, than the amount of write cache you can accept is small, but it turns out you only need a tiny amount of cache to get 90% of the benefit of write cacheing (merging writes into full tracks and seeking immediately to the next track). > OTOH, the OS must reliably know when something went wrong (even with > good power it has a right to know), and preferably this scheme should > not involve disabling the write cache, so TCQ or something mandatory > would be useful (not sure if it's mandatory in current ATA standards). We're talking about what happens to the drive on a catastrophic power failure. (Even with a UPS, this can happen if your case fan jams and your power supply catches fire and burns through a wire, Although most server side hosting facilities aren't that dusty, there's always worn bearings and other such fun things. And in a desktop environment, spilled sodas.) Currently, there are drives out there that stop writing a sector in the middle, leaving a bad CRC at the hardware level. This isn't exactly graceful. At the other end, drives with huge caches discard the contents of cache which a journaling filesystem thinks are already on disk. This isn't graceful either. > If a block has first been reported written OK and the disk later reports > error, it must send the block back (incompatible with any current ATA > draft I had my hands on), so I think tagged commands which are marked > complete only after write+verify are the way to go. If a block goes bad WHILE power is failing, you're screwed. This is just a touch unlikely. It will happen to somebody out there someday, sure. So will alpha particle decay corrupting a sector that was long ago written to the drive correctly. Designing for that is not practical. Recovering after the fact might be, but that doesn't mean you get your data back. Rob ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-27 20:31 ` Rob Landley @ 2001-11-28 18:43 ` Matthias Andree 2001-11-28 18:46 ` Rob Landley 0 siblings, 1 reply; 81+ messages in thread From: Matthias Andree @ 2001-11-28 18:43 UTC (permalink / raw) To: linux-kernel On Tue, 27 Nov 2001, Rob Landley wrote: > On Tuesday 27 November 2001 11:50, Matthias Andree wrote: > > Note, the power must RELIABLY last until all of the data has been > > writen, which includes reassigning, seeking and the like, just don't do > > it if you cannot get a real solution. > > A) At most 1 seek to a track other than the one you're on. Not really, assuming drives don't write to multiple heads concurrently, 2 MB hardly fit on a track. We can assume several hundred sectors, say 1,000, so we need four track writes, four verifies, and not a single block may be broken. We need even more time if we need to rewrite. > That's it. No more buffer than does good at the hardware level for request > merging and minimizing seek latency. Any buffering over and above that is > the operating system's job. Effectively, that's what tagged command queueing is all about, send a batch of requests that can be acknowledged individually and possibly out of order (which can lead to a trivial write barrier as suggested elsewhere, because all you do is wait with scheduling until the disk is idle, then send the past-the-barrier block). > (Relocating bad sectors breaks this, but not fatally. It causes extra seeks > in linear writes anyway where the elevator ISN'T involved, so you've already > GOT a performance hit. On modern drives, bad sectors are reassigned within the same track to evade seeks for a single bad block. If the spare block area within that track is exhausted, bad luck, you're going to seek. > The advantage of limiting the amount of data buffered to current track plus > one other is you have a fixed amount of work to do on a loss of power. One > seek, two track writes, and a spring-driven park. The amount of power this > takes has a deterministic upper bound. THAT is why you block before > accepting more data than that. It has not, you don't know in advance how many blocks on your journal track are bad. > You dont' need several seconds. You need MILISECONDS. Two track writes and > one seek. This is why you don't accept more data than that before blocking. No, you must verify the write, so that's one seek (say 35 ms, slow drive ;) and two revolutions per track at least, and, as shown, more than one track usually, so any bets of upper bounds are off. In the average case, say 70 ms should suffice, but in adverse conditions, that does not suffice at all. If writing the journal in the end fails because power is failing, the data is lost, so nothing is gained. > under 50 miliseconds. Your huge ram cache is there for reads. For writes > you don't accept more than you can reliably flush if you want anything > approaching reliability. Well, that's the point, you don't know in advance how reliable your journal track is. Worst case means: you need to consume every single spare block until the cache is flushed. Your point about write caching is valid, and IBM documentation for DTLA drives (minus their apparent other issues) declares that the write cache will be ignored when the spare block count is low. > such fun things. And in a desktop environment, spilled sodas.) Currently, > there are drives out there that stop writing a sector in the middle, leaving > a bad CRC at the hardware level. This isn't exactly graceful. At the other > end, drives with huge caches discard the contents of cache which a journaling > filesystem thinks are already on disk. This isn't graceful either. No-one said bad things cannot happen, but that is what actually happens. Where we started from, fsck would be able to "repair" a bad block by just zeroing and writing it, data that used to be there will be lost after short write anyhow. > If a block goes bad WHILE power is failing, you're screwed. This is just a > touch unlikely. It will happen to somebody out there someday, sure. So will > alpha particle decay corrupting a sector that was long ago written to the > drive correctly. Designing for that is not practical. Recovering after the > fact might be, but that doesn't mean you get your data back. Alpha particles still need to fight against inner (bit-wise) and outer (symbol- and blockwise) error correction codes, and Alpha particles don't usually move Bloch walls or get near the coercivity otherwise. We're talking about magnetic media, not E²PROMs or something. Assuming that write errors on an emergency cache flush just won't happen is just as wrong as assuming 640 kB will suffice or there's an upper bound of write time. You just don't know. -- Matthias Andree "They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." Benjamin Franklin ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-28 18:43 ` Matthias Andree @ 2001-11-28 18:46 ` Rob Landley 2001-11-28 22:19 ` Matthias Andree 0 siblings, 1 reply; 81+ messages in thread From: Rob Landley @ 2001-11-28 18:46 UTC (permalink / raw) To: Matthias Andree, linux-kernel This is wandering far enough off topic that I'm not going to CC l-k after this message... On Wednesday 28 November 2001 13:43, Matthias Andree wrote: > On Tue, 27 Nov 2001, Rob Landley wrote: > > On Tuesday 27 November 2001 11:50, Matthias Andree wrote: > > > Note, the power must RELIABLY last until all of the data has been > > > writen, which includes reassigning, seeking and the like, just don't do > > > it if you cannot get a real solution. > > > > A) At most 1 seek to a track other than the one you're on. > > Not really, assuming drives don't write to multiple heads concurrently, Not my area of expertise. Depends how cheap they're being, I'd guess. Writing multiple tracks concurrently is probably more of a current drain than writing a single track at a time anyway, by the way. > 2 MB hardly fit on a track. We can assume several hundred sectors, say > 1,000, so we need four track writes, four verifies, and not a single > block may be broken. We need even more time if we need to rewrite. A 7200 RPM drive does 120 RPS, which means one revolution is 8.3 miliseconds. We're still talking a deterministic number of miliseconds with a double-digit total. And again, it depends on how you define "track". If you talk about the two tracks you can buffer as living on seperate sides of platters you can't write to concurrently (not necessarily separated by a seek), then there is still no problem. (After the first track writes and it starts on the second track, the system still has 8.3 ms later to buffer another track before it drops below full writing speed. It's all a question of limiting how much you buffer to what you can flush out. Artificial objections about "I could have 8 zillion platters I can only write to one at a time" just means you're buffering too much to write out then. > > That's it. No more buffer than does good at the hardware level for > > request merging and minimizing seek latency. Any buffering over and > > above that is the operating system's job. > > Effectively, that's what tagged command queueing is all about, send a > batch of requests that can be acknowledged individually and possibly out > of order (which can lead to a trivial write barrier as suggested > elsewhere, because all you do is wait with scheduling until the disk is > idle, then send the past-the-barrier block). Doesn't stop the "die in the middle of a write=crc error" problem. And I'm not saying tagged command queueing is a bad idea, I'm just saying the idea's been out there forever and not everybody's done it yet, and this is a potentially simpler alternative focusing on the minimal duct-tape approach to reliability by reducing the level of guarantees you have to make. > > (Relocating bad sectors breaks this, but not fatally. It causes extra > > seeks in linear writes anyway where the elevator ISN'T involved, so > > you've already GOT a performance hit. > > On modern drives, bad sectors are reassigned within the same track to > evade seeks for a single bad block. If the spare block area within that > track is exhausted, bad luck, you're going to seek. Cool then. > > The advantage of limiting the amount of data buffered to current track > > plus one other is you have a fixed amount of work to do on a loss of > > power. One seek, two track writes, and a spring-driven park. The amount > > of power this takes has a deterministic upper bound. THAT is why you > > block before accepting more data than that. > > It has not, you don't know in advance how many blocks on your journal > track are bad. Another reason to not worry about an explicit dedicatedjournal track and just buffer one extra normal data track and budget in the power for a seek to it if necessary. There are circumstances where this will break down, sure. Any disk that has enough bad sectors on it will stop working. But that shouldn't be the normal case on a fresh drive, as is happening now with IBM. > > You dont' need several seconds. You need MILISECONDS. Two track writes > > and one seek. This is why you don't accept more data than that before > > blocking. > > No, you must verify the write, so that's one seek (say 35 ms, slow > drive ;) and two revolutions per track at least, and, as shown, more > than one track usually So don't buffer 4 tracks and call it one track. That's an artificial objection. An extra revolution is less than a seek, and noticeably less in power terms. >, so any bets of upper bounds are off. In the > average case, say 70 ms should suffice, but in adverse conditions, that > does not suffice at all. If writing the journal in the end fails because > power is failing, the data is lost, so nothing is gained. > > > under 50 miliseconds. Your huge ram cache is there for reads. For > > writes you don't accept more than you can reliably flush if you want > > anything approaching reliability. > > Well, that's the point, you don't know in advance how reliable your > journal track is. We don't knkow in advance that the drive won't fail completely due to excessive bad blocks. I'm trying to move the failure point, not pretending to eliminate it. Right now we've got something that could easily take out multiple drives in a RAID 5, and something that desktop users are likely to see more noticeably more often than they upgrade their system. > > such fun things. And in a desktop environment, spilled sodas.) > > Currently, there are drives out there that stop writing a sector in the > > middle, leaving a bad CRC at the hardware level. This isn't exactly > > graceful. At the other end, drives with huge caches discard the contents > > of cache which a journaling filesystem thinks are already on disk. This > > isn't graceful either. > > No-one said bad things cannot happen, but that is what actually happens. > Where we started from, fsck would be able to "repair" a bad block by > just zeroing and writing it, data that used to be there will be lost > after short write anyhow. Assuming the drive's inherent bad-block detection mechanisms don't find it and remap it on a read first, rapidly consuming the spare block reserve. But that's a firmware problem... > Assuming that write errors on an emergency cache flush just won't happen > is just as wrong as assuming 640 kB will suffice or there's an upper > bound of write time. You just don't know. I don't assume they won't happen. They're actually more LIKELY to happen as the power level gradually drops as the capacitor discharges. I'm just saying there's a point beyond which any given system can't recover, and a point of diminishing returns trying to fix things. I'm proposing a cheap and easy improvement over the current system. I'm not proposing a system hardened to military specifications, just something that shouldn't fail noticeably for the majority of its users on a regular basis. (Powering down without flushing the cache is a bad thing. It shouldn't happen often. This is a last ditch deal-with-evil safety net system that has a fairly good chance of saving the data without extensively redesigning the whole system. Never said it was perfect. If a "1 in 2" failure rate drops to "1 in 100,000", it'll still hit people. But it's a distinct improvement. Maybe it can be improved beyond that. That would be nice. What's the effort, expense, and inconvenience involved?) Rob ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-28 18:46 ` Rob Landley @ 2001-11-28 22:19 ` Matthias Andree 2001-11-29 22:21 ` Pavel Machek 0 siblings, 1 reply; 81+ messages in thread From: Matthias Andree @ 2001-11-28 22:19 UTC (permalink / raw) To: linux-kernel On Wed, 28 Nov 2001, Rob Landley wrote: > Not my area of expertise. Depends how cheap they're being, I'd guess. > Writing multiple tracks concurrently is probably more of a current drain than > writing a single track at a time anyway, by the way. Yes, and you need multiple write amplifiers and programmable filters (remember we do zoned recording nowadays) rather than just a set of switches. > > Effectively, that's what tagged command queueing is all about, send a > > batch of requests that can be acknowledged individually and possibly out > > of order (which can lead to a trivial write barrier as suggested > > elsewhere, because all you do is wait with scheduling until the disk is > > idle, then send the past-the-barrier block). > > Doesn't stop the "die in the middle of a write=crc error" problem. And I'm Not quite, but once you start journalling the buffer you can also write tag data -- or you know to discard the journal block when it has a CRC error and just rewrite it. > been out there forever and not everybody's done it yet, and this is a > potentially simpler alternative focusing on the minimal duct-tape approach to > reliability by reducing the level of guarantees you have to make. Yup. > > On modern drives, bad sectors are reassigned within the same track to > > evade seeks for a single bad block. If the spare block area within that > > track is exhausted, bad luck, you're going to seek. > > Cool then. I did a complete read-only benchmark of an old IBM DCAS which had like 300 grown defects and which I low-level formatted. Around the errors, it would seek, and the otherwise good performance would drop to the floor almost. Not sure whether that already had a strategy similar to that of the DTLAs or just too many blocks went boom. > Assuming the drive's inherent bad-block detection mechanisms don't find it > and remap it on a read first, rapidly consuming the spare block reserve. But > that's a firmware problem... Drives should never reassign blocks on read operations, because they'd take away the chance to try to read that block for say four hours. > I'm proposing a cheap and easy improvement over the current system. I'm not > proposing a system hardened to military specifications, just something that > shouldn't fail noticeably for the majority of its users on a regular basis. > (Powering down without flushing the cache is a bad thing. It shouldn't > happen often. This is a last ditch deal-with-evil safety net system that has > a fairly good chance of saving the data without extensively redesigning the > whole system. Never said it was perfect. If a "1 in 2" failure rate drops > to "1 in 100,000", it'll still hit people. But it's a distinct improvement. > Maybe it can be improved beyond that. That would be nice. What's the > effort, expense, and inconvenience involved?) As always, the first 90% to perfection consume 10% of the efforts, but the last 10% to perfection consume the other 90% of the efforts :-) I'm just proposing to make sure that the margin is not too narrow when you're writing your last blocks to the disk when you know power is failing. I'm still wondering if flash memory is really more effort than saving all the energy to keep this expensive mechanics going properly. -- Matthias Andree "They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." Benjamin Franklin ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-28 22:19 ` Matthias Andree @ 2001-11-29 22:21 ` Pavel Machek 2001-12-01 10:55 ` Jeff V. Merkey 2001-12-02 0:08 ` Matthias Andree 0 siblings, 2 replies; 81+ messages in thread From: Pavel Machek @ 2001-11-29 22:21 UTC (permalink / raw) To: linux-kernel Hi! > > Assuming the drive's inherent bad-block detection mechanisms don't find it > > and remap it on a read first, rapidly consuming the spare block reserve. But > > that's a firmware problem... > > Drives should never reassign blocks on read operations, because they'd > take away the chance to try to read that block for say four hours. Why not? If drive gets ECC-correctable read error, it seems to me like good time to reassign. Pavel -- "I do not steal MS software. It is not worth it." -- Pavel Kankovsky ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-29 22:21 ` Pavel Machek @ 2001-12-01 10:55 ` Jeff V. Merkey 2001-12-02 0:08 ` Matthias Andree 1 sibling, 0 replies; 81+ messages in thread From: Jeff V. Merkey @ 2001-12-01 10:55 UTC (permalink / raw) To: Pavel Machek; +Cc: linux-kernel, jmerkey Check out the hotfixing code in NWFS. It handles exactly what this long and drawn out thread has discussed, and it's already in Linux. The code is contained in nwvp.c. I can tell you that in the past three years of running NWFS on Linux and all the time I worked at Novell from about 1996 on, I never once saw a server hotfix data after the newer "data guard" drive technologies came out. In fact, by default, I make the hotfix area on the drive about .1 % of the total space, since it;s probably just wasted space at this point. It's just wasted space these days, but it is a good idea to keep it around, just in case the "pointless" argument turns out not to be pointless and someone gets eaten by a shark (1 in 100,000,000) at the same instant they are struck by lightening (1 in 200,000,000). :-) Jeff On Thu, Nov 29, 2001 at 11:21:57PM +0100, Pavel Machek wrote: > Hi! > > > > Assuming the drive's inherent bad-block detection mechanisms don't find it > > > and remap it on a read first, rapidly consuming the spare block reserve. But > > > that's a firmware problem... > > > > Drives should never reassign blocks on read operations, because they'd > > take away the chance to try to read that block for say four hours. > > Why not? If drive gets ECC-correctable read error, it seems to me like > good time to reassign. > Pavel > -- > "I do not steal MS software. It is not worth it." > -- Pavel Kankovsky > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-29 22:21 ` Pavel Machek 2001-12-01 10:55 ` Jeff V. Merkey @ 2001-12-02 0:08 ` Matthias Andree 2001-12-03 20:04 ` Pavel Machek 1 sibling, 1 reply; 81+ messages in thread From: Matthias Andree @ 2001-12-02 0:08 UTC (permalink / raw) To: linux-kernel On Thu, 29 Nov 2001, Pavel Machek wrote: > > Drives should never reassign blocks on read operations, because they'd > > take away the chance to try to read that block for say four hours. > > Why not? If drive gets ECC-correctable read error, it seems to me like > good time to reassign. Because you don't know if it's just some slipped bits, a shutdown during write, or an actual fault. When that happens on a verify after write, that's indeed reasonable. Otherwise the drive should just mark that block as "watch closely on next write". ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-12-02 0:08 ` Matthias Andree @ 2001-12-03 20:04 ` Pavel Machek 0 siblings, 0 replies; 81+ messages in thread From: Pavel Machek @ 2001-12-03 20:04 UTC (permalink / raw) To: linux-kernel Hi! > > > Drives should never reassign blocks on read operations, because they'd > > > take away the chance to try to read that block for say four hours. > > > > Why not? If drive gets ECC-correctable read error, it seems to me like > > good time to reassign. > > Because you don't know if it's just some slipped bits, a shutdown during > write, or an actual fault. When that happens on a verify after write, > that's indeed reasonable. Otherwise the drive should just mark that > block as "watch closely on next write". Or better "write back and verify". You do not want even *ECC correctable* errors to be on your platters. Pavel -- "I do not steal MS software. It is not worth it." -- Pavel Kankovsky ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-26 16:59 ` Rob Landley 2001-11-26 20:30 ` Andre Hedrick @ 2001-11-26 20:53 ` Richard B. Johnson 2001-11-26 21:18 ` Journaling pointless with today's hard disks? [wandering OT] Rob Landley 2001-11-27 0:32 ` Journaling pointless with today's hard disks? H. Peter Anvin 2001-11-27 16:39 ` Matthias Andree 2 siblings, 2 replies; 81+ messages in thread From: Richard B. Johnson @ 2001-11-26 20:53 UTC (permalink / raw) To: Rob Landley; +Cc: Chris Wedgwood, linux-kernel On Mon, 26 Nov 2001, Rob Landley wrote: > On Sunday 25 November 2001 04:14, Chris Wedgwood wrote: > > > > > P.S. Write-caching in hard-drives is insanely dangerous for > > journalling filesystems and can result in all sorts of nasties. > > I recommend people turn this off in their init scripts (perhaps I > > will send a patch for the kernel to do this on boot, I just > > wonder if it will eat some drives). > > Anybody remember back when hard drives didn't reliably park themselves when > they cut power? This isn't something drive makers seem to pay much attention > to until customers scream at them for a while... > > Having no write caching on the IDE side isn't a solution either. The problem > is the largest block of data you can send to an ATA drive in a single command > is smaller than modern track sizes (let alone all the tracks under the heads > on a multi-head drive), so without any sort of cacheing in the drive at all > you add rotational latency between each write request for the point you left > off writing to come back under the head again. This will positively KILL > write performance. (I suspect the situation's more or less the same for read > too, but nobody's objecting to read cacheing.) > > The solution isn't to avoid write cacheing altogether (performance is 100% > guaranteed to suck otherwise, for reasons unrelated to how well your hardware > works but to legacy request size limits in the ATA specification), but to > have a SMALL write buffer, the size of one or two tracks to allow linear ATA > write requests to be assembled into single whole-track writes, and to make > sure the disks' electronics has enough capacitance in it to flush this buffer > to disk. (How much do capacitors cost? We're talking what, an extra 20 > miliseconds? The buffer should be small enough you don't have to do that > much seeking.) > > Just add an off-the-shelf capacitor to your circuit. The firmware already > has to detect power failure in order to park the head sanely, so make it > flush the buffers along the way. This isn't brain surgery, it just wasn't a > design criteria on IBM's checklist of features approved in the meeting. > (Maybe they ran out of donuts and adjourned the meeting early?) > > Rob It isn't that easy! Any kind of power storage within the drive would have to be isolated with diodes so that it doesn't try to run your motherboard as well as the drive. This means that +5 volt logic supply would now be 5.0 - 0.6 = 4.4 volts at the drive, well below the design voltage. Use of a Schottky diode (0.34 volts) would help somewhat, but you have now narrowed the normal power design-margin by 90 percent, not good. There is supposed to be a "power good" line out of your power supply which is supposed to tell equipment when the main power has failed or is about to fail. There isn't a "power good" line in SCSI so that doesn't help. Basically, when the power fails, all bets are off. A write in progress may not succeed any more than a seek in progress would. Seeks take a lot of power, usually from the +12 volt line. Typically, if a write is in progress, when low power is sensed by the drive, write current is terminated. At one time, there was a electromagnet that was released to move the heads to a landing zone. Now there is none. The center of radius of the head arm is slightly forward of the center of rotation of the disk so that when the heads "land", they skate to the inside of the platter, off the active media. The media is supposed to be able to take this abuse for quite some time. When a partially written sector is read with a bad CRC, the host (not the drive) can rewrite the sector. As long as the sector header, which is ahead of the write-splice, isn't destroyed the disk doesn't need to be re-formatted. In the remote case where the sector header is destroyed, the bad sector may be re-mapped by the drive if there are any spare sectors still available. The first error returned to the host is the bad CRC. Subsequent reads will not return a bad CRC if the sector was re-mapped. However, the data is invalid! Therefore, the drivers can't retry reads expecting that a bad CRC got fixed so the data is okay. The driver needs to read all the sense data and try to figure it out. The solution is an UPS. When the UPS power gets low, shut down the computer, preferably automatically. Also, if your computer is on all day long as is typical at a workplace, never shut it off. Just turn off the monitor when you go home. Your disk drives will last until you decide to replace then because they are too small or too slow. And beware when you finally do turn off the computer. The disks may not spin up the next time you start the computer. It's a good idea to back up everything before shutting down a computer that has been running for a year or two. Of course you can re-boot as much as you want. Just don't kill the power! Cheers, Dick Johnson Penguin : Linux version 2.4.1 on an i686 machine (799.53 BogoMips). I was going to compile a list of innovations that could be attributed to Microsoft. Once I realized that Ctrl-Alt-Del was handled in the BIOS, I found that there aren't any. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? [wandering OT] 2001-11-26 20:53 ` Richard B. Johnson @ 2001-11-26 21:18 ` Rob Landley 2001-11-27 0:32 ` Journaling pointless with today's hard disks? H. Peter Anvin 1 sibling, 0 replies; 81+ messages in thread From: Rob Landley @ 2001-11-26 21:18 UTC (permalink / raw) To: Richard B. Johnson; +Cc: Chris Wedgwood, linux-kernel On Monday 26 November 2001 15:53, Richard B. Johnson wrote: > > > It isn't that easy! Any kind of power storage within the drive would > have to be isolated with diodes so that it doesn't try to run your > motherboard as well as the drive. This means that +5 volt logic supply > would now be 5.0 - 0.6 = 4.4 volts at the drive, well below the design > voltage. Use of a Schottky diode (0.34 volts) would help somewhat, but you > have now narrowed the normal power design-margin by 90 percent, not good. At this point I have to hand the conversation over to either my father (a professional electrical engineer), my grandfather (ditto for 50 years, including helping GE debug its early vacuum tube lines), or my friend chip (who got a 4.0 from a technical college and who modifies playstations with a soldering iron for fun). Me, I'm mostly a software person, but this strikes me as a fairly straightforward voltage regulation and switching problem. Must admit I was considering transistors sealing off the rest of the world's power supply when the sensor says it's going bye-bye, but I can't say I'm familiar with the kind of load you can hit one of them with. (I remember using one to drive a motor once, but that was smoke signals lab back in college and a significant number of the components I used gave up their magic smoke along the way. I ran an awful lot of current through the big evil black three-prong transistors, though. That's a problem they solved back in the 1960's, isn't it?) > There is supposed to be a "power good" line out of your power supply > which is supposed to tell equipment when the main power has failed or > is about to fail. There isn't a "power good" line in SCSI so that > doesn't help. Shouldn't be too hard to fake something up to detect a current fluctuation. Sheesh, in a way that's what the whole high/low logic gates reading the data bus do, isn't it? And the cache dump logic is more or less constant (you WANT it to go to disk), it's not so much triggering it as making sure you limit what it has to do to what you can guarantee it'll have time to do, and then adding a few miliseconds of extra power to guarantee it'll have time to do it. Maybe I'm oversimplifying. I'm a software person. We do that with hardware... > Basically, when the power fails, all bets are off. A write in progress > may not succeed any more than a seek in progress would. Currently, sure. But nobody said this was a GOOD thing. > Seeks take a > lot of power, usually from the +12 volt line. I've seen capacitors melt screws. (And in one instance, a screwdriver.) Admittedly those were the monster big ones (the screw melter was about 10 cubic centimeters, the screwdriver got melted by a friend poking around in the back of an unplugged television set; he lived), but saying a capacitor doesn't have enough power to do something without specifying the capacitor in question... My grandfather has capacitors that simulate lightning strikes to stress-test equipment against electromagnetic pulse interference during thrunderstorms. (They're a little larger than a printer paper box, and he hooks a half-dozen of them up in series.) > Typically, if a write > is in progress, when low power is sensed by the drive, write current > is terminated. At one time, there was a electromagnet that was > released to move the heads to a landing zone. Now there is none. > The center of radius of the head arm is slightly forward of the > center of rotation of the disk so that when the heads "land", they > skate to the inside of the platter, off the active media. The media > is supposed to be able to take this abuse for quite some time. I'd heard the parking these days was sometimes done centrifugally, but didn't know it skipped in... > The solution is an UPS. When the UPS power gets low, shut down > the computer, preferably automatically. I admit that laptops are driving desktops into the "workstation" market, so we'll all have battery backup automatically anyway, but saying a piece of equipment that doesn't gracefully deal with a condition CAN'T gracefully deal with that condition... If current processors ate their microcode on an unclean loss of power, or flashable bioses glitched themselves on an unclean loss of power, would you consider this behavior justifiable because you should have been using a UPS? We're not talking server side hosted RAID systems here. (Although this could easily take out multiple drives from a raid simultaneously.) We're talking a college student's home desktop system went bye-bye because his roommate hit the light switch that the computer's outlet was plugged into, and his journaling FS did no good. You're arguing that there's no real world demand for journaling filesystems. You realise this, don't you? (If an unclean shutdown can create hard errors on your drive as well as eating who knows how much write-cached data that the journal thought was committed, what's the point of journaling?) > Also, if your computer is on all day long as is typical at a > workplace, never shut it off. I don't. > Just turn off the monitor when you > go home. Your disk drives will last until you decide to replace > then because they are too small or too slow. They do. However, I have power failures from time to time. Even with a UPS, the power cord has been knocked out of the back of the box (or the switch got hit by somebody's foot) on more than one occasion. And then there was the time an entire Dr. Pepper went flying all over the machine and a very quick power down was required before liquid could drip down onto the electronics. (Not a server room scenario, no. But more common than you'd think in desktops and workstations.) > And beware when you finally do turn off the computer. The disks > may not spin up the next time you start the computer. It's a good > idea to back up everything before shutting down a computer that > has been running for a year or two. Why wait until you shut the box down? http://content.techweb.com/wire/story/TWB20010409S0012 If you have 3 year old data you still care about and you haven't backed it up yet, something is wrong. Forget the drive going bad, I had lighting cause one of the chips in my modem to explode once. (Literally. Strangely, the rest of the system, an old 386, worked fine after a reboot, but there was no reason to expect that.) Or the power supply filling up with dust and doing all SORTS of fun things to the rest of the system. > Of course you can re-boot as much as you want. Just don't kill the power! Worst case scenario this is what data recovery services are for. Assuming you can budget $10k for them to crack open your drive in their cleanroom. :) Also, sticking the drive in the freezer for a bit often works long enough to get the data off. Several theories on why (lower the resistance of stuff in the motor, contract and bring worn contacts closer together, stop the lubrication from acting like glue) but it's a good "the drive's hosed, what do we do" hail mary pass. Just don't think it's a fix longer than it takes the drive to warm up. (Oh yeah, put it in a plastic bag first. Condensation, you know. Bad for electronics.) In my personal experience the drive's bearings seem to go before the motor, but I know that's not a general rule... > Cheers, > Dick Johnson Rob ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-26 20:53 ` Richard B. Johnson 2001-11-26 21:18 ` Journaling pointless with today's hard disks? [wandering OT] Rob Landley @ 2001-11-27 0:32 ` H. Peter Anvin 1 sibling, 0 replies; 81+ messages in thread From: H. Peter Anvin @ 2001-11-27 0:32 UTC (permalink / raw) To: linux-kernel Followup to: <Pine.LNX.3.95.1011126151922.29433A-100000@chaos.analogic.com> By author: "Richard B. Johnson" <root@chaos.analogic.com> In newsgroup: linux.dev.kernel > > It isn't that easy! Any kind of power storage within the drive would > have to be isolated with diodes so that it doesn't try to run your > motherboard as well as the drive. This means that +5 volt logic supply > would now be 5.0 - 0.6 = 4.4 volts at the drive, well below the design > voltage. Use of a Schottky diode (0.34 volts) would help somewhat, but you > have now narrowed the normal power design-margin by 90 percent, not good. > Hardly a big deal since most logic is 3.3V these days (remember, you don't need to maintain VccIO since the bus is dead anyway). -hpa -- <hpa@transmeta.com> at work, <hpa@zytor.com> in private! "Unix gives you enough rope to shoot yourself in the foot." http://www.zytor.com/~hpa/puzzle.txt <amsp@zytor.com> ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-26 16:59 ` Rob Landley 2001-11-26 20:30 ` Andre Hedrick 2001-11-26 20:53 ` Richard B. Johnson @ 2001-11-27 16:39 ` Matthias Andree 2001-11-27 17:42 ` Martin Eriksson 2 siblings, 1 reply; 81+ messages in thread From: Matthias Andree @ 2001-11-27 16:39 UTC (permalink / raw) To: linux-kernel On Mon, 26 Nov 2001, Rob Landley wrote: > Having no write caching on the IDE side isn't a solution either. The problem > is the largest block of data you can send to an ATA drive in a single command > is smaller than modern track sizes (let alone all the tracks under the heads > on a multi-head drive), so without any sort of cacheing in the drive at all > you add rotational latency between each write request for the point you left > off writing to come back under the head again. This will positively KILL > write performance. (I suspect the situation's more or less the same for read > too, but nobody's objecting to read cacheing.) > > The solution isn't to avoid write cacheing altogether (performance is 100% > guaranteed to suck otherwise, for reasons unrelated to how well your hardware > works but to legacy request size limits in the ATA specification), but to > have a SMALL write buffer, the size of one or two tracks to allow linear ATA > write requests to be assembled into single whole-track writes, and to make > sure the disks' electronics has enough capacitance in it to flush this buffer > to disk. (How much do capacitors cost? We're talking what, an extra 20 > miliseconds? The buffer should be small enough you don't have to do that > much seeking.) Two things: 1- power loss. Fixing things to write to disk is bound to fail in adverse conditions. If the drive suffers from write problems and the write takes longer than the charge of your capacitor lasts, your data is still toasted. nonvolatile memory with finite write time (like NVRAM/Flash) will help to save the Cache. I don't think vendors will do that soon. 2- error handling with good power: with automatic remapping turned on, there's no problem, the drive can re-write a block it has taken responsibility of, IBM DTLA drives will automatically switch off the write cache when the number of spare block gets low. with automatic remapping turned off, write errors with enabled write cache get a real problem because the way it is now, when the drive reports the problem, the block has already expired from the write queue and is no longer available to be rescheduled. That may mean that although fsync() completed OK your block is gone. Tagged queueing may help, as would locking a block with write faults in the drive and sending it back along with the error condition to the host. (*) of course, journal data must be written in an ordered fashion to prevent trouble in case of power loss. -- Matthias Andree "They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." Benjamin Franklin ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-27 16:39 ` Matthias Andree @ 2001-11-27 17:42 ` Martin Eriksson 2001-11-28 16:35 ` Ian Stirling 0 siblings, 1 reply; 81+ messages in thread From: Martin Eriksson @ 2001-11-27 17:42 UTC (permalink / raw) To: Matthias Andree, linux-kernel ----- Original Message ----- From: "Matthias Andree" <matthias.andree@stud.uni-dortmund.de> To: <linux-kernel@vger.kernel.org> Sent: Tuesday, November 27, 2001 5:39 PM Subject: Re: Journaling pointless with today's hard disks? <snip> > > Two things: > > 1- power loss. Fixing things to write to disk is bound to fail in > adverse conditions. If the drive suffers from write problems and the > write takes longer than the charge of your capacitor lasts, your > data is still toasted. nonvolatile memory with finite write time > (like NVRAM/Flash) will help to save the Cache. I don't think vendors > will do that soon. > > 2- error handling with good power: with automatic remapping turned on, > there's no problem, the drive can re-write a block it has taken > responsibility of, IBM DTLA drives will automatically switch off the > write cache when the number of spare block gets low. > > with automatic remapping turned off, write errors with enabled write > cache get a real problem because the way it is now, when the drive > reports the problem, the block has already expired from the write > queue and is no longer available to be rescheduled. That may mean > that although fsync() completed OK your block is gone. I think we have gotten away from the original subject. The problem (as I understood it) wasn't that we don't have time to write the whole cache... the problem is that the hard disk stops in the middle of a write, not updating the CRC of the sector, thus making it report as a bad sector when trying to recover from the failure. No? I think most people here are convinced that there is not time to write a several-MB (worst case) cache to the platters in case of a power failure. Special drives for this case could of course be manufactured, and here's for a theory of mine: Wouldn't a battery backed-up SRAM cache do the thing? Anyway, maybe it is just me who have been thrown off-track? Are we discussing something else now maybe? <snap> _____________________________________________________ | Martin Eriksson <nitrax@giron.wox.org> | MSc CSE student, department of Computing Science | Umeå University, Sweden ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-27 17:42 ` Martin Eriksson @ 2001-11-28 16:35 ` Ian Stirling 0 siblings, 0 replies; 81+ messages in thread From: Ian Stirling @ 2001-11-28 16:35 UTC (permalink / raw) To: Martin Eriksson; +Cc: Matthias Andree, linux-kernel > > ----- Original Message ----- > =46rom: "Matthias Andree" <matthias.andree@stud.uni-dortmund.de> > To: <linux-kernel@vger.kernel.org> > Sent: Tuesday, November 27, 2001 5:39 PM > Subject: Re: Journaling pointless with today's hard disks? > <snip> > I think most people here are convinced that there is not time to write a > several-MB (worst case) cache to the platters in case of a power failure. > Special drives for this case could of course be manufactured, and here's > a theory of mine: Wouldn't a battery backed-up SRAM cache do the thing? No. Sram is expensive, as are batteries (they also tend to have poor cycle life, and mean that you only keep the data until the battery dies. Numbers... Taking again as an example, something that's in my machine: The Fujitsu MPG3409AT, a bargain basement 40G drive. 2 platters, 5400RPM. It has (at the high end) 798 sec/track. Worst case, to write a journal track takes a full seek, and at least one complete rev. Assuming that we want to write it over two tracks, This is 18ms + 11*2ms = 40ms. Now, how much power? 6.3W is needed, so that's .252J Assuming that the 12V line can be allowed to sag to 10V, that'll take 20%^2 of the energy of the cap out, so we need a cap that stores about .7J, or a 2500uF cap. 12V 2500uf aluminium electrolytic is rather large, 25mm long *10mm diameter. There is space for this, in the overall package, but it would need a slight redesign. The cost of the component is about 10 cents US. Another 20-80 cents may be needed for the power switch. This assumes that no power can be used from the spindle motor, which may well be wrong. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-24 13:03 Journaling pointless with today's hard disks? Florian Weimer 2001-11-24 13:40 ` Rik van Riel 2001-11-25 9:14 ` Chris Wedgwood @ 2001-11-26 17:14 ` Steve Brueggeman 2001-11-26 20:36 ` Andre Hedrick 2 siblings, 1 reply; 81+ messages in thread From: Steve Brueggeman @ 2001-11-26 17:14 UTC (permalink / raw) To: linux-kernel; +Cc: Florian Weimer While I am not familiar with the IBM drives in particular, I am familar with this particular problem. The problem is that half of a sector gets new data, then when power is dropped, the old data+CRC/ECC is left on second part of that sector, and a subsequent read on the whole sector will detect the CRC/ECC mismatch, and determine the error burst is larger than what it can correct with retries, and ECC, and report it as a HARD ERROR. (03-1100 in the SCSI World) Since the error is non-recoverable, the disk drive should not auto-reassign the sector, since it cannot succeed at moving good data to the newly assigned sector. This type of error does not require a low-level format. Just writing any data to the sector in error should give the sector a CRC/ECC field that matches the data in the sector, and you should not get hard errors when reading that sector anymore. This was more of a problem with older disk drives (8-Inch platters, or older), because the time required to finish any given sector was more than the amount of time the electronics would run reliably. All that could be guranteed on these older drives was that a power loss would not corrupt any adjacent data, ie write gate must be crow-bared inactive before the heads start retracting, emergency-style, to the landing zone. I believe that the time to complete a sector is so short on current drives, that they should be able to complete writing their current sector, but I do not believe that there are any drive manufacturers out there that gurrantee this. Thus, there is probably a window, on all disk drives out there, where a loss of power durring an active write will end up causing a hard error when that sector is subsequently read (I haven't looked though, and could be wrong). Writing to the sector with the error should clear the hard-error when that sector is read. A low-level format should not be required to fix this, and if it is, the drive is definitely broken in design. This is basic power-economics, and one of the reasons for UPS's Steve Brueggeman On 24 Nov 2001 14:03:11 +0100, you wrote: >In the German computer community, a statement from IBM[1] is >circulating which describes a rather peculiar behavior of certain IBM >IDE hard drivers (the DTLA series): > >When the drive is powered down during a write operation, the sector >which was being written has got an incorrect checksum stored on disk. >So far, so good---but if the sector is read later, the drive returns a >*permanent*, *hard* error, which can only be removed by a low-level >format (IBM provides a tool for it). The drive does not automatically >map out such sectors. > >IBM claims this isn't a firmware error, but thinks that this explains >the failures frequently observed with DTLA drivers (which might >reflect reality or not, I don't know, but that's not the point >anyway). > >Now my question: Obviously, journaling file systems do not work >correctly on drivers with such behavior. In contrast, a vital data >structure is frequently written to (the journal), so such file systems >*increase* the probability of complete failure (with a bad sector in >the journal, the file system is probably unusable; for non-journaling >file systems, only a part of the data becomes unavailable). Is the >DTLA hard disk behavior regarding aborted writes more common among >contemporary hard drives? Wouldn't this make journaling pretty >pointless? > > >1. http://www.cooling-solutions.de/dtla-faq (German) _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-26 17:14 ` Steve Brueggeman @ 2001-11-26 20:36 ` Andre Hedrick 2001-11-26 21:14 ` Steve Brueggeman 0 siblings, 1 reply; 81+ messages in thread From: Andre Hedrick @ 2001-11-26 20:36 UTC (permalink / raw) To: Steve Brueggeman; +Cc: linux-kernel, Florian Weimer Steve, Dream on fellow, it is SOP that upon media failure the device logs the failure and does an internal re-allocation in the slip-sector stream. If the media is out of slip-sectors then it does an out-of-bounds re-allocation. Once the total number of out-of-bounds sectors are gone you need to deal with getting new media or exectute a seek and purge operation; however, if the badblock list is full you are toast. That is what is done - knowledge is first hand. Regards, Andre Hedrick CEO/President, LAD Storage Consulting Group Linux ATA Development Linux Disk Certification Project On Mon, 26 Nov 2001, Steve Brueggeman wrote: > While I am not familiar with the IBM drives in particular, I am > familar with this particular problem. > > The problem is that half of a sector gets new data, then when power is > dropped, the old data+CRC/ECC is left on second part of that sector, > and a subsequent read on the whole sector will detect the CRC/ECC > mismatch, and determine the error burst is larger than what it can > correct with retries, and ECC, and report it as a HARD ERROR. (03-1100 > in the SCSI World) > > Since the error is non-recoverable, the disk drive should not > auto-reassign the sector, since it cannot succeed at moving good data > to the newly assigned sector. > > This type of error does not require a low-level format. Just writing > any data to the sector in error should give the sector a CRC/ECC field > that matches the data in the sector, and you should not get hard > errors when reading that sector anymore. > > This was more of a problem with older disk drives (8-Inch platters, or > older), because the time required to finish any given sector was more > than the amount of time the electronics would run reliably. All that > could be guranteed on these older drives was that a power loss would > not corrupt any adjacent data, ie write gate must be crow-bared > inactive before the heads start retracting, emergency-style, to the > landing zone. > > I believe that the time to complete a sector is so short on current > drives, that they should be able to complete writing their current > sector, but I do not believe that there are any drive manufacturers > out there that gurrantee this. Thus, there is probably a window, on > all disk drives out there, where a loss of power durring an active > write will end up causing a hard error when that sector is > subsequently read (I haven't looked though, and could be wrong). > Writing to the sector with the error should clear the hard-error when > that sector is read. A low-level format should not be required to fix > this, and if it is, the drive is definitely broken in design. > > This is basic power-economics, and one of the reasons for UPS's > > Steve Brueggeman > > > > On 24 Nov 2001 14:03:11 +0100, you wrote: > > >In the German computer community, a statement from IBM[1] is > >circulating which describes a rather peculiar behavior of certain IBM > >IDE hard drivers (the DTLA series): > > > >When the drive is powered down during a write operation, the sector > >which was being written has got an incorrect checksum stored on disk. > >So far, so good---but if the sector is read later, the drive returns a > >*permanent*, *hard* error, which can only be removed by a low-level > >format (IBM provides a tool for it). The drive does not automatically > >map out such sectors. > > > >IBM claims this isn't a firmware error, but thinks that this explains > >the failures frequently observed with DTLA drivers (which might > >reflect reality or not, I don't know, but that's not the point > >anyway). > > > >Now my question: Obviously, journaling file systems do not work > >correctly on drivers with such behavior. In contrast, a vital data > >structure is frequently written to (the journal), so such file systems > >*increase* the probability of complete failure (with a bad sector in > >the journal, the file system is probably unusable; for non-journaling > >file systems, only a part of the data becomes unavailable). Is the > >DTLA hard disk behavior regarding aborted writes more common among > >contemporary hard drives? Wouldn't this make journaling pretty > >pointless? > > > > > >1. http://www.cooling-solutions.de/dtla-faq (German) ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-26 20:36 ` Andre Hedrick @ 2001-11-26 21:14 ` Steve Brueggeman 2001-11-26 21:36 ` Andre Hedrick 0 siblings, 1 reply; 81+ messages in thread From: Steve Brueggeman @ 2001-11-26 21:14 UTC (permalink / raw) To: Andre Hedrick; +Cc: linux-kernel Well, since you don't clearify what part you object to, I'll have to assume that you object to my statement that the disk drive will not auto-reallocate when it cannot recover the data. If you think that a disk drive should auto-reallocate a sector (ARRE enabled in the mode pages) that it cannot recover the original data from, than you can dream on. I seriously hope this is not what you're recommending for ATA. If a disk drive were to auto-reallocate a sector that it couldn't get valid data from, you'd have serious corruptions probelms!!! Tell me, what data should exist in the sector that gets reallocated if it cannot retrieve the data the system believes to be there??? If the reallocated sector has random data, and the next read to it doesn't return an error, than the system will get no indication that it should not be using that data. If the unrecoverable error happens durring a write, the disk drive still has the data in the buffer, so auto-reallocation on writes (AWRE enabled in the mode pages), is usually OK That said, it'd be my bet that most disk drives still have a window of opportunity durring the reallocation operation, where if the drive lost power, they'd end up doing bad things. You can force a reallocation, but the data you get when you first read that unreadable reallocated sector is usually undefined, and often is the data pattern written when the drive was low-level formatted. That IS what is done, my knowledge is also first hand. I have no descrepency with your description of how spare sectors are dolled out. Steve Brueggeman On Mon, 26 Nov 2001 12:36:02 -0800 (PST), you wrote: > >Steve, > >Dream on fellow, it is SOP that upon media failure the device logs the >failure and does an internal re-allocation in the slip-sector stream. >If the media is out of slip-sectors then it does an out-of-bounds >re-allocation. Once the total number of out-of-bounds sectors are gone >you need to deal with getting new media or exectute a seek and purge >operation; however, if the badblock list is full you are toast. > >That is what is done - knowledge is first hand. > >Regards, > >Andre Hedrick >CEO/President, LAD Storage Consulting Group >Linux ATA Development >Linux Disk Certification Project > >On Mon, 26 Nov 2001, Steve Brueggeman wrote: _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-26 21:14 ` Steve Brueggeman @ 2001-11-26 21:36 ` Andre Hedrick 2001-11-27 16:36 ` Steve Brueggeman 2001-11-27 21:28 ` Wayne Whitney 0 siblings, 2 replies; 81+ messages in thread From: Andre Hedrick @ 2001-11-26 21:36 UTC (permalink / raw) To: Steve Brueggeman; +Cc: linux-kernel On Mon, 26 Nov 2001, Steve Brueggeman wrote: > Well, since you don't clearify what part you object to, I'll have to > assume that you object to my statement that the disk drive will not > auto-reallocate when it cannot recover the data. > > If you think that a disk drive should auto-reallocate a sector (ARRE > enabled in the mode pages) that it cannot recover the original data > from, than you can dream on. I seriously hope this is not what you're One has to go read the general purpose error logs to determine the location of the original platter assigned sector of the relocated LBA. Reallocation generally occurs on write to media not read, and you should know that point. > recommending for ATA. If a disk drive were to auto-reallocate a > sector that it couldn't get valid data from, you'd have serious > corruptions probelms!!! Tell me, what data should exist in the sector > that gets reallocated if it cannot retrieve the data the system > believes to be there??? If the reallocated sector has random data, > and the next read to it doesn't return an error, than the system will > get no indication that it should not be using that data. > > If the unrecoverable error happens durring a write, the disk drive > still has the data in the buffer, so auto-reallocation on writes (AWRE > enabled in the mode pages), is usually OK By the time an ATA device gets to generating this message, either the bad block list is full or all reallocation sectors are used. Unlike SCSI which has to be hand held, 90% of all errors are handle by the device. Good or Bad -- that is how it does it. Well there is an additional problem in all of storage, that drives do reorder and do not always obey the host-driver. Thus if the device is suffering from performance and you have disabled WB-Cache, it may elect to self enable. Now you have the device returning ack to platter that may not be true. Most host-drivers (all of Linux, mine include) release and dequeue the request once the ack has been presented. This is dead wrong. If a flush cache fails I get back the starting lba of the write request, and if the request is dequeued -- well you know -- bye bye data! SCSI will do the same, even with TCQ. Once the sense is cleared to platter and the request is dequeued, and a hiccup happens -- bye bye data! > That said, it'd be my bet that most disk drives still have a window of > opportunity durring the reallocation operation, where if the drive > lost power, they'd end up doing bad things. That is a given. > You can force a reallocation, but the data you get when you first read > that unreadable reallocated sector is usually undefined, and often is > the data pattern written when the drive was low-level formatted. > > That IS what is done, my knowledge is also first hand. Excellent to see another Storage Industry person here. > I have no descrepency with your description of how spare sectors are > dolled out. Cool. Question -- are you up to fixing the low-level drivers and all the stuff above ? > Steve Brueggeman > > > On Mon, 26 Nov 2001 12:36:02 -0800 (PST), you wrote: > > > > >Steve, > > > >Dream on fellow, it is SOP that upon media failure the device logs the > >failure and does an internal re-allocation in the slip-sector stream. > >If the media is out of slip-sectors then it does an out-of-bounds > >re-allocation. Once the total number of out-of-bounds sectors are gone > >you need to deal with getting new media or exectute a seek and purge > >operation; however, if the badblock list is full you are toast. > > > >That is what is done - knowledge is first hand. Regards, Andre Hedrick CEO/President, LAD Storage Consulting Group Linux ATA Development Linux Disk Certification Project ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-26 21:36 ` Andre Hedrick @ 2001-11-27 16:36 ` Steve Brueggeman 2001-11-27 20:04 ` Bill Davidsen 2001-11-27 21:28 ` Wayne Whitney 1 sibling, 1 reply; 81+ messages in thread From: Steve Brueggeman @ 2001-11-27 16:36 UTC (permalink / raw) To: linux-kernel; +Cc: Andre Hedrick My experience is with SCSI not ATA, so adjust accordingly... On Mon, 26 Nov 2001 13:36:06 -0800 (PST), you wrote: >On Mon, 26 Nov 2001, Steve Brueggeman wrote: > Snip-out my stuff > >One has to go read the general purpose error logs to determine the >location of the original platter assigned sector of the relocated LBA. > >Reallocation generally occurs on write to media not read, and you should >know that point. Actually, it has been my experience that most reallocations occur on reads. The reasons for this are two fold. 1) most systems out there do an order of magnitude more reads than writes, 2) the amount of data read (servo/headers, sync...) for a write operation is an order of magnitude less than it is for a read operation (same servo/header/sync plus the whole data-field and CRC/ECC). Note: no media related errors can be detected while write-gate is active. Only servo positioning errors, and even that's not likely with the current drives using embeded servo. > Snip some more of my stuff > >By the time an ATA device gets to generating this message, either the bad >block list is full or all reallocation sectors are used. Unlike SCSI >which has to be hand held, 90% of all errors are handle by the device. >Good or Bad -- that is how it does it. I think what you meant is, 90% of all errors are handled silently by the device. I don't like silent errors. > >Well there is an additional problem in all of storage, that drives do >reorder and do not always obey the host-driver. Thus if the device is >suffering from performance and you have disabled WB-Cache, it may elect to >self enable. Now you have the device returning ack to platter that may >not be true. Most host-drivers (all of Linux, mine include) release and >dequeue the request once the ack has been presented. This is dead wrong. >If a flush cache fails I get back the starting lba of the write request, >and if the request is dequeued -- well you know -- bye bye data! SCSI >will do the same, even with TCQ. Once the sense is cleared to platter and >the request is dequeued, and a hiccup happens -- bye bye data! > It is my convicted opinion that any device that automatically enables write-caching is broken and anyone who enables write caching probably doesn't know what they're doing. A system simply cannot get reliable error reporting with write caching enabled. Without write-caching, if you get good completion status, the data is GURANTEED to be on the platter. `With` write-caching, the best you can hope for are deferred errors, but I am yet to see a system that can properly cope with deferred errors, so at best, they're informational only. I once had to write some drive test software that ran with write-caching enabled, on a drive in degraded mode. The only option I could come up with was to maintain a history of the last 2 X queue depth commands sent to each device, and do a look-up for the LBA in the deferred error for all commands in the history that had a range that covered the LBA in error. Unfortunately, this was under DOS and this was not an option because the memory was too tight. What I ended up with was better than nothing, but still could not catch 100% of the deferred errors. (More snippage) > >Question -- are you up to fixing the low-level drivers and all the stuff >above ? > Probably not, as my plate's pretty full. Though, I would like to understand more specifically what you're talking about... I see the following opportunities.. 1) Read returns unrecoverable error. write to bad sector and re-read if re-read returns unrecoverable error, manually reallocate This should not be done automatically, since there is no easy way to determine whether that sector is in a free list, and we are only allowed to write to sectors in a free list. This would best be done by the badblocks utility, in combination with fsck. Maybe it already does this, I haven't looked. 2) At device initialization, and after device resets, force write-cache to be disabled. (not very nice to those that would rather have write cache enabled... poor souls) 3) Set the Force Unit Access bit for all write commands. (again, not very nice to those poor souls that would rather have write cache enabled) 4) Reordering of commands is rather unrelated to the problem at hand, but it is a concern for anything that needs ordered transactions. The Linux SCSI layers only inject an ordered command every so-often to prevent command starvation, but for ordered transactions, the SCSI layer should probably be forcing the sending of Ordered command queue messages with the CDB. I'd rather hate to see every SCSI request become an ordered command queue, since the disk drive really does know best how to reorder it's queue of commands. The SCSI block layer really needs some clues from the upper layers in my opinion, about whether a given request needs to be ordered or not. But I digress. This is a whole other topic. Steve Brueggeman _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-27 16:36 ` Steve Brueggeman @ 2001-11-27 20:04 ` Bill Davidsen 0 siblings, 0 replies; 81+ messages in thread From: Bill Davidsen @ 2001-11-27 20:04 UTC (permalink / raw) To: Linux Kernel Mailing List On Tue, 27 Nov 2001, Steve Brueggeman wrote: > 2) At device initialization, and after device resets, force > write-cache to be disabled. (not very nice to those that would rather > have write cache enabled... poor souls) > > 3) Set the Force Unit Access bit for all write commands. (again, not > very nice to those poor souls that would rather have write cache > enabled) I don't have a problem with setting things to "most likely to succeed" values, and (2) fits that. Those who really want w/c can enable in rc.local. However, practice (3) is something I would associate with other operating systems which believe that the computer knows best. You may personally believe that you will trade any amount of performance for a slight increase in reliability, but other may want to take the risk of losing data and have the computer fast enough to do their work. I don't think it's remotely Linux policy to do things like that, and I certainly hope it doesn't happen. Both decent disk drives and UPS systems are available, and having been in the position of having systems which can't quite keep up with the load, I want the option of doing what seems best. We have gotten along for years without doing something to force bypass of w/c, it seems that hdparm is up to continuing to allow people to make their own choices. -- bill davidsen <davidsen@tmr.com> CTO, TMR Associates, Inc Doing interesting things with little computers since 1979. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-26 21:36 ` Andre Hedrick 2001-11-27 16:36 ` Steve Brueggeman @ 2001-11-27 21:28 ` Wayne Whitney 2001-11-27 21:52 ` Andre Hedrick 1 sibling, 1 reply; 81+ messages in thread From: Wayne Whitney @ 2001-11-27 21:28 UTC (permalink / raw) To: Andre Hedrick; +Cc: LKML In mailing-lists.linux-kernel, Andre Hedrick wrote: > By the time an ATA device gets to generating this message, either the bad > block list is full or all reallocation sectors are used. Unlike SCSI > which has to be hand held, 90% of all errors are handle by the device. Perhaps you or one of the other gurus could explain the following observations, which I am sure that I and many other readers would find very enlightening: I have an IBM-DTLA-307045 drive connected to a PDC20265 controller on an ia32 machine running 2.4.16. After reading this discussion and hearing about the problems that others have had with the IBM 75GXP series, I thought that I should test out my drive to see if it is OK. So I ran 'dd if=/dev/hde of=/dev/null bs=128k'. Every thing went fine, except for about five seconds in the middle, when the disk made a lot of grinding sounds and the system was unresponsive. That generated the log messages messages appended below. However, running the dd command again (after a reboot) produced no errors. So the drive remapped some bad sectors the first time through? The common wisdom here is that once you get to the point where the drive reports a bad sector, you are in trouble. If so, why did the second dd command work OK? I have had no other problems with this drive. Thanks, Wayne hde: dma_intr: status=0x51 { DriveReady SeekComplete Error } hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=12939888, sector=12939804 end_request: I/O error, dev 21:00 (hde), sector 12939804 hde: dma_intr: status=0x51 { DriveReady SeekComplete Error } hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=12939888, sector=12939806 end_request: I/O error, dev 21:00 (hde), sector 12939806 hde: dma_intr: status=0x51 { DriveReady SeekComplete Error } hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=12939888, sector=12939808 end_request: I/O error, dev 21:00 (hde), sector 12939808 hde: dma_intr: status=0x51 { DriveReady SeekComplete Error } hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=12939888, sector=12939810 end_request: I/O error, dev 21:00 (hde), sector 12939810 hde: dma_intr: status=0x51 { DriveReady SeekComplete Error } hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=12939888, sector=12939812 end_request: I/O error, dev 21:00 (hde), sector 12939812 ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-27 21:28 ` Wayne Whitney @ 2001-11-27 21:52 ` Andre Hedrick 2001-11-28 11:53 ` Pedro M. Rodrigues 0 siblings, 1 reply; 81+ messages in thread From: Andre Hedrick @ 2001-11-27 21:52 UTC (permalink / raw) To: Wayne Whitney; +Cc: LKML I strongly suggest you execute the extend tests in the smart-suite authored by a friend of mine that I have listed on my www.linux-ide.org. What you have done is trigger a process to have the device go into a selftest mode to perform a block test. I would tell you more but I may have exposed myself already. Regards, you need to execute an extend smart offline test. Also be sure to query the smart logs. Respectfully, Andre Hedrick CEO/President, LAD Storage Consulting Group Linux ATA Development Linux Disk Certification Project On Tue, 27 Nov 2001, Wayne Whitney wrote: > In mailing-lists.linux-kernel, Andre Hedrick wrote: > > > By the time an ATA device gets to generating this message, either the bad > > block list is full or all reallocation sectors are used. Unlike SCSI > > which has to be hand held, 90% of all errors are handle by the device. > > Perhaps you or one of the other gurus could explain the following > observations, which I am sure that I and many other readers would find > very enlightening: > > I have an IBM-DTLA-307045 drive connected to a PDC20265 controller on > an ia32 machine running 2.4.16. After reading this discussion and > hearing about the problems that others have had with the IBM 75GXP > series, I thought that I should test out my drive to see if it is OK. > So I ran 'dd if=/dev/hde of=/dev/null bs=128k'. Every thing went > fine, except for about five seconds in the middle, when the disk made > a lot of grinding sounds and the system was unresponsive. That > generated the log messages messages appended below. > > However, running the dd command again (after a reboot) produced no > errors. So the drive remapped some bad sectors the first time > through? The common wisdom here is that once you get to the point > where the drive reports a bad sector, you are in trouble. If so, why > did the second dd command work OK? I have had no other problems with > this drive. > > Thanks, Wayne > > hde: dma_intr: status=0x51 { DriveReady SeekComplete Error } > hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=12939888, sector=12939804 > end_request: I/O error, dev 21:00 (hde), sector 12939804 > hde: dma_intr: status=0x51 { DriveReady SeekComplete Error } > hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=12939888, sector=12939806 > end_request: I/O error, dev 21:00 (hde), sector 12939806 > hde: dma_intr: status=0x51 { DriveReady SeekComplete Error } > hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=12939888, sector=12939808 > end_request: I/O error, dev 21:00 (hde), sector 12939808 > hde: dma_intr: status=0x51 { DriveReady SeekComplete Error } > hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=12939888, sector=12939810 > end_request: I/O error, dev 21:00 (hde), sector 12939810 > hde: dma_intr: status=0x51 { DriveReady SeekComplete Error } > hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=12939888, sector=12939812 > end_request: I/O error, dev 21:00 (hde), sector 12939812 > ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Journaling pointless with today's hard disks? 2001-11-27 21:52 ` Andre Hedrick @ 2001-11-28 11:53 ` Pedro M. Rodrigues 0 siblings, 0 replies; 81+ messages in thread From: Pedro M. Rodrigues @ 2001-11-28 11:53 UTC (permalink / raw) To: Wayne Whitney, Andre Hedrick; +Cc: LKML Just curious but what can a selfttest mode and consequent block test do to inspire you such worry? Are we dealing with the mob or something of the sort when we buy an IBM 75GXP disk? /Pedro On 27 Nov 2001 at 13:52, Andre Hedrick wrote: > > > What you have done is trigger a process to have the device go into a > selftest mode to perform a block test. I would tell you more but I > may have exposed myself already. > ^ permalink raw reply [flat|nested] 81+ messages in thread
end of thread, other threads:[~2001-12-04 3:48 UTC | newest]
Thread overview: 81+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-11-24 13:03 Journaling pointless with today's hard disks? Florian Weimer
2001-11-24 13:40 ` Rik van Riel
2001-11-24 16:36 ` Phil Howard
2001-11-24 17:19 ` Charles Marslett
2001-11-24 17:31 ` Florian Weimer
2001-11-24 17:41 ` Matthias Andree
2001-11-24 19:20 ` Florian Weimer
2001-11-24 19:29 ` Rik van Riel
2001-11-24 22:51 ` John Alvord
2001-11-24 23:41 ` Phil Howard
2001-11-25 0:24 ` Ian Stirling
2001-11-25 0:53 ` Phil Howard
2001-11-25 1:25 ` H. Peter Anvin
2001-11-25 1:44 ` Sven.Riedel
2001-11-24 22:28 ` H. Peter Anvin
2001-11-25 4:49 ` Andre Hedrick
2001-11-24 23:04 ` Pedro M. Rodrigues
2001-11-24 23:23 ` Stephen Satchell
2001-11-24 23:29 ` H. Peter Anvin
2001-11-26 18:05 ` Steve Brueggeman
2001-11-26 23:49 ` Martin Eriksson
2001-11-27 0:06 ` Andreas Dilger
2001-11-27 0:16 ` Andre Hedrick
2001-11-27 7:38 ` Andreas Dilger
2001-11-27 11:48 ` Ville Herva
2001-11-27 0:18 ` Jonathan Lundell
2001-11-27 1:01 ` Ian Stirling
2001-11-27 1:33 ` H. Peter Anvin
2001-11-27 1:57 ` Steve Underwood
2001-11-27 5:04 ` Stephen Satchell
[not found] ` <mailman.1006644421.6553.linux-kernel2news@redhat.com>
2001-11-25 4:20 ` Pete Zaitcev
2001-11-25 13:52 ` Pedro M. Rodrigues
2001-11-25 12:30 ` Matthias Andree
2001-11-25 15:04 ` Barry K. Nathan
2001-11-25 16:31 ` Matthias Andree
2001-11-27 2:39 ` Pavel Machek
2001-12-03 10:23 ` Matthias Andree
2001-11-25 9:14 ` Chris Wedgwood
2001-11-25 22:55 ` Daniel Phillips
2001-11-26 16:59 ` Rob Landley
2001-11-26 20:30 ` Andre Hedrick
2001-11-26 20:35 ` Rob Landley
2001-11-26 23:59 ` Andreas Dilger
2001-11-27 0:24 ` H. Peter Anvin
2001-11-27 0:52 ` H. Peter Anvin
2001-11-27 1:11 ` Andrew Morton
2001-11-27 1:15 ` H. Peter Anvin
2001-11-27 16:59 ` Matthias Andree
2001-11-27 16:56 ` Matthias Andree
2001-11-27 1:23 ` Ian Stirling
2001-11-26 23:00 ` Rob Landley
2001-11-27 2:41 ` H. Peter Anvin
2001-11-27 0:19 ` Rob Landley
2001-11-27 23:35 ` Andreas Bombe
2001-11-28 14:32 ` Rob Landley
2001-11-27 3:39 ` Ian Stirling
2001-11-27 7:03 ` Ville Herva
2001-11-27 16:50 ` Matthias Andree
2001-11-27 20:31 ` Rob Landley
2001-11-28 18:43 ` Matthias Andree
2001-11-28 18:46 ` Rob Landley
2001-11-28 22:19 ` Matthias Andree
2001-11-29 22:21 ` Pavel Machek
2001-12-01 10:55 ` Jeff V. Merkey
2001-12-02 0:08 ` Matthias Andree
2001-12-03 20:04 ` Pavel Machek
2001-11-26 20:53 ` Richard B. Johnson
2001-11-26 21:18 ` Journaling pointless with today's hard disks? [wandering OT] Rob Landley
2001-11-27 0:32 ` Journaling pointless with today's hard disks? H. Peter Anvin
2001-11-27 16:39 ` Matthias Andree
2001-11-27 17:42 ` Martin Eriksson
2001-11-28 16:35 ` Ian Stirling
2001-11-26 17:14 ` Steve Brueggeman
2001-11-26 20:36 ` Andre Hedrick
2001-11-26 21:14 ` Steve Brueggeman
2001-11-26 21:36 ` Andre Hedrick
2001-11-27 16:36 ` Steve Brueggeman
2001-11-27 20:04 ` Bill Davidsen
2001-11-27 21:28 ` Wayne Whitney
2001-11-27 21:52 ` Andre Hedrick
2001-11-28 11:53 ` Pedro M. Rodrigues
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox