From: Rob Landley <landley@trommello.org>
To: Matthias Andree <matthias.andree@stud.uni-dortmund.de>,
linux-kernel@vger.kernel.org
Subject: Re: Journaling pointless with today's hard disks?
Date: Tue, 27 Nov 2001 15:31:21 -0500 [thread overview]
Message-ID: <01112715312104.01486@localhost> (raw)
In-Reply-To: <Pine.LNX.4.10.10111261229190.8817-100000@master.linux-ide.org> <0111261535070J.02001@localhost.localdomain> <20011127175016.D13416@emma1.emma.line.org>
In-Reply-To: <20011127175016.D13416@emma1.emma.line.org>
On Tuesday 27 November 2001 11:50, Matthias Andree wrote:
> Please fix your domain in your mailer, localhost.localdomain is prone
> for Message-ID collisions.
I'm using Kmail talking to @home's mail server (to avoid the evil behavior
sendmail has behind an IP masquerading firewall that triggers every spam
filter in existence), so if either one of them cares about the hostname of my
laptop ("driftwood", but apparently not being set right by Red Hat's
scripts), then something's wrong anyway.
But let's see...
Ah fun, if you change the hostname of the box, either X or KDE can't pop up
any more new applicatons until you exit X and restart it. Brilliant.
Considering how many Konqueror windows I have open at present on my 6
desktops, I think I'll leave fixing this until later in the evening. But
thanks for letting me know something's up...
>
> Note, the power must RELIABLY last until all of the data has been
> writen, which includes reassigning, seeking and the like, just don't do
> it if you cannot get a real solution.
A) At most 1 seek to a track other than the one you're on.
B) If sectors have been reassigned outside of this track to a "recovery"
track, then that counts as a seperate track. Tough.
The point of the buffer is to let the OS feed data to the write head as fast
as it can write it (which unbuffered ATA can't do because individual requests
are smaller than individual tracks). You need a small buffer to avoid
blocking between each and every ATA write while the platter rotates back into
position. So you always let it have a little more data so it knows what to
do next and can start work on it immediately (doing that next seek, writing
that next sector as it passes under the head without having to wait for it to
rotate around again.)
That's it. No more buffer than does good at the hardware level for request
merging and minimizing seek latency. Any buffering over and above that is
the operating system's job.
Yes the hardware can do a slightly better job with its own elevator algorithm
using intimate knowledge of on-disk layout, but the OS can do a fairly decent
job as long as logical linear sectors are linearly arranged on disk too.
(Relocating bad sectors breaks this, but not fatally. It causes extra seeks
in linear writes anyway where the elevator ISN'T involved, so you've already
GOT a performance hit. And it just screws up the OS's elevator, not the rest
of the scheme. You still have the current track written as one lump and an
immediate seek to the other track, at which point the drive electronics can
be accepting blocks destined for the track you seek back to.)
The advantage of limiting the amount of data buffered to current track plus
one other is you have a fixed amount of work to do on a loss of power. One
seek, two track writes, and a spring-driven park. The amount of power this
takes has a deterministic upper bound. THAT is why you block before
accepting more data than that.
> battery-backed CMOS,
> NVRAM/Flash/whatever which lasts a couple of months should be fine
> though, as long as documents are publicly available that say how long
> this data lasts. Writing to disk will not work out unless you can keep
> the drive going for several seconds which will require BIG capacitors,
> so that's no option, you must go for NVRAM/Flash or something.
You dont' need several seconds. You need MILISECONDS. Two track writes and
one seek. This is why you don't accept more data than that before blocking.
Your worst case scenario is a seek from near where the head parks to the
other end of the disk, then the spring can pull it back. This should be well
under 50 miliseconds. Your huge ram cache is there for reads. For writes
you don't accept more than you can reliably flush if you want anything
approaching reliability. If you're only going to spring for a capacitor as
your power failure hedge, than the amount of write cache you can accept is
small, but it turns out you only need a tiny amount of cache to get 90% of
the benefit of write cacheing (merging writes into full tracks and seeking
immediately to the next track).
> OTOH, the OS must reliably know when something went wrong (even with
> good power it has a right to know), and preferably this scheme should
> not involve disabling the write cache, so TCQ or something mandatory
> would be useful (not sure if it's mandatory in current ATA standards).
We're talking about what happens to the drive on a catastrophic power
failure. (Even with a UPS, this can happen if your case fan jams and your
power supply catches fire and burns through a wire, Although most server side
hosting facilities aren't that dusty, there's always worn bearings and other
such fun things. And in a desktop environment, spilled sodas.) Currently,
there are drives out there that stop writing a sector in the middle, leaving
a bad CRC at the hardware level. This isn't exactly graceful. At the other
end, drives with huge caches discard the contents of cache which a journaling
filesystem thinks are already on disk. This isn't graceful either.
> If a block has first been reported written OK and the disk later reports
> error, it must send the block back (incompatible with any current ATA
> draft I had my hands on), so I think tagged commands which are marked
> complete only after write+verify are the way to go.
If a block goes bad WHILE power is failing, you're screwed. This is just a
touch unlikely. It will happen to somebody out there someday, sure. So will
alpha particle decay corrupting a sector that was long ago written to the
drive correctly. Designing for that is not practical. Recovering after the
fact might be, but that doesn't mean you get your data back.
Rob
next prev parent reply other threads:[~2001-11-28 17:13 UTC|newest]
Thread overview: 86+ messages / expand[flat|nested] mbox.gz Atom feed top
2001-11-24 13:03 Journaling pointless with today's hard disks? Florian Weimer
2001-11-24 13:40 ` Rik van Riel
2001-11-24 16:36 ` Phil Howard
2001-11-24 17:19 ` Charles Marslett
2001-11-24 17:31 ` Florian Weimer
2001-11-24 17:41 ` Matthias Andree
2001-11-24 19:20 ` Florian Weimer
2001-11-24 19:29 ` Rik van Riel
2001-11-24 22:51 ` John Alvord
2001-11-24 23:41 ` Phil Howard
2001-11-25 0:24 ` Ian Stirling
2001-11-25 0:53 ` Phil Howard
2001-11-25 1:25 ` H. Peter Anvin
2001-11-25 1:44 ` Sven.Riedel
2001-11-24 22:28 ` H. Peter Anvin
2001-11-25 4:49 ` Andre Hedrick
2001-11-24 23:04 ` Pedro M. Rodrigues
2001-11-24 23:23 ` Stephen Satchell
2001-11-24 23:29 ` H. Peter Anvin
2001-11-26 18:05 ` Steve Brueggeman
2001-11-26 23:49 ` Martin Eriksson
2001-11-27 0:06 ` Andreas Dilger
2001-11-27 0:16 ` Andre Hedrick
2001-11-27 7:38 ` Andreas Dilger
2001-11-27 11:48 ` Ville Herva
2001-11-27 0:18 ` Jonathan Lundell
2001-11-27 1:01 ` Ian Stirling
2001-11-27 1:33 ` H. Peter Anvin
2001-11-27 1:57 ` Steve Underwood
2001-11-27 5:04 ` Stephen Satchell
2001-11-25 12:30 ` Matthias Andree
2001-11-25 15:04 ` Barry K. Nathan
2001-11-25 16:31 ` Matthias Andree
2001-11-27 2:39 ` Pavel Machek
2001-12-03 10:23 ` Matthias Andree
[not found] ` <mailman.1006644421.6553.linux-kernel2news@redhat.com>
2001-11-25 4:20 ` Pete Zaitcev
2001-11-25 13:52 ` Pedro M. Rodrigues
2001-11-25 9:14 ` Chris Wedgwood
2001-11-25 22:55 ` Daniel Phillips
2001-11-26 16:59 ` Rob Landley
2001-11-26 20:30 ` Andre Hedrick
2001-11-26 20:35 ` Rob Landley
2001-11-26 23:59 ` Andreas Dilger
2001-11-27 0:24 ` H. Peter Anvin
2001-11-27 0:52 ` H. Peter Anvin
2001-11-27 1:11 ` Andrew Morton
2001-11-27 1:15 ` H. Peter Anvin
2001-11-27 16:59 ` Matthias Andree
2001-11-27 16:56 ` Matthias Andree
2001-11-27 1:23 ` Ian Stirling
2001-11-26 23:00 ` Rob Landley
2001-11-27 2:41 ` H. Peter Anvin
2001-11-27 0:19 ` Rob Landley
2001-11-27 23:35 ` Andreas Bombe
2001-11-28 14:32 ` Rob Landley
2001-11-27 3:39 ` Ian Stirling
2001-11-27 7:03 ` Ville Herva
2001-11-27 16:50 ` Matthias Andree
2001-11-27 20:31 ` Rob Landley [this message]
2001-11-28 18:43 ` Matthias Andree
2001-11-28 18:46 ` Rob Landley
2001-11-28 22:19 ` Matthias Andree
2001-11-29 22:21 ` Pavel Machek
2001-12-01 10:55 ` Jeff V. Merkey
2001-12-02 0:08 ` Matthias Andree
2001-12-03 20:04 ` Pavel Machek
2001-11-26 20:53 ` Richard B. Johnson
2001-11-26 21:18 ` Journaling pointless with today's hard disks? [wandering OT] Rob Landley
2001-11-27 0:32 ` Journaling pointless with today's hard disks? H. Peter Anvin
2001-11-27 16:39 ` Matthias Andree
2001-11-27 17:42 ` Martin Eriksson
2001-11-28 16:35 ` Ian Stirling
2001-11-26 17:14 ` Steve Brueggeman
2001-11-26 20:36 ` Andre Hedrick
2001-11-26 21:14 ` Steve Brueggeman
2001-11-26 21:36 ` Andre Hedrick
2001-11-27 16:36 ` Steve Brueggeman
2001-11-27 20:04 ` Bill Davidsen
2001-11-27 21:28 ` Wayne Whitney
2001-11-27 21:52 ` Andre Hedrick
2001-11-28 11:53 ` Pedro M. Rodrigues
-- strict thread matches above, loose matches on Subject: below --
2001-11-25 1:20 dnu478nt5w@mailexpire.com
2001-11-28 14:36 Galappatti, Kishantha
2001-11-28 17:22 David Balazic
2001-11-28 23:25 Frank de Lange
2001-11-29 1:52 ` Matthias Andree
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=01112715312104.01486@localhost \
--to=landley@trommello.org \
--cc=linux-kernel@vger.kernel.org \
--cc=matthias.andree@stud.uni-dortmund.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox