Re: Journaling pointless with today's hard disks?

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Rob Landley <landley@trommello.org>
To: Matthias Andree <matthias.andree@stud.uni-dortmund.de>,
	linux-kernel@vger.kernel.org
Subject: Re: Journaling pointless with today's hard disks?
Date: Tue, 27 Nov 2001 15:31:21 -0500	[thread overview]
Message-ID: <01112715312104.01486@localhost> (raw)
In-Reply-To: <Pine.LNX.4.10.10111261229190.8817-100000@master.linux-ide.org> <0111261535070J.02001@localhost.localdomain> <20011127175016.D13416@emma1.emma.line.org>
In-Reply-To: <20011127175016.D13416@emma1.emma.line.org>

On Tuesday 27 November 2001 11:50, Matthias Andree wrote:
> Please fix your domain in your mailer, localhost.localdomain is prone
> for Message-ID collisions.

I'm using Kmail talking to @home's mail server (to avoid the evil behavior 
sendmail has behind an IP masquerading firewall that triggers every spam 
filter in existence), so if either one of them cares about the hostname of my 
laptop ("driftwood", but apparently not being set right by Red Hat's 
scripts), then something's wrong anyway.

But let's see... 

Ah fun, if you change the hostname of the box, either X or KDE can't pop up 
any more new applicatons until you exit X and restart it.  Brilliant.  
Considering how many Konqueror windows I have open at present on my 6 
desktops, I think I'll leave fixing this until later in the evening.  But 
thanks for letting me know something's up...

>
> Note, the power must RELIABLY last until all of the data has been
> writen, which includes reassigning, seeking and the like, just don't do
> it if you cannot get a real solution.

A) At most 1 seek to a track other than the one you're on.

B) If sectors have been reassigned outside of this track to a "recovery" 
track, then that counts as a seperate track.  Tough.

The point of the buffer is to let the OS feed data to the write head as fast 
as it can write it (which unbuffered ATA can't do because individual requests 
are smaller than individual tracks).  You need a small buffer to avoid 
blocking between each and every ATA write while the platter rotates back into 
position.  So you always let it have a little more data so it knows what to 
do next and can start work on it immediately (doing that next seek, writing 
that next sector as it passes under the head without having to wait for it to 
rotate around again.)

That's it.  No more buffer than does good at the hardware level for request 
merging and minimizing seek latency.  Any buffering over and above that is 
the operating system's job.

Yes the hardware can do a slightly better job with its own elevator algorithm 
using intimate knowledge of on-disk layout, but the OS can do a fairly decent 
job as long as logical linear sectors are linearly arranged on disk too.  
(Relocating bad sectors breaks this, but not fatally.  It causes extra seeks 
in linear writes anyway where the elevator ISN'T involved, so you've already 
GOT a performance hit.  And it just screws up the OS's elevator, not the rest 
of the scheme.  You still have the current track written as one lump and an 
immediate seek to the other track, at which point the drive electronics can 
be accepting blocks destined for the track you seek back to.)

The advantage of limiting the amount of data buffered to current track plus 
one other is you have a fixed amount of work to do on a loss of power.  One 
seek, two track writes, and a spring-driven park.  The amount of power this 
takes has a deterministic upper bound.  THAT is why you block before 
accepting more data than that.

> battery-backed CMOS,
> NVRAM/Flash/whatever which lasts a couple of months should be fine
> though, as long as documents are publicly available that say how long
> this data lasts. Writing to disk will not work out unless you can keep
> the drive going for several seconds which will require BIG capacitors,
> so that's no option, you must go for NVRAM/Flash or something.

You dont' need several seconds.  You need MILISECONDS.  Two track writes and 
one seek.  This is why you don't accept more data than that before blocking.  
Your worst case scenario is a seek from near where the head parks to the 
other end of the disk, then the spring can pull it back.  This should be well 
under 50 miliseconds.  Your huge ram cache is there for reads.  For writes 
you don't accept more than you can reliably flush if you want anything 
approaching reliability.  If you're only going to spring for a capacitor as 
your power failure hedge, than the amount of write cache you can accept is 
small, but it turns out you only need a tiny amount of cache to get 90% of 
the benefit of write cacheing (merging writes into full tracks and seeking 
immediately to the next track).

> OTOH, the OS must reliably know when something went wrong (even with
> good power it has a right to know), and preferably this scheme should
> not involve disabling the write cache, so TCQ or something mandatory
> would be useful (not sure if it's mandatory in current ATA standards).

We're talking about what happens to the drive on a catastrophic power 
failure.  (Even with a UPS, this can happen if your case fan jams and your 
power supply catches fire and burns through a wire, Although most server side 
hosting facilities aren't that dusty, there's always worn bearings and other 
such fun things.  And in a desktop environment, spilled sodas.)  Currently, 
there are drives out there that stop writing a sector in the middle, leaving 
a bad CRC at the hardware level.  This isn't exactly graceful.  At the other 
end, drives with huge caches discard the contents of cache which a journaling 
filesystem thinks are already on disk.  This isn't graceful either.

> If a block has first been reported written OK and the disk later reports
> error, it must send the block back (incompatible with any current ATA
> draft I had my hands on), so I think tagged commands which are marked
> complete only after write+verify are the way to go.

If a block goes bad WHILE power is failing, you're screwed.  This is just a 
touch unlikely.  It will happen to somebody out there someday, sure.  So will 
alpha particle decay corrupting a sector that was long ago written to the 
drive correctly.  Designing for that is not practical.  Recovering after the 
fact might be, but that doesn't mean you get your data back.

Rob

next prev parent reply	other threads:[~2001-11-28 17:13 UTC|newest]

Thread overview: 86+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2001-11-24 13:03 Journaling pointless with today's hard disks? Florian Weimer
2001-11-24 13:40 ` Rik van Riel
2001-11-24 16:36   ` Phil Howard
2001-11-24 17:19     ` Charles Marslett
2001-11-24 17:31     ` Florian Weimer
2001-11-24 17:41     ` Matthias Andree
2001-11-24 19:20       ` Florian Weimer
2001-11-24 19:29         ` Rik van Riel
2001-11-24 22:51           ` John Alvord
2001-11-24 23:41             ` Phil Howard
2001-11-25  0:24               ` Ian Stirling
2001-11-25  0:53                 ` Phil Howard
2001-11-25  1:25                   ` H. Peter Anvin
2001-11-25  1:44                   ` Sven.Riedel
2001-11-24 22:28         ` H. Peter Anvin
2001-11-25  4:49           ` Andre Hedrick
2001-11-24 23:04         ` Pedro M. Rodrigues
2001-11-24 23:23         ` Stephen Satchell
2001-11-24 23:29           ` H. Peter Anvin
2001-11-26 18:05             ` Steve Brueggeman
2001-11-26 23:49               ` Martin Eriksson
2001-11-27  0:06                 ` Andreas Dilger
2001-11-27  0:16                   ` Andre Hedrick
2001-11-27  7:38                     ` Andreas Dilger
2001-11-27 11:48                       ` Ville Herva
2001-11-27  0:18                 ` Jonathan Lundell
2001-11-27  1:01                   ` Ian Stirling
2001-11-27  1:33                     ` H. Peter Anvin
2001-11-27  1:57                   ` Steve Underwood
2001-11-27  5:04                   ` Stephen Satchell
2001-11-25 12:30         ` Matthias Andree
2001-11-25 15:04           ` Barry K. Nathan
2001-11-25 16:31             ` Matthias Andree
2001-11-27  2:39               ` Pavel Machek
2001-12-03 10:23                 ` Matthias Andree
     [not found]         ` <mailman.1006644421.6553.linux-kernel2news@redhat.com>
2001-11-25  4:20           ` Pete Zaitcev
2001-11-25 13:52           ` Pedro M. Rodrigues
2001-11-25  9:14 ` Chris Wedgwood
2001-11-25 22:55   ` Daniel Phillips
2001-11-26 16:59   ` Rob Landley
2001-11-26 20:30     ` Andre Hedrick
2001-11-26 20:35       ` Rob Landley
2001-11-26 23:59         ` Andreas Dilger
2001-11-27  0:24           ` H. Peter Anvin
2001-11-27  0:52             ` H. Peter Anvin
2001-11-27  1:11               ` Andrew Morton
2001-11-27  1:15                 ` H. Peter Anvin
2001-11-27 16:59                   ` Matthias Andree
2001-11-27 16:56               ` Matthias Andree
2001-11-27  1:23         ` Ian Stirling
2001-11-26 23:00           ` Rob Landley
2001-11-27  2:41             ` H. Peter Anvin
2001-11-27  0:19               ` Rob Landley
2001-11-27 23:35                 ` Andreas Bombe
2001-11-28 14:32                   ` Rob Landley
2001-11-27  3:39             ` Ian Stirling
2001-11-27  7:03         ` Ville Herva
2001-11-27 16:50         ` Matthias Andree
2001-11-27 20:31           ` Rob Landley [this message]
2001-11-28 18:43             ` Matthias Andree
2001-11-28 18:46               ` Rob Landley
2001-11-28 22:19                 ` Matthias Andree
2001-11-29 22:21                   ` Pavel Machek
2001-12-01 10:55                     ` Jeff V. Merkey
2001-12-02  0:08                     ` Matthias Andree
2001-12-03 20:04                       ` Pavel Machek
2001-11-26 20:53     ` Richard B. Johnson
2001-11-26 21:18       ` Journaling pointless with today's hard disks? [wandering OT] Rob Landley
2001-11-27  0:32       ` Journaling pointless with today's hard disks? H. Peter Anvin
2001-11-27 16:39     ` Matthias Andree
2001-11-27 17:42       ` Martin Eriksson
2001-11-28 16:35         ` Ian Stirling
2001-11-26 17:14 ` Steve Brueggeman
2001-11-26 20:36   ` Andre Hedrick
2001-11-26 21:14     ` Steve Brueggeman
2001-11-26 21:36       ` Andre Hedrick
2001-11-27 16:36         ` Steve Brueggeman
2001-11-27 20:04           ` Bill Davidsen
2001-11-27 21:28         ` Wayne Whitney
2001-11-27 21:52           ` Andre Hedrick
2001-11-28 11:53             ` Pedro M. Rodrigues
  -- strict thread matches above, loose matches on Subject: below --
2001-11-25  1:20 dnu478nt5w@mailexpire.com
2001-11-28 14:36 Galappatti, Kishantha
2001-11-28 17:22 David Balazic
2001-11-28 23:25 Frank de Lange
2001-11-29  1:52 ` Matthias Andree

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=01112715312104.01486@localhost \
    --to=landley@trommello.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=matthias.andree@stud.uni-dortmund.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox