public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Rob Landley <landley@trommello.org>
To: Matthias Andree <matthias.andree@stud.uni-dortmund.de>,
	linux-kernel@vger.kernel.org
Subject: Re: Journaling pointless with today's hard disks?
Date: Wed, 28 Nov 2001 13:46:24 -0500	[thread overview]
Message-ID: <01112813462404.01163@driftwood> (raw)
In-Reply-To: <Pine.LNX.4.10.10111261229190.8817-100000@master.linux-ide.org> <01112715312104.01486@localhost> <20011128194302.A29500@emma1.emma.line.org>
In-Reply-To: <20011128194302.A29500@emma1.emma.line.org>

This is wandering far enough off topic that I'm not going to CC l-k after 
this message...


On Wednesday 28 November 2001 13:43, Matthias Andree wrote:
> On Tue, 27 Nov 2001, Rob Landley wrote:
> > On Tuesday 27 November 2001 11:50, Matthias Andree wrote:
> > > Note, the power must RELIABLY last until all of the data has been
> > > writen, which includes reassigning, seeking and the like, just don't do
> > > it if you cannot get a real solution.
> >
> > A) At most 1 seek to a track other than the one you're on.
>
> Not really, assuming drives don't write to multiple heads concurrently,

Not my area of expertise.  Depends how cheap they're being, I'd guess.  
Writing multiple tracks concurrently is probably more of a current drain than 
writing a single track at a time anyway, by the way.

> 2 MB hardly fit on a track. We can assume several hundred sectors, say
> 1,000, so we need four track writes, four verifies, and not a single
> block may be broken. We need even more time if we need to rewrite.

A 7200 RPM drive does 120 RPS, which means one revolution is 8.3 miliseconds. 
 We're still talking a deterministic number of miliseconds with a 
double-digit total.

And again, it depends on how you define "track".  If you talk about the two 
tracks you can buffer as living on seperate sides of platters you can't write 
to concurrently (not necessarily separated by a seek), then there is still no 
problem.  (After the first track writes and it starts on the second track, 
the system still has 8.3 ms later to buffer another track before it drops 
below full writing speed.

It's all a question of limiting how much you buffer to what you can flush 
out.  Artificial objections about "I could have 8 zillion platters I can only 
write to one at a time" just means you're buffering too much to write out 
then.

> > That's it.  No more buffer than does good at the hardware level for
> > request merging and minimizing seek latency.  Any buffering over and
> > above that is the operating system's job.
>
> Effectively, that's what tagged command queueing is all about, send a
> batch of requests that can be acknowledged individually and possibly out
> of order (which can lead to a trivial write barrier as suggested
> elsewhere, because all you do is wait with scheduling until the disk is
> idle, then send the past-the-barrier block).

Doesn't stop the "die in the middle of a write=crc error" problem.  And I'm 
not saying tagged command queueing is a bad idea, I'm just saying the idea's 
been out there forever and not everybody's done it yet, and this is a 
potentially simpler alternative focusing on the minimal duct-tape approach to 
reliability by reducing the level of guarantees you have to make.

> > (Relocating bad sectors breaks this, but not fatally.  It causes extra
> > seeks in linear writes anyway where the elevator ISN'T involved, so
> > you've already GOT a performance hit.
>
> On modern drives, bad sectors are reassigned within the same track to
> evade seeks for a single bad block. If the spare block area within that
> track is exhausted, bad luck, you're going to seek.

Cool then.

> > The advantage of limiting the amount of data buffered to current track
> > plus one other is you have a fixed amount of work to do on a loss of
> > power.  One seek, two track writes, and a spring-driven park.  The amount
> > of power this takes has a deterministic upper bound.  THAT is why you
> > block before accepting more data than that.
>
> It has not, you don't know in advance how many blocks on your journal
> track are bad.

Another reason to not worry about an explicit dedicatedjournal track and just 
buffer one extra normal data track and budget in the power for a seek to it 
if necessary.

There are circumstances where this will break down, sure.  Any disk that has 
enough bad sectors on it will stop working.  But that shouldn't be the normal 
case on a fresh drive, as is happening now with IBM.

> > You dont' need several seconds.  You need MILISECONDS.  Two track writes
> > and one seek.  This is why you don't accept more data than that before
> > blocking.
>
> No, you must verify the write, so that's one seek (say 35 ms, slow
> drive ;) and two revolutions per track at least, and, as shown, more
> than one track usually

So don't buffer 4 tracks and call it one track.  That's an artificial 
objection.

An extra revolution is less than a seek, and noticeably less in power terms.

>, so any bets of upper bounds are off. In the
> average case, say 70 ms should suffice, but in adverse conditions, that
> does not suffice at all. If writing the journal in the end fails because
> power is failing, the data is lost, so nothing is gained.
>
> > under 50 miliseconds.  Your huge ram cache is there for reads.  For
> > writes you don't accept more than you can reliably flush if you want
> > anything approaching reliability.
>
> Well, that's the point, you don't know in advance how reliable your
> journal track is.

We don't knkow in advance that the drive won't fail completely due to 
excessive bad blocks.  I'm trying to move the failure point, not pretending 
to eliminate it.  Right now we've got something that could easily take out 
multiple drives in a RAID 5, and something that desktop users are likely to 
see more noticeably more often than they upgrade their system.

> > such fun things.  And in a desktop environment, spilled sodas.) 
> > Currently, there are drives out there that stop writing a sector in the
> > middle, leaving a bad CRC at the hardware level.  This isn't exactly
> > graceful.  At the other end, drives with huge caches discard the contents
> > of cache which a journaling filesystem thinks are already on disk.  This
> > isn't graceful either.
>
> No-one said bad things cannot happen, but that is what actually happens.
> Where we started from, fsck would be able to "repair" a bad block by
> just zeroing and writing it, data that used to be there will be lost
> after short write anyhow.

Assuming the drive's inherent bad-block detection mechanisms don't find it 
and remap it on a read first, rapidly consuming the spare block reserve.  But 
that's a firmware problem...

> Assuming that write errors on an emergency cache flush just won't happen
> is just as wrong as assuming 640 kB will suffice or there's an upper
> bound of write time. You just don't know.

I don't assume they won't happen.  They're actually more LIKELY to happen as 
the power level gradually drops as the capacitor discharges.  I'm just saying 
there's a point beyond which any given system can't recover, and a point of 
diminishing returns trying to fix things.

I'm proposing a cheap and easy improvement over the current system.  I'm not 
proposing a system hardened to military specifications, just something that 
shouldn't fail noticeably for the majority of its users on a regular basis.  
(Powering down without flushing the cache is a bad thing.  It shouldn't 
happen often.  This is a last ditch deal-with-evil safety net system that has 
a fairly good chance of saving the data without extensively redesigning the 
whole system.  Never said it was perfect.  If a "1 in 2" failure rate drops 
to "1 in 100,000", it'll still hit people.  But it's a distinct improvement.  
Maybe it can be improved beyond that.  That would be nice.  What's the 
effort, expense, and inconvenience involved?)

Rob

  reply	other threads:[~2001-11-28 21:47 UTC|newest]

Thread overview: 86+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2001-11-24 13:03 Journaling pointless with today's hard disks? Florian Weimer
2001-11-24 13:40 ` Rik van Riel
2001-11-24 16:36   ` Phil Howard
2001-11-24 17:19     ` Charles Marslett
2001-11-24 17:31     ` Florian Weimer
2001-11-24 17:41     ` Matthias Andree
2001-11-24 19:20       ` Florian Weimer
2001-11-24 19:29         ` Rik van Riel
2001-11-24 22:51           ` John Alvord
2001-11-24 23:41             ` Phil Howard
2001-11-25  0:24               ` Ian Stirling
2001-11-25  0:53                 ` Phil Howard
2001-11-25  1:25                   ` H. Peter Anvin
2001-11-25  1:44                   ` Sven.Riedel
2001-11-24 22:28         ` H. Peter Anvin
2001-11-25  4:49           ` Andre Hedrick
2001-11-24 23:04         ` Pedro M. Rodrigues
2001-11-24 23:23         ` Stephen Satchell
2001-11-24 23:29           ` H. Peter Anvin
2001-11-26 18:05             ` Steve Brueggeman
2001-11-26 23:49               ` Martin Eriksson
2001-11-27  0:06                 ` Andreas Dilger
2001-11-27  0:16                   ` Andre Hedrick
2001-11-27  7:38                     ` Andreas Dilger
2001-11-27 11:48                       ` Ville Herva
2001-11-27  0:18                 ` Jonathan Lundell
2001-11-27  1:01                   ` Ian Stirling
2001-11-27  1:33                     ` H. Peter Anvin
2001-11-27  1:57                   ` Steve Underwood
2001-11-27  5:04                   ` Stephen Satchell
     [not found]         ` <mailman.1006644421.6553.linux-kernel2news@redhat.com>
2001-11-25  4:20           ` Pete Zaitcev
2001-11-25 13:52           ` Pedro M. Rodrigues
2001-11-25 12:30         ` Matthias Andree
2001-11-25 15:04           ` Barry K. Nathan
2001-11-25 16:31             ` Matthias Andree
2001-11-27  2:39               ` Pavel Machek
2001-12-03 10:23                 ` Matthias Andree
2001-11-25  9:14 ` Chris Wedgwood
2001-11-25 22:55   ` Daniel Phillips
2001-11-26 16:59   ` Rob Landley
2001-11-26 20:30     ` Andre Hedrick
2001-11-26 20:35       ` Rob Landley
2001-11-26 23:59         ` Andreas Dilger
2001-11-27  0:24           ` H. Peter Anvin
2001-11-27  0:52             ` H. Peter Anvin
2001-11-27  1:11               ` Andrew Morton
2001-11-27  1:15                 ` H. Peter Anvin
2001-11-27 16:59                   ` Matthias Andree
2001-11-27 16:56               ` Matthias Andree
2001-11-27  1:23         ` Ian Stirling
2001-11-26 23:00           ` Rob Landley
2001-11-27  2:41             ` H. Peter Anvin
2001-11-27  0:19               ` Rob Landley
2001-11-27 23:35                 ` Andreas Bombe
2001-11-28 14:32                   ` Rob Landley
2001-11-27  3:39             ` Ian Stirling
2001-11-27  7:03         ` Ville Herva
2001-11-27 16:50         ` Matthias Andree
2001-11-27 20:31           ` Rob Landley
2001-11-28 18:43             ` Matthias Andree
2001-11-28 18:46               ` Rob Landley [this message]
2001-11-28 22:19                 ` Matthias Andree
2001-11-29 22:21                   ` Pavel Machek
2001-12-01 10:55                     ` Jeff V. Merkey
2001-12-02  0:08                     ` Matthias Andree
2001-12-03 20:04                       ` Pavel Machek
2001-11-26 20:53     ` Richard B. Johnson
2001-11-26 21:18       ` Journaling pointless with today's hard disks? [wandering OT] Rob Landley
2001-11-27  0:32       ` Journaling pointless with today's hard disks? H. Peter Anvin
2001-11-27 16:39     ` Matthias Andree
2001-11-27 17:42       ` Martin Eriksson
2001-11-28 16:35         ` Ian Stirling
2001-11-26 17:14 ` Steve Brueggeman
2001-11-26 20:36   ` Andre Hedrick
2001-11-26 21:14     ` Steve Brueggeman
2001-11-26 21:36       ` Andre Hedrick
2001-11-27 16:36         ` Steve Brueggeman
2001-11-27 20:04           ` Bill Davidsen
2001-11-27 21:28         ` Wayne Whitney
2001-11-27 21:52           ` Andre Hedrick
2001-11-28 11:53             ` Pedro M. Rodrigues
  -- strict thread matches above, loose matches on Subject: below --
2001-11-25  1:20 dnu478nt5w@mailexpire.com
2001-11-28 14:36 Galappatti, Kishantha
2001-11-28 17:22 David Balazic
2001-11-28 23:25 Frank de Lange
2001-11-29  1:52 ` Matthias Andree

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=01112813462404.01163@driftwood \
    --to=landley@trommello.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=matthias.andree@stud.uni-dortmund.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox