[Lustre-devel] Integrity and corruption - can file systems be scalable?

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Nicolas Williams <Nicolas.Williams@oracle.com>
To: lustre-devel@lists.lustre.org
Subject: [Lustre-devel] Integrity and corruption - can file systems be scalable?
Date: Fri, 2 Jul 2010 17:21:51 -0500	[thread overview]
Message-ID: <20100702222151.GG15407@oracle.com> (raw)
In-Reply-To: <AANLkTin1hOwVV2EzBvnPSGoHQr7kPYo_ulncZP77KL5H@mail.gmail.com>

On Fri, Jul 02, 2010 at 03:39:42PM -0600, Peter Braam wrote:
> On Fri, Jul 2, 2010 at 3:18 PM, Dmitry Zogin <dmitry.zoguine@oracle.com>wrote:
> The post also mentions copy on write checkpoints, and their usefulness has
> not been proven.  There has been no study about this, and certainly in many
> cases they are implemented in such a way that bugs in the software can
> corrupt them.  For example, most volume level copy on write schemes actually
> copy the old data instead of leaving it in place, which is a vulnerability.
>  Shadow copies are vulnerable to software bugs, things would get better if
> there was something similar to page protection for disk blocks.

Well-delineated transactions are certainly useful.  The reason: you can
fsck each transaction discretely and incrementally.  That means that you
know exactly how much work must be done to fsck a priori.  Sure, you
still have to be confident that N correct transactions == correct
filesystem, but that's much easier to be confident of than software
correctness.  (It'd be interesting to apply theorem provers to theorems
related to on-disk data formats!) 

Another problem, incidentally, is software correctness on the read side.
It's nice to know that no bugs on the write side will corrupt your
filesystem, but read-side bugs that cause your data to be unavailable
are not good either.  The distinction between bugs in the write vs. read
sides is subtle: recovery from the latter is just a patch away, while
recovery from the former might require long fscks, or even more manual
intervention (e.g., writing a better fsck).

> I wrote this post because I'm unconvinced with the barrage of by now
> endlessly repeated ideas like checkpoints, checksums etc, and the falsehood
> of the claim that advanced file systems address these issues - they only
> address some, and leave critical vulnerability.

I do believe COW transactions + Merkel hash trees are _the_ key aspect
of the solution.  Because only by making fscks incremental and discrete
can we get a handle on the amount of time that must be spent waiting for
fscks to complete.  Without incremental fscks there'd be no hope as
storage capacity outstrips storage and compute bandwidth.

If you believe that COW, transactional, Merkle trees are an
anti-solution, or if you believe that they are only a tiny part of the
solution, please argue that view.  Otherwise I think your use of
"barrage" here is a bit over the top (nay, a lot over the top).  It's
one thing to be missing a part of the solution, and it's another to be
on the wrong track, or missing the largest part of the solution.
Extraordinary claims and all that...

(And no, manually partitioning storage into discrete "filesystems",
"filesets", "datasets", whatever, is not a solution; at most it's a
bandaid.)

Nico
--

next prev parent reply	other threads:[~2010-07-02 22:21 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-07-02 18:53 [Lustre-devel] Integrity and corruption - can file systems be scalable? Peter Braam
2010-07-02 20:52 ` Dmitry Zogin
2010-07-02 20:59   ` Peter Braam
2010-07-02 21:09     ` Nicolas Williams
2010-07-02 21:18     ` Dmitry Zogin
2010-07-02 21:39       ` Peter Braam
2010-07-02 22:21         ` Nicolas Williams [this message]
2010-07-02 22:35           ` Nicolas Williams
2010-07-03  3:37           ` Dmitry Zogin
2010-07-04 23:56             ` Nicolas Williams
2010-07-05  3:53               ` Dmitry Zogin
2010-07-05  7:11                 ` Mitchell Erblich
2010-07-05 17:58                 ` Nicolas Williams
2010-07-07  6:57         ` [Lustre-devel] [Lustre-discuss] " Andreas Dilger

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100702222151.GG15407@oracle.com \
    --to=nicolas.williams@oracle.com \
    --cc=lustre-devel@lists.lustre.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.