From: Lars Marowsky-Bree <lmb@suse.de>
To: "Peter T. Breuer" <ptb@it.uc3m.es>
Cc: linux kernel <linux-kernel@vger.kernel.org>
Subject: Re: [RFC] mount flag "direct"
Date: Sat, 7 Sep 2002 23:14:53 +0200 [thread overview]
Message-ID: <20020907211452.GA24476@marowsky-bree.de> (raw)
In-Reply-To: <200209071959.g87JxKN17732@oboe.it.uc3m.es>
On 2002-09-07T21:59:20,
"Peter T. Breuer" <ptb@it.uc3m.es> said:
> > Yes, use a distributed filesystem. There are _many_ out there; GFS, OCFS,
> > OpenGFS, Compaq has one as part of their SSI, Inter-Mezzo (sort of), Lustre,
> > PvFS etc.
> Eh, I thought I saw this - didn't I reply?
No, you didn't.
> > Noone appreciates reinventing the wheel another time, especially if - for
> > simplification - it starts out as a square.
> But what I suggest is finding a simple way to turn an existing FS into a
> distributed one. I.e. NOT reinventing the wheel. All those other people
> are reinventing a wheel, for some reason :-).
Well, actually they aren't exactly. The hard part in a "distributed
filesystem" isn't the filesystem itself; while it is very necessary of course.
The locking, synchronization and cluster infrastructure is where the real
difficulty tends to arise.
Yes, it can be argued whether it is in fact easier to create a filesystem from
scratch with clustering in mind (so it is "optimised" for being able to do
fine-grained locking etc), or whether proping a generic clustering layer on
top of existing ones.
The guesstimate of those involved in the past have seemed to suggest that the
first is the case. And I also tend to think this to be the case, but I've been
wrong.
That would - indeed - be very helpful research to do. I would start by
comparing the places where those specialized fs's actually are doing cluster
related stuff and checking whether it can be abstracted, generalized and
improved. In any case, trying to pick apart OpenGFS for example will provide
you more insight into the problem area that a discussion on l-k.
If you want to look into "turn a local fs into a cluster fs", SGI has a
"clustered XFS"; however I'm not too sure how public that extension is. The
hooks might however be in the common XFS core though.
Now, going on with the gedankenexperiment, given a distributed lock manager
(IBM open-sourced one of theirs, though it is not currently perfectly working
;), the locking primitives in the filesystems could "simply" be changed from
local-node SMP spinlocks to cluster-wide locks.
That _should_ to a large degree take care of the locking.
What remains is the invalidation of cache pages; I would expect similar
problems must have arised in NC-NUMA style systems, so looking there should
provide hints.
> > You fail to give a convincing reason why that must be made to work with
> > "all" conventional filesystems, especially given the constraints this
> > implies.
> Because that's the simplest thing to do.
Why? I disagree.
You will have to modify existing file systems quite a bit to work
_efficiently_ in a cluster environment; not even the on-disk layout is
guaranteed to stay consistent as soon as you add per-node journals etc. The
real complexity is in the distributed nature, in particular the recovery (see
below).
"Simplest thing to do" might be to take your funding and give it to the
OpenGFS group or have someone fix the Oracle Cluster FS.
> > In particular, they make them useless for the requirements you seem to
> > have. A petabyte filesystem without journaling? A petabyte filesystem with
> > a single write lock? Gimme a break.
> Journalling? Well, now you mention it, that would seem to be nice.
"Nice" ? ;-) You gotta be kidding. If you don't have journaling, distributed
recovery becomes near impossible - at least I don't have a good idea on how to
do it if you don't know what the node had been working on prior to its
failure.
If "take down the entire filesystem on all nodes, run fsck" is your answer to
that, I will start laughing in your face. Because then your requirements are
kind of from outer space and will certainly not reflect a large part of the
user base.
> > Please, do the research and tell us what features you desire to have which
> > are currently missing, and why implementing them essentially from scratch
> > is
> No features.
So they implement what you need, but you don't like them because theres just
so few of them to chose from? Interesting.
> Just take any FS that corrently works, and see if you can distribute it.
> Get rid of all fancy features along the way. The projects involved are
> huge, and they need to minimize risk, and maximize flexibility. This is
> CERN, by the way.
Well, you are taking quite a risk trying to run a
not-aimed-at-distributed-environments fs and trying to make it distributed by
force. I _believe_ that you are missing where the real trouble lurks.
You maximize flexibility for mediocre solutions; little caching, no journaling
etc.
What does this supposed "flexibility" buy you? Is there any real value in it
or is it a "because!" ?
> You mean "what's wrong with X"? Well, it won't be mainstream, for a start,
> and that's surely enough.
I have pulled these two sentences out because I don't get them. What "X" are
you referring to?
> of some kind. I need to explore as much as I can and get as much as I
> can back without "doing it first", because I need the insight you can
> offer.
The insight I can offer you is look at OpenGFS, see and understand what it
does, why and how. The try to come up with a generic approach on how to put
this on top of a generic filesystem, without making it useless.
Then I shall be amazed.
> There is no difficulty with that - there are no distributed locks. All locks
> are held on the server of the disk (I decided not to be complicated to
> begine with as a matter of principle early in life ;-).
Maybe you and I have a different idea of "distributed fs". I thought you had a
central pool of disks.
You want there to be local disks at each server, and other nodes can read
locally and have it appear as a big, single filesystem? You'll still have to
deal with node failure though.
Interesting.
One might consider to peel apart meta-data (which always goes through the
"home" node) and data (which goes directly to disk via the SAN); if necessary,
the reply to the meta-data request to the home node could tell the node where
to write/read. This smells a lot like cXFS and co with a central metadata
server.
> > recovery. ("Transaction processing" is an exceptionally good book on that
> > I believe)
> Thanks but I don't feel like rolling it out and rolling it back!
Please explain how you'll recover anywhere close to "fast" or even
"acceptable" without transactions. Even if you don't have to fsck the petabyte
filesystem completely, do a benchmark on how long e2fsck takes on, oh, 50gb
only.
> Thanks for the input. I don't know what I was supposed to take away
> from it though!
I apologize and am sorry if you didn't notice.
Sincerely,
Lars Marowsky-Brée <lmb@suse.de>
--
Immortality is an adequate definition of high availability for me.
--- Gregory F. Pfister
next prev parent reply other threads:[~2002-09-07 21:09 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20020907164631.GA17696@marowsky-bree.de>
2002-09-07 19:59 ` [lmb@suse.de: Re: [RFC] mount flag "direct" (fwd)] Peter T. Breuer
2002-09-07 20:27 ` Rik van Riel
2002-09-07 21:14 ` Lars Marowsky-Bree [this message]
2002-09-08 9:23 ` [RFC] mount flag "direct" Peter T. Breuer
2002-09-08 9:59 ` Lars Marowsky-Bree
2002-09-08 16:46 ` Peter T. Breuer
2002-09-07 23:18 ` [lmb@suse.de: Re: [RFC] mount flag "direct" (fwd)] Andreas Dilger
2002-09-03 15:01 [RFC] mount flag "direct" Peter T. Breuer
2002-09-03 15:13 ` Rik van Riel
2002-09-03 15:53 ` Maciej W. Rozycki
2002-09-03 16:04 ` Peter T. Breuer
2002-09-03 16:08 ` Rik van Riel
2002-09-03 15:16 ` jbradford
2002-09-03 15:37 ` Anton Altaparmakov
2002-09-03 15:44 ` Peter T. Breuer
2002-09-03 16:23 ` Lars Marowsky-Bree
2002-09-03 16:41 ` Peter T. Breuer
2002-09-03 17:07 ` David Lang
2002-09-03 17:30 ` Peter T. Breuer
2002-09-03 17:40 ` David Lang
2002-09-04 5:57 ` Helge Hafting
2002-09-04 6:21 ` Peter T. Breuer
2002-09-04 6:49 ` Helge Hafting
2002-09-04 9:15 ` Peter T. Breuer
2002-09-04 11:34 ` Helge Hafting
2002-09-03 17:26 ` Rik van Riel
2002-09-03 18:02 ` Andreas Dilger
2002-09-03 18:44 ` Daniel Phillips
2002-09-03 17:29 ` Jan Harkes
2002-09-03 18:31 ` Daniel Phillips
2002-09-03 18:20 ` Daniel Phillips
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20020907211452.GA24476@marowsky-bree.de \
--to=lmb@suse.de \
--cc=linux-kernel@vger.kernel.org \
--cc=ptb@it.uc3m.es \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox