* Re: GFS, what's remaining
2005-09-02 21:17 ` Andi Kleen
@ 2005-09-02 23:03 ` Bryan Henderson
2005-09-03 0:16 ` Mark Fasheh
` (3 subsequent siblings)
4 siblings, 0 replies; 106+ messages in thread
From: Bryan Henderson @ 2005-09-02 23:03 UTC (permalink / raw)
To: Andi Kleen; +Cc: akpm, linux clustering, linux-fsdevel, linux-kernel
I have to correct an error in perspective, or at least in the wording of
it, in the following, because it affects how people see the big picture in
trying to decide how the filesystem types in question fit into the world:
>Shared storage can be more efficient than network file
>systems like NFS because the storage access is often more efficient
>than network access
The shared storage access _is_ network access. In most cases, it's a
fibre channel/FCP network. Nowadays, it's more and more common for it to
be a TCP/IP network just like the one folks use for NFS (but carrying
ISCSI instead of NFS). It's also been done with a handful of other
TCP/IP-based block storage protocols.
The reason the storage access is expected to be more efficient than the
NFS access is because the block access network protocols are supposed to
be more efficient than the file access network protocols.
In reality, I'm not sure there really is such a difference in efficiency
between the protocols. The demonstrated differences in efficiency, or at
least in speed, are due to other things that are different between a given
new shared block implementation and a given old shared file
implementation.
But there's another advantage to shared block over shared file that hasn't
been mentioned yet: some people find it easier to manage a pool of blocks
than a pool of filesystems.
>it is more reliable because it doesn't have a
>single point of failure in form of the NFS server.
This advantage isn't because it's shared (block) storage, but because it's
a distributed filesystem. There are shared storage filesystems (e.g. IBM
SANFS, ADIC StorNext) that have a centralized metadata or locking server
that makes them unreliable (or unscalable) in the same ways as an NFS
server.
--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems
^ permalink raw reply [flat|nested] 106+ messages in thread* Re: GFS, what's remaining
2005-09-02 21:17 ` Andi Kleen
2005-09-02 23:03 ` Bryan Henderson
@ 2005-09-03 0:16 ` Mark Fasheh
2005-09-03 6:42 ` Daniel Phillips
2005-09-03 5:57 ` Daniel Phillips
` (2 subsequent siblings)
4 siblings, 1 reply; 106+ messages in thread
From: Mark Fasheh @ 2005-09-03 0:16 UTC (permalink / raw)
To: Andi Kleen; +Cc: akpm, linux-fsdevel, linux clustering, linux-kernel
On Fri, Sep 02, 2005 at 11:17:08PM +0200, Andi Kleen wrote:
> The only thing that should be probably resolved is a common API
> for at least the clustered lock manager. Having multiple
> incompatible user space APIs for that would be sad.
As far as userspace dlm apis go, dlmfs already abstracts away a large part
of the dlm interaction, so writing a module against another dlm looks like
it wouldn't be too bad (startup of a lockspace is probably the most
difficult part there).
--Mark
--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh@oracle.com
^ permalink raw reply [flat|nested] 106+ messages in thread* Re: GFS, what's remaining
2005-09-03 0:16 ` Mark Fasheh
@ 2005-09-03 6:42 ` Daniel Phillips
2005-09-03 6:46 ` Wim Coekaerts
0 siblings, 1 reply; 106+ messages in thread
From: Daniel Phillips @ 2005-09-03 6:42 UTC (permalink / raw)
To: Mark Fasheh
Cc: akpm, linux-fsdevel, linux clustering, Andi Kleen, linux-kernel
On Friday 02 September 2005 20:16, Mark Fasheh wrote:
> As far as userspace dlm apis go, dlmfs already abstracts away a large part
> of the dlm interaction...
Dumb question, why can't you use sysfs for this instead of rolling your own?
Side note: you seem to have deleted all the 2.6.12-rc4 patches. Perhaps you
forgot that there are dozens of lkml archives pointing at them?
Regards,
Daniel
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: Re: GFS, what's remaining
2005-09-03 6:42 ` Daniel Phillips
@ 2005-09-03 6:46 ` Wim Coekaerts
2005-09-03 22:21 ` Daniel Phillips
0 siblings, 1 reply; 106+ messages in thread
From: Wim Coekaerts @ 2005-09-03 6:46 UTC (permalink / raw)
To: linux clustering; +Cc: akpm, linux-fsdevel, Andi Kleen, linux-kernel
On Sat, Sep 03, 2005 at 02:42:36AM -0400, Daniel Phillips wrote:
> On Friday 02 September 2005 20:16, Mark Fasheh wrote:
> > As far as userspace dlm apis go, dlmfs already abstracts away a large part
> > of the dlm interaction...
>
> Dumb question, why can't you use sysfs for this instead of rolling your own?
because it's totally different. have a look at what it does.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: Re: GFS, what's remaining
2005-09-03 6:46 ` Wim Coekaerts
@ 2005-09-03 22:21 ` Daniel Phillips
2005-09-04 1:09 ` [Linux-cluster] " Joel Becker
0 siblings, 1 reply; 106+ messages in thread
From: Daniel Phillips @ 2005-09-03 22:21 UTC (permalink / raw)
To: Wim Coekaerts
Cc: akpm, linux clustering, linux-fsdevel, Andi Kleen, linux-kernel
On Saturday 03 September 2005 02:46, Wim Coekaerts wrote:
> On Sat, Sep 03, 2005 at 02:42:36AM -0400, Daniel Phillips wrote:
> > On Friday 02 September 2005 20:16, Mark Fasheh wrote:
> > > As far as userspace dlm apis go, dlmfs already abstracts away a large
> > > part of the dlm interaction...
> >
> > Dumb question, why can't you use sysfs for this instead of rolling your
> > own?
>
> because it's totally different. have a look at what it does.
You create a dlm domain when a directory is created. You create a lock
resource when a file of that name is opened. You lock the resource when the
file is opened. You access the lvb by read/writing the file. Why doesn't
that fit the configfs-nee-sysfs model? If it does, the payoff will be about
500 lines saved.
This little dlm fs is very slick, but grossly inefficient. Maybe efficiency
doesn't matter here since it is just your slow-path userspace tools taking
these locks. Please do not even think of proposing this as a way to export a
kernel-based dlm for general purpose use!
Your userdlm.c file has some hidden gold in it. You have factored the dlm
calls far more attractively than the bad old bazillion-parameter Vaxcluster
legacy. You are almost in system call zone there. (But note my earlier
comment on dlms in general: until there are dlm-based applications, merging a
general-purpose dlm API is pointless and has nothing to do with getting your
filesystem merged.)
Regards,
Daniel
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [Linux-cluster] Re: GFS, what's remaining
2005-09-03 22:21 ` Daniel Phillips
@ 2005-09-04 1:09 ` Joel Becker
2005-09-04 1:32 ` Andrew Morton
0 siblings, 1 reply; 106+ messages in thread
From: Joel Becker @ 2005-09-04 1:09 UTC (permalink / raw)
To: linux clustering
Cc: Wim Coekaerts, akpm, linux-fsdevel, Andi Kleen, linux-kernel
On Sat, Sep 03, 2005 at 06:21:26PM -0400, Daniel Phillips wrote:
> that fit the configfs-nee-sysfs model? If it does, the payoff will be about
> 500 lines saved.
I'm still awaiting your merge of ext3 and reiserfs, because you
can save probably 500 lines having a filesystem that can create reiser
and ext3 files at the same time.
Joel
--
Life's Little Instruction Book #267
"Lie on your back and look at the stars."
Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: Re: GFS, what's remaining
2005-09-04 1:09 ` [Linux-cluster] " Joel Becker
@ 2005-09-04 1:32 ` Andrew Morton
2005-09-04 3:06 ` Joel Becker
0 siblings, 1 reply; 106+ messages in thread
From: Andrew Morton @ 2005-09-04 1:32 UTC (permalink / raw)
To: Joel Becker; +Cc: linux-fsdevel, linux-kernel, linux-cluster, ak
Joel Becker <Joel.Becker@oracle.com> wrote:
>
> On Sat, Sep 03, 2005 at 06:21:26PM -0400, Daniel Phillips wrote:
> > that fit the configfs-nee-sysfs model? If it does, the payoff will be about
> > 500 lines saved.
>
> I'm still awaiting your merge of ext3 and reiserfs, because you
> can save probably 500 lines having a filesystem that can create reiser
> and ext3 files at the same time.
oy. Daniel is asking a legitimate question.
If there's duplicated code in there then we should seek to either make the
code multi-purpose or place the common or reusable parts into a library
somewhere.
If neither approach is applicable or practical for *every single function*
then fine, please explain why. AFAIR that has not been done.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: Re: GFS, what's remaining
2005-09-04 1:32 ` Andrew Morton
@ 2005-09-04 3:06 ` Joel Becker
2005-09-04 4:22 ` [Linux-cluster] " Daniel Phillips
0 siblings, 1 reply; 106+ messages in thread
From: Joel Becker @ 2005-09-04 3:06 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-fsdevel, linux-kernel, linux-cluster, ak
On Sat, Sep 03, 2005 at 06:32:41PM -0700, Andrew Morton wrote:
> If there's duplicated code in there then we should seek to either make the
> code multi-purpose or place the common or reusable parts into a library
> somewhere.
Regarding sysfs and configfs, that's a whole 'nother
conversation. I've not yet come up with a function involved that is
identical, but that's a response here for another email.
Understanding that Daniel is talking about dlmfs, dlmfs is far
more similar to devptsfs, tmpfs, and even sockfs and pipefs than it is
to sysfs. I don't see him proposing that sockfs and devptsfs be folded
into sysfs.
dlmfs is *tiny*. The VFS interface is less than his claimed 500
lines of savings. The few VFS callbacks do nothing but call DLM
functions. You'd have to replace this VFS glue with sysfs glue, and
probably save very few lines of code.
In addition, sysfs cannot support the dlmfs model. In dlmfs,
mkdir(2) creates a directory representing a DLM domain and mknod(2)
creates the user representation of a lock. sysfs doesn't support
mkdir(2) or mknod(2) at all.
More than mkdir() and mknod(), however, dlmfs uses open(2) to
acquire locks from userspace. O_RDONLY acquires a shared read lock (PR
in VMS parlance). O_RDWR gets an exclusive lock (X). O_NONBLOCK is a
trylock. Here, dlmfs is using the VFS for complete lifetiming. A lock
is released via close(2). If a process dies, close(2) happens. In
other words, ->release() handles all the cleanup for normal and abnormal
termination.
sysfs does not allow hooking into ->open() or ->release(). So
this model, and the inherent lifetiming that comes with it, cannot be
used. If dlmfs was changed to use a less intuitive model that fits
sysfs, all the handling of lifetimes and cleanup would have to be added.
This would make it more complex, not less complex. It would give it a
larger code size, not a smaller one. In the end, it would be harder to
maintian, less intuitive to use, and larger.
Joel
--
"Anything that is too stupid to be spoken is sung."
- Voltaire
Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 106+ messages in thread* Re: [Linux-cluster] Re: GFS, what's remaining
2005-09-04 3:06 ` Joel Becker
@ 2005-09-04 4:22 ` Daniel Phillips
2005-09-04 4:30 ` Joel Becker
2005-09-04 4:46 ` Andrew Morton
0 siblings, 2 replies; 106+ messages in thread
From: Daniel Phillips @ 2005-09-04 4:22 UTC (permalink / raw)
To: Joel Becker
Cc: Andrew Morton, linux-cluster, wim.coekaerts, linux-fsdevel, ak,
linux-kernel
On Saturday 03 September 2005 23:06, Joel Becker wrote:
> dlmfs is *tiny*. The VFS interface is less than his claimed 500
> lines of savings.
It is 640 lines.
> The few VFS callbacks do nothing but call DLM
> functions. You'd have to replace this VFS glue with sysfs glue, and
> probably save very few lines of code.
> In addition, sysfs cannot support the dlmfs model. In dlmfs,
> mkdir(2) creates a directory representing a DLM domain and mknod(2)
> creates the user representation of a lock. sysfs doesn't support
> mkdir(2) or mknod(2) at all.
I said "configfs" in the email to which you are replying.
> More than mkdir() and mknod(), however, dlmfs uses open(2) to
> acquire locks from userspace. O_RDONLY acquires a shared read lock (PR
> in VMS parlance). O_RDWR gets an exclusive lock (X). O_NONBLOCK is a
> trylock. Here, dlmfs is using the VFS for complete lifetiming. A lock
> is released via close(2). If a process dies, close(2) happens. In
> other words, ->release() handles all the cleanup for normal and abnormal
> termination.
>
> sysfs does not allow hooking into ->open() or ->release(). So
> this model, and the inherent lifetiming that comes with it, cannot be
> used.
Configfs has a per-item release method. Configfs has a group open method.
What is it that configfs can't do, or can't be made to do trivially?
> If dlmfs was changed to use a less intuitive model that fits
> sysfs, all the handling of lifetimes and cleanup would have to be added.
The model you came up with for dlmfs is beyond cute, it's downright clever.
Why mar that achievement by then failing to capitalize on the framework you
already have in configfs?
By the way, do you agree that dlmfs is too inefficient to be an effective way
of exporting your dlm api to user space, except for slow-path applications
like you have here?
Regards,
Daniel
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [Linux-cluster] Re: GFS, what's remaining
2005-09-04 4:22 ` [Linux-cluster] " Daniel Phillips
@ 2005-09-04 4:30 ` Joel Becker
2005-09-04 4:51 ` Daniel Phillips
2005-09-04 4:46 ` Andrew Morton
1 sibling, 1 reply; 106+ messages in thread
From: Joel Becker @ 2005-09-04 4:30 UTC (permalink / raw)
To: Daniel Phillips
Cc: Andrew Morton, linux-cluster, wim.coekaerts, linux-fsdevel, ak,
linux-kernel
On Sun, Sep 04, 2005 at 12:22:36AM -0400, Daniel Phillips wrote:
> It is 640 lines.
It's 450 without comments and blank lines. Please, don't tell
me that comments to help understanding are bloat.
> I said "configfs" in the email to which you are replying.
To wit:
> Daniel Phillips said:
> > Mark Fasheh said:
> > > as far as userspace dlm apis go, dlmfs already abstracts away a
> > > large
> > > part of the dlm interaction...
> >
> > Dumb question, why can't you use sysfs for this instead of rolling
> > your
> > own?
You asked why dlmfs can't go into sysfs, and I responded.
Joel
--
"I don't want to achieve immortality through my work; I want to
achieve immortality through not dying."
- Woody Allen
Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 106+ messages in thread* Re: Re: GFS, what's remaining
2005-09-04 4:30 ` Joel Becker
@ 2005-09-04 4:51 ` Daniel Phillips
2005-09-04 5:00 ` Joel Becker
0 siblings, 1 reply; 106+ messages in thread
From: Daniel Phillips @ 2005-09-04 4:51 UTC (permalink / raw)
To: Joel Becker; +Cc: Andrew Morton, ak, linux-cluster, linux-fsdevel, linux-kernel
On Sunday 04 September 2005 00:30, Joel Becker wrote:
> You asked why dlmfs can't go into sysfs, and I responded.
And you got me! In the heat of the moment I overlooked the fact that you and
Greg haven't agreed to the merge yet ;-)
Clearly, I ought to have asked why dlmfs can't be done by configfs. It is the
same paradigm: drive the kernel logic from user-initiated vfs methods. You
already have nearly all the right methods in nearly all the right places.
Regards,
Daniel
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: Re: GFS, what's remaining
2005-09-04 4:51 ` Daniel Phillips
@ 2005-09-04 5:00 ` Joel Becker
2005-09-04 5:52 ` [Linux-cluster] " Daniel Phillips
0 siblings, 1 reply; 106+ messages in thread
From: Joel Becker @ 2005-09-04 5:00 UTC (permalink / raw)
To: Daniel Phillips
Cc: Andrew Morton, ak, linux-cluster, linux-fsdevel, linux-kernel
On Sun, Sep 04, 2005 at 12:51:10AM -0400, Daniel Phillips wrote:
> Clearly, I ought to have asked why dlmfs can't be done by configfs. It is the
> same paradigm: drive the kernel logic from user-initiated vfs methods. You
> already have nearly all the right methods in nearly all the right places.
configfs, like sysfs, does not support ->open() or ->release()
callbacks. And it shouldn't. The point is to hide the complexity and
make it easier to plug into.
A client object should not ever have to know or care that it is
being controlled by a filesystem. It only knows that it has a tree of
items with attributes that can be set or shown.
Joel
--
"In a crisis, don't hide behind anything or anybody. They're going
to find you anyway."
- Paul "Bear" Bryant
Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [Linux-cluster] Re: GFS, what's remaining
2005-09-04 5:00 ` Joel Becker
@ 2005-09-04 5:52 ` Daniel Phillips
2005-09-04 5:56 ` Joel Becker
0 siblings, 1 reply; 106+ messages in thread
From: Daniel Phillips @ 2005-09-04 5:52 UTC (permalink / raw)
To: Joel Becker
Cc: Andrew Morton, linux-cluster, wim.coekaerts, linux-fsdevel, ak,
linux-kernel
On Sunday 04 September 2005 01:00, Joel Becker wrote:
> On Sun, Sep 04, 2005 at 12:51:10AM -0400, Daniel Phillips wrote:
> > Clearly, I ought to have asked why dlmfs can't be done by configfs. It
> > is the same paradigm: drive the kernel logic from user-initiated vfs
> > methods. You already have nearly all the right methods in nearly all the
> > right places.
>
> configfs, like sysfs, does not support ->open() or ->release()
> callbacks.
struct configfs_item_operations {
void (*release)(struct config_item *);
ssize_t (*show)(struct config_item *, struct attribute *,char *);
ssize_t (*store)(struct config_item *,struct attribute *,const char *, size_t);
int (*allow_link)(struct config_item *src, struct config_item *target);
int (*drop_link)(struct config_item *src, struct config_item *target);
};
struct configfs_group_operations {
struct config_item *(*make_item)(struct config_group *group, const char *name);
struct config_group *(*make_group)(struct config_group *group, const char *name);
int (*commit_item)(struct config_item *item);
void (*drop_item)(struct config_group *group, struct config_item *item);
};
You do have ->release and ->make_item/group.
If I may hand you a more substantive argument: you don't support user-driven
creation of files in configfs, only directories. Dlmfs supports user-created
files. But you know, there isn't actually a good reason not to support
user-created files in configfs, as dlmfs demonstrates.
Anyway, goodnight.
Regards,
Daniel
^ permalink raw reply [flat|nested] 106+ messages in thread* Re: [Linux-cluster] Re: GFS, what's remaining
2005-09-04 5:52 ` [Linux-cluster] " Daniel Phillips
@ 2005-09-04 5:56 ` Joel Becker
0 siblings, 0 replies; 106+ messages in thread
From: Joel Becker @ 2005-09-04 5:56 UTC (permalink / raw)
To: Daniel Phillips
Cc: Andrew Morton, linux-cluster, wim.coekaerts, linux-fsdevel, ak,
linux-kernel
On Sun, Sep 04, 2005 at 01:52:29AM -0400, Daniel Phillips wrote:
> You do have ->release and ->make_item/group.
->release is like kobject release. It's a free callback, not a
callback from close.
> If I may hand you a more substantive argument: you don't support user-driven
> creation of files in configfs, only directories. Dlmfs supports user-created
> files. But you know, there isn't actually a good reason not to support
> user-created files in configfs, as dlmfs demonstrates.
It is outside the domain of configfs. Just because it can be
done does not mean it should be. configfs isn't a "thing to create
files". It's an interface to creating kernel items. The actual
filesystem representation isn't the end, it's just the means.
Joel
--
"In the room the women come and go
Talking of Michaelangelo."
Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [Linux-cluster] Re: GFS, what's remaining
2005-09-04 4:22 ` [Linux-cluster] " Daniel Phillips
2005-09-04 4:30 ` Joel Becker
@ 2005-09-04 4:46 ` Andrew Morton
2005-09-04 4:58 ` Joel Becker
` (4 more replies)
1 sibling, 5 replies; 106+ messages in thread
From: Andrew Morton @ 2005-09-04 4:46 UTC (permalink / raw)
To: Daniel Phillips
Cc: Joel.Becker, linux-cluster, wim.coekaerts, linux-fsdevel, ak,
linux-kernel
Daniel Phillips <phillips@istop.com> wrote:
>
> The model you came up with for dlmfs is beyond cute, it's downright clever.
Actually I think it's rather sick. Taking O_NONBLOCK and making it a
lock-manager trylock because they're kinda-sorta-similar-sounding? Spare
me. O_NONBLOCK means "open this file in nonblocking mode", not "attempt to
acquire a clustered filesystem lock". Not even close.
It would be much better to do something which explicitly and directly
expresses what you're trying to do rather than this strange "lets do this
because the names sound the same" thing.
What happens when we want to add some new primitive which has no posix-file
analog?
Waaaay too cute. Oh well, whatever.
^ permalink raw reply [flat|nested] 106+ messages in thread* Re: [Linux-cluster] Re: GFS, what's remaining
2005-09-04 4:46 ` Andrew Morton
@ 2005-09-04 4:58 ` Joel Becker
2005-09-04 5:41 ` Andrew Morton
2005-09-04 6:10 ` Mark Fasheh
` (3 subsequent siblings)
4 siblings, 1 reply; 106+ messages in thread
From: Joel Becker @ 2005-09-04 4:58 UTC (permalink / raw)
To: Andrew Morton
Cc: Daniel Phillips, linux-cluster, wim.coekaerts, linux-fsdevel, ak,
linux-kernel
On Sat, Sep 03, 2005 at 09:46:53PM -0700, Andrew Morton wrote:
> It would be much better to do something which explicitly and directly
> expresses what you're trying to do rather than this strange "lets do this
> because the names sound the same" thing.
So, you'd like a new flag name? That can be done.
> What happens when we want to add some new primitive which has no posix-file
> analog?
The point of dlmfs is not to express every primitive that the
DLM has. dlmfs cannot express the CR, CW, and PW levels of the VMS
locking scheme. Nor should it. The point isn't to use a filesystem
interface for programs that need all the flexibility and power of the
VMS DLM. The point is a simple system that programs needing the basic
operations can use. Even shell scripts.
Joel
--
"You must remember this:
A kiss is just a kiss,
A sigh is just a sigh.
The fundamental rules apply
As time goes by."
Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 106+ messages in thread* Re: Re: GFS, what's remaining
2005-09-04 4:58 ` Joel Becker
@ 2005-09-04 5:41 ` Andrew Morton
2005-09-04 5:49 ` Joel Becker
2005-09-05 4:30 ` David Teigland
0 siblings, 2 replies; 106+ messages in thread
From: Andrew Morton @ 2005-09-04 5:41 UTC (permalink / raw)
To: Joel Becker; +Cc: phillips, linux-cluster, linux-fsdevel, ak, linux-kernel
Joel Becker <Joel.Becker@oracle.com> wrote:
>
> > What happens when we want to add some new primitive which has no posix-file
> > analog?
>
> The point of dlmfs is not to express every primitive that the
> DLM has. dlmfs cannot express the CR, CW, and PW levels of the VMS
> locking scheme. Nor should it. The point isn't to use a filesystem
> interface for programs that need all the flexibility and power of the
> VMS DLM. The point is a simple system that programs needing the basic
> operations can use. Even shell scripts.
Are you saying that the posix-file lookalike interface provides access to
part of the functionality, but there are other APIs which are used to
access the rest of the functionality? If so, what is that interface, and
why cannot that interface offer access to 100% of the functionality, thus
making the posix-file tricks unnecessary?
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: Re: GFS, what's remaining
2005-09-04 5:41 ` Andrew Morton
@ 2005-09-04 5:49 ` Joel Becker
2005-09-05 4:30 ` David Teigland
1 sibling, 0 replies; 106+ messages in thread
From: Joel Becker @ 2005-09-04 5:49 UTC (permalink / raw)
To: linux clustering; +Cc: linux-fsdevel, phillips, ak, linux-kernel
On Sat, Sep 03, 2005 at 10:41:40PM -0700, Andrew Morton wrote:
> Are you saying that the posix-file lookalike interface provides access to
> part of the functionality, but there are other APIs which are used to
> access the rest of the functionality? If so, what is that interface, and
> why cannot that interface offer access to 100% of the functionality, thus
> making the posix-file tricks unnecessary?
Currently, this is all the interface that the OCFS2 DLM
provides. But yes, if you wanted to provide the rest of the VMS
functionality (something that GFS2's DLM does), you'd need to use a more
concrete interface.
IMHO, it's worthwhile to have a simple interface, one already
used by mkfs.ocfs2, mount.ocfs2, fsck.ocfs2, etc. This is an interface
that can and is used by shell scripts even (we do this to test the DLM).
If you make it a C-library-only interface, you've just restricted the
subset of folks that can use it, while adding programming complexity.
I think that a simple fs-based interface can coexist with a more
complex one. FILE* doesn't give you the flexibility of read()/write(),
but I wouldn't remove it :-)
Joel
--
"In the beginning, the universe was created. This has made a lot
of people very angry, and is generally considered to have been a
bad move."
- Douglas Adams
Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 106+ messages in thread* Re: Re: GFS, what's remaining
2005-09-04 5:41 ` Andrew Morton
2005-09-04 5:49 ` Joel Becker
@ 2005-09-05 4:30 ` David Teigland
2005-09-05 8:54 ` [Linux-cluster] " Andrew Morton
1 sibling, 1 reply; 106+ messages in thread
From: David Teigland @ 2005-09-05 4:30 UTC (permalink / raw)
To: akpm, Joel.Becker, ak; +Cc: linux-fsdevel, linux-cluster, linux-kernel
On Sat, Sep 03, 2005 at 10:41:40PM -0700, Andrew Morton wrote:
> Joel Becker <Joel.Becker@oracle.com> wrote:
> >
> > > What happens when we want to add some new primitive which has no
> > > posix-file analog?
> >
> > The point of dlmfs is not to express every primitive that the
> > DLM has. dlmfs cannot express the CR, CW, and PW levels of the VMS
> > locking scheme. Nor should it. The point isn't to use a filesystem
> > interface for programs that need all the flexibility and power of the
> > VMS DLM. The point is a simple system that programs needing the basic
> > operations can use. Even shell scripts.
>
> Are you saying that the posix-file lookalike interface provides access to
> part of the functionality, but there are other APIs which are used to
> access the rest of the functionality? If so, what is that interface, and
> why cannot that interface offer access to 100% of the functionality, thus
> making the posix-file tricks unnecessary?
We're using our dlm quite a bit in user space and require the full dlm
API. It's difficult to export the full API through a pseudo fs like
dlmfs, so we've not found it a very practical approach. That said, it's a
nice idea and I'd be happy if someone could map a more complete dlm API
onto it.
We export our full dlm API through read/write/poll on a misc device. All
user space apps use the dlm through a library as you'd expect. The
library communicates with the dlm_device kernel module through
read/write/poll and the dlm_device module talks with the actual dlm:
linux/drivers/dlm/device.c If there's a better way to do this, via a
pseudo fs or not, we'd be pleased to try it.
Dave
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [Linux-cluster] Re: GFS, what's remaining
2005-09-05 4:30 ` David Teigland
@ 2005-09-05 8:54 ` Andrew Morton
2005-09-05 9:24 ` David Teigland
0 siblings, 1 reply; 106+ messages in thread
From: Andrew Morton @ 2005-09-05 8:54 UTC (permalink / raw)
To: David Teigland
Cc: Joel.Becker, ak, linux-cluster, linux-fsdevel, linux-kernel
David Teigland <teigland@redhat.com> wrote:
>
> We export our full dlm API through read/write/poll on a misc device.
>
inotify did that for a while, but we ended up going with a straight syscall
interface.
How fat is the dlm interface? ie: how many syscalls would it take?
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: Re: GFS, what's remaining
2005-09-05 8:54 ` [Linux-cluster] " Andrew Morton
@ 2005-09-05 9:24 ` David Teigland
2005-09-05 9:19 ` [Linux-cluster] " Andrew Morton
2005-09-05 19:11 ` kurt.hackel
0 siblings, 2 replies; 106+ messages in thread
From: David Teigland @ 2005-09-05 9:24 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-fsdevel, linux-kernel, linux-cluster, Joel.Becker, ak
On Mon, Sep 05, 2005 at 01:54:08AM -0700, Andrew Morton wrote:
> David Teigland <teigland@redhat.com> wrote:
> >
> > We export our full dlm API through read/write/poll on a misc device.
> >
>
> inotify did that for a while, but we ended up going with a straight syscall
> interface.
>
> How fat is the dlm interface? ie: how many syscalls would it take?
Four functions:
create_lockspace()
release_lockspace()
lock()
unlock()
Dave
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [Linux-cluster] Re: GFS, what's remaining
2005-09-05 9:24 ` David Teigland
@ 2005-09-05 9:19 ` Andrew Morton
2005-09-05 9:30 ` Daniel Phillips
` (2 more replies)
2005-09-05 19:11 ` kurt.hackel
1 sibling, 3 replies; 106+ messages in thread
From: Andrew Morton @ 2005-09-05 9:19 UTC (permalink / raw)
To: David Teigland
Cc: Joel.Becker, ak, linux-cluster, linux-fsdevel, linux-kernel
David Teigland <teigland@redhat.com> wrote:
>
> On Mon, Sep 05, 2005 at 01:54:08AM -0700, Andrew Morton wrote:
> > David Teigland <teigland@redhat.com> wrote:
> > >
> > > We export our full dlm API through read/write/poll on a misc device.
> > >
> >
> > inotify did that for a while, but we ended up going with a straight syscall
> > interface.
> >
> > How fat is the dlm interface? ie: how many syscalls would it take?
>
> Four functions:
> create_lockspace()
> release_lockspace()
> lock()
> unlock()
Neat. I'd be inclined to make them syscalls then. I don't suppose anyone
is likely to object if we reserve those slots.
^ permalink raw reply [flat|nested] 106+ messages in thread* Re: [Linux-cluster] Re: GFS, what's remaining
2005-09-05 9:19 ` [Linux-cluster] " Andrew Morton
@ 2005-09-05 9:30 ` Daniel Phillips
2005-09-05 9:48 ` David Teigland
2005-09-05 12:21 ` Alan Cox
2 siblings, 0 replies; 106+ messages in thread
From: Daniel Phillips @ 2005-09-05 9:30 UTC (permalink / raw)
To: Andrew Morton
Cc: David Teigland, Joel.Becker, ak, linux-cluster, linux-fsdevel,
linux-kernel
On Monday 05 September 2005 05:19, Andrew Morton wrote:
> David Teigland <teigland@redhat.com> wrote:
> > On Mon, Sep 05, 2005 at 01:54:08AM -0700, Andrew Morton wrote:
> > > David Teigland <teigland@redhat.com> wrote:
> > > > We export our full dlm API through read/write/poll on a misc device.
> > >
> > > inotify did that for a while, but we ended up going with a straight
> > > syscall interface.
> > >
> > > How fat is the dlm interface? ie: how many syscalls would it take?
> >
> > Four functions:
> > create_lockspace()
> > release_lockspace()
> > lock()
> > unlock()
>
> Neat. I'd be inclined to make them syscalls then. I don't suppose anyone
> is likely to object if we reserve those slots.
Better take a look at the actual parameter lists to those calls before jumping
to conclusions...
Regards,
Daniel
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [Linux-cluster] Re: GFS, what's remaining
2005-09-05 9:19 ` [Linux-cluster] " Andrew Morton
2005-09-05 9:30 ` Daniel Phillips
@ 2005-09-05 9:48 ` David Teigland
2005-09-05 12:21 ` Alan Cox
2 siblings, 0 replies; 106+ messages in thread
From: David Teigland @ 2005-09-05 9:48 UTC (permalink / raw)
To: Andrew Morton; +Cc: Joel.Becker, ak, linux-cluster, linux-fsdevel, linux-kernel
On Mon, Sep 05, 2005 at 02:19:48AM -0700, Andrew Morton wrote:
> David Teigland <teigland@redhat.com> wrote:
> > Four functions:
> > create_lockspace()
> > release_lockspace()
> > lock()
> > unlock()
>
> Neat. I'd be inclined to make them syscalls then. I don't suppose anyone
> is likely to object if we reserve those slots.
Patrick is really the expert in this area and he's off this week, but
based on what he's done with the misc device I don't see why there'd be
more than two or three parameters for any of these.
Dave
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: Re: GFS, what's remaining
2005-09-05 9:19 ` [Linux-cluster] " Andrew Morton
2005-09-05 9:30 ` Daniel Phillips
2005-09-05 9:48 ` David Teigland
@ 2005-09-05 12:21 ` Alan Cox
2005-09-05 19:53 ` [Linux-cluster] " Andrew Morton
2005-09-14 9:01 ` [Linux-cluster] " Patrick Caulfield
2 siblings, 2 replies; 106+ messages in thread
From: Alan Cox @ 2005-09-05 12:21 UTC (permalink / raw)
To: Andrew Morton; +Cc: ak, linux-cluster, linux-fsdevel, Joel.Becker, linux-kernel
On Llu, 2005-09-05 at 02:19 -0700, Andrew Morton wrote:
> > create_lockspace()
> > release_lockspace()
> > lock()
> > unlock()
>
> Neat. I'd be inclined to make them syscalls then. I don't suppose anyone
> is likely to object if we reserve those slots.
If the locks are not file descriptors then answer the following:
- How are they ref counted
- What are the cleanup semantics
- How do I pass a lock between processes (AF_UNIX sockets wont work now)
- How do I poll on a lock coming free.
- What are the semantics of lock ownership
- What rules apply for inheritance
- How do I access a lock across threads.
- What is the permission model.
- How do I attach audit to it
- How do I write SELinux rules for it
- How do I use mount to make namespaces appear in multiple vservers
and thats for starters...
Every so often someone decides that a deeply un-unix interface with new
syscalls is a good idea. Every time history proves them totally bonkers.
There are cases for new system calls but this doesn't seem one of them.
Look at system 5 shared memory, look at system 5 ipc, and so on. You
can't use common interfaces on them, you can't select on them, you can't
sanely pass them by fd passing.
All our existing locking uses the following behaviour
fd = open(namespace, options)
fcntl(.. lock ...)
blah
flush
fcntl(.. unlock ...)
close
Unfortunately some people here seem to have forgotten WHY we do things
this way.
1. The semantics of file descriptors are well understood by users and by
programs. That makes programming easier and keeps code size down
2. Everyone knows how close() works including across fork
3. FD passing is an obscure art but understood and just works
4. Poll() is a standard understood interface
5. Ownership of files is a standard model
6. FD passing across fork/exec is controlled in a standard way
7. The semantics for threaded applications are defined
8. Permissions are a standard model
9. Audit just works with the same tools
9. SELinux just works with the same tools
10. I don't need specialist applications to see the system state (the
whole point of sysfs yet someone wants to break it all again)
11. fcntl fd locking is a posix standard interface with precisely
defined semantics. Our extensions including leases are very powerful
12. And yes - fcntl fd locking supports mandatory locking too. That also
is standards based with precise semantics.
Everyone understands how to use the existing locking operations. So if
you use the existing interfaces with some small extensions if neccessary
everyone understands how to use cluster locks. Isn't that neat....
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [Linux-cluster] Re: GFS, what's remaining
2005-09-05 12:21 ` Alan Cox
@ 2005-09-05 19:53 ` Andrew Morton
2005-09-05 23:20 ` Alan Cox
2005-09-14 9:01 ` [Linux-cluster] " Patrick Caulfield
1 sibling, 1 reply; 106+ messages in thread
From: Andrew Morton @ 2005-09-05 19:53 UTC (permalink / raw)
To: Alan Cox
Cc: teigland, Joel.Becker, ak, linux-cluster, linux-fsdevel,
linux-kernel
Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
>
> On Llu, 2005-09-05 at 02:19 -0700, Andrew Morton wrote:
> > > create_lockspace()
> > > release_lockspace()
> > > lock()
> > > unlock()
> >
> > Neat. I'd be inclined to make them syscalls then. I don't suppose anyone
> > is likely to object if we reserve those slots.
>
> If the locks are not file descriptors then answer the following:
>
> - How are they ref counted
> - What are the cleanup semantics
> - How do I pass a lock between processes (AF_UNIX sockets wont work now)
> - How do I poll on a lock coming free.
> - What are the semantics of lock ownership
> - What rules apply for inheritance
> - How do I access a lock across threads.
> - What is the permission model.
> - How do I attach audit to it
> - How do I write SELinux rules for it
> - How do I use mount to make namespaces appear in multiple vservers
>
> and thats for starters...
Return an fd from create_lockspace().
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: Re: GFS, what's remaining
2005-09-05 19:53 ` [Linux-cluster] " Andrew Morton
@ 2005-09-05 23:20 ` Alan Cox
2005-09-05 23:06 ` Andrew Morton
0 siblings, 1 reply; 106+ messages in thread
From: Alan Cox @ 2005-09-05 23:20 UTC (permalink / raw)
To: Andrew Morton; +Cc: ak, linux-cluster, linux-fsdevel, Joel.Becker, linux-kernel
On Llu, 2005-09-05 at 12:53 -0700, Andrew Morton wrote:
> > - How are they ref counted
> > - What are the cleanup semantics
> > - How do I pass a lock between processes (AF_UNIX sockets wont work now)
> > - How do I poll on a lock coming free.
> > - What are the semantics of lock ownership
> > - What rules apply for inheritance
> > - How do I access a lock across threads.
> > - What is the permission model.
> > - How do I attach audit to it
> > - How do I write SELinux rules for it
> > - How do I use mount to make namespaces appear in multiple vservers
> >
> > and thats for starters...
>
> Return an fd from create_lockspace().
That only answers about four of the questions. The rest only come out if
create_lockspace behaves like a file system - in other words
create_lockspace is better known as either mkdir or mount.
Its certainly viable to make the lock/unlock functions taken a fd, it's
just not clear why the current lock/unlock functions we have won't do
the job. Being able to extend the functionality to leases later on may
be very powerful indeed and will fit the existing API
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: Re: GFS, what's remaining
2005-09-05 23:20 ` Alan Cox
@ 2005-09-05 23:06 ` Andrew Morton
0 siblings, 0 replies; 106+ messages in thread
From: Andrew Morton @ 2005-09-05 23:06 UTC (permalink / raw)
To: Alan Cox; +Cc: ak, linux-cluster, linux-fsdevel, Joel.Becker, linux-kernel
Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
>
> On Llu, 2005-09-05 at 12:53 -0700, Andrew Morton wrote:
> > > - How are they ref counted
> > > - What are the cleanup semantics
> > > - How do I pass a lock between processes (AF_UNIX sockets wont work now)
> > > - How do I poll on a lock coming free.
> > > - What are the semantics of lock ownership
> > > - What rules apply for inheritance
> > > - How do I access a lock across threads.
> > > - What is the permission model.
> > > - How do I attach audit to it
> > > - How do I write SELinux rules for it
> > > - How do I use mount to make namespaces appear in multiple vservers
> > >
> > > and thats for starters...
> >
> > Return an fd from create_lockspace().
>
> That only answers about four of the questions. The rest only come out if
> create_lockspace behaves like a file system - in other words
> create_lockspace is better known as either mkdir or mount.
But David said that "We export our full dlm API through read/write/poll on
a misc device.". That miscdevice will simply give us an fd. Hence my
suggestion that the miscdevice be done away with in favour of a dedicated
syscall which returns an fd.
What does a filesystem have to do with this?
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [Linux-cluster] Re: GFS, what's remaining
2005-09-05 12:21 ` Alan Cox
2005-09-05 19:53 ` [Linux-cluster] " Andrew Morton
@ 2005-09-14 9:01 ` Patrick Caulfield
1 sibling, 0 replies; 106+ messages in thread
From: Patrick Caulfield @ 2005-09-14 9:01 UTC (permalink / raw)
To: linux clustering
Cc: Andrew Morton, ak, linux-fsdevel, Joel.Becker, linux-kernel
I've just returned from holiday so I'm late to this discussion so let me tell
you what we do now and why and lets see what's wrong with it.
Currently the library create_lockspace() call returns an FD upon which all lock
operations happen. The FD is onto a misc device, one per lockspace, so if you
want lockspace protection it can happen at that level. There is no protection
applied to locks within a lockspace nor do I think it's helpful to do so to be
honest. Using a misc device limits you to <255 lockspaces depending on the other
uses of misc but this is just for userland-visible lockspace - it does not
affect GFS filesystems for instance.
Lock/convert/unlock operations are done using write calls on that lockspace FD.
Callbacks are implemented using poll and read on the FD, read will return data
blocks (one per callback) as long as there are active callbacks to process. The
current read functionality behaves more like a SOCK_PACKET than a data stream
which some may not like but then you're going to need to know what you're
reading from the device anyway.
ioctl/fcntl isn't really useful for DLM locks because you can't do asynchronous
operations on them - the lock has to succeed or fail in the one operation - if
you want a callback for completion (or blocking notification) you have to poll
the lockspace FD anyway and then you might as well go back to using read and
write because at least they are something of a matched pair. Something similar
applies, I think, to a syscall interface.
Another reason the existing fcntl interface isn't appropriate is that it's not
locking the same kind of thing. Current Unix fcntl calls lock byte ranges. DLM
locks arbitrary names and has a much richer list of lock modes. Adding another
fcntl just runs in the problems mentioned above.
The other reason we use read for callbacks is that there is information to be
passed back: lock status, value block and (possibly) query information.
While having an FD per lock sounds like a nice unixy idea I don't think it would
work very well in practice. Applications with hundreds or thousands of locks
(such as databases) would end up with huge pollfd structs to manage, and it
while it helps the refcounting (currently the nastiest bit of the current
dlm_device code) removes the possibility of having persistent locks that exist
after the process exits - a handy feature that some people do use, though I
don't think it's in the currently submitted DLM code. One FD per lock also gives
each lock two handles, the lock ID used internally by the DLM and the FD used
externally by the application which I think is a little confusing.
I don't think a dlmfs is useful, personally. The features you can export from it
are either minimal compared to the full DLM functionality (so you have to export
the rest by some other means anyway) or are going to be so un-filesystemlike as
to be very awkward to use. Doing lock operations in shell scripts is all very
cool but how often do you /really/ need to do that?
I'm not saying that what we have is perfect - far from it - but we have thought
about how this works and what we came up with seems like a good compromise
between providing full DLM functionality to userspace using unix features. But
we're very happy to listen to other ideas - and have been doing I hope.
--
patrick
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [Linux-cluster] Re: GFS, what's remaining
2005-09-05 9:24 ` David Teigland
2005-09-05 9:19 ` [Linux-cluster] " Andrew Morton
@ 2005-09-05 19:11 ` kurt.hackel
1 sibling, 0 replies; 106+ messages in thread
From: kurt.hackel @ 2005-09-05 19:11 UTC (permalink / raw)
To: David Teigland
Cc: Andrew Morton, Joel.Becker, ak, linux-cluster, linux-fsdevel,
linux-kernel
On Mon, Sep 05, 2005 at 05:24:33PM +0800, David Teigland wrote:
> On Mon, Sep 05, 2005 at 01:54:08AM -0700, Andrew Morton wrote:
> > David Teigland <teigland@redhat.com> wrote:
> > >
> > > We export our full dlm API through read/write/poll on a misc device.
> > >
> >
> > inotify did that for a while, but we ended up going with a straight syscall
> > interface.
> >
> > How fat is the dlm interface? ie: how many syscalls would it take?
>
> Four functions:
> create_lockspace()
> release_lockspace()
> lock()
> unlock()
FWIW, it looks like we can agree on the core interface. ocfs2_dlm
exports essentially the same functions:
dlm_register_domain()
dlm_unregister_domain()
dlmlock()
dlmunlock()
I also implemented dlm_migrate_lockres() to explicitly remaster a lock
on another node, but this isn't used by any callers today (except for
debugging purposes). There is also some wiring between the fs and the
dlm (eviction callbacks) to deal with some ordering issues between the
two layers, but these could go if we get stronger membership.
There are quite a few other functions in the "full" spec(1) that we
didn't even attempt, either because we didn't require direct
user<->kernel access or we just didn't need the function. As for the
rather thick set of parameters expected in dlm calls, we managed to get
dlmlock down to *ahem* eight, and the rest are fairly slim.
Looking at the misc device that gfs uses, it seems like there is pretty
much complete interface to the same calls you have in kernel, validated
on the write() calls to the misc device. With dlmfs, we were seeking to
lock down and simplify user access by using standard ast/bast/unlockast
calls, using a file descriptor as an opaque token for a single lock,
letting the vfs lifetime on this fd help with abnormal termination, etc.
I think both the misc device and dlmfs are helpful and not necessarily
mutually exclusive, and probably both are better approaches than
exporting everything via loads of syscalls (which seems to be the
VMS/opendlm model).
-kurt
1. http://opendlm.sourceforge.net/cvsmirror/opendlm/docs/dlmbook_final.pdf
Kurt C. Hackel
Oracle
kurt.hackel@oracle.com
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [Linux-cluster] Re: GFS, what's remaining
2005-09-04 4:46 ` Andrew Morton
2005-09-04 4:58 ` Joel Becker
@ 2005-09-04 6:10 ` Mark Fasheh
2005-09-04 7:23 ` Andrew Morton
2005-09-04 6:40 ` [Linux-cluster] " Daniel Phillips
` (2 subsequent siblings)
4 siblings, 1 reply; 106+ messages in thread
From: Mark Fasheh @ 2005-09-04 6:10 UTC (permalink / raw)
To: Andrew Morton
Cc: Daniel Phillips, Joel.Becker, linux-cluster, wim.coekaerts,
linux-fsdevel, ak, linux-kernel
On Sat, Sep 03, 2005 at 09:46:53PM -0700, Andrew Morton wrote:
> Actually I think it's rather sick. Taking O_NONBLOCK and making it a
> lock-manager trylock because they're kinda-sorta-similar-sounding? Spare
> me. O_NONBLOCK means "open this file in nonblocking mode", not "attempt to
> acquire a clustered filesystem lock". Not even close.
What would be an acceptable replacement? I admit that O_NONBLOCK -> trylock
is a bit unfortunate, but really it just needs a bit to express that -
nobody over here cares what it's called.
--Mark
--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh@oracle.com
^ permalink raw reply [flat|nested] 106+ messages in thread* Re: [Linux-cluster] Re: GFS, what's remaining
2005-09-04 6:10 ` Mark Fasheh
@ 2005-09-04 7:23 ` Andrew Morton
2005-09-04 8:17 ` Mark Fasheh
0 siblings, 1 reply; 106+ messages in thread
From: Andrew Morton @ 2005-09-04 7:23 UTC (permalink / raw)
To: Mark Fasheh
Cc: phillips, Joel.Becker, linux-cluster, wim.coekaerts,
linux-fsdevel, ak, linux-kernel
Mark Fasheh <mark.fasheh@oracle.com> wrote:
>
> On Sat, Sep 03, 2005 at 09:46:53PM -0700, Andrew Morton wrote:
> > Actually I think it's rather sick. Taking O_NONBLOCK and making it a
> > lock-manager trylock because they're kinda-sorta-similar-sounding? Spare
> > me. O_NONBLOCK means "open this file in nonblocking mode", not "attempt to
> > acquire a clustered filesystem lock". Not even close.
>
> What would be an acceptable replacement? I admit that O_NONBLOCK -> trylock
> is a bit unfortunate, but really it just needs a bit to express that -
> nobody over here cares what it's called.
The whole idea of reinterpreting file operations to mean something utterly
different just seems inappropriate to me.
You get a lot of goodies when using a filesystem - the ability for
unrelated processes to look things up, resource release on exit(), etc. If
those features are valuable in the ocfs2 context then fine. But I'd have
thought that it would be saner and more extensible to add new syscalls
(perhaps taking fd's) rather than overloading the open() mode in this
manner.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [Linux-cluster] Re: GFS, what's remaining
2005-09-04 7:23 ` Andrew Morton
@ 2005-09-04 8:17 ` Mark Fasheh
2005-09-04 8:37 ` Andrew Morton
0 siblings, 1 reply; 106+ messages in thread
From: Mark Fasheh @ 2005-09-04 8:17 UTC (permalink / raw)
To: Andrew Morton
Cc: phillips, Joel.Becker, linux-cluster, wim.coekaerts,
linux-fsdevel, ak, linux-kernel
On Sun, Sep 04, 2005 at 12:23:43AM -0700, Andrew Morton wrote:
> > What would be an acceptable replacement? I admit that O_NONBLOCK -> trylock
> > is a bit unfortunate, but really it just needs a bit to express that -
> > nobody over here cares what it's called.
>
> The whole idea of reinterpreting file operations to mean something utterly
> different just seems inappropriate to me.
Putting aside trylock for a minute, I'm not sure how utterly different the
operations are. You create a lock resource by creating a file named after
it. You get a lock (fd) at read or write level on the resource by calling
open(2) with the appropriate mode (O_RDONLY, O_WRONLY/O_RDWR).
Now that we've got an fd, lock value blocks are naturally represented as
file data which can be read(2) or written(2).
Close(2) drops the lock.
A really trivial usage example from shell:
node1$ echo "hello world" > mylock
node2$ cat mylock
hello world
I could always give a more useful one after I get some sleep :)
> You get a lot of goodies when using a filesystem - the ability for
> unrelated processes to look things up, resource release on exit(), etc. If
> those features are valuable in the ocfs2 context then fine.
Right, they certainly are and I think Joel, in another e-mail on this
thread, explained well the advantages of using a filesystem.
> But I'd have thought that it would be saner and more extensible to add new
> syscalls (perhaps taking fd's) rather than overloading the open() mode in
> this manner.
The idea behind dlmfs was to very simply export a small set of cluster dlm
operations to userspace. Given that goal, I felt that a whole set of system
calls would have been overkill. That said, I think perhaps I should clarify
that I don't intend dlmfs to become _the_ userspace dlm api, just a simple
and (imho) intuitive one which could be trivially accessed from any software
which just knows how to read and write files.
--Mark
--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh@oracle.com
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: Re: GFS, what's remaining
2005-09-04 8:17 ` Mark Fasheh
@ 2005-09-04 8:37 ` Andrew Morton
0 siblings, 0 replies; 106+ messages in thread
From: Andrew Morton @ 2005-09-04 8:37 UTC (permalink / raw)
To: Mark Fasheh
Cc: phillips, linux-cluster, linux-fsdevel, linux-kernel, ak,
Joel.Becker
Mark Fasheh <mark.fasheh@oracle.com> wrote:
>
> On Sun, Sep 04, 2005 at 12:23:43AM -0700, Andrew Morton wrote:
> > > What would be an acceptable replacement? I admit that O_NONBLOCK -> trylock
> > > is a bit unfortunate, but really it just needs a bit to express that -
> > > nobody over here cares what it's called.
> >
> > The whole idea of reinterpreting file operations to mean something utterly
> > different just seems inappropriate to me.
> Putting aside trylock for a minute, I'm not sure how utterly different the
> operations are. You create a lock resource by creating a file named after
> it. You get a lock (fd) at read or write level on the resource by calling
> open(2) with the appropriate mode (O_RDONLY, O_WRONLY/O_RDWR).
> Now that we've got an fd, lock value blocks are naturally represented as
> file data which can be read(2) or written(2).
> Close(2) drops the lock.
>
> A really trivial usage example from shell:
>
> node1$ echo "hello world" > mylock
> node2$ cat mylock
> hello world
>
> I could always give a more useful one after I get some sleep :)
It isn't extensible though. One couldn't retain this approach while adding
(random cfs ignorance exposure) upgrade-read, downgrade-write,
query-for-various-runtime-stats, priority modification, whatever.
> > You get a lot of goodies when using a filesystem - the ability for
> > unrelated processes to look things up, resource release on exit(), etc. If
> > those features are valuable in the ocfs2 context then fine.
> Right, they certainly are and I think Joel, in another e-mail on this
> thread, explained well the advantages of using a filesystem.
>
> > But I'd have thought that it would be saner and more extensible to add new
> > syscalls (perhaps taking fd's) rather than overloading the open() mode in
> > this manner.
> The idea behind dlmfs was to very simply export a small set of cluster dlm
> operations to userspace. Given that goal, I felt that a whole set of system
> calls would have been overkill. That said, I think perhaps I should clarify
> that I don't intend dlmfs to become _the_ userspace dlm api, just a simple
> and (imho) intuitive one which could be trivially accessed from any software
> which just knows how to read and write files.
Well, as I say. Making it a filesystem is superficially attractive, but
once you've build a super-dooper enterprise-grade infrastructure on top of
it all, nobody's going to touch the fs interface by hand and you end up
wondering why it's there, adding baggage.
Not that I'm questioning the fs interface! It has useful permission
management, monitoring and resource releasing characteristics. I'm
questioning the open() tricks. I guess from Joel's tiny description, the
filesystem's interpretation of mknod and mkdir look sensible enough.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [Linux-cluster] Re: GFS, what's remaining
2005-09-04 4:46 ` Andrew Morton
2005-09-04 4:58 ` Joel Becker
2005-09-04 6:10 ` Mark Fasheh
@ 2005-09-04 6:40 ` Daniel Phillips
2005-09-04 7:28 ` Andrew Morton
2005-09-04 7:12 ` Hua Zhong
2005-09-04 8:37 ` Alan Cox
4 siblings, 1 reply; 106+ messages in thread
From: Daniel Phillips @ 2005-09-04 6:40 UTC (permalink / raw)
To: Andrew Morton
Cc: Joel.Becker, linux-cluster, wim.coekaerts, linux-fsdevel, ak,
linux-kernel
On Sunday 04 September 2005 00:46, Andrew Morton wrote:
> Daniel Phillips <phillips@istop.com> wrote:
> > The model you came up with for dlmfs is beyond cute, it's downright
> > clever.
>
> Actually I think it's rather sick. Taking O_NONBLOCK and making it a
> lock-manager trylock because they're kinda-sorta-similar-sounding? Spare
> me. O_NONBLOCK means "open this file in nonblocking mode", not "attempt to
> acquire a clustered filesystem lock". Not even close.
Now, I see the ocfs2 guys are all ready to back down on this one, but I will
at least argue weakly in favor.
Sick is a nice word for it, but it is actually not that far off. Normally,
this fs will acquire a lock whenever the user creates a virtual file and the
create will block until the global lock arrives. With O_NONBLOCK, it will
return, erm... ETXTBSY (!) immediately. Is that not what O_NONBLOCK is
supposed to accomplish?
> It would be much better to do something which explicitly and directly
> expresses what you're trying to do rather than this strange "lets do this
> because the names sound the same" thing.
>
> What happens when we want to add some new primitive which has no posix-file
> analog?
>
> Waaaay too cute. Oh well, whatever.
The explicit way is syscalls or a set of ioctls, which he already has the
makings of. If there is going to be a userspace api, I would hope it looks
more like the contents of userdlm.c than the traditional Vaxcluster API,
which sucks beyond belief.
Another explicit way is to do it with a whole set of virtual attributes
instead of just a single file trying to capture the whole model. That is
really unappealing, but I am afraid that is exactly what a whole lot of
sysfs/configfs usage is going to end up looking like.
But more to the point: we have no urgent need for a userspace dlm api at the
moment. Nothing will break if we just put that issue off for a few months,
quite the contrary.
If the only user is their tools I would say let it go ahead and be cute, even
sickeningly so. It is not supposed to be a general dlm api, at least that is
my understanding. It is just supposed to be an interface for their tools.
Of course it would help to know exactly how those tools use it. Too sleepy
to find out tonight...
Regards,
Daniel
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: Re: GFS, what's remaining
2005-09-04 6:40 ` [Linux-cluster] " Daniel Phillips
@ 2005-09-04 7:28 ` Andrew Morton
2005-09-04 8:01 ` [Linux-cluster] " Joel Becker
2005-09-04 19:51 ` Daniel Phillips
0 siblings, 2 replies; 106+ messages in thread
From: Andrew Morton @ 2005-09-04 7:28 UTC (permalink / raw)
To: Daniel Phillips
Cc: Joel.Becker, linux-cluster, linux-fsdevel, linux-kernel, ak
Daniel Phillips <phillips@istop.com> wrote:
>
> If the only user is their tools I would say let it go ahead and be cute, even
> sickeningly so. It is not supposed to be a general dlm api, at least that is
> my understanding. It is just supposed to be an interface for their tools.
> Of course it would help to know exactly how those tools use it.
Well I'm not saying "don't do this". I'm saying "eww" and "why?".
If there is already a richer interface into all this code (such as a
syscall one) and it's feasible to migrate the open() tricksies to that API
in the future if it all comes unstuck then OK. That's why I asked (thus
far unsuccessfully):
Are you saying that the posix-file lookalike interface provides
access to part of the functionality, but there are other APIs which are
used to access the rest of the functionality? If so, what is that
interface, and why cannot that interface offer access to 100% of the
functionality, thus making the posix-file tricks unnecessary?
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [Linux-cluster] Re: GFS, what's remaining
2005-09-04 7:28 ` Andrew Morton
@ 2005-09-04 8:01 ` Joel Becker
2005-09-04 8:18 ` Andrew Morton
2005-09-04 19:51 ` Daniel Phillips
1 sibling, 1 reply; 106+ messages in thread
From: Joel Becker @ 2005-09-04 8:01 UTC (permalink / raw)
To: Andrew Morton
Cc: Daniel Phillips, linux-cluster, wim.coekaerts, linux-fsdevel, ak,
linux-kernel
On Sun, Sep 04, 2005 at 12:28:28AM -0700, Andrew Morton wrote:
> If there is already a richer interface into all this code (such as a
> syscall one) and it's feasible to migrate the open() tricksies to that API
> in the future if it all comes unstuck then OK.
> That's why I asked (thus far unsuccessfully):
I personally was under the impression that "syscalls are not
to be added". I'm also wary of the effort required to hook into process
exit. Not to mention all the lifetiming that has to be written again.
On top of that, we lose our cute ability to shell script it. We
find this very useful in testing, and think others would in practice.
> Are you saying that the posix-file lookalike interface provides
> access to part of the functionality, but there are other APIs which are
> used to access the rest of the functionality? If so, what is that
> interface, and why cannot that interface offer access to 100% of the
> functionality, thus making the posix-file tricks unnecessary?
I thought I stated this in my other email. We're not intending
to extend dlmfs. It pretty much covers the simple DLM usage required of
a simple interface. The OCFS2 DLM does not provide any other
functionality.
If the OCFS2 DLM grew more functionality, or you consider the
GFS2 DLM that already has it (and a less intuitive interface via sysfs
IIRC), I would contend that dlmfs still has a place. It's simple to use
and understand, and it's usable from shell scripts and other simple
code.
Joel
--
"The first thing we do, let's kill all the lawyers."
-Henry VI, IV:ii
Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 106+ messages in thread* Re: [Linux-cluster] Re: GFS, what's remaining
2005-09-04 8:01 ` [Linux-cluster] " Joel Becker
@ 2005-09-04 8:18 ` Andrew Morton
2005-09-04 9:11 ` Joel Becker
0 siblings, 1 reply; 106+ messages in thread
From: Andrew Morton @ 2005-09-04 8:18 UTC (permalink / raw)
To: Joel Becker
Cc: phillips, linux-cluster, wim.coekaerts, linux-fsdevel, ak,
linux-kernel
Joel Becker <Joel.Becker@oracle.com> wrote:
>
> On Sun, Sep 04, 2005 at 12:28:28AM -0700, Andrew Morton wrote:
> > If there is already a richer interface into all this code (such as a
> > syscall one) and it's feasible to migrate the open() tricksies to that API
> > in the future if it all comes unstuck then OK.
> > That's why I asked (thus far unsuccessfully):
>
> I personally was under the impression that "syscalls are not
> to be added".
We add syscalls all the time. Whichever user<->kernel API is considered to
be most appropriate, use it.
> I'm also wary of the effort required to hook into process
> exit.
I'm not questioning the use of a filesystem. I'm questioning this
overloading of normal filesystem system calls. For example (and this is
just an example! there's also mknod, mkdir, O_RDWR, O_EXCL...) it would be
more usual to do
fd = open("/sys/whatever", ...);
err = sys_dlm_trylock(fd);
I guess your current implementation prevents /sys/whatever from ever
appearing if the trylock failed. Dunno if that's valuable.
> Not to mention all the lifetiming that has to be written again.
> On top of that, we lose our cute ability to shell script it. We
> find this very useful in testing, and think others would in practice.
>
> > Are you saying that the posix-file lookalike interface provides
> > access to part of the functionality, but there are other APIs which are
> > used to access the rest of the functionality? If so, what is that
> > interface, and why cannot that interface offer access to 100% of the
> > functionality, thus making the posix-file tricks unnecessary?
>
> I thought I stated this in my other email. We're not intending
> to extend dlmfs.
Famous last words ;)
> It pretty much covers the simple DLM usage required of
> a simple interface. The OCFS2 DLM does not provide any other
> functionality.
> If the OCFS2 DLM grew more functionality, or you consider the
> GFS2 DLM that already has it (and a less intuitive interface via sysfs
> IIRC), I would contend that dlmfs still has a place. It's simple to use
> and understand, and it's usable from shell scripts and other simple
> code.
(wonders how to do O_NONBLOCK from a script)
I don't buy the general "fs is nice because we can script it" argument,
really. You can just write a few simple applications which provide access
to the syscalls (or the fs!) and then write scripts around those.
Yes, you suddenly need to get a little tarball into users' hands and that's
a hassle. And I sometimes think we let this hassle guide kernel interfaces
(mutters something about /sbin/hotplug), and that's sad.
^ permalink raw reply [flat|nested] 106+ messages in thread* Re: Re: GFS, what's remaining
2005-09-04 8:18 ` Andrew Morton
@ 2005-09-04 9:11 ` Joel Becker
2005-09-04 9:18 ` [Linux-cluster] " Andrew Morton
2005-09-04 18:03 ` [Linux-cluster] " Hua Zhong
0 siblings, 2 replies; 106+ messages in thread
From: Joel Becker @ 2005-09-04 9:11 UTC (permalink / raw)
To: Andrew Morton; +Cc: phillips, linux-cluster, linux-fsdevel, ak, linux-kernel
On Sun, Sep 04, 2005 at 01:18:05AM -0700, Andrew Morton wrote:
> > I thought I stated this in my other email. We're not intending
> > to extend dlmfs.
>
> Famous last words ;)
Heh, of course :-)
> I don't buy the general "fs is nice because we can script it" argument,
> really. You can just write a few simple applications which provide access
> to the syscalls (or the fs!) and then write scripts around those.
I can't see how that works easily. I'm not worried about a
tarball (eventually Red Hat and SuSE and Debian would have it). I'm
thinking about this shell:
exec 7</dlm/domainxxxx/lock1
do stuff
exec 7</dev/null
If someone kills the shell while stuff is doing, the lock is unlocked
because fd 7 is closed. However, if you have an application to do the
locking:
takelock domainxxx lock1
do sutff
droplock domainxxx lock1
When someone kills the shell, the lock is leaked, becuase droplock isn't
called. And SEGV/QUIT/-9 (especially -9, folks love it too much) are
handled by the first example but not by the second.
Joel
--
"Same dancers in the same old shoes.
You get too careful with the steps you choose.
You don't care about winning but you don't want to lose
After the thrill is gone."
Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [Linux-cluster] Re: GFS, what's remaining
2005-09-04 9:11 ` Joel Becker
@ 2005-09-04 9:18 ` Andrew Morton
2005-09-04 9:39 ` Joel Becker
2005-09-04 18:03 ` [Linux-cluster] " Hua Zhong
1 sibling, 1 reply; 106+ messages in thread
From: Andrew Morton @ 2005-09-04 9:18 UTC (permalink / raw)
To: Joel Becker
Cc: phillips, linux-cluster, wim.coekaerts, linux-fsdevel, ak,
linux-kernel
Joel Becker <Joel.Becker@oracle.com> wrote:
>
> I can't see how that works easily. I'm not worried about a
> tarball (eventually Red Hat and SuSE and Debian would have it). I'm
> thinking about this shell:
>
> exec 7</dlm/domainxxxx/lock1
> do stuff
> exec 7</dev/null
>
> If someone kills the shell while stuff is doing, the lock is unlocked
> because fd 7 is closed. However, if you have an application to do the
> locking:
>
> takelock domainxxx lock1
> do sutff
> droplock domainxxx lock1
>
> When someone kills the shell, the lock is leaked, becuase droplock isn't
> called. And SEGV/QUIT/-9 (especially -9, folks love it too much) are
> handled by the first example but not by the second.
take-and-drop-lock -d domainxxx -l lock1 -e "do stuff"
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: Re: GFS, what's remaining
2005-09-04 9:18 ` [Linux-cluster] " Andrew Morton
@ 2005-09-04 9:39 ` Joel Becker
0 siblings, 0 replies; 106+ messages in thread
From: Joel Becker @ 2005-09-04 9:39 UTC (permalink / raw)
To: Andrew Morton; +Cc: phillips, linux-cluster, linux-fsdevel, ak, linux-kernel
On Sun, Sep 04, 2005 at 02:18:36AM -0700, Andrew Morton wrote:
> take-and-drop-lock -d domainxxx -l lock1 -e "do stuff"
Ahh, but then you have to have lots of scripts somewhere in
path, or do massive inline scripts. especially if you want to take
another lock in there somewhere.
It's doable, but it's nowhere near as easy. :-)
Joel
--
"I always thought the hardest questions were those I could not answer.
Now I know they are the ones I can never ask."
- Charlie Watkins
Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [Linux-cluster] Re: GFS, what's remaining
2005-09-04 9:11 ` Joel Becker
2005-09-04 9:18 ` [Linux-cluster] " Andrew Morton
@ 2005-09-04 18:03 ` Hua Zhong
1 sibling, 0 replies; 106+ messages in thread
From: Hua Zhong @ 2005-09-04 18:03 UTC (permalink / raw)
To: linux clustering, Andrew Morton, phillips, wim.coekaerts,
linux-fsdevel, ak, linux-kernel
> takelock domainxxx lock1
> do sutff
> droplock domainxxx lock1
>
> When someone kills the shell, the lock is leaked, becuase droplock isn't
> called.
Why not open the lock resource (or the lock space) instead of
individual locks as file? It then looks like this:
open lock space file
takelock lockresource lock1
do stuff
droplock lockresource lock1
close lock space file
Then if you are killed the ->release of lock space file should take
care of cleaning up all the locks
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: Re: GFS, what's remaining
2005-09-04 7:28 ` Andrew Morton
2005-09-04 8:01 ` [Linux-cluster] " Joel Becker
@ 2005-09-04 19:51 ` Daniel Phillips
1 sibling, 0 replies; 106+ messages in thread
From: Daniel Phillips @ 2005-09-04 19:51 UTC (permalink / raw)
To: Andrew Morton; +Cc: Joel.Becker, linux-cluster, linux-fsdevel, linux-kernel, ak
On Sunday 04 September 2005 03:28, Andrew Morton wrote:
> If there is already a richer interface into all this code (such as a
> syscall one) and it's feasible to migrate the open() tricksies to that API
> in the future if it all comes unstuck then OK. That's why I asked (thus
> far unsuccessfully):
>
> Are you saying that the posix-file lookalike interface provides
> access to part of the functionality, but there are other APIs which are
> used to access the rest of the functionality? If so, what is that
> interface, and why cannot that interface offer access to 100% of the
> functionality, thus making the posix-file tricks unnecessary?
There is no such interface at the moment, nor is one needed in the immediate
future. Let's look at the arguments for exporting a dlm to userspace:
1) Since we already have a dlm in kernel, why not just export that and save
100K of userspace library? Answer: because we don't want userspace-only
dlm features bulking up the kernel. Answer #2: the extra syscalls and
interface baggage serve no useful purpose.
2) But we need to take locks in the same lockspaces as the kernel dlm(s)!
Answer: only support tools need to do that. A cut-down locking api is
entirely appropriate for this.
3) But the kernel dlm is the only one we have! Answer: easily fixed, a
simple matter of coding. But please bear in mind that dlm-style
synchronization is probably a bad idea for most cluster applications,
particularly ones that already do their synchronization via sockets.
In other words, exporting the full dlm api is a red herring. It has nothing
to do with getting cluster filesystems up and running. It is really just
marketing: it sounds like a great thing for userspace to get a dlm "for
free", but it isn't free, it contributes to kernel bloat and it isn't even
the most efficient way to do it.
If after considering that, we _still_ want to export a dlm api from kernel,
then can we please take the necessary time and get it right? The full api
requires not only syscall-style elements, but asynchronous events as well,
similar to aio. I do not think anybody has a good answer to this today, nor
do we even need it to begin porting applications to cluster filesystems.
Oracle guys: what is the distributed locking API for RAC? Is the RAC team
waiting with bated breath to adopt your kernel-based dlm? If not, why not?
Regards,
Daniel
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: Re: GFS, what's remaining
2005-09-04 4:46 ` Andrew Morton
` (2 preceding siblings ...)
2005-09-04 6:40 ` [Linux-cluster] " Daniel Phillips
@ 2005-09-04 7:12 ` Hua Zhong
2005-09-04 8:37 ` Alan Cox
4 siblings, 0 replies; 106+ messages in thread
From: Hua Zhong @ 2005-09-04 7:12 UTC (permalink / raw)
To: linux clustering
Cc: linux-fsdevel, ak, Daniel Phillips, Joel.Becker, linux-kernel
[-- Attachment #1.1: Type: text/plain, Size: 775 bytes --]
On 9/3/05, Andrew Morton <akpm@osdl.org> wrote:
>
> Daniel Phillips <phillips@istop.com> wrote:
> >
> > The model you came up with for dlmfs is beyond cute, it's downright
> clever.
>
> Actually I think it's rather sick. Taking O_NONBLOCK and making it a
> lock-manager trylock because they're kinda-sorta-similar-sounding? Spare
> me. O_NONBLOCK means "open this file in nonblocking mode", not "attempt to
> acquire a clustered filesystem lock". Not even close.
No, it's "open this file in nonblocking mode" vs "attempt to acquire a lock
in nonblocking mode". I think it makes perfect sense to use this flag.
Of course, whether or not to use open as a means to acquire a lock (in
either blocking or nonblocking mode) is efficient is another matter.
[-- Attachment #1.2: Type: text/html, Size: 1239 bytes --]
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 106+ messages in thread* Re: Re: GFS, what's remaining
2005-09-04 4:46 ` Andrew Morton
` (3 preceding siblings ...)
2005-09-04 7:12 ` Hua Zhong
@ 2005-09-04 8:37 ` Alan Cox
2005-09-05 23:32 ` Joel Becker
4 siblings, 1 reply; 106+ messages in thread
From: Alan Cox @ 2005-09-04 8:37 UTC (permalink / raw)
To: Andrew Morton
Cc: Daniel Phillips, linux-cluster, linux-fsdevel, linux-kernel, ak,
Joel.Becker
On Sad, 2005-09-03 at 21:46 -0700, Andrew Morton wrote:
> Actually I think it's rather sick. Taking O_NONBLOCK and making it a
> lock-manager trylock because they're kinda-sorta-similar-sounding? Spare
> me. O_NONBLOCK means "open this file in nonblocking mode", not "attempt to
> acquire a clustered filesystem lock". Not even close.
The semantics of O_NONBLOCK on many other devices are "trylock"
semantics. OSS audio has those semantics for example, as do regular
files in the presence of SYS5 mandatory locks. While the latter is "try
lock , do operation and then drop lock" the drivers using O_NDELAY are
very definitely providing trylock semantics.
I am curious why a lock manager uses open to implement its locking
semantics rather than using the locking API (POSIX locks etc) however.
Alan
^ permalink raw reply [flat|nested] 106+ messages in thread* Re: Re: GFS, what's remaining
2005-09-04 8:37 ` Alan Cox
@ 2005-09-05 23:32 ` Joel Becker
0 siblings, 0 replies; 106+ messages in thread
From: Joel Becker @ 2005-09-05 23:32 UTC (permalink / raw)
To: Alan Cox
Cc: Andrew Morton, Daniel Phillips, linux-cluster, linux-fsdevel,
linux-kernel, ak
On Sun, Sep 04, 2005 at 09:37:15AM +0100, Alan Cox wrote:
> I am curious why a lock manager uses open to implement its locking
> semantics rather than using the locking API (POSIX locks etc) however.
Because it is simple (how do you fcntl(2) from a shell fd?), has no
ranges (what do you do with ranges passed in to fcntl(2) and you don't
support them?), and has a well-known fork(2)/exec(2) pattern. fcntl(2)
has a known but less intuitive fork(2) pattern.
The real reason, though, is that we never considered fcntl(2).
We could never think of a case when a process wanted a lock fd open but
not locked. At least, that's my recollection. Mark might have more to
comment.
Joel
--
"In the room the women come and go
Talking of Michaelangelo."
Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: GFS, what's remaining
2005-09-02 21:17 ` Andi Kleen
2005-09-02 23:03 ` Bryan Henderson
2005-09-03 0:16 ` Mark Fasheh
@ 2005-09-03 5:57 ` Daniel Phillips
2005-09-05 14:14 ` Lars Marowsky-Bree
2005-09-03 7:06 ` GFS, what's remaining Wim Coekaerts
2005-09-06 12:55 ` Suparna Bhattacharya
4 siblings, 1 reply; 106+ messages in thread
From: Daniel Phillips @ 2005-09-03 5:57 UTC (permalink / raw)
To: Andi Kleen; +Cc: akpm, linux-fsdevel, linux clustering, linux-kernel
On Friday 02 September 2005 17:17, Andi Kleen wrote:
> The only thing that should be probably resolved is a common API
> for at least the clustered lock manager. Having multiple
> incompatible user space APIs for that would be sad.
The only current users of dlms are cluster filesystems. There are zero users
of the userspace dlm api. Therefore, the (g)dlm userspace interface actually
has nothing to do with the needs of gfs. It should be taken out the gfs
patch and merged later, when or if user space applications emerge that need
it. Maybe in the meantime it will be possible to come up with a userspace
dlm api that isn't completely repulsive.
Also, note that the only reason the two current dlms are in-kernel is because
it supposedly cuts down on userspace-kernel communication with the cluster
filesystems. Then why should a userspace application bother with a an
awkward interface to an in-kernel dlm? This is obviously suboptimal. Why
not have a userspace dlm for userspace apps, if indeed there are any
userspace apps that would need to use dlm-style synchronization instead of
more typical socket-based synchronization, or Posix locking, which is already
exposed via a standard api?
There is actually nothing wrong with having multiple, completely different
dlms active at the same time. There is no urgent need to merge them into the
one true dlm. It would be a lot better to let them evolve separately and
pick the winner a year or two from now. Just think of the dlm as part of the
cfs until then.
What does have to be resolved is a common API for node management. It is not
just cluster filesystems and their lock managers that have to interface to
node management. Below the filesystem layer, cluster block devices and
cluster volume management need to be coordinated by the same system, and
above the filesystem layer, applications also need to be hooked into it.
This work is, in a word, incomplete.
Regards,
Daniel
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: GFS, what's remaining
2005-09-03 5:57 ` Daniel Phillips
@ 2005-09-05 14:14 ` Lars Marowsky-Bree
2005-09-05 15:49 ` Daniel Phillips
0 siblings, 1 reply; 106+ messages in thread
From: Lars Marowsky-Bree @ 2005-09-05 14:14 UTC (permalink / raw)
To: Daniel Phillips, Andi Kleen
Cc: akpm, linux-fsdevel, linux clustering, linux-kernel
On 2005-09-03T01:57:31, Daniel Phillips <phillips@istop.com> wrote:
> The only current users of dlms are cluster filesystems. There are zero users
> of the userspace dlm api.
That is incorrect, and you're contradicting yourself here:
> What does have to be resolved is a common API for node management. It is not
> just cluster filesystems and their lock managers that have to interface to
> node management. Below the filesystem layer, cluster block devices and
> cluster volume management need to be coordinated by the same system, and
> above the filesystem layer, applications also need to be hooked into it.
> This work is, in a word, incomplete.
The Cluster Volume Management of LVM2 for example _does_ use simple
cluster-wide locks, and some OCFS2 scripts, I seem to recall, do too.
(EVMS2 in cluster-mode uses a verrry simple locking scheme which is
basically operated by the failover software and thus uses a different
model.)
Sincerely,
Lars Marowsky-Brée <lmb@suse.de>
--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"
^ permalink raw reply [flat|nested] 106+ messages in thread* Re: GFS, what's remaining
2005-09-05 14:14 ` Lars Marowsky-Bree
@ 2005-09-05 15:49 ` Daniel Phillips
2005-09-05 16:18 ` Dmitry Torokhov
0 siblings, 1 reply; 106+ messages in thread
From: Daniel Phillips @ 2005-09-05 15:49 UTC (permalink / raw)
To: Lars Marowsky-Bree
Cc: Andi Kleen, linux clustering, akpm, linux-fsdevel, linux-kernel
On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
> On 2005-09-03T01:57:31, Daniel Phillips <phillips@istop.com> wrote:
> > The only current users of dlms are cluster filesystems. There are zero
> > users of the userspace dlm api.
>
> That is incorrect...
Application users Lars, sorry if I did not make that clear. The issue is
whether we need to export an all-singing-all-dancing dlm api from kernel to
userspace today, or whether we can afford to take the necessary time to get
it right while application writers take their time to have a good think about
whether they even need it.
> ...and you're contradicting yourself here:
How so? Above talks about dlm, below talks about cluster membership.
> > What does have to be resolved is a common API for node management. It is
> > not just cluster filesystems and their lock managers that have to
> > interface to node management. Below the filesystem layer, cluster block
> > devices and cluster volume management need to be coordinated by the same
> > system, and above the filesystem layer, applications also need to be
> > hooked into it. This work is, in a word, incomplete.
Regards,
Daniel
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: GFS, what's remaining
2005-09-05 15:49 ` Daniel Phillips
@ 2005-09-05 16:18 ` Dmitry Torokhov
2005-09-06 0:57 ` Daniel Phillips
0 siblings, 1 reply; 106+ messages in thread
From: Dmitry Torokhov @ 2005-09-05 16:18 UTC (permalink / raw)
To: linux-kernel
Cc: Daniel Phillips, Lars Marowsky-Bree, Andi Kleen, linux clustering,
akpm, linux-fsdevel
On Monday 05 September 2005 10:49, Daniel Phillips wrote:
> On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
> > On 2005-09-03T01:57:31, Daniel Phillips <phillips@istop.com> wrote:
> > > The only current users of dlms are cluster filesystems. There are zero
> > > users of the userspace dlm api.
> >
> > That is incorrect...
>
> Application users Lars, sorry if I did not make that clear. The issue is
> whether we need to export an all-singing-all-dancing dlm api from kernel to
> userspace today, or whether we can afford to take the necessary time to get
> it right while application writers take their time to have a good think about
> whether they even need it.
>
If Linux fully supported OpenVMS DLM semantics we could start thinking asbout
moving our application onto a Linux box because our alpha server is aging.
That's just my user application writer $0.02.
--
Dmitry
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: GFS, what's remaining
2005-09-05 16:18 ` Dmitry Torokhov
@ 2005-09-06 0:57 ` Daniel Phillips
2005-09-06 2:03 ` Dmitry Torokhov
0 siblings, 1 reply; 106+ messages in thread
From: Daniel Phillips @ 2005-09-06 0:57 UTC (permalink / raw)
To: Dmitry Torokhov
Cc: linux-kernel, Lars Marowsky-Bree, Andi Kleen, linux clustering,
akpm, linux-fsdevel
On Monday 05 September 2005 12:18, Dmitry Torokhov wrote:
> On Monday 05 September 2005 10:49, Daniel Phillips wrote:
> > On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
> > > On 2005-09-03T01:57:31, Daniel Phillips <phillips@istop.com> wrote:
> > > > The only current users of dlms are cluster filesystems. There are
> > > > zero users of the userspace dlm api.
> > >
> > > That is incorrect...
> >
> > Application users Lars, sorry if I did not make that clear. The issue is
> > whether we need to export an all-singing-all-dancing dlm api from kernel
> > to userspace today, or whether we can afford to take the necessary time
> > to get it right while application writers take their time to have a good
> > think about whether they even need it.
>
> If Linux fully supported OpenVMS DLM semantics we could start thinking
> asbout moving our application onto a Linux box because our alpha server is
> aging.
>
> That's just my user application writer $0.02.
What stops you from trying it with the patch? That kind of feedback would be
worth way more than $0.02.
Regards,
Daniel
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: GFS, what's remaining
2005-09-06 0:57 ` Daniel Phillips
@ 2005-09-06 2:03 ` Dmitry Torokhov
2005-09-06 4:02 ` Daniel Phillips
0 siblings, 1 reply; 106+ messages in thread
From: Dmitry Torokhov @ 2005-09-06 2:03 UTC (permalink / raw)
To: Daniel Phillips
Cc: linux-kernel, Lars Marowsky-Bree, Andi Kleen, linux clustering,
akpm, linux-fsdevel
On Monday 05 September 2005 19:57, Daniel Phillips wrote:
> On Monday 05 September 2005 12:18, Dmitry Torokhov wrote:
> > On Monday 05 September 2005 10:49, Daniel Phillips wrote:
> > > On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
> > > > On 2005-09-03T01:57:31, Daniel Phillips <phillips@istop.com> wrote:
> > > > > The only current users of dlms are cluster filesystems. There are
> > > > > zero users of the userspace dlm api.
> > > >
> > > > That is incorrect...
> > >
> > > Application users Lars, sorry if I did not make that clear. The issue is
> > > whether we need to export an all-singing-all-dancing dlm api from kernel
> > > to userspace today, or whether we can afford to take the necessary time
> > > to get it right while application writers take their time to have a good
> > > think about whether they even need it.
> >
> > If Linux fully supported OpenVMS DLM semantics we could start thinking
> > asbout moving our application onto a Linux box because our alpha server is
> > aging.
> >
> > That's just my user application writer $0.02.
>
> What stops you from trying it with the patch? That kind of feedback would be
> worth way more than $0.02.
>
We do not have such plans at the moment and I prefer spending my free
time on tinkering with kernel, not rewriting some in-house application.
Besides, DLM is not the only thing that does not have a drop-in
replacement in Linux.
You just said you did not know if there are any potential users for the
full DLM and I said there are some.
--
Dmitry
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: GFS, what's remaining
2005-09-06 2:03 ` Dmitry Torokhov
@ 2005-09-06 4:02 ` Daniel Phillips
2005-09-06 4:07 ` GFS, what's remainingh Dmitry Torokhov
0 siblings, 1 reply; 106+ messages in thread
From: Daniel Phillips @ 2005-09-06 4:02 UTC (permalink / raw)
To: Dmitry Torokhov
Cc: linux-kernel, Lars Marowsky-Bree, Andi Kleen, linux clustering,
akpm, linux-fsdevel
On Monday 05 September 2005 22:03, Dmitry Torokhov wrote:
> On Monday 05 September 2005 19:57, Daniel Phillips wrote:
> > On Monday 05 September 2005 12:18, Dmitry Torokhov wrote:
> > > On Monday 05 September 2005 10:49, Daniel Phillips wrote:
> > > > On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
> > > > > On 2005-09-03T01:57:31, Daniel Phillips <phillips@istop.com> wrote:
> > > > > > The only current users of dlms are cluster filesystems. There
> > > > > > are zero users of the userspace dlm api.
> > > > >
> > > > > That is incorrect...
> > > >
> > > > Application users Lars, sorry if I did not make that clear. The
> > > > issue is whether we need to export an all-singing-all-dancing dlm api
> > > > from kernel to userspace today, or whether we can afford to take the
> > > > necessary time to get it right while application writers take their
> > > > time to have a good think about whether they even need it.
> > >
> > > If Linux fully supported OpenVMS DLM semantics we could start thinking
> > > asbout moving our application onto a Linux box because our alpha server
> > > is aging.
> > >
> > > That's just my user application writer $0.02.
> >
> > What stops you from trying it with the patch? That kind of feedback
> > would be worth way more than $0.02.
>
> We do not have such plans at the moment and I prefer spending my free
> time on tinkering with kernel, not rewriting some in-house application.
> Besides, DLM is not the only thing that does not have a drop-in
> replacement in Linux.
>
> You just said you did not know if there are any potential users for the
> full DLM and I said there are some.
I did not say "potential", I said there are zero dlm applications at the
moment. Nobody has picked up the prototype (g)dlm api, used it in an
application and said "gee this works great, look what it does".
I also claim that most developers who think that using a dlm for application
synchronization would be really cool are probably wrong. Use sockets for
synchronization exactly as for a single-node, multi-tasking application and
you will end up with less code, more obviously correct code, probably more
efficient and... you get an optimal, single-node version for free.
And I also claim that there is precious little reason to have a full-featured
dlm in-kernel. Being in-kernel has no benefit for a userspace application.
But being in-kernel does add kernel bloat, because there will be extra
features lathered on that are not needed by the only in-kernel user, the
cluster filesystem.
In the case of your port, you'd be better off hacking up a userspace library
to provide OpenVMS dlm semantics exactly, not almost.
By the way, you said "alpha server" not "alpha servers", was that just a slip?
Because if you don't have a cluster then why are you using a dlm?
Regards,
Daniel
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: GFS, what's remainingh
2005-09-06 4:02 ` Daniel Phillips
@ 2005-09-06 4:07 ` Dmitry Torokhov
2005-09-06 4:58 ` Daniel Phillips
0 siblings, 1 reply; 106+ messages in thread
From: Dmitry Torokhov @ 2005-09-06 4:07 UTC (permalink / raw)
To: Daniel Phillips
Cc: linux-kernel, Lars Marowsky-Bree, Andi Kleen, linux clustering,
akpm, linux-fsdevel
On Monday 05 September 2005 23:02, Daniel Phillips wrote:
>
> By the way, you said "alpha server" not "alpha servers", was that just a slip?
> Because if you don't have a cluster then why are you using a dlm?
>
No, it is not a slip. The application is running on just one node, so we
do not really use "distributed" part. However we make heavy use of the
rest of lock manager features, especially lock value blocks.
--
Dmitry
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: GFS, what's remainingh
2005-09-06 4:07 ` GFS, what's remainingh Dmitry Torokhov
@ 2005-09-06 4:58 ` Daniel Phillips
2005-09-06 5:05 ` Dmitry Torokhov
0 siblings, 1 reply; 106+ messages in thread
From: Daniel Phillips @ 2005-09-06 4:58 UTC (permalink / raw)
To: Dmitry Torokhov
Cc: akpm, linux clustering, linux-fsdevel, Andi Kleen, linux-kernel
On Tuesday 06 September 2005 00:07, Dmitry Torokhov wrote:
> On Monday 05 September 2005 23:02, Daniel Phillips wrote:
> > By the way, you said "alpha server" not "alpha servers", was that just a
> > slip? Because if you don't have a cluster then why are you using a dlm?
>
> No, it is not a slip. The application is running on just one node, so we
> do not really use "distributed" part. However we make heavy use of the
> rest of lock manager features, especially lock value blocks.
Urk, so you imprinted on the clunkiest, most pathetically limited dlm feature
without even having the excuse you were forced to use it. Why don't you just
have a daemon that sends your values over a socket? That should be all of a
day's coding.
Anyway, thanks for sticking your head up, and sorry if it sounds aggressive.
But you nicely supported my claim that most who think they should be using a
dlm, really shouldn't.
Regards,
Daniel
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: GFS, what's remainingh
2005-09-06 4:58 ` Daniel Phillips
@ 2005-09-06 5:05 ` Dmitry Torokhov
2005-09-06 6:48 ` Daniel Phillips
0 siblings, 1 reply; 106+ messages in thread
From: Dmitry Torokhov @ 2005-09-06 5:05 UTC (permalink / raw)
To: Daniel Phillips
Cc: linux-kernel, Lars Marowsky-Bree, Andi Kleen, linux clustering,
akpm, linux-fsdevel
On Monday 05 September 2005 23:58, Daniel Phillips wrote:
> On Tuesday 06 September 2005 00:07, Dmitry Torokhov wrote:
> > On Monday 05 September 2005 23:02, Daniel Phillips wrote:
> > > By the way, you said "alpha server" not "alpha servers", was that just a
> > > slip? Because if you don't have a cluster then why are you using a dlm?
> >
> > No, it is not a slip. The application is running on just one node, so we
> > do not really use "distributed" part. However we make heavy use of the
> > rest of lock manager features, especially lock value blocks.
>
> Urk, so you imprinted on the clunkiest, most pathetically limited dlm feature
> without even having the excuse you were forced to use it. Why don't you just
> have a daemon that sends your values over a socket? That should be all of a
> day's coding.
>
Umm, because when most of the code was written TCP and the rest was the
clunkiest code out there? Plus, having a daemon introduces problems with
cleanup (say process dies for one reason or another) whereas having it in
OS takes care of that.
> Anyway, thanks for sticking your head up, and sorry if it sounds aggressive.
> But you nicely supported my claim that most who think they should be using a
> dlm, really shouldn't.
Heh, do you think it is a bit premature to dismiss something even without
ever seeing the code?
--
Dmitry
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: GFS, what's remainingh
2005-09-06 5:05 ` Dmitry Torokhov
@ 2005-09-06 6:48 ` Daniel Phillips
2005-09-06 6:55 ` Dmitry Torokhov
2005-09-06 13:42 ` Alan Cox
0 siblings, 2 replies; 106+ messages in thread
From: Daniel Phillips @ 2005-09-06 6:48 UTC (permalink / raw)
To: Dmitry Torokhov
Cc: linux-kernel, Lars Marowsky-Bree, Andi Kleen, linux clustering,
akpm, linux-fsdevel
On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote:
> do you think it is a bit premature to dismiss something even without
> ever seeing the code?
You told me you are using a dlm for a single-node application, is there
anything more I need to know?
Regards,
Daniel
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: GFS, what's remainingh
2005-09-06 6:48 ` Daniel Phillips
@ 2005-09-06 6:55 ` Dmitry Torokhov
2005-09-06 7:18 ` Daniel Phillips
2005-09-06 13:42 ` Alan Cox
1 sibling, 1 reply; 106+ messages in thread
From: Dmitry Torokhov @ 2005-09-06 6:55 UTC (permalink / raw)
To: Daniel Phillips
Cc: linux-kernel, Lars Marowsky-Bree, Andi Kleen, linux clustering,
akpm, linux-fsdevel
On Tuesday 06 September 2005 01:48, Daniel Phillips wrote:
> On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote:
> > do you think it is a bit premature to dismiss something even without
> > ever seeing the code?
>
> You told me you are using a dlm for a single-node application, is there
> anything more I need to know?
>
I would still like to know why you consider it a "sin". On OpenVMS it is
fast, provides a way of cleaning up and does not introduce single point
of failure as it is the case with a daemon. And if we ever want to spread
the load between 2 boxes we easily can do it. Why would I not want to use
it?
--
Dmitry
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: GFS, what's remainingh
2005-09-06 6:55 ` Dmitry Torokhov
@ 2005-09-06 7:18 ` Daniel Phillips
2005-09-06 14:31 ` Dmitry Torokhov
0 siblings, 1 reply; 106+ messages in thread
From: Daniel Phillips @ 2005-09-06 7:18 UTC (permalink / raw)
To: Dmitry Torokhov
Cc: akpm, linux clustering, linux-fsdevel, Andi Kleen, linux-kernel
On Tuesday 06 September 2005 02:55, Dmitry Torokhov wrote:
> On Tuesday 06 September 2005 01:48, Daniel Phillips wrote:
> > On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote:
> > > do you think it is a bit premature to dismiss something even without
> > > ever seeing the code?
> >
> > You told me you are using a dlm for a single-node application, is there
> > anything more I need to know?
>
> I would still like to know why you consider it a "sin". On OpenVMS it is
> fast, provides a way of cleaning up...
There is something hard about handling EPIPE?
> and does not introduce single point
> of failure as it is the case with a daemon. And if we ever want to spread
> the load between 2 boxes we easily can do it.
But you said it runs on an aging Alpha, surely you do not intend to expand it
to two aging Alphas? And what makes you think that socket-based
synchronization keeps you from spreading out the load over multiple boxes?
> Why would I not want to use it?
It is not the right tool for the job from what you have told me. You want to
get a few bytes of information from one task to another? Use a socket, as
God intended.
Regards,
Daniel
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: GFS, what's remainingh
2005-09-06 7:18 ` Daniel Phillips
@ 2005-09-06 14:31 ` Dmitry Torokhov
0 siblings, 0 replies; 106+ messages in thread
From: Dmitry Torokhov @ 2005-09-06 14:31 UTC (permalink / raw)
To: Daniel Phillips
Cc: linux-kernel, Lars Marowsky-Bree, Andi Kleen, linux clustering,
akpm, linux-fsdevel
On 9/6/05, Daniel Phillips <phillips@istop.com> wrote:
> On Tuesday 06 September 2005 02:55, Dmitry Torokhov wrote:
> > On Tuesday 06 September 2005 01:48, Daniel Phillips wrote:
> > > On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote:
> > > > do you think it is a bit premature to dismiss something even without
> > > > ever seeing the code?
> > >
> > > You told me you are using a dlm for a single-node application, is there
> > > anything more I need to know?
> >
> > I would still like to know why you consider it a "sin". On OpenVMS it is
> > fast, provides a way of cleaning up...
>
> There is something hard about handling EPIPE?
>
Just the fact that you want me to handle it ;)
> > and does not introduce single point
> > of failure as it is the case with a daemon. And if we ever want to spread
> > the load between 2 boxes we easily can do it.
>
> But you said it runs on an aging Alpha, surely you do not intend to expand it
> to two aging Alphas?
You would be right if I was designing this right now. Now roll 10 - 12
years back and now I have a shiny new alpha. Would you criticize me
then for using a mechanism that allowed easily spread application
across several nodes with minimal changes if needed?
What you fail to realize that there applications that run and will
continue to run for a long time.
> And what makes you think that socket-based
> synchronization keeps you from spreading out the load over multiple boxes?
>
> > Why would I not want to use it?
>
> It is not the right tool for the job from what you have told me. You want to
> get a few bytes of information from one task to another? Use a socket, as
> God intended.
>
Again, when TCPIP is not a native network stack, when libc socket
routines are not readily available - DLM starts looking much more
viable.
--
Dmitry
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: GFS, what's remainingh
2005-09-06 6:48 ` Daniel Phillips
2005-09-06 6:55 ` Dmitry Torokhov
@ 2005-09-06 13:42 ` Alan Cox
1 sibling, 0 replies; 106+ messages in thread
From: Alan Cox @ 2005-09-06 13:42 UTC (permalink / raw)
To: Daniel Phillips
Cc: akpm, linux clustering, linux-fsdevel, Dmitry Torokhov,
Andi Kleen, linux-kernel
On Maw, 2005-09-06 at 02:48 -0400, Daniel Phillips wrote:
> On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote:
> > do you think it is a bit premature to dismiss something even without
> > ever seeing the code?
>
> You told me you are using a dlm for a single-node application, is there
> anything more I need to know?
That's standard practice for many non-Unix operating systems. It means
your code supports failover without much additional work and it provides
all the functionality for locks on a single node too
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: GFS, what's remaining
2005-09-02 21:17 ` Andi Kleen
` (2 preceding siblings ...)
2005-09-03 5:57 ` Daniel Phillips
@ 2005-09-03 7:06 ` Wim Coekaerts
2005-09-06 12:55 ` Suparna Bhattacharya
4 siblings, 0 replies; 106+ messages in thread
From: Wim Coekaerts @ 2005-09-03 7:06 UTC (permalink / raw)
To: Andi Kleen; +Cc: akpm, linux-fsdevel, linux clustering, linux-kernel
On Fri, Sep 02, 2005 at 11:17:08PM +0200, Andi Kleen wrote:
> Andrew Morton <akpm@osdl.org> writes:
>
> >
> > Again, that's not a technical reason. It's _a_ reason, sure. But what are
> > the technical reasons for merging gfs[2], ocfs2, both or neither?
clusterfilesystems are very common, there are companies that had/have a
whole business around it, veritas, polyserve, ex-sistina, thus now
redhat, ibm, tons of companies out there sell this, big bucks. as
someone said, it's different than nfs because for certian things there
is less overhead but there are many other reasons, it makes it a lot
easier to create a clustered nfs server so you create a cfs on a set of
disks with a number of nodes and export that fs from all those, you can
easily do loadbalancing for applications, you have a lot of
infrastructure where people have invested in that allows for shared
storage...
for ocfs we have tons of production customers running many terabyte
databases on a cfs. why ? because dealing with the raw disk froma number
of nodes sucks. because nfs is pretty broken for a lot of stuff, there
is no consistency across nodes when each machine nfs mounts a server
partition. yes nfs can be used for things but cfs's are very useful for
many things nfs just can't do. want a list ?
companies building failover for services like to use things like this,
it creates a non single point of failure kind of setup much more easily.
andso on and so on, yes there are alternatives out there but fact is
that a lot of folks like to use it, have been using it for ages, and
want to be using it.
from an implementation point of view, as folks here have already said,
we 've tried our best to implement things as a real linux filesystem, no
abstractions to have something generic, it's clean and as tight as can
be for a lot of stuff. and compared to other cfs's it's pretty darned
nice, however I think it's silly to have competition between ocfs2 and
gfs2. they are different just like the ton of local filesystems are
different and people like to use one or/over the other. david said gfs
is popular and has been around, well, I can list you tons of folks that
have been using our stuff 24/7 for years (for free) just as well. it's
different. that's that.
it'd be really nice if mainline kernel had it/them included. it would be
a good start to get more folks involved and instead of years of talk on
maillists that end up in nothing actually end up with folks
participating and contributing.
^ permalink raw reply [flat|nested] 106+ messages in thread* Re: GFS, what's remaining
2005-09-02 21:17 ` Andi Kleen
` (3 preceding siblings ...)
2005-09-03 7:06 ` GFS, what's remaining Wim Coekaerts
@ 2005-09-06 12:55 ` Suparna Bhattacharya
4 siblings, 0 replies; 106+ messages in thread
From: Suparna Bhattacharya @ 2005-09-06 12:55 UTC (permalink / raw)
To: Andi Kleen; +Cc: linux clustering, akpm, linux-fsdevel, linux-kernel
On Fri, Sep 02, 2005 at 11:17:08PM +0200, Andi Kleen wrote:
> Andrew Morton <akpm@osdl.org> writes:
>
> >
> > > > - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot
> > > > possibly gain (or vice versa)
> > > >
> > > > - Relative merits of the two offerings
> > >
> > > You missed the important one - people actively use it and have been for
> > > some years. Same reason with have NTFS, HPFS, and all the others. On
> > > that alone it makes sense to include.
> >
> > Again, that's not a technical reason. It's _a_ reason, sure. But what are
> > the technical reasons for merging gfs[2], ocfs2, both or neither?
>
> There seems to be clearly a need for a shared-storage fs of some sort
> for HA clusters and virtualized usage (multiple guests sharing a
> partition). Shared storage can be more efficient than network file
> systems like NFS because the storage access is often more efficient
> than network access and it is more reliable because it doesn't have a
> single point of failure in form of the NFS server.
>
> It's also a logical extension of the "failover on failure" clusters
> many people run now - instead of only failing over the shared fs at
> failure and keeping one machine idle the load can be balanced between
> multiple machines at any time.
>
> One argument to merge both might be that nobody really knows yet which
> shared-storage file system (GFS or OCFS2) is better. The only way to
> find out would be to let the user base try out both, and that's most
> practical when they're merged.
>
> Personally I think ocfs2 has nicer&cleaner code than GFS.
> It seems to be more or less a 64bit ext3 with cluster support, while
The "more or less" is what bothers me here - the first time I heard this,
it sounded a little misleading, as I expected to find some kind of a
patch to ext3 to make it 64 bit with extents and cluster support.
Now I understand it a little better (thanks to Joel and Mark)
And herein lies the issue where I tend to agree with Andrew on
-- its really nice to have multiple filesystems innovating freely in
their niches and eventually proving themselves in practice, without
being bogged down by legacy etc. But at the same time, is there enough
thought and discussion about where the fragmentation/diversification is really
warranted, vs improving what is already there, or say incorporating
the best of one into another, maybe over a period of time ?
The number of filesystems seems to just keep growing, and supporting
all of them isn't easy -- for users it isn't really easy to switch from
one to another, and the justifications for choosing between them is
sometimes confusing and burdensome from an administrator standpoint
- one filesystem is good in certain conditions, another in others,
stability levels may vary etc, and its not always possible to predict
which aspect to prioritize.
Now, with filesystems that have been around in production for a long
time, the on-disk format becomes a major constraining factor, and the
reason for having various legacy support around. Likewise, for some
special purpose filesystems there really is a niche usage. But for new
and sufficiently general purpose filesystems, with new on-disk structure,
isn't it worth thinking this through and trying to get it right ?
Yeah, it is a lot of work upfront ... but with double the people working
on something, it just might get much better than what they individually
can. Sometimes.
BTW, I don't know if it is worth it in this particular case, but just
something that worries me in general.
> GFS seems to reinvent a lot more things and has somewhat uglier code.
> On the other hand GFS' cluster support seems to be more aimed
> at being a universal cluster service open for other usages too,
> which might be a good thing. OCFS2s cluster seems to be more
> aimed at only serving the file system.
>
> But which one works better in practice is really an open question.
True, but what usually ends up happening is that this question can
never quite be answered in black and white. So both just continue
to exist and apps need to support both ... convergence becomes impossible
and long term duplication inevitable.
So at least having a clear demarcation/guideline of what situations
each is suitable for upfront would be a good thing. That might also
get some cross ocfs-gfs and ocfs-ext3 reviews in the process :)
Regards
Suparna
--
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Lab, India
^ permalink raw reply [flat|nested] 106+ messages in thread