All of lore.kernel.org
 help / color / mirror / Atom feed
* [Ocfs2-devel] OCFS2 features RFC
@ 2006-04-25 18:35 Mark Fasheh
  2006-04-25 21:55 ` Christoph Hellwig
                   ` (4 more replies)
  0 siblings, 5 replies; 38+ messages in thread
From: Mark Fasheh @ 2006-04-25 18:35 UTC (permalink / raw)
  To: ocfs2-devel

The OCFS2 team is in the preliminary stages of planning major features for
our next cycle of development. The goal of this e-mail then is to stimulate
some discussion as to how features should be prioritized going forward. Some
disclaimers apply:

* The following list is very preliminary and is sure to change.

* I've probably missed some things.

* Development priorities within Oracle can be influenced but are ultimately
  up to management. That's not stopping anyone from contributing though, and
  patches are always welcome.

So I'll start with changes that can be completely contained within the file
system (no cluster stack changes needed):

-Sparse file support: Self explanatory. We need this for various reasons
 including performance, correctness and space usage.

-Htree support

-Extended attributes: This might be another area where we
 steal^H^H^H^H^Hcopy some good code from Ext3 :) On top of this one can
 trivially implement posix acls. We're not likely to support EA block
 sharing though as it becomes difficult to manage across the cluster.

-Removal of the vote mechanism: The most trivial dentry type network votes
 can go quite easily by replacing them with a cluster lock. This is critical
 in speeding up unlink and rename operations in the cluster. The remaining
 votes (mount, unmount, delete_inode) look like they'll require cluster
 stack adjustments.

-Data in inode blocks: Should speed up local node data operations with small
 files significantly.

-Shared writeable mmap: This looks like it might require changes to the
 kernel (outside of OCFS2). We need to investigate further...

Now on to file system features which require cluster stack changes. I'll
have alot more to say about the cluster stack in a bit, but it's worth
listing these out here for completeness.

-Cluster consistent Flock / Lockf

-Online file system resize

-Removal of remaining FS votes: If we can get rid of the delete_inode vote,
 I don't believe we'll need the mount / umount ones anymore (and if we still
 do, then a proper group services could handle that)

-Allow the file system to go "hard read only" when it loses it's connection
 to the disk, rather than the kernel panic we have today. This allows
 applications using the file system to gracefully shut down. Other
 applications on the system continue unharmed. "Hard read only" in the OCFS2
 context means that the RO node does not look mounted to the other nodes on
 that file system. Absolutely no disk writes are allowed.  File data and
 meta data can be stale or otherwise invalid. We never want to return
 invalid data to userspace, so file reads return -EIO.

As far as the existing cluster stack goes, currently most of the OCFS2 team
feels that the code has gone as far as it can and should go. It would
therefore be prudent to allow pluggable cluster stacks. Jeff Mahoney at
Novell has already done some integration work implementing a userspace
clustering interface. We probably want to do more in that area though.

There are several good reasons why we might want to integrate with external
cluster stacks. The most obvious is code reuse. The list of cluster stack
features we require for our next phase of development is very large (some
are listed below). There is no reason to implement those features unless
we're certain existing software doesn't provide them and can't be extended.
This will also allow a greater amount of choice for the end user. What stack
works well for one environment might not work as well for another. There's
also the fact that current resources are limited. It's enough work designing
and implementing a file system. If we can get out of the business of
maintaining a cluster stack, we should do so.

So the question then becomes, "What is it that we require of our cluster
stack going forward?"

- We'd like as much of it to be user space code as is possible and
  practical.

- The node manager should support dynamic cluster topology updates,
  including removing nodes from the cluster, propagating new configurations to
  existing nodes, etc.

- A pluggable fencing mechanism is a priority.

- We'd like some group services implementation to handle things like
  membership of a mount point, dlm domain/lockspace, etc.

- On the DLM side, we'd like things like directory based mastery, a range
  locking API, and some extra LVB recovery bits.

So that's it for now. Hopefully this will spurn some interesting discussion.
Please keep in mind that any of this is subject to change - cluster stack
requirements especially are things we've only recently begun discussing.
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh at oracle.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Ocfs2-devel] OCFS2 features RFC
  2006-04-25 18:35 [Ocfs2-devel] OCFS2 features RFC Mark Fasheh
@ 2006-04-25 21:55 ` Christoph Hellwig
  2006-04-25 22:24   ` Mark Fasheh
  2006-04-26 16:50   ` Daniel Phillips
  2006-04-26  4:11 ` Andi Kleen
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 38+ messages in thread
From: Christoph Hellwig @ 2006-04-25 21:55 UTC (permalink / raw)
  To: ocfs2-devel

On Tue, Apr 25, 2006 at 11:35:53AM -0700, Mark Fasheh wrote:
> -Htree support

Please not.  htree is just the worst possible directory format around.
Do some nice hashed or btree directories, but don't try this odd hack
again. Especially as the only reason it was developed for in ext2/3
doesn't work very well in a cluster filesystem anyway - to access the
new htree all nodes would have to support the format anyway, so the
whole easy up/downgrade thing doesn't matter at all.

> -Extended attributes: This might be another area where we
>  steal^H^H^H^H^Hcopy some good code from Ext3 :) On top of this one can
>  trivially implement posix acls. We're not likely to support EA block
>  sharing though as it becomes difficult to manage across the cluster.

again the ext3 implementation might not be the best.  I'd say look at
jfs or xfs (in the latter case of course with a less monsterous btree
implementation)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Ocfs2-devel] OCFS2 features RFC
  2006-04-25 21:55 ` Christoph Hellwig
@ 2006-04-25 22:24   ` Mark Fasheh
  2006-04-26 16:50   ` Daniel Phillips
  1 sibling, 0 replies; 38+ messages in thread
From: Mark Fasheh @ 2006-04-25 22:24 UTC (permalink / raw)
  To: ocfs2-devel

On Tue, Apr 25, 2006 at 11:55:48PM +0200, Christoph Hellwig wrote:
> On Tue, Apr 25, 2006 at 11:35:53AM -0700, Mark Fasheh wrote:
> > -Htree support
> 
> Please not.  htree is just the worst possible directory format around.
> Do some nice hashed or btree directories, but don't try this odd hack
> again. Especially as the only reason it was developed for in ext2/3
> doesn't work very well in a cluster filesystem anyway - to access the
> new htree all nodes would have to support the format anyway, so the
> whole easy up/downgrade thing doesn't matter at all.
Interesting. You make a good point about the up/downgrade code - we
certainly couldn't use that (at least not without jumping some hoops). I
have to admit that I haven't looked very deeply into htree yet but if it's
that bad and we won't be compatible in any case it certainly makes sense to
try something new. Would you mind pointing out a few of the htree issues
that make it so poor?

> 
> > -Extended attributes: This might be another area where we
> >  steal^H^H^H^H^Hcopy some good code from Ext3 :) On top of this one can
> >  trivially implement posix acls. We're not likely to support EA block
> >  sharing though as it becomes difficult to manage across the cluster.
> 
> again the ext3 implementation might not be the best.  I'd say look at
> jfs or xfs (in the latter case of course with a less monsterous btree
> implementation)
I agree the XFS implementation seems a bit overboard... The problem I'm
having is that I can't seem to determine what size the average set of
extended attributes will be. Basically, as far as I can tell, ext3 will
allow about 1 block plus whatever will fit in the inode, minus overhead.
We'd like to have inlined EA but want to be able to move them out to a block
in the case that the number of extents we need grows to the end of the inode
block - this is to avoid having to create an allocation btree. So then if we
take the one-block-attached-to-the-inode approach, we'd have a capacity a
little less than ext3.

I've also noticed that, while the ext3 EA entries are stored in sorted
order, the search for them is linear. I wonder if that could be improved
upon (or if it even matters if you're just limited to one block).

If one block is insufficient, then certainly we need to look at some other
format. My first inclination would be to have a single level tree with
pointers to leaf nodes stored in hashed order to speed up lookups.
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh at oracle.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Ocfs2-devel] OCFS2 features RFC
  2006-04-25 18:35 [Ocfs2-devel] OCFS2 features RFC Mark Fasheh
  2006-04-25 21:55 ` Christoph Hellwig
@ 2006-04-26  4:11 ` Andi Kleen
  2006-04-26 18:06   ` Mark Fasheh
  2006-04-27 20:25 ` Paul Taysom
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 38+ messages in thread
From: Andi Kleen @ 2006-04-26  4:11 UTC (permalink / raw)
  To: ocfs2-devel

Mark Fasheh <mark.fasheh@oracle.com> writes:
> 
> - We'd like as much of it to be user space code as is possible and
>   practical.

Won't you get into deadlocks then when the system is low on memory?
(freeing memory might require write outs on OCFS2 and the user space
cluster might be stuck already)

Or rather if you rely on user space you would need to make sure
that the basic block write out path works without such possible
deadlocks.

-Andi

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Ocfs2-devel] OCFS2 features RFC
  2006-04-25 21:55 ` Christoph Hellwig
  2006-04-25 22:24   ` Mark Fasheh
@ 2006-04-26 16:50   ` Daniel Phillips
  1 sibling, 0 replies; 38+ messages in thread
From: Daniel Phillips @ 2006-04-26 16:50 UTC (permalink / raw)
  To: ocfs2-devel

Christoph Hellwig wrote:
> On Tue, Apr 25, 2006 at 11:35:53AM -0700, Mark Fasheh wrote:
>>-Htree support
> 
> Please not.  htree is just the worst possible directory format around.
> Do some nice hashed or btree directories, but don't try this odd hack
> again.

Could you be specific about what you think is odd about it?

> Especially as the only reason it was developed for in ext2/3
> doesn't work very well in a cluster filesystem anyway

In what way?

 > to access the
> new htree all nodes would have to support the format anyway, so the
> whole easy up/downgrade thing doesn't matter at all.

Good point, and this only affects the leaf node format.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Ocfs2-devel] OCFS2 features RFC
  2006-04-26  4:11 ` Andi Kleen
@ 2006-04-26 18:06   ` Mark Fasheh
  2006-04-26 18:08     ` Andi Kleen
  0 siblings, 1 reply; 38+ messages in thread
From: Mark Fasheh @ 2006-04-26 18:06 UTC (permalink / raw)
  To: ocfs2-devel

On Wed, Apr 26, 2006 at 06:11:04AM +0200, Andi Kleen wrote:
> Won't you get into deadlocks then when the system is low on memory?
> (freeing memory might require write outs on OCFS2 and the user space
> cluster might be stuck already)
> 
> Or rather if you rely on user space you would need to make sure
> that the basic block write out path works without such possible
> deadlocks.
The DLM certainly wouldn't be in userspace - there's also a convincing
performance argument for it being in kernel.

Primarily then I think we're worred about that in the context of something
like heartbeat. In that case, we probably want something that can do it's
work within some preallocated, mlock'd area. I'm not sure (yet) how the
various stacks handle this problem, or even if they do. I need to think
about membership software. I want to say that I don't think this would be an
issue there, but I have a feeling I could concoct a case during node
recovery.
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh at oracle.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Ocfs2-devel] OCFS2 features RFC
  2006-04-26 18:06   ` Mark Fasheh
@ 2006-04-26 18:08     ` Andi Kleen
  2006-04-26 18:34       ` Daniel Phillips
  0 siblings, 1 reply; 38+ messages in thread
From: Andi Kleen @ 2006-04-26 18:08 UTC (permalink / raw)
  To: ocfs2-devel

On Wednesday 26 April 2006 20:06, Mark Fasheh wrote:
> On Wed, Apr 26, 2006 at 06:11:04AM +0200, Andi Kleen wrote:
> > Won't you get into deadlocks then when the system is low on memory?
> > (freeing memory might require write outs on OCFS2 and the user space
> > cluster might be stuck already)
> > 
> > Or rather if you rely on user space you would need to make sure
> > that the basic block write out path works without such possible
> > deadlocks.
> The DLM certainly wouldn't be in userspace - there's also a convincing
> performance argument for it being in kernel.
> 
> Primarily then I think we're worred about that in the context of something
> like heartbeat. In that case, we probably want something that can do it's
> work within some preallocated, mlock'd area. 

That's not enough - it wouldn't be able to do anything that requires
memory allocation in the critical path. This includes most system calls.

> I'm not sure (yet) how the 
> various stacks handle this problem

I suspect they don't.


-Andi

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Ocfs2-devel] OCFS2 features RFC
  2006-04-26 18:08     ` Andi Kleen
@ 2006-04-26 18:34       ` Daniel Phillips
  0 siblings, 0 replies; 38+ messages in thread
From: Daniel Phillips @ 2006-04-26 18:34 UTC (permalink / raw)
  To: ocfs2-devel

Andi Kleen wrote:
> On Wednesday 26 April 2006 20:06, Mark Fasheh wrote:
>>On Wed, Apr 26, 2006 at 06:11:04AM +0200, Andi Kleen wrote:
>>
>>>Won't you get into deadlocks then when the system is low on memory?
>>>(freeing memory might require write outs on OCFS2 and the user space
>>>cluster might be stuck already)
>>>
>>>Or rather if you rely on user space you would need to make sure
>>>that the basic block write out path works without such possible
>>>deadlocks.
>>
>>The DLM certainly wouldn't be in userspace - there's also a convincing
>>performance argument for it being in kernel.
>>
>>Primarily then I think we're worred about that in the context of something
>>like heartbeat. In that case, we probably want something that can do it's
>>work within some preallocated, mlock'd area. 
> 
> That's not enough - it wouldn't be able to do anything that requires
> memory allocation in the critical path. This includes most system calls.

Indeed.  In general, what we have to do is give such a userspace process
access to the PF_MEMALLOC reserve, simply by setting that flag.  This
introduces a requirement to audit tasks's memory usage, but this isn't
different from what we have to do in kernel anyway.

So we can do this if we want to, but it isn't clear to me why we want
heartbeat in userspace.

Advantages for heartbeat in kernel:

   * Easier to manage reserve memory
   * No memlock requirement
   * Can act on heartbeat timeout with higher precision, possibly
     hard realtime precision

Disadvantages:

   * Handling heartbeat timeout looks a lot like policy
   * Need to invent a mechanism for communicating with userspace
     helpers

I am biased towards heartbeat in kernel, but the issues really need to be
talked out in detail.  The ground rule is that *everything* that can
execute in the block writeout path has to have access to reserve memory.
This includes everything in the failover path, fencing for example.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Ocfs2-devel] OCFS2 features RFC
  2006-04-25 18:35 [Ocfs2-devel] OCFS2 features RFC Mark Fasheh
  2006-04-25 21:55 ` Christoph Hellwig
  2006-04-26  4:11 ` Andi Kleen
@ 2006-04-27 20:25 ` Paul Taysom
  2006-05-03 23:04 ` [Ocfs2-devel] OCFS2 features RFC - separate journal? Daniel Phillips
  2006-05-11 20:04 ` [Ocfs2-devel] OCFS2 features RFC Jeff Mahoney
  4 siblings, 0 replies; 38+ messages in thread
From: Paul Taysom @ 2006-04-27 20:25 UTC (permalink / raw)
  To: ocfs2-devel

I've done some experiments with h-trees on ext3 and have found one case
where h-trees get confused.  If I create several thousand files in a
single directory and then try to remove the directory (rm -r), I get an
error that one of the files has not been removed but when I check the
directory, the file is not there.  I repeat the command and the
directory is removed.  I suspect the h-tree code is using the hash for
the cookie for readdir and I'm getting a hash collision.  ReiserFS
solves this problem by having 24 bits of hash and 8 bits of uniqueness
to resolve hash collisions.

Paul Taysom
 
>>> Mark Fasheh <mark.fasheh@oracle.com> 04/25/06 12:35 pm >>> 
The OCFS2 team is in the preliminary stages of planning major features
for
our next cycle of development. The goal of this e- mail then is to
stimulate
some discussion as to how features should be prioritized going forward.
Some
disclaimers apply:

* The following list is very preliminary and is sure to change.

* I've probably missed some things.

* Development priorities within Oracle can be influenced but are
ultimately
  up to management. That's not stopping anyone from contributing
though, and
  patches are always welcome.

So I'll start with changes that can be completely contained within the
file
system (no cluster stack changes needed):

- Sparse file support: Self explanatory. We need this for various
reasons
 including performance, correctness and space usage.

- Htree support

- Extended attributes: This might be another area where we
 steal^H^H^H^H^Hcopy some good code from Ext3 :) On top of this one
can
 trivially implement posix acls. We're not likely to support EA block
 sharing though as it becomes difficult to manage across the cluster.

- Removal of the vote mechanism: The most trivial dentry type network
votes
 can go quite easily by replacing them with a cluster lock. This is
critical
 in speeding up unlink and rename operations in the cluster. The
remaining
 votes (mount, unmount, delete_inode) look like they'll require
cluster
 stack adjustments.

- Data in inode blocks: Should speed up local node data operations with
small
 files significantly.

- Shared writeable mmap: This looks like it might require changes to
the
 kernel (outside of OCFS2). We need to investigate further...

Now on to file system features which require cluster stack changes.
I'll
have alot more to say about the cluster stack in a bit, but it's worth
listing these out here for completeness.

- Cluster consistent Flock / Lockf

- Online file system resize

- Removal of remaining FS votes: If we can get rid of the delete_inode
vote,
 I don't believe we'll need the mount / umount ones anymore (and if we
still
 do, then a proper group services could handle that)

- Allow the file system to go "hard read only" when it loses it's
connection
 to the disk, rather than the kernel panic we have today. This allows
 applications using the file system to gracefully shut down. Other
 applications on the system continue unharmed. "Hard read only" in the
OCFS2
 context means that the RO node does not look mounted to the other
nodes on
 that file system. Absolutely no disk writes are allowed.  File data
and
 meta data can be stale or otherwise invalid. We never want to return
 invalid data to userspace, so file reads return - EIO.

As far as the existing cluster stack goes, currently most of the OCFS2
team
feels that the code has gone as far as it can and should go. It would
therefore be prudent to allow pluggable cluster stacks. Jeff Mahoney
at
Novell has already done some integration work implementing a userspace
clustering interface. We probably want to do more in that area though.

There are several good reasons why we might want to integrate with
external
cluster stacks. The most obvious is code reuse. The list of cluster
stack
features we require for our next phase of development is very large
(some
are listed below). There is no reason to implement those features
unless
we're certain existing software doesn't provide them and can't be
extended.
This will also allow a greater amount of choice for the end user. What
stack
works well for one environment might not work as well for another.
There's
also the fact that current resources are limited. It's enough work
designing
and implementing a file system. If we can get out of the business of
maintaining a cluster stack, we should do so.

So the question then becomes, "What is it that we require of our
cluster
stack going forward?"

-  We'd like as much of it to be user space code as is possible and
  practical.

-  The node manager should support dynamic cluster topology updates,
  including removing nodes from the cluster, propagating new
configurations to
  existing nodes, etc.

-  A pluggable fencing mechanism is a priority.

-  We'd like some group services implementation to handle things like
  membership of a mount point, dlm domain/lockspace, etc.

-  On the DLM side, we'd like things like directory based mastery, a
range
  locking API, and some extra LVB recovery bits.

So that's it for now. Hopefully this will spurn some interesting
discussion.
Please keep in mind that any of this is subject to change -  cluster
stack
requirements especially are things we've only recently begun
discussing.
	-- Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh at oracle.com

_______________________________________________
Ocfs2- devel mailing list
Ocfs2- devel at oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2- devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Ocfs2-devel] OCFS2 Features RFC
@ 2006-05-02 18:22 Brian Long
  2006-05-02 20:29 ` Sunil Mushran
  0 siblings, 1 reply; 38+ messages in thread
From: Brian Long @ 2006-05-02 18:22 UTC (permalink / raw)
  To: ocfs2-devel

Hello,

I just subscribed to this list because I saw this posting in the
archives:
http://oss.oracle.com/pipermail/ocfs2-devel/2006-April/000931.html

Is there any reason you wouldn't ask the ocfs2-users community for
feedback on features as well?  I hadn't subscribed to -devel because I
figured it was solely for folks actually developing the OCFS2 code  :)

In my opinion, the proposed feature about "hard read only" is the most-
wanted.  My team is in the middle of testing 10gR2 RAC on OCFS2 for
production deployments on RHEL 4 (hopefully your x86_64 certification is
coming soon).  I assume Oracle RAC would like the "hard read only" more
than the current panic.

Also, while I saw one end user complain about your ideas of implementing
ext3 code inside OCFS2, please remember the rest of us that survive just
fine with ext3 in Red Hat's Enterprise Linux.  :)

Third, is there any thoughts on integrating LVM support or using
something like Red Hat's CLVM to allow OCFS2 to layer on top of LVs
instead of just individual disks?

The biggest drawback I see in my environment is that my storage team
provides 34GB and 68GB metas from the EMC Frames.  I'd rather not have a
ton of 68GB OCFS2 filesystems but rather a larger, host-controlled LV.
Trying to get the storage team to provide a 200+GB LUN and grow it on
the fly in the future is a tough task.  If I could control the LV on the
host _and_ grow OCFS2 into larger LVs, that would rock.

Thanks.

/Brian/
-- 
       Brian Long                      |         |           |
       IT Data Center Systems          |       .|||.       .|||.
       Cisco Linux Developer           |   ..:|||||||:...:|||||||:..
       Phone: (919) 392-7363           |   C i s c o   S y s t e m s

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Ocfs2-devel] OCFS2 Features RFC
  2006-05-02 18:22 [Ocfs2-devel] OCFS2 Features RFC Brian Long
@ 2006-05-02 20:29 ` Sunil Mushran
  0 siblings, 0 replies; 38+ messages in thread
From: Sunil Mushran @ 2006-05-02 20:29 UTC (permalink / raw)
  To: ocfs2-devel

Brian Long wrote:
> Is there any reason you wouldn't ask the ocfs2-users community for
> feedback on features as well?  I hadn't subscribed to -devel because I
> figured it was solely for folks actually developing the OCFS2 code  :)
>   
-devel is for all discussion regarding ocfs2 development. It is not limited
to developers. We could have posted this to -users too, but I guess one is
trying not to cross the "spam" line.

> In my opinion, the proposed feature about "hard read only" is the most-
> wanted.  My team is in the middle of testing 10gR2 RAC on OCFS2 for
> production deployments on RHEL 4 (hopefully your x86_64 certification is
> coming soon).  I assume Oracle RAC would like the "hard read only" more
> than the current panic.
>
> Also, while I saw one end user complain about your ideas of implementing
> ext3 code inside OCFS2, please remember the rest of us that survive just
> fine with ext3 in Red Hat's Enterprise Linux.  :)
>   
:)

> Third, is there any thoughts on integrating LVM support or using
> something like Red Hat's CLVM to allow OCFS2 to layer on top of LVs
> instead of just individual disks?
>
> The biggest drawback I see in my environment is that my storage team
> provides 34GB and 68GB metas from the EMC Frames.  I'd rather not have a
> ton of 68GB OCFS2 filesystems but rather a larger, host-controlled LV.
> Trying to get the storage team to provide a 200+GB LUN and grow it on
> the fly in the future is a tough task.  If I could control the LV on the
> host _and_ grow OCFS2 into larger LVs, that would rock.
>   
We are looking into this.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Ocfs2-devel] OCFS2 features RFC - separate journal?
  2006-04-25 18:35 [Ocfs2-devel] OCFS2 features RFC Mark Fasheh
                   ` (2 preceding siblings ...)
  2006-04-27 20:25 ` Paul Taysom
@ 2006-05-03 23:04 ` Daniel Phillips
  2006-05-04  0:29   ` Zach Brown
  2006-05-11 20:04 ` [Ocfs2-devel] OCFS2 features RFC Jeff Mahoney
  4 siblings, 1 reply; 38+ messages in thread
From: Daniel Phillips @ 2006-05-03 23:04 UTC (permalink / raw)
  To: ocfs2-devel

Mark Fasheh wrote:
> The OCFS2 team is in the preliminary stages of planning major features for
> our next cycle of development. The goal of this e-mail then is to stimulate
> some discussion as to how features should be prioritized going forward. Some
> disclaimers apply:

Hi guys,

Sorry about the lag.  Here's an easy feature nobody has mentioned so far, and
from my reading isn't supported: separate journal, like Ext3.  The journals
stay per-node, but they can be on a different (shared) volume than the
filesystem proper.  This should be dead simple to do and it can make a huge
difference to write latency, by putting the journals on separate spindles or
(what I actually have in mind) in NVRAM.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Ocfs2-devel] OCFS2 features RFC - separate journal?
  2006-05-03 23:04 ` [Ocfs2-devel] OCFS2 features RFC - separate journal? Daniel Phillips
@ 2006-05-04  0:29   ` Zach Brown
  2006-05-04  0:46     ` Daniel Phillips
  0 siblings, 1 reply; 38+ messages in thread
From: Zach Brown @ 2006-05-04  0:29 UTC (permalink / raw)
  To: ocfs2-devel

Daniel Phillips wrote:

> Sorry about the lag.  Here's an easy feature nobody has mentioned so far, and
> from my reading isn't supported: separate journal, like Ext3.

Yeah, I think this would be a fine piece to have some day.

I'm not sure it's a high priority, though, given that the vast majority
of deployments are already using hardware that has either some form of
write caching or so many spindles that external journals just aren't
worth the time they take to configure.

I'd be interested in seeing more careful write ordering in JBD before
worrying about external journals, personally.

- z

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Ocfs2-devel] OCFS2 features RFC - separate journal?
  2006-05-04  0:29   ` Zach Brown
@ 2006-05-04  0:46     ` Daniel Phillips
  2006-05-04 20:56       ` Zach Brown
  0 siblings, 1 reply; 38+ messages in thread
From: Daniel Phillips @ 2006-05-04  0:46 UTC (permalink / raw)
  To: ocfs2-devel

Zach Brown wrote:
> Daniel Phillips wrote:
>>Sorry about the lag.  Here's an easy feature nobody has mentioned so far, and
>>from my reading isn't supported: separate journal, like Ext3.
> 
> Yeah, I think this would be a fine piece to have some day.

Ext3 has it today.

> I'm not sure it's a high priority, though, given that the vast majority
> of deployments are already using hardware that has either some form of
> write caching or so many spindles that external journals just aren't
> worth the time they take to configure.

The journal has different, less demanding mirroring requirements than the
filesystem proper.  It is unnecessary and redundant to have a dirty map for
the journal mirror.  It is also unnecessary and stupid to snapshot the
journal.  These two things add up to a _huge_ performance boost for the
journal, if it can be separated.

It is worth remembering that not every OCFS2 user will be running it on a
big expensive SAN.  Probably not even the majority.

> I'd be interested in seeing more careful write ordering in JBD before
> worrying about external journals, personally.

IMHO, the separate journal on NVRAM will yield a much bigger gain and be
much less work besides.  Agreed that improvements to JBD are good.  They
are also scary.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Ocfs2-devel] OCFS2 features RFC - separate journal?
  2006-05-04  0:46     ` Daniel Phillips
@ 2006-05-04 20:56       ` Zach Brown
  2006-05-04 20:59         ` Wim Coekaerts
  2006-05-04 22:23         ` Daniel Phillips
  0 siblings, 2 replies; 38+ messages in thread
From: Zach Brown @ 2006-05-04 20:56 UTC (permalink / raw)
  To: ocfs2-devel


> journal.  These two things add up to a _huge_ performance boost for the
> journal, if it can be separated.

Sure, I don't doubt the high level theory.  Does anyone have numbers to
show it's relative effect in practice?  That'd be interesting.

> It is worth remembering that not every OCFS2 user will be running it on a
> big expensive SAN.  Probably not even the majority.

Well, that's debatable.  My only point, though, is that there are higher
priority things that we should get to first because they affect *everyone*.

If the lack of external journals makes you sad, well, I'm sorry to hear
that.  We certainly wouldn't turn away patches if someone got to it
before us.

> IMHO, the separate journal on NVRAM will yield a much bigger gain and be
> much less work besides.

So noted.  I'm curious, though.  What sort of NVRAM hardware do you have
in mind?

- z

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Ocfs2-devel] OCFS2 features RFC - separate journal?
  2006-05-04 20:56       ` Zach Brown
@ 2006-05-04 20:59         ` Wim Coekaerts
  2006-05-04 22:23         ` Daniel Phillips
  1 sibling, 0 replies; 38+ messages in thread
From: Wim Coekaerts @ 2006-05-04 20:59 UTC (permalink / raw)
  To: ocfs2-devel


>
> So noted.  I'm curious, though.  What sort of NVRAM hardware do you have
> in mind?
>   
and can be shared across nodes so taht you can do recovery ;-) I think 
that's a reasonable requirement to have ;)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Ocfs2-devel] OCFS2 features RFC - separate journal?
  2006-05-04 20:56       ` Zach Brown
  2006-05-04 20:59         ` Wim Coekaerts
@ 2006-05-04 22:23         ` Daniel Phillips
  2006-05-04 22:30           ` Mark Fasheh
  1 sibling, 1 reply; 38+ messages in thread
From: Daniel Phillips @ 2006-05-04 22:23 UTC (permalink / raw)
  To: ocfs2-devel

Zach Brown wrote:
>>journal.  These two things add up to a _huge_ performance boost for the
>>journal, if it can be separated.
> 
> Sure, I don't doubt the high level theory.  Does anyone have numbers to
> show it's relative effect in practice?  That'd be interesting.

I will have Ext3 numbers pretty soon.

>>It is worth remembering that not every OCFS2 user will be running it on a
>>big expensive SAN.  Probably not even the majority.
> 
> Well, that's debatable.  My only point, though, is that there are higher
> priority things that we should get to first because they affect *everyone*.

By all means, prioritize them.  Did separate journals make the list, even
if well towards the end?

> If the lack of external journals makes you sad, well, I'm sorry to hear
> that.  We certainly wouldn't turn away patches if someone got to it
> before us.

By proposing a feature that I do not also implicitly propose that Oracle
employees have to do the work.  This is a quick hack after all, I would be
happy to contribute.

>>IMHO, the separate journal on NVRAM will yield a much bigger gain and be
>>much less work besides.
> 
> So noted.  I'm curious, though.  What sort of NVRAM hardware do you have
> in mind?

For the moment, iRAM cards.  Yes I know they suck for throughput, but
there are faster SATA NVRAM cards coming down the pipe, they are still
much faster than IDE disks, and there is no seeking.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Ocfs2-devel] OCFS2 features RFC - separate journal?
  2006-05-04 22:23         ` Daniel Phillips
@ 2006-05-04 22:30           ` Mark Fasheh
  2006-05-05  3:05             ` Daniel Phillips
                               ` (3 more replies)
  0 siblings, 4 replies; 38+ messages in thread
From: Mark Fasheh @ 2006-05-04 22:30 UTC (permalink / raw)
  To: ocfs2-devel

On Thu, May 04, 2006 at 03:23:47PM -0700, Daniel Phillips wrote:
> >>It is worth remembering that not every OCFS2 user will be running it on a
> >>big expensive SAN.  Probably not even the majority.
> >
> >Well, that's debatable.  My only point, though, is that there are higher
> >priority things that we should get to first because they affect *everyone*.
> 
> By all means, prioritize them.  Did separate journals make the list, even
> if well towards the end?
It's on the list now.

> 
> >If the lack of external journals makes you sad, well, I'm sorry to hear
> >that.  We certainly wouldn't turn away patches if someone got to it
> >before us.
> 
> By proposing a feature that I do not also implicitly propose that Oracle
> employees have to do the work.  This is a quick hack after all, I would be
> happy to contribute.
By all means. It should be a fairly straightfoward change. Out of
curiousity, are we talking about a single journal device (all slot journals
on one disk) or one device per journal?
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh at oracle.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Ocfs2-devel] OCFS2 features RFC - separate journal?
  2006-05-04 22:30           ` Mark Fasheh
@ 2006-05-05  3:05             ` Daniel Phillips
  2006-05-05 18:25               ` Mark Fasheh
  2006-05-05 17:12             ` Paul Taysom
                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 38+ messages in thread
From: Daniel Phillips @ 2006-05-05  3:05 UTC (permalink / raw)
  To: ocfs2-devel

Mark Fasheh wrote:
> Out of curiousity, are we talking about a single journal device (all slot 
 > journals on one disk) or one device per journal?

Hi Mark,

For me, all journals on one disk, but that is just what I want for my one
particular project.  The user should be able to specify slot by slot which
device the journal is on, if it is not on the main volume.  This is just
the logical extension of the Ext3 scheme.

I don't see that there is anything to be gained by requiring the user to
specify a different device for each journal since the user tools already
have to handle the case where all the journals are on the same device.

The configuration I am most interested at the moment has two nodes, each
of which exports one NVRAM disk and one normal disk to the other.  The
NVRAM disks form a mirror with two journals on it.  The normal disks
likewise form a mirror with the OCFS2 fs proper on it.  The latter
volume needs to be snapshotted and its mirror needs a dirty map.  The
dirty map will live on the (NVRAM) journal volume.  See how big a deal
it is to be able to factor out the journals like that?  As I mentioned
earlier, the journals don't need to be snapshotted and the mirror
doesn't need a dirty map, which is a really big help considering that
typical write latency is determined by the journal, and the latency of
a snapshoted, mirrored device with a persistent dirty map can get really
high.

A picture:

                Node0  <---- GigE cable  ---->  Node1
   NVRAM:   Slot0 Journal               Mirror of Slot0 Journal
            Slot1 Journal               Mirror of Slot1 Journal
            HDISK Dirty Map             Mirror of HDISK Dirty Map

   HDISK:   OCFS2 FS proper             Mirror of OCFS2 FS proper
            OCFS2 FS Snapshot Store     Mirror of OCFS2 FS Snapshot Store

As a side note, separate journals will allow the user to be much less
conservative about setting the number of slots.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Ocfs2-devel] OCFS2 features RFC - separate journal?
  2006-05-04 22:30           ` Mark Fasheh
  2006-05-05  3:05             ` Daniel Phillips
@ 2006-05-05 17:12             ` Paul Taysom
  2006-05-05 18:06               ` Daniel Phillips
  2006-05-05 18:57               ` Sunil Mushran
  2006-05-08 14:28             ` Paul Taysom
  2006-05-08 18:00             ` Paul Taysom
  3 siblings, 2 replies; 38+ messages in thread
From: Paul Taysom @ 2006-05-05 17:12 UTC (permalink / raw)
  To: ocfs2-devel

 The performance you might gain from a separate journaling device will
be very dependent on exactly how the journal is done.  On NSS, the
Netware journaled file system, we ran experiments with the journal
turned off (just didn't do the write) and found it had little impact on
benchmarks like NetBench.  Part of the reason for this is that the
journal writes were asynchronous to the main flow of the system.  Normal
operations would normally not ever wait for journal writes.

Paul
 
_______________________________________________
Ocfs2- devel mailing list
Ocfs2- devel at oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2- devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Ocfs2-devel] OCFS2 features RFC - separate journal?
  2006-05-05 17:12             ` Paul Taysom
@ 2006-05-05 18:06               ` Daniel Phillips
  2006-05-05 18:57               ` Sunil Mushran
  1 sibling, 0 replies; 38+ messages in thread
From: Daniel Phillips @ 2006-05-05 18:06 UTC (permalink / raw)
  To: ocfs2-devel

Paul Taysom wrote:
>  The performance you might gain from a separate journaling device will
> be very dependent on exactly how the journal is done.  On NSS, the
> Netware journaled file system, we ran experiments with the journal
> turned off (just didn't do the write) and found it had little impact on
> benchmarks like NetBench.  Part of the reason for this is that the
> journal writes were asynchronous to the main flow of the system.  Normal
> operations would normally not ever wait for journal writes.

That is one load.  Did you try NFS with synchronous mount?

Regards,

Daniel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Ocfs2-devel] OCFS2 features RFC - separate journal?
  2006-05-05  3:05             ` Daniel Phillips
@ 2006-05-05 18:25               ` Mark Fasheh
  2006-05-06  3:09                 ` Daniel Phillips
  0 siblings, 1 reply; 38+ messages in thread
From: Mark Fasheh @ 2006-05-05 18:25 UTC (permalink / raw)
  To: ocfs2-devel

Hi Daniel,

On Thu, May 04, 2006 at 08:05:16PM -0700, Daniel Phillips wrote:
> The user should be able to specify slot by slot which
> device the journal is on, if it is not on the main volume.  This is just
> the logical extension of the Ext3 scheme.
To be honest, that sounds a little bit like overkill to me.

For example, I was imagining that the user could create a seperate, rootless
file system on the journal device - similar to how we do heartbeat only file
systems. The normal file system would have the journal file system UUID
stored in it's superblock. This way mount.ocfs2 could find the proper disk
on the system and pass it along to the file system. If we had multiple
possible journal devices, it would at least mean a much larget set of UUID's
to store, necessitating a seperate area on disk for them. I'm sure there are
other implications as well.

> The configuration I am most interested at the moment has two nodes, each
> of which exports one NVRAM disk and one normal disk to the other.  The
> NVRAM disks form a mirror with two journals on it.  The normal disks
> likewise form a mirror with the OCFS2 fs proper on it.  The latter
> volume needs to be snapshotted and its mirror needs a dirty map.  The
> dirty map will live on the (NVRAM) journal volume.  See how big a deal
> it is to be able to factor out the journals like that?  As I mentioned
> earlier, the journals don't need to be snapshotted and the mirror
> doesn't need a dirty map, which is a really big help considering that
> typical write latency is determined by the journal, and the latency of
> a snapshoted, mirrored device with a persistent dirty map can get really
> high.
Thanks for explaining your proposed setup. What are you using to mirror the
devices?
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh at oracle.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Ocfs2-devel] OCFS2 features RFC - separate journal?
  2006-05-05 17:12             ` Paul Taysom
  2006-05-05 18:06               ` Daniel Phillips
@ 2006-05-05 18:57               ` Sunil Mushran
  1 sibling, 0 replies; 38+ messages in thread
From: Sunil Mushran @ 2006-05-05 18:57 UTC (permalink / raw)
  To: ocfs2-devel

jbd is also asynch. That's not the issue. The issue is more the size of the
journal. The larger the journal, the lesser need to flush the journal.

In ocfs2, as each slot has a separate journal, there is a desire to 
limit the
journal size so as to make more space available to actual data. Also, as the
fs is clustered, flushes could be triggered by other nodes.

So, having a separate device makes sense. It adds complexity to the
configuration, but, that is to be expected. ;)

Paul Taysom wrote:
>  The performance you might gain from a separate journaling device will
> be very dependent on exactly how the journal is done.  On NSS, the
> Netware journaled file system, we ran experiments with the journal
> turned off (just didn't do the write) and found it had little impact on
> benchmarks like NetBench.  Part of the reason for this is that the
> journal writes were asynchronous to the main flow of the system.  Normal
> operations would normally not ever wait for journal writes.
>
> Paul
>  
> _______________________________________________
> Ocfs2- devel mailing list
> Ocfs2- devel at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2- devel
>
>
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-devel
>   

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Ocfs2-devel] OCFS2 features RFC - separate journal?
  2006-05-05 18:25               ` Mark Fasheh
@ 2006-05-06  3:09                 ` Daniel Phillips
  0 siblings, 0 replies; 38+ messages in thread
From: Daniel Phillips @ 2006-05-06  3:09 UTC (permalink / raw)
  To: ocfs2-devel

Mark Fasheh wrote:
> Hi Daniel,
> On Thu, May 04, 2006 at 08:05:16PM -0700, Daniel Phillips wrote:
>>The user should be able to specify slot by slot which
>>device the journal is on, if it is not on the main volume.  This is just
>>the logical extension of the Ext3 scheme.
> 
> To be honest, that sounds a little bit like overkill to me.
> 
> For example, I was imagining that the user could create a seperate, rootless
> file system on the journal device - similar to how we do heartbeat only file
> systems. The normal file system would have the journal file system UUID
> stored in it's superblock. This way mount.ocfs2 could find the proper disk
> on the system and pass it along to the file system. If we had multiple
> possible journal devices, it would at least mean a much larget set of UUID's
> to store, necessitating a seperate area on disk for them. I'm sure there are
> other implications as well.

Hi Mark,

Why do you want to wrap the separate journals in a filesystem instead of just
being devices?

> Thanks for explaining your proposed setup. What are you using to mirror the
> devices?

DDRaid over NBD or iSCSI, probably NBD (which leads the performance race at
the moment).

Regards,

Daniel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Ocfs2-devel] OCFS2 features RFC - separate journal?
  2006-05-04 22:30           ` Mark Fasheh
  2006-05-05  3:05             ` Daniel Phillips
  2006-05-05 17:12             ` Paul Taysom
@ 2006-05-08 14:28             ` Paul Taysom
  2006-05-08 17:43               ` Daniel Phillips
  2006-05-08 18:00             ` Paul Taysom
  3 siblings, 1 reply; 38+ messages in thread
From: Paul Taysom @ 2006-05-08 14:28 UTC (permalink / raw)
  To: ocfs2-devel

If I was worried about NFS performance, I'd rather use NVRAM as an
immediate reply disk drive.

Paul
 
>That is one load.  Did you try NFS with synchronous mount?

>Regards,

>Daniel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Ocfs2-devel] OCFS2 features RFC - separate journal?
  2006-05-08 14:28             ` Paul Taysom
@ 2006-05-08 17:43               ` Daniel Phillips
  0 siblings, 0 replies; 38+ messages in thread
From: Daniel Phillips @ 2006-05-08 17:43 UTC (permalink / raw)
  To: ocfs2-devel

Paul Taysom wrote:
> If I was worried about NFS performance, I'd rather use NVRAM as an
> immediate reply disk drive.

What makes you think that that is any faster than just having a fast
journal on the filesystem?  It is certainly messier and adds two more
data copies.  Plus it only helps NFS, what if there are other servers
on the node?  And how do you maintain cache consistency with the data
written to the NFS reply journal when it has been acknowledged but is
not actually in the filesystem?

On a snapshot, the NFS reply journal would be one more thing that
needs to be flushed, this is one more thing needing administration
attention.

How much latency do you think is saved by a dedicated reply journal vs
a fast filesystem journal?  I doubt it is as much as you suppose, it
is on the order of microseconds per write and the reply journal will
eventually have to pay double for that anyway.

Also, somebody has to implement your NFS reply journal, further messing
up knfsd.  I am having a hard time seeing what is good about a
dedicated NFS reply journal.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Ocfs2-devel] OCFS2 features RFC - separate journal?
  2006-05-04 22:30           ` Mark Fasheh
                               ` (2 preceding siblings ...)
  2006-05-08 14:28             ` Paul Taysom
@ 2006-05-08 18:00             ` Paul Taysom
  2006-05-08 18:22               ` Daniel Phillips
  3 siblings, 1 reply; 38+ messages in thread
From: Paul Taysom @ 2006-05-08 18:00 UTC (permalink / raw)
  To: ocfs2-devel

Network Appliance has been very successful with exactly this
architecture.
Paul 
 
>>> Daniel Phillips <phillips@google.com> 05/08/06 11:43 am >>> 
Paul Taysom wrote:
> If I was worried about NFS performance, I'd rather use NVRAM as an
> immediate reply disk drive.

What makes you think that that is any faster than just having a fast
journal on the filesystem?  It is certainly messier and adds two more
data copies.  Plus it only helps NFS, what if there are other servers
on the node?  And how do you maintain cache consistency with the data
written to the NFS reply journal when it has been acknowledged but is
not actually in the filesystem?

On a snapshot, the NFS reply journal would be one more thing that
needs to be flushed, this is one more thing needing administration
attention.

How much latency do you think is saved by a dedicated reply journal vs
a fast filesystem journal?  I doubt it is as much as you suppose, it
is on the order of microseconds per write and the reply journal will
eventually have to pay double for that anyway.

Also, somebody has to implement your NFS reply journal, further
messing
up knfsd.  I am having a hard time seeing what is good about a
dedicated NFS reply journal.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Ocfs2-devel] OCFS2 features RFC - separate journal?
  2006-05-08 18:00             ` Paul Taysom
@ 2006-05-08 18:22               ` Daniel Phillips
  0 siblings, 0 replies; 38+ messages in thread
From: Daniel Phillips @ 2006-05-08 18:22 UTC (permalink / raw)
  To: ocfs2-devel

Paul Taysom wrote:
> Network Appliance has been very successful with exactly this
> architecture.
> Paul 

Perhaps alternative architectures exist that are just as good, if not
better?

Regards,

Daniel

>>>>Daniel Phillips <phillips@google.com> 05/08/06 11:43 am >>> 
> 
> Paul Taysom wrote:
> 
>>If I was worried about NFS performance, I'd rather use NVRAM as an
>>immediate reply disk drive.
> 
> 
> What makes you think that that is any faster than just having a fast
> journal on the filesystem?  It is certainly messier and adds two more
> data copies.  Plus it only helps NFS, what if there are other servers
> on the node?  And how do you maintain cache consistency with the data
> written to the NFS reply journal when it has been acknowledged but is
> not actually in the filesystem?
> 
> On a snapshot, the NFS reply journal would be one more thing that
> needs to be flushed, this is one more thing needing administration
> attention.
> 
> How much latency do you think is saved by a dedicated reply journal vs
> a fast filesystem journal?  I doubt it is as much as you suppose, it
> is on the order of microseconds per write and the reply journal will
> eventually have to pay double for that anyway.
> 
> Also, somebody has to implement your NFS reply journal, further
> messing
> up knfsd.  I am having a hard time seeing what is good about a
> dedicated NFS reply journal.
> 
> Regards,
> 
> Daniel
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Ocfs2-devel] OCFS2 features RFC
  2006-04-25 18:35 [Ocfs2-devel] OCFS2 features RFC Mark Fasheh
                   ` (3 preceding siblings ...)
  2006-05-03 23:04 ` [Ocfs2-devel] OCFS2 features RFC - separate journal? Daniel Phillips
@ 2006-05-11 20:04 ` Jeff Mahoney
  2006-05-11 20:40   ` Paul Taysom
                     ` (2 more replies)
  4 siblings, 3 replies; 38+ messages in thread
From: Jeff Mahoney @ 2006-05-11 20:04 UTC (permalink / raw)
  To: ocfs2-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Mark Fasheh wrote:
> The OCFS2 team is in the preliminary stages of planning major features for
> our next cycle of development. The goal of this e-mail then is to stimulate
> some discussion as to how features should be prioritized going forward. Some
> disclaimers apply:
> 
> * The following list is very preliminary and is sure to change.
> 
> * I've probably missed some things.
> 
> * Development priorities within Oracle can be influenced but are ultimately
>   up to management. That's not stopping anyone from contributing though, and
>   patches are always welcome.
> 

While performance enhancements are always welcome, the two big features
we'd like to see in future OCFS2 releases are features that will make
using OCFS2 more transparent and more like a "local" file system. The
features we want are cluster wide lockf/flock and shared writable mmap.

- From a data integrity perspective, it shouldn't make a difference to an
application whether competing reader/writers are on the same node or a
different node. If standard locking primitives are already in use by the
application, they should "just work" if the competing process is on
another node.

> So I'll start with changes that can be completely contained within the file
> system (no cluster stack changes needed):
> 
> -Sparse file support: Self explanatory. We need this for various reasons
>  including performance, correctness and space usage.

I think we all want this one. Once apon a time, ReiserFS didn't support
sparse files and it made doing things that expected sparse files an
exercise in torture.

> -Htree support

Hashed directories in some form, but I think the comments against ext3
style h-trees are valid.

> Now on to file system features which require cluster stack changes. I'll
> have alot more to say about the cluster stack in a bit, but it's worth
> listing these out here for completeness.

> -Online file system resize

This would be nice, and I think easily done in the same manner ext3
does. Anything outside the file system's current view of the block
device can be initialized in userspace, and the last block group,
bitmaps, and superblock would be adjusted by an ioctl in kernelspace.

> -Allow the file system to go "hard read only" when it loses it's connection
>  to the disk, rather than the kernel panic we have today. This allows
>  applications using the file system to gracefully shut down. Other
>  applications on the system continue unharmed. "Hard read only" in the OCFS2
>  context means that the RO node does not look mounted to the other nodes on
>  that file system. Absolutely no disk writes are allowed.  File data and
>  meta data can be stale or otherwise invalid. We never want to return
>  invalid data to userspace, so file reads return -EIO.

This is a big one as well. If a node knows to fence itself, it can put
itself in an error state as well. fence={panic,ro} would be a decent start.

> As far as the existing cluster stack goes, currently most of the OCFS2 team
> feels that the code has gone as far as it can and should go. It would
> therefore be prudent to allow pluggable cluster stacks. Jeff Mahoney at
> Novell has already done some integration work implementing a userspace
> clustering interface. We probably want to do more in that area though.
> 
> There are several good reasons why we might want to integrate with external
> cluster stacks. The most obvious is code reuse. The list of cluster stack
> features we require for our next phase of development is very large (some
> are listed below). There is no reason to implement those features unless
> we're certain existing software doesn't provide them and can't be extended.
> This will also allow a greater amount of choice for the end user. What stack
> works well for one environment might not work as well for another. There's
> also the fact that current resources are limited. It's enough work designing
> and implementing a file system. If we can get out of the business of
> maintaining a cluster stack, we should do so.
> 
> So the question then becomes, "What is it that we require of our cluster
> stack going forward?"
> 
> - We'd like as much of it to be user space code as is possible and
>   practical.

The heartbeat project does a pretty good job on the userspace end, but
as Andi pointed out, it has the usual shortcomings of anything in
userspace involved with writing data inside the kernel. It is prone to
deadlocks and we could miss node topology events.

- -Jeff

- --
Jeff Mahoney
SUSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org

iD8DBQFEY5jTLPWxlyuTD7IRAmsMAKCTZpN5rb+6jr6K0TvMJVq6LxNrwgCggFvT
uLovIf8rbp1GhF2LVg1i6Cw=
=SkZi
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Ocfs2-devel] OCFS2 features RFC
  2006-05-11 20:04 ` [Ocfs2-devel] OCFS2 features RFC Jeff Mahoney
@ 2006-05-11 20:40   ` Paul Taysom
  2006-05-11 20:55     ` Joel Becker
  2006-05-11 21:16   ` Daniel Phillips
  2006-05-17  1:44   ` Mark Fasheh
  2 siblings, 1 reply; 38+ messages in thread
From: Paul Taysom @ 2006-05-11 20:40 UTC (permalink / raw)
  To: ocfs2-devel

What make the online file system resize tricky is updating all the
allocation chains.  The last block of each of the existing chains needs
to be updated to point to the new blocks in the chains.

Would it be possible to get rid of chains and just compute the next
block in the chain?

Paul

>>Online file system resize

>This would be nice, and I think easily done in the same manner ext3
>does. Anything outside the file system's current view of the block
>device can be initialized in userspace, and the last block group,
>bitmaps, and superblock would be adjusted by an ioctl in kernelspace.

 
>>> Jeff Mahoney <jeffm@suse.com> 05/11/06 2:04 pm >>> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Ocfs2-devel] OCFS2 features RFC
  2006-05-11 20:40   ` Paul Taysom
@ 2006-05-11 20:55     ` Joel Becker
  0 siblings, 0 replies; 38+ messages in thread
From: Joel Becker @ 2006-05-11 20:55 UTC (permalink / raw)
  To: ocfs2-devel

On Thu, May 11, 2006 at 02:40:46PM -0600, Paul Taysom wrote:
> What make the online file system resize tricky is updating all the
> allocation chains.  The last block of each of the existing chains needs
> to be updated to point to the new blocks in the chains.

	Nope, not tricky at all.  As clean new allocation groups, we
just insert them at the front of the chains.  The new chain has a
pointer to the existing chains initialized by userspace.  The chain
allocator inode has its chain pointers moved to the new group as a
single write during the in-kernel update.  Only one block write to
update all chains in the inode.

> Would it be possible to get rid of chains and just compute the next
> block in the chain?

	You can't mathematically compute the next group for anything
other than the cluster allocator.  In addition, the chain-reorder logic
allows fewer reads to find a relatively empty chain, an optimization
we'd lose.

Joel

-- 

"One of the symptoms of an approaching nervous breakdown is the
 belief that one's work is terribly important."
         - Bertrand Russell 

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Ocfs2-devel] OCFS2 features RFC
  2006-05-11 20:04 ` [Ocfs2-devel] OCFS2 features RFC Jeff Mahoney
  2006-05-11 20:40   ` Paul Taysom
@ 2006-05-11 21:16   ` Daniel Phillips
  2006-05-17  1:44   ` Mark Fasheh
  2 siblings, 0 replies; 38+ messages in thread
From: Daniel Phillips @ 2006-05-11 21:16 UTC (permalink / raw)
  To: ocfs2-devel

Jeff Mahoney wrote:
> While performance enhancements are always welcome, the two big features
> we'd like to see in future OCFS2 releases are features that will make
> using OCFS2 more transparent and more like a "local" file system. The
> features we want are cluster wide lockf/flock and shared writable mmap.

These are both already on the list, so I suppose you are just voting for
priority?  I agree re priority: these two items stand in the way of full
local Posix semantics.  They should be number one and two on the list.

> Hashed directories in some form, but I think the comments against ext3
> style h-trees are valid.

I do not know which "comments against" you are refering to.  I only saw
an unsupported, non-technical assertion from Christoph.  Perhaps
Christoph would be kind enough to share with us the technical details
of how XFS deals with the 31 bit telldir cookie problem.

Hash directories or btrees of any form all have the same telldir issue
as Htree, so if you advocate hashed directories, you also advocate
coming up with some scheme to try to reduce the severity of the telldir
problem.

The only schemes that make the telldir problem actually go away are ones
that stick with a directory scheme modelled on UFS.  I only know of one
of those, the FSF hashing scheme, which has a major problem: the hash
index is not persistent.  It has to be recreated on initial access to
the directory and kept around in memory, competing with other hashed
objects.  This does not scale well.  Another problem is, since the holes
in this scheme are so obvious there is not a lot of incentive to put
time into it, knowing it will eventually be tossed out in favor of
something else.  But feel free :-)

The reason people like HTree is, it is really, really fast and minimizes
disk accesses.  It is also mostly debugged, though we still tend to see
a new issue every now and then.  It's been more than a year since I saw
the last one, and that was an outright bug.

One thing that we tried to do with HTree is work within a 31 bit cookie
limitation to accomodate NFSv2.  I am thinking that maybe we should have
just made NFSv2 fall back to not using the index, which is easy to do
with HTree, and thereby give ourselves the 62 bits of cookie we really
need.  I will float this idea on ext2-devel.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Ocfs2-devel] OCFS2 features RFC
  2006-05-11 20:04 ` [Ocfs2-devel] OCFS2 features RFC Jeff Mahoney
  2006-05-11 20:40   ` Paul Taysom
  2006-05-11 21:16   ` Daniel Phillips
@ 2006-05-17  1:44   ` Mark Fasheh
       [not found]     ` <446BBCF5.7040903@google.com>
  2006-05-22 17:01     ` Paul Taysom
  2 siblings, 2 replies; 38+ messages in thread
From: Mark Fasheh @ 2006-05-17  1:44 UTC (permalink / raw)
  To: ocfs2-devel

Hi Jeff,

On Thu, May 11, 2006 at 04:04:35PM -0400, Jeff Mahoney wrote:
> While performance enhancements are always welcome, the two big features
> we'd like to see in future OCFS2 releases are features that will make
> using OCFS2 more transparent and more like a "local" file system. The
> features we want are cluster wide lockf/flock and shared writable mmap.
I'm trying to do more research on the user locking stuff right now actually.
My aim is twofold. The first is to nail down exactly what each type of
locking entails, and secondly I want to know what impact a lack of
cluster-aware locking has on at least one existing application.

Just to break it down, lockf() seems to be a (POSIX compliant?) library
wrapper around fcntl() locking, which is range based, optionally mandatory,
and provides deadlock detection. Ranges can encompass any part of the file,
with a special case that allows to lock all possible (present and future)
bytes from a given offset. Along with the usual blocking / nonblocking
variants on read / exclusive locks, fcntl supports the F_GETLK operation
which allows userspace to query information about a range, including the
pids of processes holding incompatible locks.

flock() on the other hand is always advisory and does not support ranges. No
explicit deadlock detection seems to be done, though deadlocks can be broken
by the user sending a signal (including kill -9) to one of the waiting
processes. It also supports shared, exclusive and trylock type operations.

And finally, quoting from the fcntl() man page: "Since kernel 2.0, there is
no interaction between the types of lock placed by flock(2) and fcntl(2)."

Now, to get to an actual example of application usage, I took a look at the
apache 2.2.2 source. It seems that they do file locking in the apr functions
apr_file_lock() / apr_file_unlock() (located in
srclib/apr/file_io/unix/flock.c). On Linux, these use fcntl().

The only consumer of those functions I could find in the httpd tarball are
the "sdbm" routines in and around srclib/apr-util/dbm/sdbm/

And that's about where my apache expertise ends :/

There's many more apps to look at for information though - sendmail
immediately comes to mind. An strace on my machine here reveals that rpm
uses fcntl() locking.

So the question for the current OCFS2 code base is what impact the lack of
cluster-aware fcntl() locking has on the particular set of software which
we're going to worry about right now. Whenever we chose to do it (and we
_will_ do it), it will take a long time to develop - fcntl() locking alone
encompasses about two thirds of our non-trivial dlm feature wishlist.

> - From a data integrity perspective, it shouldn't make a difference to an
> application whether competing reader/writers are on the same node or a
> different node. If standard locking primitives are already in use by the
> application, they should "just work" if the competing process is on
> another node.
I agree 100%.

> > -Online file system resize
> 
> This would be nice, and I think easily done in the same manner ext3
> does. Anything outside the file system's current view of the block
> device can be initialized in userspace, and the last block group,
> bitmaps, and superblock would be adjusted by an ioctl in kernelspace.
Yep, that's basically how we plan to approach it.

> > - We'd like as much of it to be user space code as is possible and
> >   practical.
> 
> The heartbeat project does a pretty good job on the userspace end, but
> as Andi pointed out, it has the usual shortcomings of anything in
> userspace involved with writing data inside the kernel. It is prone to
> deadlocks and we could miss node topology events.
Ahh ok, so that explains how heartbeat handles it (aka, it doesn't right
now). It really seems to me that we're going to need to find a solution to
this sort of problem sooner or later (why do I get the feeling that I'm
being naive?). Speaking from experience, having cluster stack components in
kernel means a much longer development time. Even small focused ones like
the OCFS2 stack.
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh at oracle.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Ocfs2-devel] OCFS2 features RFC
       [not found]       ` <20060518024638.GY21588@ca-server1.us.oracle.com>
@ 2006-05-19  0:35         ` Daniel Phillips
  2006-05-19 15:16           ` J. Bruce Fields
  2006-05-20  6:11           ` Mark Fasheh
  0 siblings, 2 replies; 38+ messages in thread
From: Daniel Phillips @ 2006-05-19  0:35 UTC (permalink / raw)
  To: ocfs2-devel

(a dialog between Mark and me that inadvertently became private)

Mark Fasheh wrote:
> On Wed, May 17, 2006 at 05:16:53PM -0700, Daniel Phillips wrote:
>>Does clustered NFS count as software we're going to worry about right now?
>>The impact is, if OCFS2 does provide cluster-aware fcntl locking then the
>>cluster locking hacks lockd needs can possibly be smaller.  Otherwise,
>>lockd must do the job itself, and as a consequence, any applications running
>>on the (clustered) NFS server nodes will not see locks held by NFS clients.
> 
> Clustered NFS is definitely something we care about. We have people using it
> today, with the caveat that file locking won't be cluster aware. It's
> actually pretty interesting how far people get with that. We'd love to
> support the whole thing of course. As far as NFS with file locking though, I
> have to admit that we've heard many more requests from people wanting to do
> things like apache, sendmail, etc on OCFS2.

Ok, I just figured out how to be really lazy and do cluster-consistent
NFS locking across clustered NFS servers without doing much work.  In the
duh category, only one node will actually run lockd and all other NFS
server nodes will just port-forward the NLM traffic to/from it.  Sure,
you can bottleneck this scheme with a little effort, but to be honest we
aren't that interested in NFS locking performance, we are more interested
in actual file operations.

So strike NFS serving off the list of applications that care about cluster
fcntl locking.

>>Unless I have missed something major, fcntl locking does not have any
>>overlap with your existing DLM, so you can implement it with a separate
>>mechanism.  Does that help?
> 
> Eh, unfortunately not that much... It's still a large amount of work :/
> Doing it outside a dlm would just mean one has to reproduce existing
> mechanisms (such as determining lock mastery for instance).

You don't have to distribute the fcntl locking, you can instead manage it
with a single server active on just one node at a time.  So go ahead and
distribute it if you really enjoy futzing with the DLM, but going for the
server approach should reduce your stress considerably.  As a fringe
benefit, you are then forced to consider how to accomodate classic server
failover within the cluster manager framework, which should not be very
hard and is absolutely necessary.

>>Starting with one obvious requirement, the cluster stack needs to be able
>>to handle different kinds of fencing methods or even mixed fencing methods.
>>If the stack stays in kernel, what is the instancing framework?  Modules?
>>I do believe we can make that work.
> 
> call_usermodehelper()?

Bad idea, this gets you back into memory deadlock zone.  Avoiding memory
deadlock is considerably easier in kernel and is nigh on impossible with
call_usermodehelper.

Sure, it's totally possible to do all that in kernel.
> 
> But we're getting ahead of ourselves - I don't want to implement yet another
> cluster stack - I'd rather fit the file system into an existing framework -
> one which already has all the fencing methods work out for instance.

Like the Red Hat framework?  Ahem.  Maybe not.  For one thing, they never
even got close to figuring out how to avoid memory deadlock.  For another,
it's a rambling bloated pig with lots of bogus factoring.  Honestly, what
you have now is a much better starting point, you should be thinking about
how to evolve it in the direction it needs to go rather than cutting over
to an existing framework, that was designed with the mindset of usermode
cluster apps, not the more stringent requirements of a cluster filesystem.

>>Consider this: if we define the fencing interface entirely in terms of
>>messages over sockets then the cluster stack does not need to know or care
>>whether the other end lives in kernel or userland.  Comments?
> 
> Interesting, and I'll have to think about whether I can poke holes in that
> or not. Of course, I'm not sure the file system ever has to call out to
> fencing directly, so maybe it's something it never has to worry about.

No, the filesystem never calls fencing, only the cluster manager does.
As I understand it, what happens is:

    1) Somebody (heartbeat) reports a dead node to cluster manager
    2) Cluster manager issues a fence request for the dead node
    3) Cluster manager receives confirmation that the node was fenced
    4) Cluster manager sends out dead node messages to cluster managers
       on other nodes
    5) Some cluster manager receives dead node message, notifies DLM
    6) DLM receives dead node message, initiates lock recovery

Step (2) is where we need plugins, where each plugin registers a fencing
and somehow each node becomes associated with a particular fencing method
(setting up this association is an excellent example of a component that
can and should be in userspace because this part never executes in the
block IO path).  The right interface to initiate fencing is probably a
direct (kernel-to-kernel) call, there is actually no good reason to use
a socket interface here.  However, the fencing confirmation is an
asynchronous event and might as well come in over a socket.  There are
alternatives (e.g., linked list event queue) but the socket is most
natural because the cluster manager already needs one to receive events
from other sources.

Actually, fencing has no divine right to be a separate subsystem and is
properly part of the cluster manager.  It's better to think of it that
way.  As such, the cluster manager <=> fencing api is internal, there is
no need to get into interminable discussions of how to standardize it.  So
let's just do something really minimal that gives us a plugin interface
and move on to harder problems.  If you do eventually figure out how to
move the whole cluster manager to userspace then you replace the module
scheme in favor of a dso scheme.

Anyway, assuming both bits are in-kernel then initiating fencing should
just be a method on the (in-kernel) node object and confirmation of
fencing is just an event sent to the node manager's event pipe.  Simple,
no?

In summary, I retract my point about using the socket to abstract away
the question of whether fencing lives in kernel or userspace and
instead assert that the fencing harness should live wherever the cluster
manager lives, which is in kernel right now and ought to stay there for
the time being.  Socket is still the right way to receive messages from
a fencing module, but a method call is a better way to initiate fencing.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Ocfs2-devel] OCFS2 features RFC
  2006-05-19  0:35         ` Daniel Phillips
@ 2006-05-19 15:16           ` J. Bruce Fields
  2006-05-20  6:11           ` Mark Fasheh
  1 sibling, 0 replies; 38+ messages in thread
From: J. Bruce Fields @ 2006-05-19 15:16 UTC (permalink / raw)
  To: ocfs2-devel

On Thu, May 18, 2006 at 05:35:27PM -0700, Daniel Phillips wrote:
> Ok, I just figured out how to be really lazy and do cluster-consistent
> NFS locking across clustered NFS servers without doing much work.  In the
> duh category, only one node will actually run lockd and all other NFS
> server nodes will just port-forward the NLM traffic to/from it.

Yeah, I can't see why that wouldn't work with v2/v3.  The same trick
won't work with NFSv4 since it has the locking integrated into the
protocol.

It shouldn't be that much work to make lockd/nfsd use whatever locking
the filesystem provides--see

http://linux-nfs.org/cgi-bin/gitweb.cgi?p=bfields-2.6.git;a=shortlog;h=server-cluster-locking-api

for one attempt.  Of course the hard part is providing the locking
support in the filesystem in the first place!  And the main obstacle to
our work has been the lack of an in-kernel filesystem that does this....
(The only testing has been done with GPFS.)

--b.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Ocfs2-devel] OCFS2 features RFC
  2006-05-19  0:35         ` Daniel Phillips
  2006-05-19 15:16           ` J. Bruce Fields
@ 2006-05-20  6:11           ` Mark Fasheh
  2006-05-22 19:18             ` Daniel Phillips
  1 sibling, 1 reply; 38+ messages in thread
From: Mark Fasheh @ 2006-05-20  6:11 UTC (permalink / raw)
  To: ocfs2-devel

On Thu, May 18, 2006 at 05:35:27PM -0700, Daniel Phillips wrote:
> Ok, I just figured out how to be really lazy and do cluster-consistent
> NFS locking across clustered NFS servers without doing much work.  In the
> duh category, only one node will actually run lockd and all other NFS
> server nodes will just port-forward the NLM traffic to/from it.  Sure,
> you can bottleneck this scheme with a little effort, but to be honest we
> aren't that interested in NFS locking performance, we are more interested
> in actual file operations.
Out of curiousity, how will a failure on the lockd node be handled? Or is
this something that you're not worried about.

> >call_usermodehelper()?
> 
> Bad idea, this gets you back into memory deadlock zone.  Avoiding memory
> deadlock is considerably easier in kernel and is nigh on impossible with
> call_usermodehelper.
Good catch, I threw that out without fully evaluating the implications :)
 
> Like the Red Hat framework?  Ahem.  Maybe not.  For one thing, they never
> even got close to figuring out how to avoid memory deadlock.  For another,
> it's a rambling bloated pig with lots of bogus factoring.  Honestly, what
> you have now is a much better starting point,
Well, I should've said "multiple existing frameworks" - so people could run
whatever fits their needs the best. So folks could pick the feature sets
that suit their needs the best. Besides, I think you're being somewhat
unfair to the Red Hat framework. It does _alot_ more than the OCFS2 stack
can even dream of handling right now. And we haven't even talked about
Linux-HA yet.

> you should be thinking about how to evolve it in the direction it needs to
> go rather than cutting over to an existing framework, that was designed
> with the mindset of usermode cluster apps, not the more stringent
> requirements of a cluster filesystem.
I hear they have this thing called "GFS" ;) 

What we are thinking about right now is how we can reuse code - building on
other people's bug fixes, feature patches, etc. What we have today just
bootstraps our file system into the world of the cluster. Deciding to go the
full blown home grown cluster route path isn't some decision we make based
on one (admittedly difficult) bug or design issue. Nor is it something that
we will undertake without having fully explored all other alternatives.
 
> No, the filesystem never calls fencing, only the cluster manager does.
> As I understand it, what happens is:
> 
>    1) Somebody (heartbeat) reports a dead node to cluster manager
>    2) Cluster manager issues a fence request for the dead node
>    3) Cluster manager receives confirmation that the node was fenced
>    4) Cluster manager sends out dead node messages to cluster managers
>       on other nodes
>    5) Some cluster manager receives dead node message, notifies DLM
>    6) DLM receives dead node message, initiates lock recovery
That sounds alot closer to how it should happen, IMHO.

> Step (2) is where we need plugins, where each plugin registers a fencing
> and somehow each node becomes associated with a particular fencing method
> (setting up this association is an excellent example of a component that
> can and should be in userspace because this part never executes in the
> block IO path).  The right interface to initiate fencing is probably a
> direct (kernel-to-kernel) call, there is actually no good reason to use
> a socket interface here.
Fencing plugins by the way can tend to do a variety of things, ranging from
direct device access, to being able to telnet or ssh into a switch. The
plugin system therefore needs to be fairly generic, to the level of
running a binary that could be written in perl, C, etc.

> However, the fencing confirmation is an asynchronous event and might as
> well come in over a socket. There are alternatives (e.g., linked list
> event queue) but the socket is most natural because the cluster manager
> already needs one to receive events from other sources.
> 
> Actually, fencing has no divine right to be a separate subsystem and is
> properly part of the cluster manager.  It's better to think of it that
> way.  As such, the cluster manager <=> fencing api is internal, there is
> no need to get into interminable discussions of how to standardize it.
Sure.

> So let's just do something really minimal that gives us a plugin
> interface and move on to harder problems. If you do eventually figure out
> how to move the whole cluster manager to userspace then you replace the
> module scheme in favor of a dso scheme.
Well, I'm wondering how we're going to support all the different fencing
methods using kernel modules ;)
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh at oracle.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Ocfs2-devel] OCFS2 features RFC
  2006-05-17  1:44   ` Mark Fasheh
       [not found]     ` <446BBCF5.7040903@google.com>
@ 2006-05-22 17:01     ` Paul Taysom
  1 sibling, 0 replies; 38+ messages in thread
From: Paul Taysom @ 2006-05-22 17:01 UTC (permalink / raw)
  To: ocfs2-devel

 
Two major applications that use byte range locks are Open Office and
Microsoft Office (Word, Excel, ...).  They use them to coordinate
sharing a document when more than one person opens the file.  These
applications typically get a byte range lock on a single byte at a
predetermined offset, then write data into the file about who has the
file open.  This way, when someone else opens the file, they can find
out who else has the file open.  Word is of course going through SAMBA
to access the file system.

Paul Taysom

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Ocfs2-devel] OCFS2 features RFC
  2006-05-20  6:11           ` Mark Fasheh
@ 2006-05-22 19:18             ` Daniel Phillips
  0 siblings, 0 replies; 38+ messages in thread
From: Daniel Phillips @ 2006-05-22 19:18 UTC (permalink / raw)
  To: ocfs2-devel

Mark Fasheh wrote:
> On Thu, May 18, 2006 at 05:35:27PM -0700, Daniel Phillips wrote:
>>Ok, I just figured out how to be really lazy and do cluster-consistent
>>NFS locking across clustered NFS servers without doing much work.  In the
>>duh category, only one node will actually run lockd and all other NFS
>>server nodes will just port-forward the NLM traffic to/from it.  Sure,
>>you can bottleneck this scheme with a little effort, but to be honest we
>>aren't that interested in NFS locking performance, we are more interested
>>in actual file operations.
> 
> Out of curiousity, how will a failure on the lockd node be handled? Or is
> this something that you're not worried about.

Of course I'm worried about it!  Luckily, normal NFS reboot semantics can
be repurposed to provide failover.  Client lockds are notified of a server
failure via NSM/statd.  Our cluster manager invokes a failover method (this
harness yet to be designed) that activates a new lockd on some other node
and updates the NLM port forward addresses on all other nodes.  When all is
ready, the new server announces via NLM that it is up and clients retake
their locks as they would for a server reboot.

I don't think this part of it is new, anybody who has attempted nfs serving
from a cluster must have noticed it.  The port forwarding idea may be new,
I did not notice anybody mention it out there.

>>>call_usermodehelper()?
>>Like the Red Hat framework?  Ahem.  Maybe not.  For one thing, they never
>>even got close to figuring out how to avoid memory deadlock.  For another,
>>it's a rambling bloated pig with lots of bogus factoring.  Honestly, what
>>you have now is a much better starting point,
> 
> Well, I should've said "multiple existing frameworks" - so people could run
> whatever fits their needs the best. So folks could pick the feature sets
> that suit their needs the best. Besides, I think you're being somewhat
> unfair to the Red Hat framework. It does _alot_ more than the OCFS2 stack
> can even dream of handling right now.

In a reasonable way?  I think not.  The only bit you might lust after is the
range locking, and that was never tested to any great extent.  I still think
you have a better, more sensible base to work from, and what's more, it's
attached to a relatively stable, in-tree cluster filesystem.

Curious... have you tried the Red Hat cluster stack?  Which version(s)?

> And we haven't even talked about Linux-HA yet.

And we should, briefly.  Linux-HA looks great to me but it can't be directly
used by OCFS2 because it is in userspace with no thought at all invested
in dealing with memory deadlock.  You might be able to interface with
Linux-HA one day in order to unify the handling of membership and failover,
however I doubt that the easiest path there is to try to fix up the Linux-HA
internals to avoid memory pitfalls.  Much better to fix your much smaller
in-kernel framework, and then evolve it in the direction of interfacing to
Linux-HA.  Note that fencing, membership, heartbeat and failover all lie in
the block IO path, so they all have to obey rigorous rules that Linux-HA
knows nothing about.  What has to be done here is adapt Linux-HA's structure
to expose the OCFS2 implementation, so for example Linux-HA would not
directly send heartbeats, but would receive your stack's up/down messages.

But this is getting way ahead of things.  First, OCFS2 needs to establish
itself as a filesystem, before projects like Linux-HA can look at how to do
the grand unification.

>>No, the filesystem never calls fencing, only the cluster manager does.
>>As I understand it, what happens is:
>>
>>   1) Somebody (heartbeat) reports a dead node to cluster manager
>>   2) Cluster manager issues a fence request for the dead node
>>   3) Cluster manager receives confirmation that the node was fenced
>>   4) Cluster manager sends out dead node messages to cluster managers
>>      on other nodes
>>   5) Some cluster manager receives dead node message, notifies DLM
>>   6) DLM receives dead node message, initiates lock recovery
> 
> That sounds alot closer to how it should happen, IMHO.
> 
> Fencing plugins by the way can tend to do a variety of things, ranging from
> direct device access, to being able to telnet or ssh into a switch. The
> plugin system therefore needs to be fairly generic, to the level of
> running a binary that could be written in perl, C, etc.

Then you would implement a kernel fencing method that interfaces to user
space, and cross your fingers.  Fencing lies in the block IO path so it
has to obey anti-memory deadlock rules.  Perl and bash certainly will not,
so if somebody insists on writing their fence scripts that way, then they
will need to run them on a separate node that does not mount the OCFS2
filesystem, or inside a resource sandbox, for example a UML instance that
has all its resources pre-allocated.  By the time you have done all the
setup required for that, you would have gotten the job done faster and
better by rewriting the script in C.  Then you still have to do memlocking,
and run syscalls like connect in PF_MEMALLOC mode, but you would need that
for the UML sandbox anyway, with rather more work to do to audit all the
call paths.

The practical approach is to do kernel implementations of the fencing
methods that can be implemented there (including mine!) and offload any
messy userspace ones to a non-filesystem node.

>>So let's just do something really minimal that gives us a plugin
>>interface and move on to harder problems. If you do eventually figure out
>>how to move the whole cluster manager to userspace then you replace the
>>module scheme in favor of a dso scheme.
> 
> Well, I'm wondering how we're going to support all the different fencing
> methods using kernel modules ;)

Choose your poison:

   1) A kernel fencing method sends messages to a dedicated fencing node
   that does not mount the filesystem.  This may waste a node and needs some
   additional mechanism to avoid becoming a single point of failure.

   2) A kernel fencing method sends messages to a userspace program written
   in C, memlocked, and running in (slight kernel hack here) PF_MEMALLOC mode.
   This might require a little more work than a Perl script, but then real men
   enjoy work.

   3) A kernel fencing method sends messages to a userspace program running
   in a resource sandbox (e.g. UML or XEN) that does whatever it wants to.
   This is really buzzword compatible, really wasteful, and a great use of
   administration time.

   4) You may find that you can implement in-kernel all of the fencing modules
   you need easier and better than any of the above.  This is the case with me.

The thing we can't do is go on pretending that we can just shell to bash and
run anything we want.  That way lies deadlock.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2006-05-22 19:18 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-25 18:35 [Ocfs2-devel] OCFS2 features RFC Mark Fasheh
2006-04-25 21:55 ` Christoph Hellwig
2006-04-25 22:24   ` Mark Fasheh
2006-04-26 16:50   ` Daniel Phillips
2006-04-26  4:11 ` Andi Kleen
2006-04-26 18:06   ` Mark Fasheh
2006-04-26 18:08     ` Andi Kleen
2006-04-26 18:34       ` Daniel Phillips
2006-04-27 20:25 ` Paul Taysom
2006-05-03 23:04 ` [Ocfs2-devel] OCFS2 features RFC - separate journal? Daniel Phillips
2006-05-04  0:29   ` Zach Brown
2006-05-04  0:46     ` Daniel Phillips
2006-05-04 20:56       ` Zach Brown
2006-05-04 20:59         ` Wim Coekaerts
2006-05-04 22:23         ` Daniel Phillips
2006-05-04 22:30           ` Mark Fasheh
2006-05-05  3:05             ` Daniel Phillips
2006-05-05 18:25               ` Mark Fasheh
2006-05-06  3:09                 ` Daniel Phillips
2006-05-05 17:12             ` Paul Taysom
2006-05-05 18:06               ` Daniel Phillips
2006-05-05 18:57               ` Sunil Mushran
2006-05-08 14:28             ` Paul Taysom
2006-05-08 17:43               ` Daniel Phillips
2006-05-08 18:00             ` Paul Taysom
2006-05-08 18:22               ` Daniel Phillips
2006-05-11 20:04 ` [Ocfs2-devel] OCFS2 features RFC Jeff Mahoney
2006-05-11 20:40   ` Paul Taysom
2006-05-11 20:55     ` Joel Becker
2006-05-11 21:16   ` Daniel Phillips
2006-05-17  1:44   ` Mark Fasheh
     [not found]     ` <446BBCF5.7040903@google.com>
     [not found]       ` <20060518024638.GY21588@ca-server1.us.oracle.com>
2006-05-19  0:35         ` Daniel Phillips
2006-05-19 15:16           ` J. Bruce Fields
2006-05-20  6:11           ` Mark Fasheh
2006-05-22 19:18             ` Daniel Phillips
2006-05-22 17:01     ` Paul Taysom
  -- strict thread matches above, loose matches on Subject: below --
2006-05-02 18:22 [Ocfs2-devel] OCFS2 Features RFC Brian Long
2006-05-02 20:29 ` Sunil Mushran

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.