* [Ocfs2-devel] OCFS2 features RFC
@ 2006-04-25 18:35 Mark Fasheh
2006-04-25 21:55 ` Christoph Hellwig
` (4 more replies)
0 siblings, 5 replies; 38+ messages in thread
From: Mark Fasheh @ 2006-04-25 18:35 UTC (permalink / raw)
To: ocfs2-devel
The OCFS2 team is in the preliminary stages of planning major features for
our next cycle of development. The goal of this e-mail then is to stimulate
some discussion as to how features should be prioritized going forward. Some
disclaimers apply:
* The following list is very preliminary and is sure to change.
* I've probably missed some things.
* Development priorities within Oracle can be influenced but are ultimately
up to management. That's not stopping anyone from contributing though, and
patches are always welcome.
So I'll start with changes that can be completely contained within the file
system (no cluster stack changes needed):
-Sparse file support: Self explanatory. We need this for various reasons
including performance, correctness and space usage.
-Htree support
-Extended attributes: This might be another area where we
steal^H^H^H^H^Hcopy some good code from Ext3 :) On top of this one can
trivially implement posix acls. We're not likely to support EA block
sharing though as it becomes difficult to manage across the cluster.
-Removal of the vote mechanism: The most trivial dentry type network votes
can go quite easily by replacing them with a cluster lock. This is critical
in speeding up unlink and rename operations in the cluster. The remaining
votes (mount, unmount, delete_inode) look like they'll require cluster
stack adjustments.
-Data in inode blocks: Should speed up local node data operations with small
files significantly.
-Shared writeable mmap: This looks like it might require changes to the
kernel (outside of OCFS2). We need to investigate further...
Now on to file system features which require cluster stack changes. I'll
have alot more to say about the cluster stack in a bit, but it's worth
listing these out here for completeness.
-Cluster consistent Flock / Lockf
-Online file system resize
-Removal of remaining FS votes: If we can get rid of the delete_inode vote,
I don't believe we'll need the mount / umount ones anymore (and if we still
do, then a proper group services could handle that)
-Allow the file system to go "hard read only" when it loses it's connection
to the disk, rather than the kernel panic we have today. This allows
applications using the file system to gracefully shut down. Other
applications on the system continue unharmed. "Hard read only" in the OCFS2
context means that the RO node does not look mounted to the other nodes on
that file system. Absolutely no disk writes are allowed. File data and
meta data can be stale or otherwise invalid. We never want to return
invalid data to userspace, so file reads return -EIO.
As far as the existing cluster stack goes, currently most of the OCFS2 team
feels that the code has gone as far as it can and should go. It would
therefore be prudent to allow pluggable cluster stacks. Jeff Mahoney at
Novell has already done some integration work implementing a userspace
clustering interface. We probably want to do more in that area though.
There are several good reasons why we might want to integrate with external
cluster stacks. The most obvious is code reuse. The list of cluster stack
features we require for our next phase of development is very large (some
are listed below). There is no reason to implement those features unless
we're certain existing software doesn't provide them and can't be extended.
This will also allow a greater amount of choice for the end user. What stack
works well for one environment might not work as well for another. There's
also the fact that current resources are limited. It's enough work designing
and implementing a file system. If we can get out of the business of
maintaining a cluster stack, we should do so.
So the question then becomes, "What is it that we require of our cluster
stack going forward?"
- We'd like as much of it to be user space code as is possible and
practical.
- The node manager should support dynamic cluster topology updates,
including removing nodes from the cluster, propagating new configurations to
existing nodes, etc.
- A pluggable fencing mechanism is a priority.
- We'd like some group services implementation to handle things like
membership of a mount point, dlm domain/lockspace, etc.
- On the DLM side, we'd like things like directory based mastery, a range
locking API, and some extra LVB recovery bits.
So that's it for now. Hopefully this will spurn some interesting discussion.
Please keep in mind that any of this is subject to change - cluster stack
requirements especially are things we've only recently begun discussing.
--Mark
--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh at oracle.com
^ permalink raw reply [flat|nested] 38+ messages in thread* [Ocfs2-devel] OCFS2 features RFC 2006-04-25 18:35 [Ocfs2-devel] OCFS2 features RFC Mark Fasheh @ 2006-04-25 21:55 ` Christoph Hellwig 2006-04-25 22:24 ` Mark Fasheh 2006-04-26 16:50 ` Daniel Phillips 2006-04-26 4:11 ` Andi Kleen ` (3 subsequent siblings) 4 siblings, 2 replies; 38+ messages in thread From: Christoph Hellwig @ 2006-04-25 21:55 UTC (permalink / raw) To: ocfs2-devel On Tue, Apr 25, 2006 at 11:35:53AM -0700, Mark Fasheh wrote: > -Htree support Please not. htree is just the worst possible directory format around. Do some nice hashed or btree directories, but don't try this odd hack again. Especially as the only reason it was developed for in ext2/3 doesn't work very well in a cluster filesystem anyway - to access the new htree all nodes would have to support the format anyway, so the whole easy up/downgrade thing doesn't matter at all. > -Extended attributes: This might be another area where we > steal^H^H^H^H^Hcopy some good code from Ext3 :) On top of this one can > trivially implement posix acls. We're not likely to support EA block > sharing though as it becomes difficult to manage across the cluster. again the ext3 implementation might not be the best. I'd say look at jfs or xfs (in the latter case of course with a less monsterous btree implementation) ^ permalink raw reply [flat|nested] 38+ messages in thread
* [Ocfs2-devel] OCFS2 features RFC 2006-04-25 21:55 ` Christoph Hellwig @ 2006-04-25 22:24 ` Mark Fasheh 2006-04-26 16:50 ` Daniel Phillips 1 sibling, 0 replies; 38+ messages in thread From: Mark Fasheh @ 2006-04-25 22:24 UTC (permalink / raw) To: ocfs2-devel On Tue, Apr 25, 2006 at 11:55:48PM +0200, Christoph Hellwig wrote: > On Tue, Apr 25, 2006 at 11:35:53AM -0700, Mark Fasheh wrote: > > -Htree support > > Please not. htree is just the worst possible directory format around. > Do some nice hashed or btree directories, but don't try this odd hack > again. Especially as the only reason it was developed for in ext2/3 > doesn't work very well in a cluster filesystem anyway - to access the > new htree all nodes would have to support the format anyway, so the > whole easy up/downgrade thing doesn't matter at all. Interesting. You make a good point about the up/downgrade code - we certainly couldn't use that (at least not without jumping some hoops). I have to admit that I haven't looked very deeply into htree yet but if it's that bad and we won't be compatible in any case it certainly makes sense to try something new. Would you mind pointing out a few of the htree issues that make it so poor? > > > -Extended attributes: This might be another area where we > > steal^H^H^H^H^Hcopy some good code from Ext3 :) On top of this one can > > trivially implement posix acls. We're not likely to support EA block > > sharing though as it becomes difficult to manage across the cluster. > > again the ext3 implementation might not be the best. I'd say look at > jfs or xfs (in the latter case of course with a less monsterous btree > implementation) I agree the XFS implementation seems a bit overboard... The problem I'm having is that I can't seem to determine what size the average set of extended attributes will be. Basically, as far as I can tell, ext3 will allow about 1 block plus whatever will fit in the inode, minus overhead. We'd like to have inlined EA but want to be able to move them out to a block in the case that the number of extents we need grows to the end of the inode block - this is to avoid having to create an allocation btree. So then if we take the one-block-attached-to-the-inode approach, we'd have a capacity a little less than ext3. I've also noticed that, while the ext3 EA entries are stored in sorted order, the search for them is linear. I wonder if that could be improved upon (or if it even matters if you're just limited to one block). If one block is insufficient, then certainly we need to look at some other format. My first inclination would be to have a single level tree with pointers to leaf nodes stored in hashed order to speed up lookups. --Mark -- Mark Fasheh Senior Software Developer, Oracle mark.fasheh at oracle.com ^ permalink raw reply [flat|nested] 38+ messages in thread
* [Ocfs2-devel] OCFS2 features RFC 2006-04-25 21:55 ` Christoph Hellwig 2006-04-25 22:24 ` Mark Fasheh @ 2006-04-26 16:50 ` Daniel Phillips 1 sibling, 0 replies; 38+ messages in thread From: Daniel Phillips @ 2006-04-26 16:50 UTC (permalink / raw) To: ocfs2-devel Christoph Hellwig wrote: > On Tue, Apr 25, 2006 at 11:35:53AM -0700, Mark Fasheh wrote: >>-Htree support > > Please not. htree is just the worst possible directory format around. > Do some nice hashed or btree directories, but don't try this odd hack > again. Could you be specific about what you think is odd about it? > Especially as the only reason it was developed for in ext2/3 > doesn't work very well in a cluster filesystem anyway In what way? > to access the > new htree all nodes would have to support the format anyway, so the > whole easy up/downgrade thing doesn't matter at all. Good point, and this only affects the leaf node format. Regards, Daniel ^ permalink raw reply [flat|nested] 38+ messages in thread
* [Ocfs2-devel] OCFS2 features RFC 2006-04-25 18:35 [Ocfs2-devel] OCFS2 features RFC Mark Fasheh 2006-04-25 21:55 ` Christoph Hellwig @ 2006-04-26 4:11 ` Andi Kleen 2006-04-26 18:06 ` Mark Fasheh 2006-04-27 20:25 ` Paul Taysom ` (2 subsequent siblings) 4 siblings, 1 reply; 38+ messages in thread From: Andi Kleen @ 2006-04-26 4:11 UTC (permalink / raw) To: ocfs2-devel Mark Fasheh <mark.fasheh@oracle.com> writes: > > - We'd like as much of it to be user space code as is possible and > practical. Won't you get into deadlocks then when the system is low on memory? (freeing memory might require write outs on OCFS2 and the user space cluster might be stuck already) Or rather if you rely on user space you would need to make sure that the basic block write out path works without such possible deadlocks. -Andi ^ permalink raw reply [flat|nested] 38+ messages in thread
* [Ocfs2-devel] OCFS2 features RFC 2006-04-26 4:11 ` Andi Kleen @ 2006-04-26 18:06 ` Mark Fasheh 2006-04-26 18:08 ` Andi Kleen 0 siblings, 1 reply; 38+ messages in thread From: Mark Fasheh @ 2006-04-26 18:06 UTC (permalink / raw) To: ocfs2-devel On Wed, Apr 26, 2006 at 06:11:04AM +0200, Andi Kleen wrote: > Won't you get into deadlocks then when the system is low on memory? > (freeing memory might require write outs on OCFS2 and the user space > cluster might be stuck already) > > Or rather if you rely on user space you would need to make sure > that the basic block write out path works without such possible > deadlocks. The DLM certainly wouldn't be in userspace - there's also a convincing performance argument for it being in kernel. Primarily then I think we're worred about that in the context of something like heartbeat. In that case, we probably want something that can do it's work within some preallocated, mlock'd area. I'm not sure (yet) how the various stacks handle this problem, or even if they do. I need to think about membership software. I want to say that I don't think this would be an issue there, but I have a feeling I could concoct a case during node recovery. --Mark -- Mark Fasheh Senior Software Developer, Oracle mark.fasheh at oracle.com ^ permalink raw reply [flat|nested] 38+ messages in thread
* [Ocfs2-devel] OCFS2 features RFC 2006-04-26 18:06 ` Mark Fasheh @ 2006-04-26 18:08 ` Andi Kleen 2006-04-26 18:34 ` Daniel Phillips 0 siblings, 1 reply; 38+ messages in thread From: Andi Kleen @ 2006-04-26 18:08 UTC (permalink / raw) To: ocfs2-devel On Wednesday 26 April 2006 20:06, Mark Fasheh wrote: > On Wed, Apr 26, 2006 at 06:11:04AM +0200, Andi Kleen wrote: > > Won't you get into deadlocks then when the system is low on memory? > > (freeing memory might require write outs on OCFS2 and the user space > > cluster might be stuck already) > > > > Or rather if you rely on user space you would need to make sure > > that the basic block write out path works without such possible > > deadlocks. > The DLM certainly wouldn't be in userspace - there's also a convincing > performance argument for it being in kernel. > > Primarily then I think we're worred about that in the context of something > like heartbeat. In that case, we probably want something that can do it's > work within some preallocated, mlock'd area. That's not enough - it wouldn't be able to do anything that requires memory allocation in the critical path. This includes most system calls. > I'm not sure (yet) how the > various stacks handle this problem I suspect they don't. -Andi ^ permalink raw reply [flat|nested] 38+ messages in thread
* [Ocfs2-devel] OCFS2 features RFC 2006-04-26 18:08 ` Andi Kleen @ 2006-04-26 18:34 ` Daniel Phillips 0 siblings, 0 replies; 38+ messages in thread From: Daniel Phillips @ 2006-04-26 18:34 UTC (permalink / raw) To: ocfs2-devel Andi Kleen wrote: > On Wednesday 26 April 2006 20:06, Mark Fasheh wrote: >>On Wed, Apr 26, 2006 at 06:11:04AM +0200, Andi Kleen wrote: >> >>>Won't you get into deadlocks then when the system is low on memory? >>>(freeing memory might require write outs on OCFS2 and the user space >>>cluster might be stuck already) >>> >>>Or rather if you rely on user space you would need to make sure >>>that the basic block write out path works without such possible >>>deadlocks. >> >>The DLM certainly wouldn't be in userspace - there's also a convincing >>performance argument for it being in kernel. >> >>Primarily then I think we're worred about that in the context of something >>like heartbeat. In that case, we probably want something that can do it's >>work within some preallocated, mlock'd area. > > That's not enough - it wouldn't be able to do anything that requires > memory allocation in the critical path. This includes most system calls. Indeed. In general, what we have to do is give such a userspace process access to the PF_MEMALLOC reserve, simply by setting that flag. This introduces a requirement to audit tasks's memory usage, but this isn't different from what we have to do in kernel anyway. So we can do this if we want to, but it isn't clear to me why we want heartbeat in userspace. Advantages for heartbeat in kernel: * Easier to manage reserve memory * No memlock requirement * Can act on heartbeat timeout with higher precision, possibly hard realtime precision Disadvantages: * Handling heartbeat timeout looks a lot like policy * Need to invent a mechanism for communicating with userspace helpers I am biased towards heartbeat in kernel, but the issues really need to be talked out in detail. The ground rule is that *everything* that can execute in the block writeout path has to have access to reserve memory. This includes everything in the failover path, fencing for example. Regards, Daniel ^ permalink raw reply [flat|nested] 38+ messages in thread
* [Ocfs2-devel] OCFS2 features RFC 2006-04-25 18:35 [Ocfs2-devel] OCFS2 features RFC Mark Fasheh 2006-04-25 21:55 ` Christoph Hellwig 2006-04-26 4:11 ` Andi Kleen @ 2006-04-27 20:25 ` Paul Taysom 2006-05-03 23:04 ` [Ocfs2-devel] OCFS2 features RFC - separate journal? Daniel Phillips 2006-05-11 20:04 ` [Ocfs2-devel] OCFS2 features RFC Jeff Mahoney 4 siblings, 0 replies; 38+ messages in thread From: Paul Taysom @ 2006-04-27 20:25 UTC (permalink / raw) To: ocfs2-devel I've done some experiments with h-trees on ext3 and have found one case where h-trees get confused. If I create several thousand files in a single directory and then try to remove the directory (rm -r), I get an error that one of the files has not been removed but when I check the directory, the file is not there. I repeat the command and the directory is removed. I suspect the h-tree code is using the hash for the cookie for readdir and I'm getting a hash collision. ReiserFS solves this problem by having 24 bits of hash and 8 bits of uniqueness to resolve hash collisions. Paul Taysom >>> Mark Fasheh <mark.fasheh@oracle.com> 04/25/06 12:35 pm >>> The OCFS2 team is in the preliminary stages of planning major features for our next cycle of development. The goal of this e- mail then is to stimulate some discussion as to how features should be prioritized going forward. Some disclaimers apply: * The following list is very preliminary and is sure to change. * I've probably missed some things. * Development priorities within Oracle can be influenced but are ultimately up to management. That's not stopping anyone from contributing though, and patches are always welcome. So I'll start with changes that can be completely contained within the file system (no cluster stack changes needed): - Sparse file support: Self explanatory. We need this for various reasons including performance, correctness and space usage. - Htree support - Extended attributes: This might be another area where we steal^H^H^H^H^Hcopy some good code from Ext3 :) On top of this one can trivially implement posix acls. We're not likely to support EA block sharing though as it becomes difficult to manage across the cluster. - Removal of the vote mechanism: The most trivial dentry type network votes can go quite easily by replacing them with a cluster lock. This is critical in speeding up unlink and rename operations in the cluster. The remaining votes (mount, unmount, delete_inode) look like they'll require cluster stack adjustments. - Data in inode blocks: Should speed up local node data operations with small files significantly. - Shared writeable mmap: This looks like it might require changes to the kernel (outside of OCFS2). We need to investigate further... Now on to file system features which require cluster stack changes. I'll have alot more to say about the cluster stack in a bit, but it's worth listing these out here for completeness. - Cluster consistent Flock / Lockf - Online file system resize - Removal of remaining FS votes: If we can get rid of the delete_inode vote, I don't believe we'll need the mount / umount ones anymore (and if we still do, then a proper group services could handle that) - Allow the file system to go "hard read only" when it loses it's connection to the disk, rather than the kernel panic we have today. This allows applications using the file system to gracefully shut down. Other applications on the system continue unharmed. "Hard read only" in the OCFS2 context means that the RO node does not look mounted to the other nodes on that file system. Absolutely no disk writes are allowed. File data and meta data can be stale or otherwise invalid. We never want to return invalid data to userspace, so file reads return - EIO. As far as the existing cluster stack goes, currently most of the OCFS2 team feels that the code has gone as far as it can and should go. It would therefore be prudent to allow pluggable cluster stacks. Jeff Mahoney at Novell has already done some integration work implementing a userspace clustering interface. We probably want to do more in that area though. There are several good reasons why we might want to integrate with external cluster stacks. The most obvious is code reuse. The list of cluster stack features we require for our next phase of development is very large (some are listed below). There is no reason to implement those features unless we're certain existing software doesn't provide them and can't be extended. This will also allow a greater amount of choice for the end user. What stack works well for one environment might not work as well for another. There's also the fact that current resources are limited. It's enough work designing and implementing a file system. If we can get out of the business of maintaining a cluster stack, we should do so. So the question then becomes, "What is it that we require of our cluster stack going forward?" - We'd like as much of it to be user space code as is possible and practical. - The node manager should support dynamic cluster topology updates, including removing nodes from the cluster, propagating new configurations to existing nodes, etc. - A pluggable fencing mechanism is a priority. - We'd like some group services implementation to handle things like membership of a mount point, dlm domain/lockspace, etc. - On the DLM side, we'd like things like directory based mastery, a range locking API, and some extra LVB recovery bits. So that's it for now. Hopefully this will spurn some interesting discussion. Please keep in mind that any of this is subject to change - cluster stack requirements especially are things we've only recently begun discussing. -- Mark -- Mark Fasheh Senior Software Developer, Oracle mark.fasheh at oracle.com _______________________________________________ Ocfs2- devel mailing list Ocfs2- devel at oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2- devel ^ permalink raw reply [flat|nested] 38+ messages in thread
* [Ocfs2-devel] OCFS2 features RFC - separate journal? 2006-04-25 18:35 [Ocfs2-devel] OCFS2 features RFC Mark Fasheh ` (2 preceding siblings ...) 2006-04-27 20:25 ` Paul Taysom @ 2006-05-03 23:04 ` Daniel Phillips 2006-05-04 0:29 ` Zach Brown 2006-05-11 20:04 ` [Ocfs2-devel] OCFS2 features RFC Jeff Mahoney 4 siblings, 1 reply; 38+ messages in thread From: Daniel Phillips @ 2006-05-03 23:04 UTC (permalink / raw) To: ocfs2-devel Mark Fasheh wrote: > The OCFS2 team is in the preliminary stages of planning major features for > our next cycle of development. The goal of this e-mail then is to stimulate > some discussion as to how features should be prioritized going forward. Some > disclaimers apply: Hi guys, Sorry about the lag. Here's an easy feature nobody has mentioned so far, and from my reading isn't supported: separate journal, like Ext3. The journals stay per-node, but they can be on a different (shared) volume than the filesystem proper. This should be dead simple to do and it can make a huge difference to write latency, by putting the journals on separate spindles or (what I actually have in mind) in NVRAM. Regards, Daniel ^ permalink raw reply [flat|nested] 38+ messages in thread
* [Ocfs2-devel] OCFS2 features RFC - separate journal? 2006-05-03 23:04 ` [Ocfs2-devel] OCFS2 features RFC - separate journal? Daniel Phillips @ 2006-05-04 0:29 ` Zach Brown 2006-05-04 0:46 ` Daniel Phillips 0 siblings, 1 reply; 38+ messages in thread From: Zach Brown @ 2006-05-04 0:29 UTC (permalink / raw) To: ocfs2-devel Daniel Phillips wrote: > Sorry about the lag. Here's an easy feature nobody has mentioned so far, and > from my reading isn't supported: separate journal, like Ext3. Yeah, I think this would be a fine piece to have some day. I'm not sure it's a high priority, though, given that the vast majority of deployments are already using hardware that has either some form of write caching or so many spindles that external journals just aren't worth the time they take to configure. I'd be interested in seeing more careful write ordering in JBD before worrying about external journals, personally. - z ^ permalink raw reply [flat|nested] 38+ messages in thread
* [Ocfs2-devel] OCFS2 features RFC - separate journal? 2006-05-04 0:29 ` Zach Brown @ 2006-05-04 0:46 ` Daniel Phillips 2006-05-04 20:56 ` Zach Brown 0 siblings, 1 reply; 38+ messages in thread From: Daniel Phillips @ 2006-05-04 0:46 UTC (permalink / raw) To: ocfs2-devel Zach Brown wrote: > Daniel Phillips wrote: >>Sorry about the lag. Here's an easy feature nobody has mentioned so far, and >>from my reading isn't supported: separate journal, like Ext3. > > Yeah, I think this would be a fine piece to have some day. Ext3 has it today. > I'm not sure it's a high priority, though, given that the vast majority > of deployments are already using hardware that has either some form of > write caching or so many spindles that external journals just aren't > worth the time they take to configure. The journal has different, less demanding mirroring requirements than the filesystem proper. It is unnecessary and redundant to have a dirty map for the journal mirror. It is also unnecessary and stupid to snapshot the journal. These two things add up to a _huge_ performance boost for the journal, if it can be separated. It is worth remembering that not every OCFS2 user will be running it on a big expensive SAN. Probably not even the majority. > I'd be interested in seeing more careful write ordering in JBD before > worrying about external journals, personally. IMHO, the separate journal on NVRAM will yield a much bigger gain and be much less work besides. Agreed that improvements to JBD are good. They are also scary. Regards, Daniel ^ permalink raw reply [flat|nested] 38+ messages in thread
* [Ocfs2-devel] OCFS2 features RFC - separate journal? 2006-05-04 0:46 ` Daniel Phillips @ 2006-05-04 20:56 ` Zach Brown 2006-05-04 20:59 ` Wim Coekaerts 2006-05-04 22:23 ` Daniel Phillips 0 siblings, 2 replies; 38+ messages in thread From: Zach Brown @ 2006-05-04 20:56 UTC (permalink / raw) To: ocfs2-devel > journal. These two things add up to a _huge_ performance boost for the > journal, if it can be separated. Sure, I don't doubt the high level theory. Does anyone have numbers to show it's relative effect in practice? That'd be interesting. > It is worth remembering that not every OCFS2 user will be running it on a > big expensive SAN. Probably not even the majority. Well, that's debatable. My only point, though, is that there are higher priority things that we should get to first because they affect *everyone*. If the lack of external journals makes you sad, well, I'm sorry to hear that. We certainly wouldn't turn away patches if someone got to it before us. > IMHO, the separate journal on NVRAM will yield a much bigger gain and be > much less work besides. So noted. I'm curious, though. What sort of NVRAM hardware do you have in mind? - z ^ permalink raw reply [flat|nested] 38+ messages in thread
* [Ocfs2-devel] OCFS2 features RFC - separate journal? 2006-05-04 20:56 ` Zach Brown @ 2006-05-04 20:59 ` Wim Coekaerts 2006-05-04 22:23 ` Daniel Phillips 1 sibling, 0 replies; 38+ messages in thread From: Wim Coekaerts @ 2006-05-04 20:59 UTC (permalink / raw) To: ocfs2-devel > > So noted. I'm curious, though. What sort of NVRAM hardware do you have > in mind? > and can be shared across nodes so taht you can do recovery ;-) I think that's a reasonable requirement to have ;) ^ permalink raw reply [flat|nested] 38+ messages in thread
* [Ocfs2-devel] OCFS2 features RFC - separate journal? 2006-05-04 20:56 ` Zach Brown 2006-05-04 20:59 ` Wim Coekaerts @ 2006-05-04 22:23 ` Daniel Phillips 2006-05-04 22:30 ` Mark Fasheh 1 sibling, 1 reply; 38+ messages in thread From: Daniel Phillips @ 2006-05-04 22:23 UTC (permalink / raw) To: ocfs2-devel Zach Brown wrote: >>journal. These two things add up to a _huge_ performance boost for the >>journal, if it can be separated. > > Sure, I don't doubt the high level theory. Does anyone have numbers to > show it's relative effect in practice? That'd be interesting. I will have Ext3 numbers pretty soon. >>It is worth remembering that not every OCFS2 user will be running it on a >>big expensive SAN. Probably not even the majority. > > Well, that's debatable. My only point, though, is that there are higher > priority things that we should get to first because they affect *everyone*. By all means, prioritize them. Did separate journals make the list, even if well towards the end? > If the lack of external journals makes you sad, well, I'm sorry to hear > that. We certainly wouldn't turn away patches if someone got to it > before us. By proposing a feature that I do not also implicitly propose that Oracle employees have to do the work. This is a quick hack after all, I would be happy to contribute. >>IMHO, the separate journal on NVRAM will yield a much bigger gain and be >>much less work besides. > > So noted. I'm curious, though. What sort of NVRAM hardware do you have > in mind? For the moment, iRAM cards. Yes I know they suck for throughput, but there are faster SATA NVRAM cards coming down the pipe, they are still much faster than IDE disks, and there is no seeking. Regards, Daniel ^ permalink raw reply [flat|nested] 38+ messages in thread
* [Ocfs2-devel] OCFS2 features RFC - separate journal? 2006-05-04 22:23 ` Daniel Phillips @ 2006-05-04 22:30 ` Mark Fasheh 2006-05-05 3:05 ` Daniel Phillips ` (3 more replies) 0 siblings, 4 replies; 38+ messages in thread From: Mark Fasheh @ 2006-05-04 22:30 UTC (permalink / raw) To: ocfs2-devel On Thu, May 04, 2006 at 03:23:47PM -0700, Daniel Phillips wrote: > >>It is worth remembering that not every OCFS2 user will be running it on a > >>big expensive SAN. Probably not even the majority. > > > >Well, that's debatable. My only point, though, is that there are higher > >priority things that we should get to first because they affect *everyone*. > > By all means, prioritize them. Did separate journals make the list, even > if well towards the end? It's on the list now. > > >If the lack of external journals makes you sad, well, I'm sorry to hear > >that. We certainly wouldn't turn away patches if someone got to it > >before us. > > By proposing a feature that I do not also implicitly propose that Oracle > employees have to do the work. This is a quick hack after all, I would be > happy to contribute. By all means. It should be a fairly straightfoward change. Out of curiousity, are we talking about a single journal device (all slot journals on one disk) or one device per journal? --Mark -- Mark Fasheh Senior Software Developer, Oracle mark.fasheh at oracle.com ^ permalink raw reply [flat|nested] 38+ messages in thread
* [Ocfs2-devel] OCFS2 features RFC - separate journal? 2006-05-04 22:30 ` Mark Fasheh @ 2006-05-05 3:05 ` Daniel Phillips 2006-05-05 18:25 ` Mark Fasheh 2006-05-05 17:12 ` Paul Taysom ` (2 subsequent siblings) 3 siblings, 1 reply; 38+ messages in thread From: Daniel Phillips @ 2006-05-05 3:05 UTC (permalink / raw) To: ocfs2-devel Mark Fasheh wrote: > Out of curiousity, are we talking about a single journal device (all slot > journals on one disk) or one device per journal? Hi Mark, For me, all journals on one disk, but that is just what I want for my one particular project. The user should be able to specify slot by slot which device the journal is on, if it is not on the main volume. This is just the logical extension of the Ext3 scheme. I don't see that there is anything to be gained by requiring the user to specify a different device for each journal since the user tools already have to handle the case where all the journals are on the same device. The configuration I am most interested at the moment has two nodes, each of which exports one NVRAM disk and one normal disk to the other. The NVRAM disks form a mirror with two journals on it. The normal disks likewise form a mirror with the OCFS2 fs proper on it. The latter volume needs to be snapshotted and its mirror needs a dirty map. The dirty map will live on the (NVRAM) journal volume. See how big a deal it is to be able to factor out the journals like that? As I mentioned earlier, the journals don't need to be snapshotted and the mirror doesn't need a dirty map, which is a really big help considering that typical write latency is determined by the journal, and the latency of a snapshoted, mirrored device with a persistent dirty map can get really high. A picture: Node0 <---- GigE cable ----> Node1 NVRAM: Slot0 Journal Mirror of Slot0 Journal Slot1 Journal Mirror of Slot1 Journal HDISK Dirty Map Mirror of HDISK Dirty Map HDISK: OCFS2 FS proper Mirror of OCFS2 FS proper OCFS2 FS Snapshot Store Mirror of OCFS2 FS Snapshot Store As a side note, separate journals will allow the user to be much less conservative about setting the number of slots. Regards, Daniel ^ permalink raw reply [flat|nested] 38+ messages in thread
* [Ocfs2-devel] OCFS2 features RFC - separate journal? 2006-05-05 3:05 ` Daniel Phillips @ 2006-05-05 18:25 ` Mark Fasheh 2006-05-06 3:09 ` Daniel Phillips 0 siblings, 1 reply; 38+ messages in thread From: Mark Fasheh @ 2006-05-05 18:25 UTC (permalink / raw) To: ocfs2-devel Hi Daniel, On Thu, May 04, 2006 at 08:05:16PM -0700, Daniel Phillips wrote: > The user should be able to specify slot by slot which > device the journal is on, if it is not on the main volume. This is just > the logical extension of the Ext3 scheme. To be honest, that sounds a little bit like overkill to me. For example, I was imagining that the user could create a seperate, rootless file system on the journal device - similar to how we do heartbeat only file systems. The normal file system would have the journal file system UUID stored in it's superblock. This way mount.ocfs2 could find the proper disk on the system and pass it along to the file system. If we had multiple possible journal devices, it would at least mean a much larget set of UUID's to store, necessitating a seperate area on disk for them. I'm sure there are other implications as well. > The configuration I am most interested at the moment has two nodes, each > of which exports one NVRAM disk and one normal disk to the other. The > NVRAM disks form a mirror with two journals on it. The normal disks > likewise form a mirror with the OCFS2 fs proper on it. The latter > volume needs to be snapshotted and its mirror needs a dirty map. The > dirty map will live on the (NVRAM) journal volume. See how big a deal > it is to be able to factor out the journals like that? As I mentioned > earlier, the journals don't need to be snapshotted and the mirror > doesn't need a dirty map, which is a really big help considering that > typical write latency is determined by the journal, and the latency of > a snapshoted, mirrored device with a persistent dirty map can get really > high. Thanks for explaining your proposed setup. What are you using to mirror the devices? --Mark -- Mark Fasheh Senior Software Developer, Oracle mark.fasheh at oracle.com ^ permalink raw reply [flat|nested] 38+ messages in thread
* [Ocfs2-devel] OCFS2 features RFC - separate journal? 2006-05-05 18:25 ` Mark Fasheh @ 2006-05-06 3:09 ` Daniel Phillips 0 siblings, 0 replies; 38+ messages in thread From: Daniel Phillips @ 2006-05-06 3:09 UTC (permalink / raw) To: ocfs2-devel Mark Fasheh wrote: > Hi Daniel, > On Thu, May 04, 2006 at 08:05:16PM -0700, Daniel Phillips wrote: >>The user should be able to specify slot by slot which >>device the journal is on, if it is not on the main volume. This is just >>the logical extension of the Ext3 scheme. > > To be honest, that sounds a little bit like overkill to me. > > For example, I was imagining that the user could create a seperate, rootless > file system on the journal device - similar to how we do heartbeat only file > systems. The normal file system would have the journal file system UUID > stored in it's superblock. This way mount.ocfs2 could find the proper disk > on the system and pass it along to the file system. If we had multiple > possible journal devices, it would at least mean a much larget set of UUID's > to store, necessitating a seperate area on disk for them. I'm sure there are > other implications as well. Hi Mark, Why do you want to wrap the separate journals in a filesystem instead of just being devices? > Thanks for explaining your proposed setup. What are you using to mirror the > devices? DDRaid over NBD or iSCSI, probably NBD (which leads the performance race at the moment). Regards, Daniel ^ permalink raw reply [flat|nested] 38+ messages in thread
* [Ocfs2-devel] OCFS2 features RFC - separate journal? 2006-05-04 22:30 ` Mark Fasheh 2006-05-05 3:05 ` Daniel Phillips @ 2006-05-05 17:12 ` Paul Taysom 2006-05-05 18:06 ` Daniel Phillips 2006-05-05 18:57 ` Sunil Mushran 2006-05-08 14:28 ` Paul Taysom 2006-05-08 18:00 ` Paul Taysom 3 siblings, 2 replies; 38+ messages in thread From: Paul Taysom @ 2006-05-05 17:12 UTC (permalink / raw) To: ocfs2-devel The performance you might gain from a separate journaling device will be very dependent on exactly how the journal is done. On NSS, the Netware journaled file system, we ran experiments with the journal turned off (just didn't do the write) and found it had little impact on benchmarks like NetBench. Part of the reason for this is that the journal writes were asynchronous to the main flow of the system. Normal operations would normally not ever wait for journal writes. Paul _______________________________________________ Ocfs2- devel mailing list Ocfs2- devel at oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2- devel ^ permalink raw reply [flat|nested] 38+ messages in thread
* [Ocfs2-devel] OCFS2 features RFC - separate journal? 2006-05-05 17:12 ` Paul Taysom @ 2006-05-05 18:06 ` Daniel Phillips 2006-05-05 18:57 ` Sunil Mushran 1 sibling, 0 replies; 38+ messages in thread From: Daniel Phillips @ 2006-05-05 18:06 UTC (permalink / raw) To: ocfs2-devel Paul Taysom wrote: > The performance you might gain from a separate journaling device will > be very dependent on exactly how the journal is done. On NSS, the > Netware journaled file system, we ran experiments with the journal > turned off (just didn't do the write) and found it had little impact on > benchmarks like NetBench. Part of the reason for this is that the > journal writes were asynchronous to the main flow of the system. Normal > operations would normally not ever wait for journal writes. That is one load. Did you try NFS with synchronous mount? Regards, Daniel ^ permalink raw reply [flat|nested] 38+ messages in thread
* [Ocfs2-devel] OCFS2 features RFC - separate journal? 2006-05-05 17:12 ` Paul Taysom 2006-05-05 18:06 ` Daniel Phillips @ 2006-05-05 18:57 ` Sunil Mushran 1 sibling, 0 replies; 38+ messages in thread From: Sunil Mushran @ 2006-05-05 18:57 UTC (permalink / raw) To: ocfs2-devel jbd is also asynch. That's not the issue. The issue is more the size of the journal. The larger the journal, the lesser need to flush the journal. In ocfs2, as each slot has a separate journal, there is a desire to limit the journal size so as to make more space available to actual data. Also, as the fs is clustered, flushes could be triggered by other nodes. So, having a separate device makes sense. It adds complexity to the configuration, but, that is to be expected. ;) Paul Taysom wrote: > The performance you might gain from a separate journaling device will > be very dependent on exactly how the journal is done. On NSS, the > Netware journaled file system, we ran experiments with the journal > turned off (just didn't do the write) and found it had little impact on > benchmarks like NetBench. Part of the reason for this is that the > journal writes were asynchronous to the main flow of the system. Normal > operations would normally not ever wait for journal writes. > > Paul > > _______________________________________________ > Ocfs2- devel mailing list > Ocfs2- devel at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2- devel > > > _______________________________________________ > Ocfs2-devel mailing list > Ocfs2-devel at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-devel > ^ permalink raw reply [flat|nested] 38+ messages in thread
* [Ocfs2-devel] OCFS2 features RFC - separate journal? 2006-05-04 22:30 ` Mark Fasheh 2006-05-05 3:05 ` Daniel Phillips 2006-05-05 17:12 ` Paul Taysom @ 2006-05-08 14:28 ` Paul Taysom 2006-05-08 17:43 ` Daniel Phillips 2006-05-08 18:00 ` Paul Taysom 3 siblings, 1 reply; 38+ messages in thread From: Paul Taysom @ 2006-05-08 14:28 UTC (permalink / raw) To: ocfs2-devel If I was worried about NFS performance, I'd rather use NVRAM as an immediate reply disk drive. Paul >That is one load. Did you try NFS with synchronous mount? >Regards, >Daniel ^ permalink raw reply [flat|nested] 38+ messages in thread
* [Ocfs2-devel] OCFS2 features RFC - separate journal? 2006-05-08 14:28 ` Paul Taysom @ 2006-05-08 17:43 ` Daniel Phillips 0 siblings, 0 replies; 38+ messages in thread From: Daniel Phillips @ 2006-05-08 17:43 UTC (permalink / raw) To: ocfs2-devel Paul Taysom wrote: > If I was worried about NFS performance, I'd rather use NVRAM as an > immediate reply disk drive. What makes you think that that is any faster than just having a fast journal on the filesystem? It is certainly messier and adds two more data copies. Plus it only helps NFS, what if there are other servers on the node? And how do you maintain cache consistency with the data written to the NFS reply journal when it has been acknowledged but is not actually in the filesystem? On a snapshot, the NFS reply journal would be one more thing that needs to be flushed, this is one more thing needing administration attention. How much latency do you think is saved by a dedicated reply journal vs a fast filesystem journal? I doubt it is as much as you suppose, it is on the order of microseconds per write and the reply journal will eventually have to pay double for that anyway. Also, somebody has to implement your NFS reply journal, further messing up knfsd. I am having a hard time seeing what is good about a dedicated NFS reply journal. Regards, Daniel ^ permalink raw reply [flat|nested] 38+ messages in thread
* [Ocfs2-devel] OCFS2 features RFC - separate journal? 2006-05-04 22:30 ` Mark Fasheh ` (2 preceding siblings ...) 2006-05-08 14:28 ` Paul Taysom @ 2006-05-08 18:00 ` Paul Taysom 2006-05-08 18:22 ` Daniel Phillips 3 siblings, 1 reply; 38+ messages in thread From: Paul Taysom @ 2006-05-08 18:00 UTC (permalink / raw) To: ocfs2-devel Network Appliance has been very successful with exactly this architecture. Paul >>> Daniel Phillips <phillips@google.com> 05/08/06 11:43 am >>> Paul Taysom wrote: > If I was worried about NFS performance, I'd rather use NVRAM as an > immediate reply disk drive. What makes you think that that is any faster than just having a fast journal on the filesystem? It is certainly messier and adds two more data copies. Plus it only helps NFS, what if there are other servers on the node? And how do you maintain cache consistency with the data written to the NFS reply journal when it has been acknowledged but is not actually in the filesystem? On a snapshot, the NFS reply journal would be one more thing that needs to be flushed, this is one more thing needing administration attention. How much latency do you think is saved by a dedicated reply journal vs a fast filesystem journal? I doubt it is as much as you suppose, it is on the order of microseconds per write and the reply journal will eventually have to pay double for that anyway. Also, somebody has to implement your NFS reply journal, further messing up knfsd. I am having a hard time seeing what is good about a dedicated NFS reply journal. Regards, Daniel ^ permalink raw reply [flat|nested] 38+ messages in thread
* [Ocfs2-devel] OCFS2 features RFC - separate journal? 2006-05-08 18:00 ` Paul Taysom @ 2006-05-08 18:22 ` Daniel Phillips 0 siblings, 0 replies; 38+ messages in thread From: Daniel Phillips @ 2006-05-08 18:22 UTC (permalink / raw) To: ocfs2-devel Paul Taysom wrote: > Network Appliance has been very successful with exactly this > architecture. > Paul Perhaps alternative architectures exist that are just as good, if not better? Regards, Daniel >>>>Daniel Phillips <phillips@google.com> 05/08/06 11:43 am >>> > > Paul Taysom wrote: > >>If I was worried about NFS performance, I'd rather use NVRAM as an >>immediate reply disk drive. > > > What makes you think that that is any faster than just having a fast > journal on the filesystem? It is certainly messier and adds two more > data copies. Plus it only helps NFS, what if there are other servers > on the node? And how do you maintain cache consistency with the data > written to the NFS reply journal when it has been acknowledged but is > not actually in the filesystem? > > On a snapshot, the NFS reply journal would be one more thing that > needs to be flushed, this is one more thing needing administration > attention. > > How much latency do you think is saved by a dedicated reply journal vs > a fast filesystem journal? I doubt it is as much as you suppose, it > is on the order of microseconds per write and the reply journal will > eventually have to pay double for that anyway. > > Also, somebody has to implement your NFS reply journal, further > messing > up knfsd. I am having a hard time seeing what is good about a > dedicated NFS reply journal. > > Regards, > > Daniel > ^ permalink raw reply [flat|nested] 38+ messages in thread
* [Ocfs2-devel] OCFS2 features RFC 2006-04-25 18:35 [Ocfs2-devel] OCFS2 features RFC Mark Fasheh ` (3 preceding siblings ...) 2006-05-03 23:04 ` [Ocfs2-devel] OCFS2 features RFC - separate journal? Daniel Phillips @ 2006-05-11 20:04 ` Jeff Mahoney 2006-05-11 20:40 ` Paul Taysom ` (2 more replies) 4 siblings, 3 replies; 38+ messages in thread From: Jeff Mahoney @ 2006-05-11 20:04 UTC (permalink / raw) To: ocfs2-devel -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Mark Fasheh wrote: > The OCFS2 team is in the preliminary stages of planning major features for > our next cycle of development. The goal of this e-mail then is to stimulate > some discussion as to how features should be prioritized going forward. Some > disclaimers apply: > > * The following list is very preliminary and is sure to change. > > * I've probably missed some things. > > * Development priorities within Oracle can be influenced but are ultimately > up to management. That's not stopping anyone from contributing though, and > patches are always welcome. > While performance enhancements are always welcome, the two big features we'd like to see in future OCFS2 releases are features that will make using OCFS2 more transparent and more like a "local" file system. The features we want are cluster wide lockf/flock and shared writable mmap. - From a data integrity perspective, it shouldn't make a difference to an application whether competing reader/writers are on the same node or a different node. If standard locking primitives are already in use by the application, they should "just work" if the competing process is on another node. > So I'll start with changes that can be completely contained within the file > system (no cluster stack changes needed): > > -Sparse file support: Self explanatory. We need this for various reasons > including performance, correctness and space usage. I think we all want this one. Once apon a time, ReiserFS didn't support sparse files and it made doing things that expected sparse files an exercise in torture. > -Htree support Hashed directories in some form, but I think the comments against ext3 style h-trees are valid. > Now on to file system features which require cluster stack changes. I'll > have alot more to say about the cluster stack in a bit, but it's worth > listing these out here for completeness. > -Online file system resize This would be nice, and I think easily done in the same manner ext3 does. Anything outside the file system's current view of the block device can be initialized in userspace, and the last block group, bitmaps, and superblock would be adjusted by an ioctl in kernelspace. > -Allow the file system to go "hard read only" when it loses it's connection > to the disk, rather than the kernel panic we have today. This allows > applications using the file system to gracefully shut down. Other > applications on the system continue unharmed. "Hard read only" in the OCFS2 > context means that the RO node does not look mounted to the other nodes on > that file system. Absolutely no disk writes are allowed. File data and > meta data can be stale or otherwise invalid. We never want to return > invalid data to userspace, so file reads return -EIO. This is a big one as well. If a node knows to fence itself, it can put itself in an error state as well. fence={panic,ro} would be a decent start. > As far as the existing cluster stack goes, currently most of the OCFS2 team > feels that the code has gone as far as it can and should go. It would > therefore be prudent to allow pluggable cluster stacks. Jeff Mahoney at > Novell has already done some integration work implementing a userspace > clustering interface. We probably want to do more in that area though. > > There are several good reasons why we might want to integrate with external > cluster stacks. The most obvious is code reuse. The list of cluster stack > features we require for our next phase of development is very large (some > are listed below). There is no reason to implement those features unless > we're certain existing software doesn't provide them and can't be extended. > This will also allow a greater amount of choice for the end user. What stack > works well for one environment might not work as well for another. There's > also the fact that current resources are limited. It's enough work designing > and implementing a file system. If we can get out of the business of > maintaining a cluster stack, we should do so. > > So the question then becomes, "What is it that we require of our cluster > stack going forward?" > > - We'd like as much of it to be user space code as is possible and > practical. The heartbeat project does a pretty good job on the userspace end, but as Andi pointed out, it has the usual shortcomings of anything in userspace involved with writing data inside the kernel. It is prone to deadlocks and we could miss node topology events. - -Jeff - -- Jeff Mahoney SUSE Labs -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (GNU/Linux) Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org iD8DBQFEY5jTLPWxlyuTD7IRAmsMAKCTZpN5rb+6jr6K0TvMJVq6LxNrwgCggFvT uLovIf8rbp1GhF2LVg1i6Cw= =SkZi -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 38+ messages in thread
* [Ocfs2-devel] OCFS2 features RFC 2006-05-11 20:04 ` [Ocfs2-devel] OCFS2 features RFC Jeff Mahoney @ 2006-05-11 20:40 ` Paul Taysom 2006-05-11 20:55 ` Joel Becker 2006-05-11 21:16 ` Daniel Phillips 2006-05-17 1:44 ` Mark Fasheh 2 siblings, 1 reply; 38+ messages in thread From: Paul Taysom @ 2006-05-11 20:40 UTC (permalink / raw) To: ocfs2-devel What make the online file system resize tricky is updating all the allocation chains. The last block of each of the existing chains needs to be updated to point to the new blocks in the chains. Would it be possible to get rid of chains and just compute the next block in the chain? Paul >>Online file system resize >This would be nice, and I think easily done in the same manner ext3 >does. Anything outside the file system's current view of the block >device can be initialized in userspace, and the last block group, >bitmaps, and superblock would be adjusted by an ioctl in kernelspace. >>> Jeff Mahoney <jeffm@suse.com> 05/11/06 2:04 pm >>> ^ permalink raw reply [flat|nested] 38+ messages in thread
* [Ocfs2-devel] OCFS2 features RFC 2006-05-11 20:40 ` Paul Taysom @ 2006-05-11 20:55 ` Joel Becker 0 siblings, 0 replies; 38+ messages in thread From: Joel Becker @ 2006-05-11 20:55 UTC (permalink / raw) To: ocfs2-devel On Thu, May 11, 2006 at 02:40:46PM -0600, Paul Taysom wrote: > What make the online file system resize tricky is updating all the > allocation chains. The last block of each of the existing chains needs > to be updated to point to the new blocks in the chains. Nope, not tricky at all. As clean new allocation groups, we just insert them at the front of the chains. The new chain has a pointer to the existing chains initialized by userspace. The chain allocator inode has its chain pointers moved to the new group as a single write during the in-kernel update. Only one block write to update all chains in the inode. > Would it be possible to get rid of chains and just compute the next > block in the chain? You can't mathematically compute the next group for anything other than the cluster allocator. In addition, the chain-reorder logic allows fewer reads to find a relatively empty chain, an optimization we'd lose. Joel -- "One of the symptoms of an approaching nervous breakdown is the belief that one's work is terribly important." - Bertrand Russell Joel Becker Principal Software Developer Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 38+ messages in thread
* [Ocfs2-devel] OCFS2 features RFC 2006-05-11 20:04 ` [Ocfs2-devel] OCFS2 features RFC Jeff Mahoney 2006-05-11 20:40 ` Paul Taysom @ 2006-05-11 21:16 ` Daniel Phillips 2006-05-17 1:44 ` Mark Fasheh 2 siblings, 0 replies; 38+ messages in thread From: Daniel Phillips @ 2006-05-11 21:16 UTC (permalink / raw) To: ocfs2-devel Jeff Mahoney wrote: > While performance enhancements are always welcome, the two big features > we'd like to see in future OCFS2 releases are features that will make > using OCFS2 more transparent and more like a "local" file system. The > features we want are cluster wide lockf/flock and shared writable mmap. These are both already on the list, so I suppose you are just voting for priority? I agree re priority: these two items stand in the way of full local Posix semantics. They should be number one and two on the list. > Hashed directories in some form, but I think the comments against ext3 > style h-trees are valid. I do not know which "comments against" you are refering to. I only saw an unsupported, non-technical assertion from Christoph. Perhaps Christoph would be kind enough to share with us the technical details of how XFS deals with the 31 bit telldir cookie problem. Hash directories or btrees of any form all have the same telldir issue as Htree, so if you advocate hashed directories, you also advocate coming up with some scheme to try to reduce the severity of the telldir problem. The only schemes that make the telldir problem actually go away are ones that stick with a directory scheme modelled on UFS. I only know of one of those, the FSF hashing scheme, which has a major problem: the hash index is not persistent. It has to be recreated on initial access to the directory and kept around in memory, competing with other hashed objects. This does not scale well. Another problem is, since the holes in this scheme are so obvious there is not a lot of incentive to put time into it, knowing it will eventually be tossed out in favor of something else. But feel free :-) The reason people like HTree is, it is really, really fast and minimizes disk accesses. It is also mostly debugged, though we still tend to see a new issue every now and then. It's been more than a year since I saw the last one, and that was an outright bug. One thing that we tried to do with HTree is work within a 31 bit cookie limitation to accomodate NFSv2. I am thinking that maybe we should have just made NFSv2 fall back to not using the index, which is easy to do with HTree, and thereby give ourselves the 62 bits of cookie we really need. I will float this idea on ext2-devel. Regards, Daniel ^ permalink raw reply [flat|nested] 38+ messages in thread
* [Ocfs2-devel] OCFS2 features RFC 2006-05-11 20:04 ` [Ocfs2-devel] OCFS2 features RFC Jeff Mahoney 2006-05-11 20:40 ` Paul Taysom 2006-05-11 21:16 ` Daniel Phillips @ 2006-05-17 1:44 ` Mark Fasheh [not found] ` <446BBCF5.7040903@google.com> 2006-05-22 17:01 ` Paul Taysom 2 siblings, 2 replies; 38+ messages in thread From: Mark Fasheh @ 2006-05-17 1:44 UTC (permalink / raw) To: ocfs2-devel Hi Jeff, On Thu, May 11, 2006 at 04:04:35PM -0400, Jeff Mahoney wrote: > While performance enhancements are always welcome, the two big features > we'd like to see in future OCFS2 releases are features that will make > using OCFS2 more transparent and more like a "local" file system. The > features we want are cluster wide lockf/flock and shared writable mmap. I'm trying to do more research on the user locking stuff right now actually. My aim is twofold. The first is to nail down exactly what each type of locking entails, and secondly I want to know what impact a lack of cluster-aware locking has on at least one existing application. Just to break it down, lockf() seems to be a (POSIX compliant?) library wrapper around fcntl() locking, which is range based, optionally mandatory, and provides deadlock detection. Ranges can encompass any part of the file, with a special case that allows to lock all possible (present and future) bytes from a given offset. Along with the usual blocking / nonblocking variants on read / exclusive locks, fcntl supports the F_GETLK operation which allows userspace to query information about a range, including the pids of processes holding incompatible locks. flock() on the other hand is always advisory and does not support ranges. No explicit deadlock detection seems to be done, though deadlocks can be broken by the user sending a signal (including kill -9) to one of the waiting processes. It also supports shared, exclusive and trylock type operations. And finally, quoting from the fcntl() man page: "Since kernel 2.0, there is no interaction between the types of lock placed by flock(2) and fcntl(2)." Now, to get to an actual example of application usage, I took a look at the apache 2.2.2 source. It seems that they do file locking in the apr functions apr_file_lock() / apr_file_unlock() (located in srclib/apr/file_io/unix/flock.c). On Linux, these use fcntl(). The only consumer of those functions I could find in the httpd tarball are the "sdbm" routines in and around srclib/apr-util/dbm/sdbm/ And that's about where my apache expertise ends :/ There's many more apps to look at for information though - sendmail immediately comes to mind. An strace on my machine here reveals that rpm uses fcntl() locking. So the question for the current OCFS2 code base is what impact the lack of cluster-aware fcntl() locking has on the particular set of software which we're going to worry about right now. Whenever we chose to do it (and we _will_ do it), it will take a long time to develop - fcntl() locking alone encompasses about two thirds of our non-trivial dlm feature wishlist. > - From a data integrity perspective, it shouldn't make a difference to an > application whether competing reader/writers are on the same node or a > different node. If standard locking primitives are already in use by the > application, they should "just work" if the competing process is on > another node. I agree 100%. > > -Online file system resize > > This would be nice, and I think easily done in the same manner ext3 > does. Anything outside the file system's current view of the block > device can be initialized in userspace, and the last block group, > bitmaps, and superblock would be adjusted by an ioctl in kernelspace. Yep, that's basically how we plan to approach it. > > - We'd like as much of it to be user space code as is possible and > > practical. > > The heartbeat project does a pretty good job on the userspace end, but > as Andi pointed out, it has the usual shortcomings of anything in > userspace involved with writing data inside the kernel. It is prone to > deadlocks and we could miss node topology events. Ahh ok, so that explains how heartbeat handles it (aka, it doesn't right now). It really seems to me that we're going to need to find a solution to this sort of problem sooner or later (why do I get the feeling that I'm being naive?). Speaking from experience, having cluster stack components in kernel means a much longer development time. Even small focused ones like the OCFS2 stack. --Mark -- Mark Fasheh Senior Software Developer, Oracle mark.fasheh at oracle.com ^ permalink raw reply [flat|nested] 38+ messages in thread
[parent not found: <446BBCF5.7040903@google.com>]
[parent not found: <20060518024638.GY21588@ca-server1.us.oracle.com>]
* [Ocfs2-devel] OCFS2 features RFC [not found] ` <20060518024638.GY21588@ca-server1.us.oracle.com> @ 2006-05-19 0:35 ` Daniel Phillips 2006-05-19 15:16 ` J. Bruce Fields 2006-05-20 6:11 ` Mark Fasheh 0 siblings, 2 replies; 38+ messages in thread From: Daniel Phillips @ 2006-05-19 0:35 UTC (permalink / raw) To: ocfs2-devel (a dialog between Mark and me that inadvertently became private) Mark Fasheh wrote: > On Wed, May 17, 2006 at 05:16:53PM -0700, Daniel Phillips wrote: >>Does clustered NFS count as software we're going to worry about right now? >>The impact is, if OCFS2 does provide cluster-aware fcntl locking then the >>cluster locking hacks lockd needs can possibly be smaller. Otherwise, >>lockd must do the job itself, and as a consequence, any applications running >>on the (clustered) NFS server nodes will not see locks held by NFS clients. > > Clustered NFS is definitely something we care about. We have people using it > today, with the caveat that file locking won't be cluster aware. It's > actually pretty interesting how far people get with that. We'd love to > support the whole thing of course. As far as NFS with file locking though, I > have to admit that we've heard many more requests from people wanting to do > things like apache, sendmail, etc on OCFS2. Ok, I just figured out how to be really lazy and do cluster-consistent NFS locking across clustered NFS servers without doing much work. In the duh category, only one node will actually run lockd and all other NFS server nodes will just port-forward the NLM traffic to/from it. Sure, you can bottleneck this scheme with a little effort, but to be honest we aren't that interested in NFS locking performance, we are more interested in actual file operations. So strike NFS serving off the list of applications that care about cluster fcntl locking. >>Unless I have missed something major, fcntl locking does not have any >>overlap with your existing DLM, so you can implement it with a separate >>mechanism. Does that help? > > Eh, unfortunately not that much... It's still a large amount of work :/ > Doing it outside a dlm would just mean one has to reproduce existing > mechanisms (such as determining lock mastery for instance). You don't have to distribute the fcntl locking, you can instead manage it with a single server active on just one node at a time. So go ahead and distribute it if you really enjoy futzing with the DLM, but going for the server approach should reduce your stress considerably. As a fringe benefit, you are then forced to consider how to accomodate classic server failover within the cluster manager framework, which should not be very hard and is absolutely necessary. >>Starting with one obvious requirement, the cluster stack needs to be able >>to handle different kinds of fencing methods or even mixed fencing methods. >>If the stack stays in kernel, what is the instancing framework? Modules? >>I do believe we can make that work. > > call_usermodehelper()? Bad idea, this gets you back into memory deadlock zone. Avoiding memory deadlock is considerably easier in kernel and is nigh on impossible with call_usermodehelper. Sure, it's totally possible to do all that in kernel. > > But we're getting ahead of ourselves - I don't want to implement yet another > cluster stack - I'd rather fit the file system into an existing framework - > one which already has all the fencing methods work out for instance. Like the Red Hat framework? Ahem. Maybe not. For one thing, they never even got close to figuring out how to avoid memory deadlock. For another, it's a rambling bloated pig with lots of bogus factoring. Honestly, what you have now is a much better starting point, you should be thinking about how to evolve it in the direction it needs to go rather than cutting over to an existing framework, that was designed with the mindset of usermode cluster apps, not the more stringent requirements of a cluster filesystem. >>Consider this: if we define the fencing interface entirely in terms of >>messages over sockets then the cluster stack does not need to know or care >>whether the other end lives in kernel or userland. Comments? > > Interesting, and I'll have to think about whether I can poke holes in that > or not. Of course, I'm not sure the file system ever has to call out to > fencing directly, so maybe it's something it never has to worry about. No, the filesystem never calls fencing, only the cluster manager does. As I understand it, what happens is: 1) Somebody (heartbeat) reports a dead node to cluster manager 2) Cluster manager issues a fence request for the dead node 3) Cluster manager receives confirmation that the node was fenced 4) Cluster manager sends out dead node messages to cluster managers on other nodes 5) Some cluster manager receives dead node message, notifies DLM 6) DLM receives dead node message, initiates lock recovery Step (2) is where we need plugins, where each plugin registers a fencing and somehow each node becomes associated with a particular fencing method (setting up this association is an excellent example of a component that can and should be in userspace because this part never executes in the block IO path). The right interface to initiate fencing is probably a direct (kernel-to-kernel) call, there is actually no good reason to use a socket interface here. However, the fencing confirmation is an asynchronous event and might as well come in over a socket. There are alternatives (e.g., linked list event queue) but the socket is most natural because the cluster manager already needs one to receive events from other sources. Actually, fencing has no divine right to be a separate subsystem and is properly part of the cluster manager. It's better to think of it that way. As such, the cluster manager <=> fencing api is internal, there is no need to get into interminable discussions of how to standardize it. So let's just do something really minimal that gives us a plugin interface and move on to harder problems. If you do eventually figure out how to move the whole cluster manager to userspace then you replace the module scheme in favor of a dso scheme. Anyway, assuming both bits are in-kernel then initiating fencing should just be a method on the (in-kernel) node object and confirmation of fencing is just an event sent to the node manager's event pipe. Simple, no? In summary, I retract my point about using the socket to abstract away the question of whether fencing lives in kernel or userspace and instead assert that the fencing harness should live wherever the cluster manager lives, which is in kernel right now and ought to stay there for the time being. Socket is still the right way to receive messages from a fencing module, but a method call is a better way to initiate fencing. Regards, Daniel ^ permalink raw reply [flat|nested] 38+ messages in thread
* [Ocfs2-devel] OCFS2 features RFC 2006-05-19 0:35 ` Daniel Phillips @ 2006-05-19 15:16 ` J. Bruce Fields 2006-05-20 6:11 ` Mark Fasheh 1 sibling, 0 replies; 38+ messages in thread From: J. Bruce Fields @ 2006-05-19 15:16 UTC (permalink / raw) To: ocfs2-devel On Thu, May 18, 2006 at 05:35:27PM -0700, Daniel Phillips wrote: > Ok, I just figured out how to be really lazy and do cluster-consistent > NFS locking across clustered NFS servers without doing much work. In the > duh category, only one node will actually run lockd and all other NFS > server nodes will just port-forward the NLM traffic to/from it. Yeah, I can't see why that wouldn't work with v2/v3. The same trick won't work with NFSv4 since it has the locking integrated into the protocol. It shouldn't be that much work to make lockd/nfsd use whatever locking the filesystem provides--see http://linux-nfs.org/cgi-bin/gitweb.cgi?p=bfields-2.6.git;a=shortlog;h=server-cluster-locking-api for one attempt. Of course the hard part is providing the locking support in the filesystem in the first place! And the main obstacle to our work has been the lack of an in-kernel filesystem that does this.... (The only testing has been done with GPFS.) --b. ^ permalink raw reply [flat|nested] 38+ messages in thread
* [Ocfs2-devel] OCFS2 features RFC 2006-05-19 0:35 ` Daniel Phillips 2006-05-19 15:16 ` J. Bruce Fields @ 2006-05-20 6:11 ` Mark Fasheh 2006-05-22 19:18 ` Daniel Phillips 1 sibling, 1 reply; 38+ messages in thread From: Mark Fasheh @ 2006-05-20 6:11 UTC (permalink / raw) To: ocfs2-devel On Thu, May 18, 2006 at 05:35:27PM -0700, Daniel Phillips wrote: > Ok, I just figured out how to be really lazy and do cluster-consistent > NFS locking across clustered NFS servers without doing much work. In the > duh category, only one node will actually run lockd and all other NFS > server nodes will just port-forward the NLM traffic to/from it. Sure, > you can bottleneck this scheme with a little effort, but to be honest we > aren't that interested in NFS locking performance, we are more interested > in actual file operations. Out of curiousity, how will a failure on the lockd node be handled? Or is this something that you're not worried about. > >call_usermodehelper()? > > Bad idea, this gets you back into memory deadlock zone. Avoiding memory > deadlock is considerably easier in kernel and is nigh on impossible with > call_usermodehelper. Good catch, I threw that out without fully evaluating the implications :) > Like the Red Hat framework? Ahem. Maybe not. For one thing, they never > even got close to figuring out how to avoid memory deadlock. For another, > it's a rambling bloated pig with lots of bogus factoring. Honestly, what > you have now is a much better starting point, Well, I should've said "multiple existing frameworks" - so people could run whatever fits their needs the best. So folks could pick the feature sets that suit their needs the best. Besides, I think you're being somewhat unfair to the Red Hat framework. It does _alot_ more than the OCFS2 stack can even dream of handling right now. And we haven't even talked about Linux-HA yet. > you should be thinking about how to evolve it in the direction it needs to > go rather than cutting over to an existing framework, that was designed > with the mindset of usermode cluster apps, not the more stringent > requirements of a cluster filesystem. I hear they have this thing called "GFS" ;) What we are thinking about right now is how we can reuse code - building on other people's bug fixes, feature patches, etc. What we have today just bootstraps our file system into the world of the cluster. Deciding to go the full blown home grown cluster route path isn't some decision we make based on one (admittedly difficult) bug or design issue. Nor is it something that we will undertake without having fully explored all other alternatives. > No, the filesystem never calls fencing, only the cluster manager does. > As I understand it, what happens is: > > 1) Somebody (heartbeat) reports a dead node to cluster manager > 2) Cluster manager issues a fence request for the dead node > 3) Cluster manager receives confirmation that the node was fenced > 4) Cluster manager sends out dead node messages to cluster managers > on other nodes > 5) Some cluster manager receives dead node message, notifies DLM > 6) DLM receives dead node message, initiates lock recovery That sounds alot closer to how it should happen, IMHO. > Step (2) is where we need plugins, where each plugin registers a fencing > and somehow each node becomes associated with a particular fencing method > (setting up this association is an excellent example of a component that > can and should be in userspace because this part never executes in the > block IO path). The right interface to initiate fencing is probably a > direct (kernel-to-kernel) call, there is actually no good reason to use > a socket interface here. Fencing plugins by the way can tend to do a variety of things, ranging from direct device access, to being able to telnet or ssh into a switch. The plugin system therefore needs to be fairly generic, to the level of running a binary that could be written in perl, C, etc. > However, the fencing confirmation is an asynchronous event and might as > well come in over a socket. There are alternatives (e.g., linked list > event queue) but the socket is most natural because the cluster manager > already needs one to receive events from other sources. > > Actually, fencing has no divine right to be a separate subsystem and is > properly part of the cluster manager. It's better to think of it that > way. As such, the cluster manager <=> fencing api is internal, there is > no need to get into interminable discussions of how to standardize it. Sure. > So let's just do something really minimal that gives us a plugin > interface and move on to harder problems. If you do eventually figure out > how to move the whole cluster manager to userspace then you replace the > module scheme in favor of a dso scheme. Well, I'm wondering how we're going to support all the different fencing methods using kernel modules ;) --Mark -- Mark Fasheh Senior Software Developer, Oracle mark.fasheh at oracle.com ^ permalink raw reply [flat|nested] 38+ messages in thread
* [Ocfs2-devel] OCFS2 features RFC 2006-05-20 6:11 ` Mark Fasheh @ 2006-05-22 19:18 ` Daniel Phillips 0 siblings, 0 replies; 38+ messages in thread From: Daniel Phillips @ 2006-05-22 19:18 UTC (permalink / raw) To: ocfs2-devel Mark Fasheh wrote: > On Thu, May 18, 2006 at 05:35:27PM -0700, Daniel Phillips wrote: >>Ok, I just figured out how to be really lazy and do cluster-consistent >>NFS locking across clustered NFS servers without doing much work. In the >>duh category, only one node will actually run lockd and all other NFS >>server nodes will just port-forward the NLM traffic to/from it. Sure, >>you can bottleneck this scheme with a little effort, but to be honest we >>aren't that interested in NFS locking performance, we are more interested >>in actual file operations. > > Out of curiousity, how will a failure on the lockd node be handled? Or is > this something that you're not worried about. Of course I'm worried about it! Luckily, normal NFS reboot semantics can be repurposed to provide failover. Client lockds are notified of a server failure via NSM/statd. Our cluster manager invokes a failover method (this harness yet to be designed) that activates a new lockd on some other node and updates the NLM port forward addresses on all other nodes. When all is ready, the new server announces via NLM that it is up and clients retake their locks as they would for a server reboot. I don't think this part of it is new, anybody who has attempted nfs serving from a cluster must have noticed it. The port forwarding idea may be new, I did not notice anybody mention it out there. >>>call_usermodehelper()? >>Like the Red Hat framework? Ahem. Maybe not. For one thing, they never >>even got close to figuring out how to avoid memory deadlock. For another, >>it's a rambling bloated pig with lots of bogus factoring. Honestly, what >>you have now is a much better starting point, > > Well, I should've said "multiple existing frameworks" - so people could run > whatever fits their needs the best. So folks could pick the feature sets > that suit their needs the best. Besides, I think you're being somewhat > unfair to the Red Hat framework. It does _alot_ more than the OCFS2 stack > can even dream of handling right now. In a reasonable way? I think not. The only bit you might lust after is the range locking, and that was never tested to any great extent. I still think you have a better, more sensible base to work from, and what's more, it's attached to a relatively stable, in-tree cluster filesystem. Curious... have you tried the Red Hat cluster stack? Which version(s)? > And we haven't even talked about Linux-HA yet. And we should, briefly. Linux-HA looks great to me but it can't be directly used by OCFS2 because it is in userspace with no thought at all invested in dealing with memory deadlock. You might be able to interface with Linux-HA one day in order to unify the handling of membership and failover, however I doubt that the easiest path there is to try to fix up the Linux-HA internals to avoid memory pitfalls. Much better to fix your much smaller in-kernel framework, and then evolve it in the direction of interfacing to Linux-HA. Note that fencing, membership, heartbeat and failover all lie in the block IO path, so they all have to obey rigorous rules that Linux-HA knows nothing about. What has to be done here is adapt Linux-HA's structure to expose the OCFS2 implementation, so for example Linux-HA would not directly send heartbeats, but would receive your stack's up/down messages. But this is getting way ahead of things. First, OCFS2 needs to establish itself as a filesystem, before projects like Linux-HA can look at how to do the grand unification. >>No, the filesystem never calls fencing, only the cluster manager does. >>As I understand it, what happens is: >> >> 1) Somebody (heartbeat) reports a dead node to cluster manager >> 2) Cluster manager issues a fence request for the dead node >> 3) Cluster manager receives confirmation that the node was fenced >> 4) Cluster manager sends out dead node messages to cluster managers >> on other nodes >> 5) Some cluster manager receives dead node message, notifies DLM >> 6) DLM receives dead node message, initiates lock recovery > > That sounds alot closer to how it should happen, IMHO. > > Fencing plugins by the way can tend to do a variety of things, ranging from > direct device access, to being able to telnet or ssh into a switch. The > plugin system therefore needs to be fairly generic, to the level of > running a binary that could be written in perl, C, etc. Then you would implement a kernel fencing method that interfaces to user space, and cross your fingers. Fencing lies in the block IO path so it has to obey anti-memory deadlock rules. Perl and bash certainly will not, so if somebody insists on writing their fence scripts that way, then they will need to run them on a separate node that does not mount the OCFS2 filesystem, or inside a resource sandbox, for example a UML instance that has all its resources pre-allocated. By the time you have done all the setup required for that, you would have gotten the job done faster and better by rewriting the script in C. Then you still have to do memlocking, and run syscalls like connect in PF_MEMALLOC mode, but you would need that for the UML sandbox anyway, with rather more work to do to audit all the call paths. The practical approach is to do kernel implementations of the fencing methods that can be implemented there (including mine!) and offload any messy userspace ones to a non-filesystem node. >>So let's just do something really minimal that gives us a plugin >>interface and move on to harder problems. If you do eventually figure out >>how to move the whole cluster manager to userspace then you replace the >>module scheme in favor of a dso scheme. > > Well, I'm wondering how we're going to support all the different fencing > methods using kernel modules ;) Choose your poison: 1) A kernel fencing method sends messages to a dedicated fencing node that does not mount the filesystem. This may waste a node and needs some additional mechanism to avoid becoming a single point of failure. 2) A kernel fencing method sends messages to a userspace program written in C, memlocked, and running in (slight kernel hack here) PF_MEMALLOC mode. This might require a little more work than a Perl script, but then real men enjoy work. 3) A kernel fencing method sends messages to a userspace program running in a resource sandbox (e.g. UML or XEN) that does whatever it wants to. This is really buzzword compatible, really wasteful, and a great use of administration time. 4) You may find that you can implement in-kernel all of the fencing modules you need easier and better than any of the above. This is the case with me. The thing we can't do is go on pretending that we can just shell to bash and run anything we want. That way lies deadlock. Regards, Daniel ^ permalink raw reply [flat|nested] 38+ messages in thread
* [Ocfs2-devel] OCFS2 features RFC 2006-05-17 1:44 ` Mark Fasheh [not found] ` <446BBCF5.7040903@google.com> @ 2006-05-22 17:01 ` Paul Taysom 1 sibling, 0 replies; 38+ messages in thread From: Paul Taysom @ 2006-05-22 17:01 UTC (permalink / raw) To: ocfs2-devel Two major applications that use byte range locks are Open Office and Microsoft Office (Word, Excel, ...). They use them to coordinate sharing a document when more than one person opens the file. These applications typically get a byte range lock on a single byte at a predetermined offset, then write data into the file about who has the file open. This way, when someone else opens the file, they can find out who else has the file open. Word is of course going through SAMBA to access the file system. Paul Taysom ^ permalink raw reply [flat|nested] 38+ messages in thread
* [Ocfs2-devel] OCFS2 Features RFC
@ 2006-05-02 18:22 Brian Long
2006-05-02 20:29 ` Sunil Mushran
0 siblings, 1 reply; 38+ messages in thread
From: Brian Long @ 2006-05-02 18:22 UTC (permalink / raw)
To: ocfs2-devel
Hello,
I just subscribed to this list because I saw this posting in the
archives:
http://oss.oracle.com/pipermail/ocfs2-devel/2006-April/000931.html
Is there any reason you wouldn't ask the ocfs2-users community for
feedback on features as well? I hadn't subscribed to -devel because I
figured it was solely for folks actually developing the OCFS2 code :)
In my opinion, the proposed feature about "hard read only" is the most-
wanted. My team is in the middle of testing 10gR2 RAC on OCFS2 for
production deployments on RHEL 4 (hopefully your x86_64 certification is
coming soon). I assume Oracle RAC would like the "hard read only" more
than the current panic.
Also, while I saw one end user complain about your ideas of implementing
ext3 code inside OCFS2, please remember the rest of us that survive just
fine with ext3 in Red Hat's Enterprise Linux. :)
Third, is there any thoughts on integrating LVM support or using
something like Red Hat's CLVM to allow OCFS2 to layer on top of LVs
instead of just individual disks?
The biggest drawback I see in my environment is that my storage team
provides 34GB and 68GB metas from the EMC Frames. I'd rather not have a
ton of 68GB OCFS2 filesystems but rather a larger, host-controlled LV.
Trying to get the storage team to provide a 200+GB LUN and grow it on
the fly in the future is a tough task. If I could control the LV on the
host _and_ grow OCFS2 into larger LVs, that would rock.
Thanks.
/Brian/
--
Brian Long | | |
IT Data Center Systems | .|||. .|||.
Cisco Linux Developer | ..:|||||||:...:|||||||:..
Phone: (919) 392-7363 | C i s c o S y s t e m s
^ permalink raw reply [flat|nested] 38+ messages in thread* [Ocfs2-devel] OCFS2 Features RFC 2006-05-02 18:22 [Ocfs2-devel] OCFS2 Features RFC Brian Long @ 2006-05-02 20:29 ` Sunil Mushran 0 siblings, 0 replies; 38+ messages in thread From: Sunil Mushran @ 2006-05-02 20:29 UTC (permalink / raw) To: ocfs2-devel Brian Long wrote: > Is there any reason you wouldn't ask the ocfs2-users community for > feedback on features as well? I hadn't subscribed to -devel because I > figured it was solely for folks actually developing the OCFS2 code :) > -devel is for all discussion regarding ocfs2 development. It is not limited to developers. We could have posted this to -users too, but I guess one is trying not to cross the "spam" line. > In my opinion, the proposed feature about "hard read only" is the most- > wanted. My team is in the middle of testing 10gR2 RAC on OCFS2 for > production deployments on RHEL 4 (hopefully your x86_64 certification is > coming soon). I assume Oracle RAC would like the "hard read only" more > than the current panic. > > Also, while I saw one end user complain about your ideas of implementing > ext3 code inside OCFS2, please remember the rest of us that survive just > fine with ext3 in Red Hat's Enterprise Linux. :) > :) > Third, is there any thoughts on integrating LVM support or using > something like Red Hat's CLVM to allow OCFS2 to layer on top of LVs > instead of just individual disks? > > The biggest drawback I see in my environment is that my storage team > provides 34GB and 68GB metas from the EMC Frames. I'd rather not have a > ton of 68GB OCFS2 filesystems but rather a larger, host-controlled LV. > Trying to get the storage team to provide a 200+GB LUN and grow it on > the fly in the future is a tough task. If I could control the LV on the > host _and_ grow OCFS2 into larger LVs, that would rock. > We are looking into this. ^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2006-05-22 19:18 UTC | newest]
Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-25 18:35 [Ocfs2-devel] OCFS2 features RFC Mark Fasheh
2006-04-25 21:55 ` Christoph Hellwig
2006-04-25 22:24 ` Mark Fasheh
2006-04-26 16:50 ` Daniel Phillips
2006-04-26 4:11 ` Andi Kleen
2006-04-26 18:06 ` Mark Fasheh
2006-04-26 18:08 ` Andi Kleen
2006-04-26 18:34 ` Daniel Phillips
2006-04-27 20:25 ` Paul Taysom
2006-05-03 23:04 ` [Ocfs2-devel] OCFS2 features RFC - separate journal? Daniel Phillips
2006-05-04 0:29 ` Zach Brown
2006-05-04 0:46 ` Daniel Phillips
2006-05-04 20:56 ` Zach Brown
2006-05-04 20:59 ` Wim Coekaerts
2006-05-04 22:23 ` Daniel Phillips
2006-05-04 22:30 ` Mark Fasheh
2006-05-05 3:05 ` Daniel Phillips
2006-05-05 18:25 ` Mark Fasheh
2006-05-06 3:09 ` Daniel Phillips
2006-05-05 17:12 ` Paul Taysom
2006-05-05 18:06 ` Daniel Phillips
2006-05-05 18:57 ` Sunil Mushran
2006-05-08 14:28 ` Paul Taysom
2006-05-08 17:43 ` Daniel Phillips
2006-05-08 18:00 ` Paul Taysom
2006-05-08 18:22 ` Daniel Phillips
2006-05-11 20:04 ` [Ocfs2-devel] OCFS2 features RFC Jeff Mahoney
2006-05-11 20:40 ` Paul Taysom
2006-05-11 20:55 ` Joel Becker
2006-05-11 21:16 ` Daniel Phillips
2006-05-17 1:44 ` Mark Fasheh
[not found] ` <446BBCF5.7040903@google.com>
[not found] ` <20060518024638.GY21588@ca-server1.us.oracle.com>
2006-05-19 0:35 ` Daniel Phillips
2006-05-19 15:16 ` J. Bruce Fields
2006-05-20 6:11 ` Mark Fasheh
2006-05-22 19:18 ` Daniel Phillips
2006-05-22 17:01 ` Paul Taysom
-- strict thread matches above, loose matches on Subject: below --
2006-05-02 18:22 [Ocfs2-devel] OCFS2 Features RFC Brian Long
2006-05-02 20:29 ` Sunil Mushran
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.