From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mark Fasheh Date: Tue, 16 May 2006 18:44:20 -0700 Subject: [Ocfs2-devel] OCFS2 features RFC In-Reply-To: <446398D3.7010508@suse.com> References: <20060425183553.GB10524@ca-server1.us.oracle.com> <446398D3.7010508@suse.com> Message-ID: <20060517014419.GS21588@ca-server1.us.oracle.com> List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: ocfs2-devel@oss.oracle.com Hi Jeff, On Thu, May 11, 2006 at 04:04:35PM -0400, Jeff Mahoney wrote: > While performance enhancements are always welcome, the two big features > we'd like to see in future OCFS2 releases are features that will make > using OCFS2 more transparent and more like a "local" file system. The > features we want are cluster wide lockf/flock and shared writable mmap. I'm trying to do more research on the user locking stuff right now actually. My aim is twofold. The first is to nail down exactly what each type of locking entails, and secondly I want to know what impact a lack of cluster-aware locking has on at least one existing application. Just to break it down, lockf() seems to be a (POSIX compliant?) library wrapper around fcntl() locking, which is range based, optionally mandatory, and provides deadlock detection. Ranges can encompass any part of the file, with a special case that allows to lock all possible (present and future) bytes from a given offset. Along with the usual blocking / nonblocking variants on read / exclusive locks, fcntl supports the F_GETLK operation which allows userspace to query information about a range, including the pids of processes holding incompatible locks. flock() on the other hand is always advisory and does not support ranges. No explicit deadlock detection seems to be done, though deadlocks can be broken by the user sending a signal (including kill -9) to one of the waiting processes. It also supports shared, exclusive and trylock type operations. And finally, quoting from the fcntl() man page: "Since kernel 2.0, there is no interaction between the types of lock placed by flock(2) and fcntl(2)." Now, to get to an actual example of application usage, I took a look at the apache 2.2.2 source. It seems that they do file locking in the apr functions apr_file_lock() / apr_file_unlock() (located in srclib/apr/file_io/unix/flock.c). On Linux, these use fcntl(). The only consumer of those functions I could find in the httpd tarball are the "sdbm" routines in and around srclib/apr-util/dbm/sdbm/ And that's about where my apache expertise ends :/ There's many more apps to look at for information though - sendmail immediately comes to mind. An strace on my machine here reveals that rpm uses fcntl() locking. So the question for the current OCFS2 code base is what impact the lack of cluster-aware fcntl() locking has on the particular set of software which we're going to worry about right now. Whenever we chose to do it (and we _will_ do it), it will take a long time to develop - fcntl() locking alone encompasses about two thirds of our non-trivial dlm feature wishlist. > - From a data integrity perspective, it shouldn't make a difference to an > application whether competing reader/writers are on the same node or a > different node. If standard locking primitives are already in use by the > application, they should "just work" if the competing process is on > another node. I agree 100%. > > -Online file system resize > > This would be nice, and I think easily done in the same manner ext3 > does. Anything outside the file system's current view of the block > device can be initialized in userspace, and the last block group, > bitmaps, and superblock would be adjusted by an ioctl in kernelspace. Yep, that's basically how we plan to approach it. > > - We'd like as much of it to be user space code as is possible and > > practical. > > The heartbeat project does a pretty good job on the userspace end, but > as Andi pointed out, it has the usual shortcomings of anything in > userspace involved with writing data inside the kernel. It is prone to > deadlocks and we could miss node topology events. Ahh ok, so that explains how heartbeat handles it (aka, it doesn't right now). It really seems to me that we're going to need to find a solution to this sort of problem sooner or later (why do I get the feeling that I'm being naive?). Speaking from experience, having cluster stack components in kernel means a much longer development time. Even small focused ones like the OCFS2 stack. --Mark -- Mark Fasheh Senior Software Developer, Oracle mark.fasheh at oracle.com