From: Wendy Cheng <wcheng@redhat.com>
To: suparna@in.ibm.com
Cc: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org,
linux-kernel@vger.kernel.org
Subject: Re: [RFC][WIP] DIO simplification and AIO-DIO stability
Date: Thu, 23 Feb 2006 14:12:04 -0500 [thread overview]
Message-ID: <43FE0904.3020900@redhat.com> (raw)
In-Reply-To: <20060223072955.GA14244@in.ibm.com>
[-- Attachment #1: Type: text/plain, Size: 1929 bytes --]
Suparna Bhattacharya wrote:
>http://www.kernel.org/pub/linux/kernel/people/suparna/DIO-simplify.txt
>(also inlined below)
>
>
Hi, Suparna,
It would be nice to ensure that the lock sequence will not cause issues
for out-of-tree external kernel modules (e.g. cluster files System) that
require extra locking for various purposes. We've found several
deadlocks issues in Global File System (GFS) Direct IO path due to lock
order enforced by VFS layer:
1) In sys_ftruncate()->do_truncate(), VFS layer grabs
* i_sem
* then i_alloc_sem (i_mutex)
* then call filesystem's setattr().
2) In Direct IO read, VFS layer calls
* filesystem's direct_IO()
* grabs i_sem (i_mutex)
* followed by i_alloc_sem.
In our case, both gfs_setattr() and gfs_direct_IO() need its own
(global) locks to synchronize inter-nodes (and inter-processes) control
structures access but gfs_direct_IO later ends up in
__blockdev_direct_IO path that deadlocks with i_sem (i_mutex) and
i_alloc_sem.
A new DIO flag is added into our distribution (2.6.9 based) to work
around the problem by moving the inode semaphore acquiring within
__blockdev_direct_IO() (patch attached) into GFS code path (so lock
order can be re-arranged). The new lock granularity is not ideal but it
gets us out of this deadlock.
We havn't had a chance to go thru your mail (and patch) in details yet
but would like bring up this issue earlier before it gets messy.
-- Wendy
[-- Attachment #2: linux-2.6.9-dio-gfs-locking.patch --]
[-- Type: text/plain, Size: 2648 bytes --]
--- linux-2.6.9-22.EL/include/linux/fs.h 2005-12-07 12:43:55.000000000 -0500
+++ linux.truncate/include/linux/fs.h 2005-12-02 00:25:22.000000000 -0500
@@ -1509,7 +1509,8 @@ ssize_t __blockdev_direct_IO(int rw, str
int lock_type);
enum {
- DIO_LOCKING = 1, /* need locking between buffered and direct access */
+ DIO_CLUSTER_LOCKING = 0, /* allow (cluster) fs handle its own locking */
+ DIO_LOCKING, /* need locking between buffered and direct access */
DIO_NO_LOCKING, /* bdev; no locking at all between buffered/direct */
DIO_OWN_LOCKING, /* filesystem locks buffered and direct internally */
};
@@ -1541,6 +1542,15 @@ static inline ssize_t blockdev_direct_IO
nr_segs, get_blocks, end_io, DIO_OWN_LOCKING);
}
+static inline ssize_t blockdev_direct_IO_cluster_locking(int rw, struct kiocb *iocb,
+ struct inode *inode, struct block_device *bdev, const struct iovec *iov,
+ loff_t offset, unsigned long nr_segs, get_blocks_t get_blocks,
+ dio_iodone_t end_io)
+{
+ return __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
+ nr_segs, get_blocks, end_io, DIO_CLUSTER_LOCKING);
+}
+
extern struct file_operations generic_ro_fops;
#define special_file(m) (S_ISCHR(m)||S_ISBLK(m)||S_ISFIFO(m)||S_ISSOCK(m))
--- linux-2.6.9-22.EL/fs/direct-io.c 2005-11-09 17:26:02.000000000 -0500
+++ linux.truncate/fs/direct-io.c 2005-12-07 12:27:17.000000000 -0500
@@ -515,7 +515,7 @@ static int get_more_blocks(struct dio *d
fs_count++;
create = dio->rw == WRITE;
- if (dio->lock_type == DIO_LOCKING) {
+ if ((dio->lock_type == DIO_LOCKING) || (dio->lock_type == DIO_CLUSTER_LOCKING)) {
if (dio->block_in_file < (i_size_read(dio->inode) >>
dio->blkbits))
create = 0;
@@ -1183,9 +1183,16 @@ __blockdev_direct_IO(int rw, struct kioc
* For regular files using DIO_OWN_LOCKING,
* neither readers nor writers take any locks here
* (i_sem is already held and release for writers here)
+ * The DIO_CLUSTER_LOCKING allows (cluster) filesystem manages its own
+ * locking (bypassing i_sem and i_alloc_sem handling within
+ * __blockdev_direct_IO()).
*/
+
dio->lock_type = dio_lock_type;
- if (dio_lock_type != DIO_NO_LOCKING) {
+ if (dio_lock_type == DIO_CLUSTER_LOCKING)
+ goto cluster_skip_locking;
+
+ if (dio_lock_type != DIO_NO_LOCKING) {
if (rw == READ) {
struct address_space *mapping;
@@ -1205,6 +1212,9 @@ __blockdev_direct_IO(int rw, struct kioc
if (dio_lock_type == DIO_LOCKING)
down_read(&inode->i_alloc_sem);
}
+
+cluster_skip_locking:
+
/*
* For file extending writes updating i_size before data
* writeouts complete can expose uninitialized blocks. So
next prev parent reply other threads:[~2006-02-23 19:12 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-02-23 7:29 [RFC][WIP] DIO simplification and AIO-DIO stability Suparna Bhattacharya
2006-02-23 19:12 ` Wendy Cheng [this message]
2006-02-24 11:53 ` Suparna Bhattacharya
2006-02-24 15:51 ` Wendy Cheng
2006-02-24 0:39 ` Badari Pulavarty
2006-02-24 1:13 ` Andrew Morton
2006-02-24 11:25 ` Suparna Bhattacharya
2006-02-24 1:01 ` Chris Mason
2006-02-24 9:37 ` Suparna Bhattacharya
2006-02-24 1:21 ` Zach Brown
2006-02-24 11:12 ` Suparna Bhattacharya
2006-02-24 18:09 ` Badari Pulavarty
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=43FE0904.3020900@redhat.com \
--to=wcheng@redhat.com \
--cc=linux-aio@kvack.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=suparna@in.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).