* [PATCH 0/2] handle O_(D)SYNC for AIO
@ 2012-11-23 7:55 Christoph Hellwig
2012-11-23 7:55 ` [PATCH 1/2] [PATCH 1/2] direct-io: implement generic deferred AIO completions Christoph Hellwig
2012-11-23 7:55 ` [PATCH 2/2] [PATCH 2/2] direct-io: handle handle O_(D)SYNC AIO Christoph Hellwig
0 siblings, 2 replies; 14+ messages in thread
From: Christoph Hellwig @ 2012-11-23 7:55 UTC (permalink / raw)
To: linux-fsdevel; +Cc: Jan Kara, Jeff Moyer, Darrick J. Wong
This is my version of the AIO O_SYNC series, which is a lot simpler than
the previous attempts. The first patch adds generic support for deferring
AIO completions to a workqueue by taking the XFS code that alter got copied
to ext4 to the core. At least the XFS code could be further simplified
after this and a memory allocation could be dropped, but due to Dave having
other changes in that area pending I'm not going to add those bits yet.
The second patch simply adds a call to generic_write_sync from the
I/O completion handler and makes sure it is called from user context
if we have to. This works fine for the "normal" blkdev_direct_IO callers
and XFS, but it currently breaks ext4 due to the mess that ext4 I/O
completions are.
We could work around this by requiring the filesystem to call
generic_write_sync from its own end_io handler if we we absolutely need to,
but I'd be much happier if someone who understands ext4 better than I
could fix it. More details in the second patch.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 1/2] [PATCH 1/2] direct-io: implement generic deferred AIO completions
2012-11-23 7:55 [PATCH 0/2] handle O_(D)SYNC for AIO Christoph Hellwig
@ 2012-11-23 7:55 ` Christoph Hellwig
2012-11-27 16:17 ` Jeff Moyer
2012-12-06 19:05 ` Jan Kara
2012-11-23 7:55 ` [PATCH 2/2] [PATCH 2/2] direct-io: handle handle O_(D)SYNC AIO Christoph Hellwig
1 sibling, 2 replies; 14+ messages in thread
From: Christoph Hellwig @ 2012-11-23 7:55 UTC (permalink / raw)
To: linux-fsdevel; +Cc: Jan Kara, Jeff Moyer, Darrick J. Wong
[-- Attachment #1: directio-defer-completions --]
[-- Type: text/plain, Size: 15829 bytes --]
Add support to the core direct-io code to defer AIO completions to user
context using a workqueue. This replaces opencoded and less efficient
code in XFS and ext4 and will be needed to properly support O_(D)SYNC
for AIO.
The communication between the filesystem and the direct I/O code requires
a new buffer head flag, which is a bit ugly but not avoidable until the
direct I/O code stops abusing the buffer_head structure for communicating
with the filesystems.
Currently this creates a per-superblock unbnound workqueue for these
completions, which is taken from an earlier patch by Jan Kara. I'm
not really convince about this use and would prefer a "normal" global
workqueue with a high concurrently limit, but this needs further discussion.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
fs/direct-io.c | 50 +++++++++++++++++++++++++++++---------------
fs/ext4/ext4.h | 5 ----
fs/ext4/inode.c | 15 +------------
fs/ext4/page-io.c | 4 ---
fs/ocfs2/aops.c | 8 -------
fs/super.c | 20 +++++++++--------
fs/xfs/xfs_aops.c | 28 ++++--------------------
fs/xfs/xfs_aops.h | 3 --
include/linux/buffer_head.h | 2 +
include/linux/fs.h | 7 ++++--
10 files changed, 61 insertions(+), 81 deletions(-)
Index: linux-2.6/fs/direct-io.c
===================================================================
--- linux-2.6.orig/fs/direct-io.c 2012-11-06 13:33:43.446660128 +0100
+++ linux-2.6/fs/direct-io.c 2012-11-21 21:22:57.875141232 +0100
@@ -126,6 +126,7 @@ struct dio {
spinlock_t bio_lock; /* protects BIO fields below */
int page_errors; /* errno from get_user_pages() */
int is_async; /* is IO async ? */
+ bool defer_completion; /* defer AIO completion to workqueue? */
int io_error; /* IO error in completion path */
unsigned long refcount; /* direct_io_worker() and bios */
struct bio *bio_list; /* singly linked via bi_private */
@@ -140,7 +141,10 @@ struct dio {
* allocation time. Don't add new fields after pages[] unless you
* wish that they not be zeroed.
*/
- struct page *pages[DIO_PAGES]; /* page buffer */
+ union {
+ struct page *pages[DIO_PAGES]; /* page buffer */
+ struct work_struct complete_work;/* deferred AIO completion */
+ };
} ____cacheline_aligned_in_smp;
static struct kmem_cache *dio_cache __read_mostly;
@@ -220,16 +224,16 @@ static inline struct page *dio_get_page(
* dio_complete() - called when all DIO BIO I/O has been completed
* @offset: the byte offset in the file of the completed operation
*
- * This releases locks as dictated by the locking type, lets interested parties
- * know that a DIO operation has completed, and calculates the resulting return
- * code for the operation.
+ * This drops i_dio_count, lets interested parties know that a DIO operation
+ * has completed, and calculates the resulting return code for the operation.
*
* It lets the filesystem know if it registered an interest earlier via
* get_block. Pass the private field of the map buffer_head so that
* filesystems can use it to hold additional state between get_block calls and
* dio_complete.
*/
-static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret, bool is_async)
+static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret,
+ bool is_async)
{
ssize_t transferred = 0;
@@ -257,19 +261,26 @@ static ssize_t dio_complete(struct dio *
if (ret == 0)
ret = transferred;
- if (dio->end_io && dio->result) {
- dio->end_io(dio->iocb, offset, transferred,
- dio->private, ret, is_async);
- } else {
- if (is_async)
- aio_complete(dio->iocb, ret, 0);
- inode_dio_done(dio->inode);
- }
+ if (dio->result && dio->end_io)
+ dio->end_io(dio->iocb, offset, transferred, dio->private);
+
+ if (is_async)
+ aio_complete(dio->iocb, ret, 0);
+ inode_dio_done(dio->inode);
+ kmem_cache_free(dio_cache, dio);
return ret;
}
+static void dio_aio_complete_work(struct work_struct *work)
+{
+ struct dio *dio = container_of(work, struct dio, complete_work);
+
+ dio_complete(dio, dio->iocb->ki_pos, 0, true);
+}
+
static int dio_bio_complete(struct dio *dio, struct bio *bio);
+
/*
* Asynchronous IO callback.
*/
@@ -289,8 +300,13 @@ static void dio_bio_end_aio(struct bio *
spin_unlock_irqrestore(&dio->bio_lock, flags);
if (remaining == 0) {
- dio_complete(dio, dio->iocb->ki_pos, 0, true);
- kmem_cache_free(dio_cache, dio);
+ if (dio->result && dio->defer_completion) {
+ INIT_WORK(&dio->complete_work, dio_aio_complete_work);
+ queue_work(dio->inode->i_sb->s_dio_done_wq,
+ &dio->complete_work);
+ } else {
+ dio_complete(dio, dio->iocb->ki_pos, 0, true);
+ }
}
}
@@ -579,6 +595,9 @@ static int get_more_blocks(struct dio *d
/* Store for completion */
dio->private = map_bh->b_private;
+
+ if (buffer_defer_completion(map_bh))
+ dio->defer_completion = true;
}
return ret;
}
@@ -1271,7 +1290,6 @@ do_blockdev_direct_IO(int rw, struct kio
if (drop_refcount(dio) == 0) {
retval = dio_complete(dio, offset, retval, false);
- kmem_cache_free(dio_cache, dio);
} else
BUG_ON(retval != -EIOCBQUEUED);
Index: linux-2.6/fs/super.c
===================================================================
--- linux-2.6.orig/fs/super.c 2012-11-06 13:33:43.706660126 +0100
+++ linux-2.6/fs/super.c 2012-11-21 21:19:50.000000000 +0100
@@ -152,15 +152,13 @@ static struct super_block *alloc_super(s
static const struct super_operations default_op;
if (s) {
- if (security_sb_alloc(s)) {
- /*
- * We cannot call security_sb_free() without
- * security_sb_alloc() succeeding. So bail out manually
- */
- kfree(s);
- s = NULL;
- goto out;
- }
+ s->s_dio_done_wq = alloc_workqueue("dio-sync", WQ_UNBOUND, 1);
+ if (!s->s_dio_done_wq)
+ goto out_free_sb;
+
+ if (security_sb_alloc(s))
+ goto out_destroy_wq;
+
#ifdef CONFIG_SMP
s->s_files = alloc_percpu(struct list_head);
if (!s->s_files)
@@ -228,6 +226,9 @@ err_out:
free_percpu(s->s_files);
#endif
destroy_sb_writers(s);
+out_destroy_wq:
+ destroy_workqueue(s->s_dio_done_wq);
+out_free_sb:
kfree(s);
s = NULL;
goto out;
@@ -244,6 +245,7 @@ static inline void destroy_super(struct
#ifdef CONFIG_SMP
free_percpu(s->s_files);
#endif
+ destroy_workqueue(s->s_dio_done_wq);
destroy_sb_writers(s);
security_sb_free(s);
WARN_ON(!list_empty(&s->s_mounts));
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h 2012-11-06 13:33:44.526660122 +0100
+++ linux-2.6/include/linux/fs.h 2012-11-21 21:19:50.000000000 +0100
@@ -44,6 +44,7 @@ struct vm_area_struct;
struct vfsmount;
struct cred;
struct swap_info_struct;
+struct workqueue_struct;
extern void __init inode_init(void);
extern void __init inode_init_early(void);
@@ -61,8 +62,7 @@ struct buffer_head;
typedef int (get_block_t)(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create);
typedef void (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
- ssize_t bytes, void *private, int ret,
- bool is_async);
+ ssize_t bytes, void *private);
#define MAY_EXEC 0x00000001
#define MAY_WRITE 0x00000002
@@ -1321,6 +1321,9 @@ struct super_block {
/* Being remounted read-only */
int s_readonly_remount;
+
+ /* AIO completions deferred from interrupt context */
+ struct workqueue_struct *s_dio_done_wq;
};
/* superblock cache pruning functions */
Index: linux-2.6/fs/ext4/inode.c
===================================================================
--- linux-2.6.orig/fs/ext4/inode.c 2012-11-06 13:33:43.483326794 +0100
+++ linux-2.6/fs/ext4/inode.c 2012-11-21 21:19:35.227136042 +0100
@@ -2876,15 +2876,13 @@ static int ext4_get_block_write_nolock(s
}
static void ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
- ssize_t size, void *private, int ret,
- bool is_async)
+ ssize_t size, void *private)
{
- struct inode *inode = iocb->ki_filp->f_path.dentry->d_inode;
ext4_io_end_t *io_end = iocb->private;
/* if not async direct IO or dio with 0 bytes write, just return */
if (!io_end || !size)
- goto out;
+ return;
ext_debug("ext4_end_io_dio(): io_end 0x%p "
"for inode %lu, iocb 0x%p, offset %llu, size %zd\n",
@@ -2896,19 +2894,11 @@ static void ext4_end_io_dio(struct kiocb
/* if not aio dio with unwritten extents, just free io and return */
if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
ext4_free_io_end(io_end);
-out:
- if (is_async)
- aio_complete(iocb, ret, 0);
- inode_dio_done(inode);
return;
}
io_end->offset = offset;
io_end->size = size;
- if (is_async) {
- io_end->iocb = iocb;
- io_end->result = ret;
- }
ext4_add_complete_io(io_end);
}
@@ -3044,7 +3034,6 @@ static ssize_t ext4_ext_direct_IO(int rw
ret = -ENOMEM;
goto retake_lock;
}
- io_end->flag |= EXT4_IO_END_DIRECT;
iocb->private = io_end;
/*
* we save the io structure for current async
Index: linux-2.6/fs/ext4/page-io.c
===================================================================
--- linux-2.6.orig/fs/ext4/page-io.c 2012-11-06 13:33:43.486660128 +0100
+++ linux-2.6/fs/ext4/page-io.c 2012-11-21 19:51:44.475001087 +0100
@@ -104,11 +104,7 @@ static int ext4_end_io(ext4_io_end_t *io
"(inode %lu, offset %llu, size %zd, error %d)",
inode->i_ino, offset, size, ret);
}
- if (io->iocb)
- aio_complete(io->iocb, io->result, 0);
- if (io->flag & EXT4_IO_END_DIRECT)
- inode_dio_done(inode);
/* Wake up anyone waiting on unwritten extent conversion */
if (atomic_dec_and_test(&EXT4_I(inode)->i_unwritten))
wake_up_all(ext4_ioend_wq(io->inode));
Index: linux-2.6/fs/ocfs2/aops.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/aops.c 2012-11-06 13:33:43.646660127 +0100
+++ linux-2.6/fs/ocfs2/aops.c 2012-11-21 19:52:05.291001619 +0100
@@ -565,9 +565,7 @@ bail:
static void ocfs2_dio_end_io(struct kiocb *iocb,
loff_t offset,
ssize_t bytes,
- void *private,
- int ret,
- bool is_async)
+ void *private)
{
struct inode *inode = iocb->ki_filp->f_path.dentry->d_inode;
int level;
@@ -592,10 +590,6 @@ static void ocfs2_dio_end_io(struct kioc
level = ocfs2_iocb_rw_locked_level(iocb);
ocfs2_rw_unlock(inode, level);
-
- if (is_async)
- aio_complete(iocb, ret, 0);
- inode_dio_done(inode);
}
/*
Index: linux-2.6/fs/xfs/xfs_aops.c
===================================================================
--- linux-2.6.orig/fs/xfs/xfs_aops.c 2012-11-21 10:58:21.356982409 +0100
+++ linux-2.6/fs/xfs/xfs_aops.c 2012-11-21 21:19:50.000000000 +0100
@@ -85,14 +85,6 @@ xfs_destroy_ioend(
bh->b_end_io(bh, !ioend->io_error);
}
- if (ioend->io_iocb) {
- if (ioend->io_isasync) {
- aio_complete(ioend->io_iocb, ioend->io_error ?
- ioend->io_error : ioend->io_result, 0);
- }
- inode_dio_done(ioend->io_inode);
- }
-
mempool_free(ioend, xfs_ioend_pool);
}
@@ -290,7 +282,6 @@ xfs_alloc_ioend(
* all the I/O from calling the completion routine too early.
*/
atomic_set(&ioend->io_remaining, 1);
- ioend->io_isasync = 0;
ioend->io_isdirect = 0;
ioend->io_error = 0;
ioend->io_list = NULL;
@@ -300,8 +291,6 @@ xfs_alloc_ioend(
ioend->io_buffer_tail = NULL;
ioend->io_offset = 0;
ioend->io_size = 0;
- ioend->io_iocb = NULL;
- ioend->io_result = 0;
ioend->io_append_trans = NULL;
INIT_WORK(&ioend->io_work, xfs_end_io);
@@ -1280,8 +1269,10 @@ __xfs_get_blocks(
if (create || !ISUNWRITTEN(&imap))
xfs_map_buffer(inode, bh_result, &imap, offset);
if (create && ISUNWRITTEN(&imap)) {
- if (direct)
+ if (direct) {
bh_result->b_private = inode;
+ set_buffer_defer_completion(bh_result);
+ }
set_buffer_unwritten(bh_result);
}
}
@@ -1378,9 +1369,7 @@ xfs_end_io_direct_write(
struct kiocb *iocb,
loff_t offset,
ssize_t size,
- void *private,
- int ret,
- bool is_async)
+ void *private)
{
struct xfs_ioend *ioend = iocb->private;
@@ -1402,17 +1391,10 @@ xfs_end_io_direct_write(
ioend->io_offset = offset;
ioend->io_size = size;
- ioend->io_iocb = iocb;
- ioend->io_result = ret;
if (private && size > 0)
ioend->io_type = XFS_IO_UNWRITTEN;
- if (is_async) {
- ioend->io_isasync = 1;
- xfs_finish_ioend(ioend);
- } else {
- xfs_finish_ioend_sync(ioend);
- }
+ xfs_finish_ioend_sync(ioend);
}
STATIC ssize_t
Index: linux-2.6/fs/ext4/ext4.h
===================================================================
--- linux-2.6.orig/fs/ext4/ext4.h 2012-11-06 13:33:43.479993461 +0100
+++ linux-2.6/fs/ext4/ext4.h 2012-11-21 21:19:50.000000000 +0100
@@ -142,7 +142,7 @@ struct ext4_allocation_request {
*/
#define EXT4_MAP_NEW (1 << BH_New)
#define EXT4_MAP_MAPPED (1 << BH_Mapped)
-#define EXT4_MAP_UNWRITTEN (1 << BH_Unwritten)
+#define EXT4_MAP_UNWRITTEN ((1 << BH_Unwritten) | (1 << BH_Defer_Completion))
#define EXT4_MAP_BOUNDARY (1 << BH_Boundary)
#define EXT4_MAP_UNINIT (1 << BH_Uninit)
/* Sometimes (in the bigalloc case, from ext4_da_get_block_prep) the caller of
@@ -185,7 +185,6 @@ struct mpage_da_data {
#define EXT4_IO_END_UNWRITTEN 0x0001
#define EXT4_IO_END_ERROR 0x0002
#define EXT4_IO_END_QUEUED 0x0004
-#define EXT4_IO_END_DIRECT 0x0008
struct ext4_io_page {
struct page *p_page;
@@ -209,8 +208,6 @@ typedef struct ext4_io_end {
loff_t offset; /* offset in the file */
ssize_t size; /* size of the extent */
struct work_struct work; /* data work queue */
- struct kiocb *iocb; /* iocb struct for AIO */
- int result; /* error value for AIO */
int num_io_pages; /* for writepages() */
struct ext4_io_page *pages[MAX_IO_PAGES]; /* for writepages() */
} ext4_io_end_t;
Index: linux-2.6/fs/xfs/xfs_aops.h
===================================================================
--- linux-2.6.orig/fs/xfs/xfs_aops.h 2012-11-06 13:33:43.736660126 +0100
+++ linux-2.6/fs/xfs/xfs_aops.h 2012-11-21 20:11:42.211031753 +0100
@@ -45,7 +45,6 @@ typedef struct xfs_ioend {
unsigned int io_type; /* delalloc / unwritten */
int io_error; /* I/O error code */
atomic_t io_remaining; /* hold count */
- unsigned int io_isasync : 1; /* needs aio_complete */
unsigned int io_isdirect : 1;/* direct I/O */
struct inode *io_inode; /* file being written to */
struct buffer_head *io_buffer_head;/* buffer linked list head */
@@ -54,8 +53,6 @@ typedef struct xfs_ioend {
xfs_off_t io_offset; /* offset in the file */
struct work_struct io_work; /* xfsdatad work queue */
struct xfs_trans *io_append_trans;/* xact. for size update */
- struct kiocb *io_iocb;
- int io_result;
} xfs_ioend_t;
extern const struct address_space_operations xfs_address_space_operations;
Index: linux-2.6/include/linux/buffer_head.h
===================================================================
--- linux-2.6.orig/include/linux/buffer_head.h 2012-11-21 21:19:35.227136042 +0100
+++ linux-2.6/include/linux/buffer_head.h 2012-11-21 21:19:50.000000000 +0100
@@ -34,6 +34,7 @@ enum bh_state_bits {
BH_Write_EIO, /* I/O error on write */
BH_Unwritten, /* Buffer is allocated on disk but not written */
BH_Quiet, /* Buffer Error Prinks to be quiet */
+ BH_Defer_Completion, /* Defer AIO completion to workqueue */
BH_PrivateStart,/* not a state bit, but the first bit available
* for private allocation by other entities
@@ -124,6 +125,7 @@ BUFFER_FNS(Delay, delay)
BUFFER_FNS(Boundary, boundary)
BUFFER_FNS(Write_EIO, write_io_error)
BUFFER_FNS(Unwritten, unwritten)
+BUFFER_FNS(Defer_Completion, defer_completion)
#define bh_offset(bh) ((unsigned long)(bh)->b_data & ~PAGE_MASK)
#define touch_buffer(bh) mark_page_accessed(bh->b_page)
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 2/2] [PATCH 2/2] direct-io: handle handle O_(D)SYNC AIO
2012-11-23 7:55 [PATCH 0/2] handle O_(D)SYNC for AIO Christoph Hellwig
2012-11-23 7:55 ` [PATCH 1/2] [PATCH 1/2] direct-io: implement generic deferred AIO completions Christoph Hellwig
@ 2012-11-23 7:55 ` Christoph Hellwig
2012-11-27 16:19 ` Jeff Moyer
1 sibling, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2012-11-23 7:55 UTC (permalink / raw)
To: linux-fsdevel; +Cc: Jan Kara, Jeff Moyer, Darrick J. Wong
[-- Attachment #1: directio-handle-aio-O_SYNC --]
[-- Type: text/plain, Size: 4934 bytes --]
Call generic_write_sync from the deferred I/O completion handler if
O_DSYNC is set for a write request. Also make sure various callers
don't call generic_write_sync if the direct I/O code returns
-EIOCBQUEUED.
Note: this currently breaks ext4 due to it's convoluted unwritten
extent conversion code. I've tried to understand it and as far
as I can see it's a workaround for the fact that ext4 marks page
writeback as completed before converting unwritten extents.
Ext4 should follow xfs on this and only mark writeback as completed
when it really is and at that point can remove the big hairy mess to
force unwritten extent conversions from fsync, truncate and a few other
places.
Based on an earlier patch from Jan Kara <jack@suse.cz> with updates from
Jeff Moyer <jmoyer@redhat.com> and Darrick J. Wong <darrick.wong@oracle.com>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
fs/block_dev.c | 2 +-
fs/btrfs/file.c | 2 +-
fs/cifs/file.c | 2 +-
fs/direct-io.c | 22 +++++++++++++++++++++-
fs/ext4/file.c | 2 +-
mm/filemap.c | 2 +-
6 files changed, 26 insertions(+), 6 deletions(-)
Index: linux-2.6/fs/block_dev.c
===================================================================
--- linux-2.6.orig/fs/block_dev.c 2012-11-21 21:19:34.075136013 +0100
+++ linux-2.6/fs/block_dev.c 2012-11-21 21:23:51.227142598 +0100
@@ -1631,7 +1631,7 @@ ssize_t blkdev_aio_write(struct kiocb *i
percpu_down_read(&bdev->bd_block_size_semaphore);
ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
- if (ret > 0 || ret == -EIOCBQUEUED) {
+ if (ret > 0) {
ssize_t err;
err = generic_write_sync(file, pos, ret);
Index: linux-2.6/fs/btrfs/file.c
===================================================================
--- linux-2.6.orig/fs/btrfs/file.c 2012-11-21 21:19:34.075136013 +0100
+++ linux-2.6/fs/btrfs/file.c 2012-11-21 21:23:51.231142597 +0100
@@ -1495,7 +1495,7 @@ static ssize_t btrfs_file_aio_write(stru
* one running right now.
*/
BTRFS_I(inode)->last_trans = root->fs_info->generation + 1;
- if (num_written > 0 || num_written == -EIOCBQUEUED) {
+ if (num_written > 0) {
err = generic_write_sync(file, pos, num_written);
if (err < 0 && num_written > 0)
num_written = err;
Index: linux-2.6/fs/cifs/file.c
===================================================================
--- linux-2.6.orig/fs/cifs/file.c 2012-11-21 21:19:34.075136013 +0100
+++ linux-2.6/fs/cifs/file.c 2012-11-21 21:23:51.231142597 +0100
@@ -2464,7 +2464,7 @@ cifs_writev(struct kiocb *iocb, const st
mutex_unlock(&inode->i_mutex);
}
- if (rc > 0 || rc == -EIOCBQUEUED) {
+ if (rc > 0) {
ssize_t err;
err = generic_write_sync(file, pos, rc);
Index: linux-2.6/fs/ext4/file.c
===================================================================
--- linux-2.6.orig/fs/ext4/file.c 2012-11-21 21:19:34.075136013 +0100
+++ linux-2.6/fs/ext4/file.c 2012-11-21 21:23:51.231142597 +0100
@@ -155,7 +155,7 @@ ext4_file_dio_write(struct kiocb *iocb,
ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
mutex_unlock(&inode->i_mutex);
- if (ret > 0 || ret == -EIOCBQUEUED) {
+ if (ret > 0) {
ssize_t err;
err = generic_write_sync(file, pos, ret);
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c 2012-11-21 21:19:34.075136013 +0100
+++ linux-2.6/mm/filemap.c 2012-11-21 21:23:51.235142597 +0100
@@ -2532,7 +2532,7 @@ ssize_t generic_file_aio_write(struct ki
ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
mutex_unlock(&inode->i_mutex);
- if (ret > 0 || ret == -EIOCBQUEUED) {
+ if (ret > 0) {
ssize_t err;
err = generic_write_sync(file, pos, ret);
Index: linux-2.6/fs/direct-io.c
===================================================================
--- linux-2.6.orig/fs/direct-io.c 2012-11-21 21:22:57.875141232 +0100
+++ linux-2.6/fs/direct-io.c 2012-11-21 21:23:51.235142597 +0100
@@ -264,8 +264,19 @@ static ssize_t dio_complete(struct dio *
if (dio->result && dio->end_io)
dio->end_io(dio->iocb, offset, transferred, dio->private);
- if (is_async)
+ if (is_async) {
+ if (dio->rw & WRITE) {
+ int err;
+
+ err = generic_write_sync(dio->iocb->ki_filp, offset,
+ transferred);
+ if (err < 0 && ret > 0)
+ ret = err;
+ }
+
aio_complete(dio->iocb, ret, 0);
+ }
+
inode_dio_done(dio->inode);
kmem_cache_free(dio_cache, dio);
@@ -1163,6 +1174,15 @@ do_blockdev_direct_IO(int rw, struct kio
dio->is_async = !is_sync_kiocb(iocb) && !((rw & WRITE) &&
(end > i_size_read(inode)));
+ /*
+ * For AIO O_(D)SYNC writes we need to defer completions to a workqueue
+ * so that we can call ->fsync.
+ */
+ if (dio->is_async && (rw & WRITE) &&
+ ((iocb->ki_filp->f_flags & O_DSYNC) ||
+ IS_SYNC(iocb->ki_filp->f_mapping->host)))
+ dio->defer_completion = true;
+
retval = 0;
dio->inode = inode;
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] [PATCH 1/2] direct-io: implement generic deferred AIO completions
2012-11-23 7:55 ` [PATCH 1/2] [PATCH 1/2] direct-io: implement generic deferred AIO completions Christoph Hellwig
@ 2012-11-27 16:17 ` Jeff Moyer
2012-12-06 19:05 ` Jan Kara
1 sibling, 0 replies; 14+ messages in thread
From: Jeff Moyer @ 2012-11-27 16:17 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-fsdevel, Jan Kara, Darrick J. Wong
Christoph Hellwig <hch@infradead.org> writes:
> Add support to the core direct-io code to defer AIO completions to user
> context using a workqueue. This replaces opencoded and less efficient
> code in XFS and ext4 and will be needed to properly support O_(D)SYNC
> for AIO.
>
> The communication between the filesystem and the direct I/O code requires
> a new buffer head flag, which is a bit ugly but not avoidable until the
> direct I/O code stops abusing the buffer_head structure for communicating
> with the filesystems.
>
> Currently this creates a per-superblock unbnound workqueue for these
> completions, which is taken from an earlier patch by Jan Kara. I'm
> not really convince about this use and would prefer a "normal" global
> workqueue with a high concurrently limit, but this needs further discussion.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
I like this a lot, thanks Christoph.
Acked-by: Jeff Moyer <jmoyer@redhat.com>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] [PATCH 2/2] direct-io: handle handle O_(D)SYNC AIO
2012-11-23 7:55 ` [PATCH 2/2] [PATCH 2/2] direct-io: handle handle O_(D)SYNC AIO Christoph Hellwig
@ 2012-11-27 16:19 ` Jeff Moyer
2012-11-28 0:26 ` Darrick J. Wong
2012-12-05 13:08 ` [PATCH 2/2] [PATCH 2/2] direct-io: handle handle O_(D)SYNC AIO Jan Kara
0 siblings, 2 replies; 14+ messages in thread
From: Jeff Moyer @ 2012-11-27 16:19 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-fsdevel, Jan Kara, Darrick J. Wong
Christoph Hellwig <hch@infradead.org> writes:
> Call generic_write_sync from the deferred I/O completion handler if
> O_DSYNC is set for a write request. Also make sure various callers
> don't call generic_write_sync if the direct I/O code returns
> -EIOCBQUEUED.
>
> Note: this currently breaks ext4 due to it's convoluted unwritten
> extent conversion code. I've tried to understand it and as far
> as I can see it's a workaround for the fact that ext4 marks page
> writeback as completed before converting unwritten extents.
> Ext4 should follow xfs on this and only mark writeback as completed
> when it really is and at that point can remove the big hairy mess to
> force unwritten extent conversions from fsync, truncate and a few other
> places.
Well, I've tested the xfs bits, and there are no regressions. I won't
bother testing ext4 until a full solution is in place. Jan, is this
something you have the time for and/or interest in doing?
Cheers,
Jeff
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] [PATCH 2/2] direct-io: handle handle O_(D)SYNC AIO
2012-11-27 16:19 ` Jeff Moyer
@ 2012-11-28 0:26 ` Darrick J. Wong
2012-11-28 0:30 ` Christoph Hellwig
2012-12-05 13:08 ` [PATCH 2/2] [PATCH 2/2] direct-io: handle handle O_(D)SYNC AIO Jan Kara
1 sibling, 1 reply; 14+ messages in thread
From: Darrick J. Wong @ 2012-11-28 0:26 UTC (permalink / raw)
To: Jeff Moyer; +Cc: Christoph Hellwig, linux-fsdevel, Jan Kara, linux-ext4
On Tue, Nov 27, 2012 at 11:19:41AM -0500, Jeff Moyer wrote:
> Christoph Hellwig <hch@infradead.org> writes:
>
> > Call generic_write_sync from the deferred I/O completion handler if
> > O_DSYNC is set for a write request. Also make sure various callers
> > don't call generic_write_sync if the direct I/O code returns
> > -EIOCBQUEUED.
> >
> > Note: this currently breaks ext4 due to it's convoluted unwritten
> > extent conversion code. I've tried to understand it and as far
> > as I can see it's a workaround for the fact that ext4 marks page
> > writeback as completed before converting unwritten extents.
> > Ext4 should follow xfs on this and only mark writeback as completed
> > when it really is and at that point can remove the big hairy mess to
> > force unwritten extent conversions from fsync, truncate and a few other
> > places.
>
> Well, I've tested the xfs bits, and there are no regressions. I won't
> bother testing ext4 until a full solution is in place. Jan, is this
> something you have the time for and/or interest in doing?
I tested the raw block device use case and it seems to be ok too.
cc'ing the ext4 list, since this affects ext4 pretty heavily.
--D
>
> Cheers,
> Jeff
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] [PATCH 2/2] direct-io: handle handle O_(D)SYNC AIO
2012-11-28 0:26 ` Darrick J. Wong
@ 2012-11-28 0:30 ` Christoph Hellwig
2012-11-28 8:02 ` [RFC PATCH] ext4: Convert unwritten extents during end_io processing Darrick J. Wong
0 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2012-11-28 0:30 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Jeff Moyer, Christoph Hellwig, linux-fsdevel, Jan Kara,
linux-ext4
On Tue, Nov 27, 2012 at 04:26:56PM -0800, Darrick J. Wong wrote:
> cc'ing the ext4 list, since this affects ext4 pretty heavily.
Oops, I really should have done that for the full series.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [RFC PATCH] ext4: Convert unwritten extents during end_io processing
2012-11-28 0:30 ` Christoph Hellwig
@ 2012-11-28 8:02 ` Darrick J. Wong
2012-11-28 14:34 ` Christoph Hellwig
0 siblings, 1 reply; 14+ messages in thread
From: Darrick J. Wong @ 2012-11-28 8:02 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Jeff Moyer, linux-fsdevel, Jan Kara, linux-ext4
Here's a lightly tested (it passed enough of xfstests and an aio+dio+osync
tester on ext4 on x64...) patch that rips out the whole wq mess to convert
unwritten extents from endio processing. This has the effect that unwritten
extents are now converted as part of writeback, not fsync/truncate/punch_hole.
I have a suspicion that the reason why ext4 had that behavior was to reduce
churn in the extent tree if one writes a bunch of adjacent sections of hole.
Oh well. I haven't seen any huge regressions yet, but then I'm really just
posting this early to see if anyone spots obvious bugs.
Christoph, was this what you had in mind?
--D
---
When writing into an unwritten part of a file, convert the unwritten extent
while doing the endio processing instead of deferring it to another workqueue.
This enables us to reduce the endio processing to only one workqueue and also
means that writeback doesn't end until the extent conversion is complete.
This patch is intended to be a follow-on to Christoph's "handle O_(D)SYNC for
AIO" patchset.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
fs/ext4/ext4.h | 9 ---
fs/ext4/extents.c | 14 ++---
fs/ext4/fsync.c | 4 -
fs/ext4/indirect.c | 6 --
fs/ext4/inode.c | 10 +---
fs/ext4/page-io.c | 139 ++++------------------------------------------------
fs/ext4/super.c | 18 -------
7 files changed, 23 insertions(+), 177 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 6129dd5..98d2b92 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -904,9 +904,6 @@ struct ext4_inode_info {
qsize_t i_reserved_quota;
#endif
- /* completed IOs that might need unwritten extents handling */
- struct list_head i_completed_io_list;
- spinlock_t i_completed_io_lock;
atomic_t i_ioend_count; /* Number of outstanding io_end structs */
atomic_t i_unwritten; /* Nr. of inflight conversions pending */
@@ -1273,9 +1270,6 @@ struct ext4_sb_info {
struct flex_groups *s_flex_groups;
ext4_group_t s_flex_groups_allocated;
- /* workqueue for dio unwritten */
- struct workqueue_struct *dio_unwritten_wq;
-
/* timer for periodic error stats printing */
struct timer_list s_err_report;
@@ -1944,7 +1938,6 @@ extern void ext4_htree_free_dir_info(struct dir_private_info *p);
/* fsync.c */
extern int ext4_sync_file(struct file *, loff_t, loff_t, int);
-extern int ext4_flush_unwritten_io(struct inode *);
/* hash.c */
extern int ext4fs_dirhash(const char *name, int len, struct
@@ -2416,7 +2409,7 @@ extern int ext4_move_extents(struct file *o_filp, struct file *d_filp,
/* page-io.c */
extern int __init ext4_init_pageio(void);
-extern void ext4_add_complete_io(ext4_io_end_t *io_end);
+extern void ext4_complete_io(ext4_io_end_t *io_end);
extern void ext4_exit_pageio(void);
extern void ext4_ioend_wait(struct inode *);
extern void ext4_free_io_end(ext4_io_end_t *io);
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 7011ac9..f3fec5f 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4300,10 +4300,10 @@ void ext4_ext_truncate(struct inode *inode)
int err = 0;
/*
- * finish any pending end_io work so we won't run the risk of
- * converting any truncated blocks to initialized later
+ * Wait for pending extent conversion so we won't accidentally
+ * convert truncated blocks to initialized later.
*/
- ext4_flush_unwritten_io(inode);
+ ext4_unwritten_wait(inode);
/*
* probably first extent we're gonna free will be last in block
@@ -4464,8 +4464,8 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
if (len <= EXT_UNINIT_MAX_LEN << blkbits)
flags |= EXT4_GET_BLOCKS_NO_NORMALIZE;
- /* Prevent race condition between unwritten */
- ext4_flush_unwritten_io(inode);
+ /* Don't race with unwritten */
+ ext4_unwritten_wait(inode);
retry:
while (ret >= 0 && ret < max_blocks) {
map.m_lblk = map.m_lblk + ret;
@@ -4885,9 +4885,7 @@ int ext4_ext_punch_hole(struct file *file, loff_t offset, loff_t length)
/* Wait all existing dio workers, newcomers will block on i_mutex */
ext4_inode_block_unlocked_dio(inode);
- err = ext4_flush_unwritten_io(inode);
- if (err)
- goto out_dio;
+ ext4_unwritten_wait(inode);
inode_dio_wait(inode);
credits = ext4_writepage_trans_blocks(inode);
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index be1d89f..94f5d3e 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -138,9 +138,7 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
if (inode->i_sb->s_flags & MS_RDONLY)
goto out;
- ret = ext4_flush_unwritten_io(inode);
- if (ret < 0)
- goto out;
+ ext4_unwritten_wait(inode);
if (!journal) {
ret = __sync_inode(inode, datasync);
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 792e388..f4b65f2 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -807,11 +807,7 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
retry:
if (rw == READ && ext4_should_dioread_nolock(inode)) {
- if (unlikely(atomic_read(&EXT4_I(inode)->i_unwritten))) {
- mutex_lock(&inode->i_mutex);
- ext4_flush_unwritten_io(inode);
- mutex_unlock(&inode->i_mutex);
- }
+ ext4_unwritten_wait(inode);
/*
* Nolock dioread optimization may be dynamically disabled
* via ext4_inode_block_unlocked_dio(). Check inode's state
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0446f28..3c9f6a0 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2891,16 +2891,10 @@ static void ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
iocb->private = NULL;
- /* if not aio dio with unwritten extents, just free io and return */
- if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
- ext4_free_io_end(io_end);
- return;
- }
-
io_end->offset = offset;
io_end->size = size;
- ext4_add_complete_io(io_end);
+ ext4_complete_io(io_end);
}
static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate)
@@ -2925,7 +2919,7 @@ static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate)
*/
inode = io_end->inode;
ext4_set_io_unwritten_flag(inode, io_end);
- ext4_add_complete_io(io_end);
+ ext4_complete_io(io_end);
out:
bh->b_private = NULL;
bh->b_end_io = NULL;
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index df1d5b7..2adf23f 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -71,7 +71,6 @@ void ext4_free_io_end(ext4_io_end_t *io)
int i;
BUG_ON(!io);
- BUG_ON(!list_empty(&io->list));
BUG_ON(io->flag & EXT4_IO_END_UNWRITTEN);
if (io->page)
@@ -92,9 +91,8 @@ static int ext4_end_io(ext4_io_end_t *io)
ssize_t size = io->size;
int ret = 0;
- ext4_debug("ext4_end_io_nolock: io 0x%p from inode %lu,list->next 0x%p,"
- "list->prev 0x%p\n",
- io, inode->i_ino, io->list.next, io->list.prev);
+ ext4_debug("ext4_end_io_nolock: io 0x%p from inode %lu\n",
+ io, inode->i_ino);
ret = ext4_convert_unwritten_extents(inode, offset, size);
if (ret < 0) {
@@ -111,125 +109,19 @@ static int ext4_end_io(ext4_io_end_t *io)
return ret;
}
-static void dump_completed_IO(struct inode *inode)
-{
-#ifdef EXT4FS_DEBUG
- struct list_head *cur, *before, *after;
- ext4_io_end_t *io, *io0, *io1;
- unsigned long flags;
-
- if (list_empty(&EXT4_I(inode)->i_completed_io_list)) {
- ext4_debug("inode %lu completed_io list is empty\n",
- inode->i_ino);
- return;
- }
-
- ext4_debug("Dump inode %lu completed_io list\n", inode->i_ino);
- list_for_each_entry(io, &EXT4_I(inode)->i_completed_io_list, list) {
- cur = &io->list;
- before = cur->prev;
- io0 = container_of(before, ext4_io_end_t, list);
- after = cur->next;
- io1 = container_of(after, ext4_io_end_t, list);
-
- ext4_debug("io 0x%p from inode %lu,prev 0x%p,next 0x%p\n",
- io, inode->i_ino, io0, io1);
- }
-#endif
-}
-
-/* Add the io_end to per-inode completed end_io list. */
-void ext4_add_complete_io(ext4_io_end_t *io_end)
-{
- struct ext4_inode_info *ei = EXT4_I(io_end->inode);
- struct workqueue_struct *wq;
- unsigned long flags;
-
- BUG_ON(!(io_end->flag & EXT4_IO_END_UNWRITTEN));
- wq = EXT4_SB(io_end->inode->i_sb)->dio_unwritten_wq;
-
- spin_lock_irqsave(&ei->i_completed_io_lock, flags);
- if (list_empty(&ei->i_completed_io_list)) {
- io_end->flag |= EXT4_IO_END_QUEUED;
- queue_work(wq, &io_end->work);
- }
- list_add_tail(&io_end->list, &ei->i_completed_io_list);
- spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
-}
-
-static int ext4_do_flush_completed_IO(struct inode *inode,
- ext4_io_end_t *work_io)
-{
- ext4_io_end_t *io;
- struct list_head unwritten, complete, to_free;
- unsigned long flags;
- struct ext4_inode_info *ei = EXT4_I(inode);
- int err, ret = 0;
-
- INIT_LIST_HEAD(&complete);
- INIT_LIST_HEAD(&to_free);
-
- spin_lock_irqsave(&ei->i_completed_io_lock, flags);
- dump_completed_IO(inode);
- list_replace_init(&ei->i_completed_io_list, &unwritten);
- spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
-
- while (!list_empty(&unwritten)) {
- io = list_entry(unwritten.next, ext4_io_end_t, list);
- BUG_ON(!(io->flag & EXT4_IO_END_UNWRITTEN));
- list_del_init(&io->list);
-
- err = ext4_end_io(io);
- if (unlikely(!ret && err))
- ret = err;
-
- list_add_tail(&io->list, &complete);
- }
- spin_lock_irqsave(&ei->i_completed_io_lock, flags);
- while (!list_empty(&complete)) {
- io = list_entry(complete.next, ext4_io_end_t, list);
- io->flag &= ~EXT4_IO_END_UNWRITTEN;
- /* end_io context can not be destroyed now because it still
- * used by queued worker. Worker thread will destroy it later */
- if (io->flag & EXT4_IO_END_QUEUED)
- list_del_init(&io->list);
- else
- list_move(&io->list, &to_free);
- }
- /* If we are called from worker context, it is time to clear queued
- * flag, and destroy it's end_io if it was converted already */
- if (work_io) {
- work_io->flag &= ~EXT4_IO_END_QUEUED;
- if (!(work_io->flag & EXT4_IO_END_UNWRITTEN))
- list_add_tail(&work_io->list, &to_free);
- }
- spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
-
- while (!list_empty(&to_free)) {
- io = list_entry(to_free.next, ext4_io_end_t, list);
- list_del_init(&io->list);
- ext4_free_io_end(io);
- }
- return ret;
-}
-
/*
- * work on completed aio dio IO, to convert unwritten extents to extents
+ * Complete endio processing by converting unwritten extents (if necessary) and
+ * freeing the endio.
*/
-static void ext4_end_io_work(struct work_struct *work)
+void ext4_complete_io(ext4_io_end_t *io_end)
{
- ext4_io_end_t *io = container_of(work, ext4_io_end_t, work);
- ext4_do_flush_completed_IO(io->inode, io);
-}
+ if (!(io_end->flag & EXT4_IO_END_UNWRITTEN))
+ goto out;
-int ext4_flush_unwritten_io(struct inode *inode)
-{
- int ret;
- WARN_ON_ONCE(!mutex_is_locked(&inode->i_mutex) &&
- !(inode->i_state & I_FREEING));
- ret = ext4_do_flush_completed_IO(inode, NULL);
- ext4_unwritten_wait(inode);
- return ret;
+ ext4_end_io(io_end);
+ io_end->flag &= ~EXT4_IO_END_UNWRITTEN;
+out:
+ ext4_free_io_end(io_end);
}
ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags)
@@ -238,8 +130,6 @@ ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags)
if (io) {
atomic_inc(&EXT4_I(inode)->i_ioend_count);
io->inode = inode;
- INIT_WORK(&io->work, ext4_end_io_work);
- INIT_LIST_HEAD(&io->list);
}
return io;
}
@@ -315,12 +205,7 @@ static void ext4_end_bio(struct bio *bio, int error)
bi_sector >> (inode->i_blkbits - 9));
}
- if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
- ext4_free_io_end(io_end);
- return;
- }
-
- ext4_add_complete_io(io_end);
+ ext4_complete_io(io_end);
}
void ext4_io_submit(struct ext4_io_submit *io)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 80928f7..1691209 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -848,9 +848,6 @@ static void ext4_put_super(struct super_block *sb)
ext4_unregister_li_request(sb);
dquot_disable(sb, -1, DQUOT_USAGE_ENABLED | DQUOT_LIMITS_ENABLED);
- flush_workqueue(sbi->dio_unwritten_wq);
- destroy_workqueue(sbi->dio_unwritten_wq);
-
if (sbi->s_journal) {
err = jbd2_journal_destroy(sbi->s_journal);
sbi->s_journal = NULL;
@@ -953,8 +950,6 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
ei->i_reserved_quota = 0;
#endif
ei->jinode = NULL;
- INIT_LIST_HEAD(&ei->i_completed_io_list);
- spin_lock_init(&ei->i_completed_io_lock);
ei->i_sync_tid = 0;
ei->i_datasync_tid = 0;
atomic_set(&ei->i_ioend_count, 0);
@@ -3903,17 +3898,6 @@ no_journal:
}
/*
- * The maximum number of concurrent works can be high and
- * concurrency isn't really necessary. Limit it to 1.
- */
- EXT4_SB(sb)->dio_unwritten_wq =
- alloc_workqueue("ext4-dio-unwritten", WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
- if (!EXT4_SB(sb)->dio_unwritten_wq) {
- printk(KERN_ERR "EXT4-fs: failed to create DIO workqueue\n");
- goto failed_mount_wq;
- }
-
- /*
* The jbd2_journal_load will have done any necessary log recovery,
* so we can safely mount the rest of the filesystem now.
*/
@@ -4045,7 +4029,6 @@ failed_mount4a:
sb->s_root = NULL;
failed_mount4:
ext4_msg(sb, KERN_ERR, "mount failed");
- destroy_workqueue(EXT4_SB(sb)->dio_unwritten_wq);
failed_mount_wq:
if (sbi->s_journal) {
jbd2_journal_destroy(sbi->s_journal);
@@ -4493,7 +4476,6 @@ static int ext4_sync_fs(struct super_block *sb, int wait)
struct ext4_sb_info *sbi = EXT4_SB(sb);
trace_ext4_sync_fs(sb, wait);
- flush_workqueue(sbi->dio_unwritten_wq);
/*
* Writeback quota in non-journalled quota case - journalled quota has
* no dirty dquots
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [RFC PATCH] ext4: Convert unwritten extents during end_io processing
2012-11-28 8:02 ` [RFC PATCH] ext4: Convert unwritten extents during end_io processing Darrick J. Wong
@ 2012-11-28 14:34 ` Christoph Hellwig
2012-11-29 19:47 ` Darrick J. Wong
0 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2012-11-28 14:34 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Christoph Hellwig, Jeff Moyer, linux-fsdevel, Jan Kara,
linux-ext4
On Wed, Nov 28, 2012 at 12:02:54AM -0800, Darrick J. Wong wrote:
> Here's a lightly tested (it passed enough of xfstests and an aio+dio+osync
> tester on ext4 on x64...) patch that rips out the whole wq mess to convert
> unwritten extents from endio processing. This has the effect that unwritten
> extents are now converted as part of writeback, not fsync/truncate/punch_hole.
> I have a suspicion that the reason why ext4 had that behavior was to reduce
> churn in the extent tree if one writes a bunch of adjacent sections of hole.
> Oh well. I haven't seen any huge regressions yet, but then I'm really just
> posting this early to see if anyone spots obvious bugs.
>
> Christoph, was this what you had in mind?
Can you actually call ext4_convert_unwritten_extents from irq context
safely for the buffered I/O case? At least for the XFS equivalent we
need user context, which is why we have these workqueues in the first
place.
But what we're doing is to make sure unwritten extent conversion happens
before marking the page writeback complete, so that
filemap_write_and_wait and friends implicitly wait for this conversion
when waiting for page I/O to complete, and thus removing the need for
all the explicit flushing infrastructure.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH] ext4: Convert unwritten extents during end_io processing
2012-11-28 14:34 ` Christoph Hellwig
@ 2012-11-29 19:47 ` Darrick J. Wong
0 siblings, 0 replies; 14+ messages in thread
From: Darrick J. Wong @ 2012-11-29 19:47 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Jeff Moyer, linux-fsdevel, Jan Kara, linux-ext4
On Wed, Nov 28, 2012 at 09:34:05AM -0500, Christoph Hellwig wrote:
> On Wed, Nov 28, 2012 at 12:02:54AM -0800, Darrick J. Wong wrote:
> > Here's a lightly tested (it passed enough of xfstests and an aio+dio+osync
> > tester on ext4 on x64...) patch that rips out the whole wq mess to convert
> > unwritten extents from endio processing. This has the effect that unwritten
> > extents are now converted as part of writeback, not fsync/truncate/punch_hole.
> > I have a suspicion that the reason why ext4 had that behavior was to reduce
> > churn in the extent tree if one writes a bunch of adjacent sections of hole.
> > Oh well. I haven't seen any huge regressions yet, but then I'm really just
> > posting this early to see if anyone spots obvious bugs.
> >
> > Christoph, was this what you had in mind?
>
> Can you actually call ext4_convert_unwritten_extents from irq context
> safely for the buffered I/O case? At least for the XFS equivalent we
> need user context, which is why we have these workqueues in the first
> place.
You can't call the conversion from irq context. It /looks/ like for the
buffered case the conversion seems to get done from the context of the calling
process, and it's only for dio that we need to do odd twists to make
dio_complete happen from a wq.
Sadly, I also discovered that I hadn't fixed all the cases where the conversion
could happen from irq context. I think I found the last two, but now I'm
suspicious that I've messed up the locking... it seems like the
generic_write_sync -> ext4_fsync_file path is encountering extents that are
still unconverted, and stalling there. Hm. Maybe I should have some lunch
first.
> But what we're doing is to make sure unwritten extent conversion happens
> before marking the page writeback complete, so that
> filemap_write_and_wait and friends implicitly wait for this conversion
> when waiting for page I/O to complete, and thus removing the need for
> all the explicit flushing infrastructure.
That's where I (hope) I'm headed too. :)
--D
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] [PATCH 2/2] direct-io: handle handle O_(D)SYNC AIO
2012-11-27 16:19 ` Jeff Moyer
2012-11-28 0:26 ` Darrick J. Wong
@ 2012-12-05 13:08 ` Jan Kara
1 sibling, 0 replies; 14+ messages in thread
From: Jan Kara @ 2012-12-05 13:08 UTC (permalink / raw)
To: Jeff Moyer; +Cc: Christoph Hellwig, linux-fsdevel, Jan Kara, Darrick J. Wong
On Tue 27-11-12 11:19:41, Jeff Moyer wrote:
> Christoph Hellwig <hch@infradead.org> writes:
>
> > Call generic_write_sync from the deferred I/O completion handler if
> > O_DSYNC is set for a write request. Also make sure various callers
> > don't call generic_write_sync if the direct I/O code returns
> > -EIOCBQUEUED.
> >
> > Note: this currently breaks ext4 due to it's convoluted unwritten
> > extent conversion code. I've tried to understand it and as far
> > as I can see it's a workaround for the fact that ext4 marks page
> > writeback as completed before converting unwritten extents.
> > Ext4 should follow xfs on this and only mark writeback as completed
> > when it really is and at that point can remove the big hairy mess to
> > force unwritten extent conversions from fsync, truncate and a few other
> > places.
>
> Well, I've tested the xfs bits, and there are no regressions. I won't
> bother testing ext4 until a full solution is in place. Jan, is this
> something you have the time for and/or interest in doing?
Back from vacation... I see Darrick started looking at it but it didn't
quite work out? I'll have a look at it since that code in ext4 has been
nagging me for quite a while as well ... after I sort out the most urgent
stuff (maybe later this week or early next week).
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] [PATCH 1/2] direct-io: implement generic deferred AIO completions
2012-11-23 7:55 ` [PATCH 1/2] [PATCH 1/2] direct-io: implement generic deferred AIO completions Christoph Hellwig
2012-11-27 16:17 ` Jeff Moyer
@ 2012-12-06 19:05 ` Jan Kara
2012-12-08 12:02 ` Christoph Hellwig
1 sibling, 1 reply; 14+ messages in thread
From: Jan Kara @ 2012-12-06 19:05 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-fsdevel, Jan Kara, Jeff Moyer, Darrick J. Wong
On Fri 23-11-12 02:55:03, Christoph Hellwig wrote:
> Add support to the core direct-io code to defer AIO completions to user
> context using a workqueue. This replaces opencoded and less efficient
> code in XFS and ext4 and will be needed to properly support O_(D)SYNC
> for AIO.
>
> The communication between the filesystem and the direct I/O code requires
> a new buffer head flag, which is a bit ugly but not avoidable until the
> direct I/O code stops abusing the buffer_head structure for communicating
> with the filesystems.
>
> Currently this creates a per-superblock unbnound workqueue for these
> completions, which is taken from an earlier patch by Jan Kara. I'm
> not really convince about this use and would prefer a "normal" global
> workqueue with a high concurrently limit, but this needs further discussion.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
I like this, but this patch already breaks ext4, doesn't it?
> Index: linux-2.6/fs/ext4/inode.c
> ===================================================================
> --- linux-2.6.orig/fs/ext4/inode.c 2012-11-06 13:33:43.483326794 +0100
> +++ linux-2.6/fs/ext4/inode.c 2012-11-21 21:19:35.227136042 +0100
> @@ -2876,15 +2876,13 @@ static int ext4_get_block_write_nolock(s
> }
>
> static void ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
> - ssize_t size, void *private, int ret,
> - bool is_async)
> + ssize_t size, void *private)
> {
> - struct inode *inode = iocb->ki_filp->f_path.dentry->d_inode;
> ext4_io_end_t *io_end = iocb->private;
>
> /* if not async direct IO or dio with 0 bytes write, just return */
> if (!io_end || !size)
> - goto out;
> + return;
>
> ext_debug("ext4_end_io_dio(): io_end 0x%p "
> "for inode %lu, iocb 0x%p, offset %llu, size %zd\n",
> @@ -2896,19 +2894,11 @@ static void ext4_end_io_dio(struct kiocb
> /* if not aio dio with unwritten extents, just free io and return */
> if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
> ext4_free_io_end(io_end);
> -out:
> - if (is_async)
> - aio_complete(iocb, ret, 0);
> - inode_dio_done(inode);
> return;
> }
>
> io_end->offset = offset;
> io_end->size = size;
> - if (is_async) {
> - io_end->iocb = iocb;
> - io_end->result = ret;
> - }
>
> ext4_add_complete_io(io_end);
^^^ Here ext4 offloads IO completion to a worker thread. So you now
complete AIO / DIO before ext4_end_io() runs which is a bug (ext4_end_io()
is responsible for example for calling end_page_writeback()). I'll modify
these patches to work for ext4 tomorrow I hope...
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] [PATCH 1/2] direct-io: implement generic deferred AIO completions
2012-12-06 19:05 ` Jan Kara
@ 2012-12-08 12:02 ` Christoph Hellwig
2012-12-20 11:15 ` Jan Kara
0 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2012-12-08 12:02 UTC (permalink / raw)
To: Jan Kara; +Cc: Christoph Hellwig, linux-fsdevel, Jeff Moyer, Darrick J. Wong
On Thu, Dec 06, 2012 at 08:05:37PM +0100, Jan Kara wrote:
> I like this, but this patch already breaks ext4, doesn't it?
> ^^^ Here ext4 offloads IO completion to a worker thread. So you now
> complete AIO / DIO before ext4_end_io() runs which is a bug (ext4_end_io()
> is responsible for example for calling end_page_writeback()). I'll modify
> these patches to work for ext4 tomorrow I hope...
You're right, patch 2 actively deadlocks ext4 under xfstests, but the
first one already breaks semantics. So any rework of the ext4 unwritten
extent handling really should go before these two patches.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] [PATCH 1/2] direct-io: implement generic deferred AIO completions
2012-12-08 12:02 ` Christoph Hellwig
@ 2012-12-20 11:15 ` Jan Kara
0 siblings, 0 replies; 14+ messages in thread
From: Jan Kara @ 2012-12-20 11:15 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Jan Kara, linux-fsdevel, Jeff Moyer, Darrick J. Wong
On Sat 08-12-12 07:02:03, Christoph Hellwig wrote:
> On Thu, Dec 06, 2012 at 08:05:37PM +0100, Jan Kara wrote:
> > I like this, but this patch already breaks ext4, doesn't it?
>
> > ^^^ Here ext4 offloads IO completion to a worker thread. So you now
> > complete AIO / DIO before ext4_end_io() runs which is a bug (ext4_end_io()
> > is responsible for example for calling end_page_writeback()). I'll modify
> > these patches to work for ext4 tomorrow I hope...
>
> You're right, patch 2 actively deadlocks ext4 under xfstests, but the
> first one already breaks semantics. So any rework of the ext4 unwritten
> extent handling really should go before these two patches.
Just to let you know, I didn't forget about this. I even have patches
to change PageWriteback handling in ext4 but I realized it can cause
deadlocks.
When converting unwritten extents to written ones, we need to start a
transaction, that may block because of lack of journal space waiting for
currently running transaction to finish. And that running transaction never
finishes because someone (e.g. ext4_write_begin()) has a handle to it waiting
for PageWriteback bit (e.g. in grab_cache_page_write_begin()).
I'm pondering how we could fix this in ext4 because I'd really like to get
rid of that PageWriteback handling oddity in ext4. Just it's not simple...
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2012-12-20 11:15 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-23 7:55 [PATCH 0/2] handle O_(D)SYNC for AIO Christoph Hellwig
2012-11-23 7:55 ` [PATCH 1/2] [PATCH 1/2] direct-io: implement generic deferred AIO completions Christoph Hellwig
2012-11-27 16:17 ` Jeff Moyer
2012-12-06 19:05 ` Jan Kara
2012-12-08 12:02 ` Christoph Hellwig
2012-12-20 11:15 ` Jan Kara
2012-11-23 7:55 ` [PATCH 2/2] [PATCH 2/2] direct-io: handle handle O_(D)SYNC AIO Christoph Hellwig
2012-11-27 16:19 ` Jeff Moyer
2012-11-28 0:26 ` Darrick J. Wong
2012-11-28 0:30 ` Christoph Hellwig
2012-11-28 8:02 ` [RFC PATCH] ext4: Convert unwritten extents during end_io processing Darrick J. Wong
2012-11-28 14:34 ` Christoph Hellwig
2012-11-29 19:47 ` Darrick J. Wong
2012-12-05 13:08 ` [PATCH 2/2] [PATCH 2/2] direct-io: handle handle O_(D)SYNC AIO Jan Kara
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).