* [PATCH 1/7] ext4: ext4_inode_info diet
2012-09-09 17:27 [PATCH 0/7] ext4: Bunch of DIO/AIO fixes Dmitry Monakhov
@ 2012-09-09 17:27 ` Dmitry Monakhov
2012-09-13 10:50 ` Zheng Liu
2012-09-09 17:27 ` [PATCH 2/7] ext4: completed_io locking cleanup Dmitry Monakhov
` (5 subsequent siblings)
6 siblings, 1 reply; 30+ messages in thread
From: Dmitry Monakhov @ 2012-09-09 17:27 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, jack, wenqing.lz, Dmitry Monakhov
Generic inode has unused i_private pointer which may be used as cur_aio_dio
storage.
TODO: If cur_aio_dio will be passed as an argument to get_block_t this allow
to have concurent AIO_DIO requests.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
fs/ext4/ext4.h | 5 +++--
fs/ext4/extents.c | 4 ++--
fs/ext4/inode.c | 6 +++---
fs/ext4/super.c | 1 -
4 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index f9024a6..b3f3c55 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -912,8 +912,6 @@ struct ext4_inode_info {
struct list_head i_completed_io_list;
spinlock_t i_completed_io_lock;
atomic_t i_ioend_count; /* Number of outstanding io_end structs */
- /* current io_end structure for async DIO write*/
- ext4_io_end_t *cur_aio_dio;
atomic_t i_aiodio_unwritten; /* Nr. of inflight conversions pending */
spinlock_t i_block_reservation_lock;
@@ -929,6 +927,9 @@ struct ext4_inode_info {
__u32 i_csum_seed;
};
+/* current io_end structure for async DIO write */
+#define EXT4_CUR_AIO_DIO(inode) (inode)->i_private
+
/*
* File system states
*/
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 10f0afd..e993879 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3644,7 +3644,7 @@ ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode,
{
int ret = 0;
int err = 0;
- ext4_io_end_t *io = EXT4_I(inode)->cur_aio_dio;
+ ext4_io_end_t *io = (ext4_io_end_t*) EXT4_CUR_AIO_DIO(inode);
ext_debug("ext4_ext_handle_uninitialized_extents: inode %lu, logical "
"block %llu, max_blocks %u, flags %x, allocated %u\n",
@@ -3902,7 +3902,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
unsigned int allocated = 0, offset = 0;
unsigned int allocated_clusters = 0;
struct ext4_allocation_request ar;
- ext4_io_end_t *io = EXT4_I(inode)->cur_aio_dio;
+ ext4_io_end_t *io = (ext4_io_end_t*) EXT4_CUR_AIO_DIO(inode);
ext4_lblk_t cluster_offset;
ext_debug("blocks %u/%u requested for inode %lu\n",
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d12d30e..202ae3f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3060,7 +3060,7 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
* hook to the iocb.
*/
iocb->private = NULL;
- EXT4_I(inode)->cur_aio_dio = NULL;
+ EXT4_CUR_AIO_DIO(inode) = NULL;
if (!is_sync_kiocb(iocb)) {
ext4_io_end_t *io_end =
ext4_init_io_end(inode, GFP_NOFS);
@@ -3077,7 +3077,7 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
* is a unwritten extents needs to be converted
* when IO is completed.
*/
- EXT4_I(inode)->cur_aio_dio = iocb->private;
+ EXT4_CUR_AIO_DIO(inode) = iocb->private;
}
if (overwrite)
@@ -3097,7 +3097,7 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
NULL,
DIO_LOCKING);
if (iocb->private)
- EXT4_I(inode)->cur_aio_dio = NULL;
+ EXT4_CUR_AIO_DIO(inode) = NULL;
/*
* The io_end structure takes a reference to the inode,
* that structure needs to be destroyed and the
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 4a7092b..a6904b3 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -967,7 +967,6 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
ei->jinode = NULL;
INIT_LIST_HEAD(&ei->i_completed_io_list);
spin_lock_init(&ei->i_completed_io_lock);
- ei->cur_aio_dio = NULL;
ei->i_sync_tid = 0;
ei->i_datasync_tid = 0;
atomic_set(&ei->i_ioend_count, 0);
--
1.7.7.6
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH 1/7] ext4: ext4_inode_info diet
2012-09-09 17:27 ` [PATCH 1/7] ext4: ext4_inode_info diet Dmitry Monakhov
@ 2012-09-13 10:50 ` Zheng Liu
2012-09-13 11:15 ` Dmitry Monakhov
0 siblings, 1 reply; 30+ messages in thread
From: Zheng Liu @ 2012-09-13 10:50 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: linux-ext4, tytso, jack, wenqing.lz
On Sun, Sep 09, 2012 at 09:27:08PM +0400, Dmitry Monakhov wrote:
> Generic inode has unused i_private pointer which may be used as cur_aio_dio
> storage.
>
> TODO: If cur_aio_dio will be passed as an argument to get_block_t this allow
> to have concurent AIO_DIO requests.
Out of curiosity, could you please describe your idea about concurrent
AIO_DIO requests in get_block_t? Thanks!
>
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Looks good to me. You can add:
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Regards,
Zheng
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 1/7] ext4: ext4_inode_info diet
2012-09-13 10:50 ` Zheng Liu
@ 2012-09-13 11:15 ` Dmitry Monakhov
2012-09-15 15:53 ` Theodore Ts'o
0 siblings, 1 reply; 30+ messages in thread
From: Dmitry Monakhov @ 2012-09-13 11:15 UTC (permalink / raw)
To: Zheng Liu; +Cc: linux-ext4, tytso, jack, wenqing.lz
On Thu, 13 Sep 2012 18:50:22 +0800, Zheng Liu <gnehzuil.liu@gmail.com> wrote:
> On Sun, Sep 09, 2012 at 09:27:08PM +0400, Dmitry Monakhov wrote:
> > Generic inode has unused i_private pointer which may be used as cur_aio_dio
> > storage.
> >
> > TODO: If cur_aio_dio will be passed as an argument to get_block_t this allow
> > to have concurent AIO_DIO requests.
>
> Out of curiosity, could you please describe your idea about concurrent
> AIO_DIO requests in get_block_t? Thanks!
Current buffer.c API layering looks sub-optimal
->xxx_fs_routine: May create different contexts
->generic_buffer_methods(inode, bh..)
->xxx_fs_get_block(inode, bh,...): There is no efficient way to pass fscontext
Both xxx_fs_routine and xxx_fs_get_block are fs specific, but
the only way to pass fscontext down to get_block is to pass it by
attaching it to inode, which make it implicit serialization point.
I just want to add fsprivate argument to get_block_t callback similar
write_begin/write_end and iocb->private, so filesystem will able to pass
it to it's get_block callback.
>
> >
> > Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
>
> Looks good to me. You can add:
> Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
>
> Regards,
> Zheng
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 1/7] ext4: ext4_inode_info diet
2012-09-13 11:15 ` Dmitry Monakhov
@ 2012-09-15 15:53 ` Theodore Ts'o
0 siblings, 0 replies; 30+ messages in thread
From: Theodore Ts'o @ 2012-09-15 15:53 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: Zheng Liu, linux-ext4, jack, wenqing.lz
On Thu, Sep 13, 2012 at 03:15:40PM +0400, Dmitry Monakhov wrote:
> Current buffer.c API layering looks sub-optimal
>
> ->xxx_fs_routine: May create different contexts
> ->generic_buffer_methods(inode, bh..)
> ->xxx_fs_get_block(inode, bh,...): There is no efficient way to pass fscontext
> Both xxx_fs_routine and xxx_fs_get_block are fs specific, but
> the only way to pass fscontext down to get_block is to pass it by
> attaching it to inode, which make it implicit serialization point.
>
> I just want to add fsprivate argument to get_block_t callback similar
> write_begin/write_end and iocb->private, so filesystem will able to pass
> it to it's get_block callback.
If we're going to change the function prototype for get_block_t, it
might also be good to fix a long-standing ugliness with this interface
which is that the get_block() interface does two very different things
when bh->b_size is greater a single block. If b_size is a greater a
block, then instead of bh being a "real" buffer_head, it's a pseudo
map_bh. This is annoying because map_bh is generally allocated on the
stack, but most of the fields are unused, since it's not really a
_real_ buffer_head, but just a countainer for a handful of fields used
by the AIO/DIO and mpage.c functions.
Instead of changing get_block(), which would require lots of changes
all over the kernel, what I'd suggest doing instead is creating a new
function which I'd propose calling get_map() which fills in a new
structure which replaces map_bh, and which includes the aio_dio
structure. It could be made mandatory for the few file systems which
support DIO, but that way we don't havec to change the dozens of other
file systems which don't use the DIO code paths.
Regards,
- Ted
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH 2/7] ext4: completed_io locking cleanup
2012-09-09 17:27 [PATCH 0/7] ext4: Bunch of DIO/AIO fixes Dmitry Monakhov
2012-09-09 17:27 ` [PATCH 1/7] ext4: ext4_inode_info diet Dmitry Monakhov
@ 2012-09-09 17:27 ` Dmitry Monakhov
2012-09-10 9:23 ` Jan Kara
2012-09-13 10:48 ` Zheng Liu
2012-09-09 17:27 ` [PATCH 3/7] ext4: serialize dio nolocked reads with defrag workers V2 Dmitry Monakhov
` (4 subsequent siblings)
6 siblings, 2 replies; 30+ messages in thread
From: Dmitry Monakhov @ 2012-09-09 17:27 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, jack, wenqing.lz, Dmitry Monakhov
Current unwritten extent conversion state-machine is very fuzzy.
- By unknown reason it want perform conversion under i_mutex. What for?
All this games with mutex_trylock result in following deadlock
truncate: kworker:
ext4_setattr ext4_end_io_work
mutex_lock(i_mutex)
inode_dio_wait(inode) ->BLOCK
DEADLOCK<- mutex_trylock()
inode_dio_done()
#TEST_CASE1_BEGIN
MNT=/mnt_scrach
unlink $MNT/file
fallocate -l $((1024*1024*1024)) $MNT/file
aio-stress -I 100000 -O -s 100m -n -t 1 -c 10 -o 2 -o 3 $MNT/file
sleep 2
truncate -s 0 $MNT/file
#TEST_CASE1_END
This patch makes state machine simple and clean:
(1) ext4_flush_completed_IO is responsible for handling all pending
end_io from ei->i_completed_io_list(per inode list)
NOTE1: i_completed_io_lock is acquired only once
NOTE2: i_mutex is not required because it does not protect
any data guarded by i_mutex
(2) xxx_end_io schedule end_io context completion simply by pushing it
to the inode's list.
NOTE1: because of (1) work should be queued only if
->i_completed_io_list was empty at the moment, otherwise it
work is scheduled already.
- remove useless END_IO_XXX flags
- Improve smp scalability by removing useless i_mutex which does not
protect anything
- Reduce lock contention on i_completed_io_lock
- Move open coded logic from various xx_end_xx routines to ext4_add_complete_io()
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
fs/ext4/ext4.h | 5 +--
fs/ext4/fsync.c | 47 +++++++++++---------------------
fs/ext4/indirect.c | 6 +---
fs/ext4/inode.c | 25 +---------------
fs/ext4/page-io.c | 76 ++++++++++++++-------------------------------------
5 files changed, 43 insertions(+), 116 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b3f3c55..be976ca 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -184,9 +184,7 @@ struct mpage_da_data {
*/
#define EXT4_IO_END_UNWRITTEN 0x0001
#define EXT4_IO_END_ERROR 0x0002
-#define EXT4_IO_END_QUEUED 0x0004
-#define EXT4_IO_END_DIRECT 0x0008
-#define EXT4_IO_END_IN_FSYNC 0x0010
+#define EXT4_IO_END_DIRECT 0x0004
struct ext4_io_page {
struct page *p_page;
@@ -2405,6 +2403,7 @@ extern int ext4_move_extents(struct file *o_filp, struct file *d_filp,
/* page-io.c */
extern int __init ext4_init_pageio(void);
+extern void ext4_add_complete_io(ext4_io_end_t *io_end);
extern void ext4_exit_pageio(void);
extern void ext4_ioend_wait(struct inode *);
extern void ext4_free_io_end(ext4_io_end_t *io);
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 323eb15..24f3719 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -73,46 +73,31 @@ static void dump_completed_IO(struct inode * inode)
* might needs to do the conversion. This function walks through
* the list and convert the related unwritten extents for completed IO
* to written.
- * The function return the number of pending IOs on success.
+ * The function return first error;
*/
int ext4_flush_completed_IO(struct inode *inode)
{
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ unsigned long flags;
+ struct list_head complete_list;
+ int err, ret = 0;
ext4_io_end_t *io;
- struct ext4_inode_info *ei = EXT4_I(inode);
- unsigned long flags;
- int ret = 0;
- int ret2 = 0;
dump_completed_IO(inode);
+
spin_lock_irqsave(&ei->i_completed_io_lock, flags);
- while (!list_empty(&ei->i_completed_io_list)){
- io = list_entry(ei->i_completed_io_list.next,
- ext4_io_end_t, list);
+ list_replace_init(&ei->i_completed_io_list, &complete_list);
+ spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
+
+ while(!list_empty(&complete_list)) {
+ io = list_entry(complete_list.next, ext4_io_end_t, list);
list_del_init(&io->list);
- io->flag |= EXT4_IO_END_IN_FSYNC;
- /*
- * Calling ext4_end_io_nolock() to convert completed
- * IO to written.
- *
- * When ext4_sync_file() is called, run_queue() may already
- * about to flush the work corresponding to this io structure.
- * It will be upset if it founds the io structure related
- * to the work-to-be schedule is freed.
- *
- * Thus we need to keep the io structure still valid here after
- * conversion finished. The io structure has a flag to
- * avoid double converting from both fsync and background work
- * queue work.
- */
- spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
- ret = ext4_end_io_nolock(io);
- if (ret < 0)
- ret2 = ret;
- spin_lock_irqsave(&ei->i_completed_io_lock, flags);
- io->flag &= ~EXT4_IO_END_IN_FSYNC;
+ err = ext4_end_io_nolock(io);
+ ext4_free_io_end(io);
+ if (unlikely(err && !ret))
+ ret = err;
}
- spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
- return (ret2 < 0) ? ret2 : 0;
+ return ret;
}
/*
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 830e1b2..61f13e5 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -807,11 +807,9 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
retry:
if (rw == READ && ext4_should_dioread_nolock(inode)) {
- if (unlikely(!list_empty(&ei->i_completed_io_list))) {
- mutex_lock(&inode->i_mutex);
+ if (unlikely(!list_empty(&ei->i_completed_io_list)))
ext4_flush_completed_IO(inode);
- mutex_unlock(&inode->i_mutex);
- }
+
ret = __blockdev_direct_IO(rw, iocb, inode,
inode->i_sb->s_bdev, iov,
offset, nr_segs,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 202ae3f..762b955 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2885,9 +2885,6 @@ static void ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
{
struct inode *inode = iocb->ki_filp->f_path.dentry->d_inode;
ext4_io_end_t *io_end = iocb->private;
- struct workqueue_struct *wq;
- unsigned long flags;
- struct ext4_inode_info *ei;
/* if not async direct IO or dio with 0 bytes write, just return */
if (!io_end || !size)
@@ -2916,24 +2913,14 @@ out:
io_end->iocb = iocb;
io_end->result = ret;
}
- wq = EXT4_SB(io_end->inode->i_sb)->dio_unwritten_wq;
-
- /* Add the io_end to per-inode completed aio dio list*/
- ei = EXT4_I(io_end->inode);
- spin_lock_irqsave(&ei->i_completed_io_lock, flags);
- list_add_tail(&io_end->list, &ei->i_completed_io_list);
- spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
- /* queue the work to convert unwritten extents to written */
- queue_work(wq, &io_end->work);
+ ext4_add_complete_io(io_end);
}
static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate)
{
ext4_io_end_t *io_end = bh->b_private;
- struct workqueue_struct *wq;
struct inode *inode;
- unsigned long flags;
if (!test_clear_buffer_uninit(bh) || !io_end)
goto out;
@@ -2952,15 +2939,7 @@ static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate)
*/
inode = io_end->inode;
ext4_set_io_unwritten_flag(inode, io_end);
-
- /* Add the io_end to per-inode completed io list*/
- spin_lock_irqsave(&EXT4_I(inode)->i_completed_io_lock, flags);
- list_add_tail(&io_end->list, &EXT4_I(inode)->i_completed_io_list);
- spin_unlock_irqrestore(&EXT4_I(inode)->i_completed_io_lock, flags);
-
- wq = EXT4_SB(inode->i_sb)->dio_unwritten_wq;
- /* queue the work to convert unwritten extents to written */
- queue_work(wq, &io_end->work);
+ ext4_add_complete_io(io_end);
out:
bh->b_private = NULL;
bh->b_end_io = NULL;
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index dcdeef1..c369419 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -57,6 +57,22 @@ void ext4_ioend_wait(struct inode *inode)
wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_ioend_count) == 0));
}
+/* Add the io_end to per-inode completed aio dio list. */
+void ext4_add_complete_io(ext4_io_end_t *io_end)
+{
+ struct ext4_inode_info *ei = EXT4_I(io_end->inode);
+ struct workqueue_struct *wq;
+ unsigned long flags;
+
+ wq = EXT4_SB(io_end->inode->i_sb)->dio_unwritten_wq;
+
+ spin_lock_irqsave(&ei->i_completed_io_lock, flags);
+ if (list_empty(&ei->i_completed_io_list))
+ queue_work(wq, &io_end->work);
+ list_add_tail(&io_end->list, &ei->i_completed_io_list);
+ spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
+}
+
static void put_io_page(struct ext4_io_page *io_page)
{
if (atomic_dec_and_test(&io_page->p_count)) {
@@ -81,12 +97,7 @@ void ext4_free_io_end(ext4_io_end_t *io)
kmem_cache_free(io_end_cachep, io);
}
-/*
- * check a range of space and convert unwritten extents to written.
- *
- * Called with inode->i_mutex; we depend on this when we manipulate
- * io->flag, since we could otherwise race with ext4_flush_completed_IO()
- */
+/* check a range of space and convert unwritten extents to written. */
int ext4_end_io_nolock(ext4_io_end_t *io)
{
struct inode *inode = io->inode;
@@ -94,6 +105,8 @@ int ext4_end_io_nolock(ext4_io_end_t *io)
ssize_t size = io->size;
int ret = 0;
+ BUG_ON(!list_empty(&io->list));
+
ext4_debug("ext4_end_io_nolock: io 0x%p from inode %lu,list->next 0x%p,"
"list->prev 0x%p\n",
io, inode->i_ino, io->list.next, io->list.prev);
@@ -124,45 +137,7 @@ int ext4_end_io_nolock(ext4_io_end_t *io)
static void ext4_end_io_work(struct work_struct *work)
{
ext4_io_end_t *io = container_of(work, ext4_io_end_t, work);
- struct inode *inode = io->inode;
- struct ext4_inode_info *ei = EXT4_I(inode);
- unsigned long flags;
-
- spin_lock_irqsave(&ei->i_completed_io_lock, flags);
- if (io->flag & EXT4_IO_END_IN_FSYNC)
- goto requeue;
- if (list_empty(&io->list)) {
- spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
- goto free;
- }
-
- if (!mutex_trylock(&inode->i_mutex)) {
- bool was_queued;
-requeue:
- was_queued = !!(io->flag & EXT4_IO_END_QUEUED);
- io->flag |= EXT4_IO_END_QUEUED;
- spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
- /*
- * Requeue the work instead of waiting so that the work
- * items queued after this can be processed.
- */
- queue_work(EXT4_SB(inode->i_sb)->dio_unwritten_wq, &io->work);
- /*
- * To prevent the ext4-dio-unwritten thread from keeping
- * requeueing end_io requests and occupying cpu for too long,
- * yield the cpu if it sees an end_io request that has already
- * been requeued.
- */
- if (was_queued)
- yield();
- return;
- }
- list_del_init(&io->list);
- spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
- (void) ext4_end_io_nolock(io);
- mutex_unlock(&inode->i_mutex);
-free:
- ext4_free_io_end(io);
+ (void) ext4_flush_completed_IO(io->inode);
}
ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags)
@@ -195,9 +170,7 @@ static void buffer_io_error(struct buffer_head *bh)
static void ext4_end_bio(struct bio *bio, int error)
{
ext4_io_end_t *io_end = bio->bi_private;
- struct workqueue_struct *wq;
struct inode *inode;
- unsigned long flags;
int i;
sector_t bi_sector = bio->bi_sector;
@@ -255,14 +228,7 @@ static void ext4_end_bio(struct bio *bio, int error)
return;
}
- /* Add the io_end to per-inode completed io list*/
- spin_lock_irqsave(&EXT4_I(inode)->i_completed_io_lock, flags);
- list_add_tail(&io_end->list, &EXT4_I(inode)->i_completed_io_list);
- spin_unlock_irqrestore(&EXT4_I(inode)->i_completed_io_lock, flags);
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH 2/7] ext4: completed_io locking cleanup
2012-09-09 17:27 ` [PATCH 2/7] ext4: completed_io locking cleanup Dmitry Monakhov
@ 2012-09-10 9:23 ` Jan Kara
2012-09-10 10:19 ` Dmitry Monakhov
2012-09-13 10:48 ` Zheng Liu
1 sibling, 1 reply; 30+ messages in thread
From: Jan Kara @ 2012-09-10 9:23 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: linux-ext4, tytso, jack, wenqing.lz
On Sun 09-09-12 21:27:09, Dmitry Monakhov wrote:
> Current unwritten extent conversion state-machine is very fuzzy.
> - By unknown reason it want perform conversion under i_mutex. What for?
> All this games with mutex_trylock result in following deadlock
> truncate: kworker:
> ext4_setattr ext4_end_io_work
> mutex_lock(i_mutex)
> inode_dio_wait(inode) ->BLOCK
> DEADLOCK<- mutex_trylock()
> inode_dio_done()
> #TEST_CASE1_BEGIN
> MNT=/mnt_scrach
> unlink $MNT/file
> fallocate -l $((1024*1024*1024)) $MNT/file
> aio-stress -I 100000 -O -s 100m -n -t 1 -c 10 -o 2 -o 3 $MNT/file
> sleep 2
> truncate -s 0 $MNT/file
> #TEST_CASE1_END
>
> This patch makes state machine simple and clean:
>
> (1) ext4_flush_completed_IO is responsible for handling all pending
> end_io from ei->i_completed_io_list(per inode list)
> NOTE1: i_completed_io_lock is acquired only once
> NOTE2: i_mutex is not required because it does not protect
> any data guarded by i_mutex
Removing of i_mutex might be OK but you really need to add more
justification (both into the changelog and end_io code) than just "it does
not protect any data guarded by i_mutex". So far i_mutex is supposed to
protect all modifications of extent tree, now that won't be true anymore.
We have i_data_sem protecting extent tree as well but that is acquired for
single extent operation only so it cannot be used for guarding "global
properties of the extent tree" - e.g. if end_io code would race with
truncate it could happen that end_io would create some extents beyond EOF.
Now maybe that cannot happen because of other synchronization mechanisms
(e.g. inode_dio_wait(), or waiting for PageWriteback while invalidating
page cache - although extra care needs to be taken when extents straddle
i_size and there can be IO running on the part of the extent before
i_size). So I don't immediaetly see a problem but please add more
justification to convince me (and future readers of the code) here...
Honza
>
> (2) xxx_end_io schedule end_io context completion simply by pushing it
> to the inode's list.
> NOTE1: because of (1) work should be queued only if
> ->i_completed_io_list was empty at the moment, otherwise it
> work is scheduled already.
>
> - remove useless END_IO_XXX flags
> - Improve smp scalability by removing useless i_mutex which does not
> protect anything
> - Reduce lock contention on i_completed_io_lock
> - Move open coded logic from various xx_end_xx routines to ext4_add_complete_io()
>
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> ---
> fs/ext4/ext4.h | 5 +--
> fs/ext4/fsync.c | 47 +++++++++++---------------------
> fs/ext4/indirect.c | 6 +---
> fs/ext4/inode.c | 25 +---------------
> fs/ext4/page-io.c | 76 ++++++++++++++-------------------------------------
> 5 files changed, 43 insertions(+), 116 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index b3f3c55..be976ca 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -184,9 +184,7 @@ struct mpage_da_data {
> */
> #define EXT4_IO_END_UNWRITTEN 0x0001
> #define EXT4_IO_END_ERROR 0x0002
> -#define EXT4_IO_END_QUEUED 0x0004
> -#define EXT4_IO_END_DIRECT 0x0008
> -#define EXT4_IO_END_IN_FSYNC 0x0010
> +#define EXT4_IO_END_DIRECT 0x0004
>
> struct ext4_io_page {
> struct page *p_page;
> @@ -2405,6 +2403,7 @@ extern int ext4_move_extents(struct file *o_filp, struct file *d_filp,
>
> /* page-io.c */
> extern int __init ext4_init_pageio(void);
> +extern void ext4_add_complete_io(ext4_io_end_t *io_end);
> extern void ext4_exit_pageio(void);
> extern void ext4_ioend_wait(struct inode *);
> extern void ext4_free_io_end(ext4_io_end_t *io);
> diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
> index 323eb15..24f3719 100644
> --- a/fs/ext4/fsync.c
> +++ b/fs/ext4/fsync.c
> @@ -73,46 +73,31 @@ static void dump_completed_IO(struct inode * inode)
> * might needs to do the conversion. This function walks through
> * the list and convert the related unwritten extents for completed IO
> * to written.
> - * The function return the number of pending IOs on success.
> + * The function return first error;
> */
> int ext4_flush_completed_IO(struct inode *inode)
> {
> + struct ext4_inode_info *ei = EXT4_I(inode);
> + unsigned long flags;
> + struct list_head complete_list;
> + int err, ret = 0;
> ext4_io_end_t *io;
> - struct ext4_inode_info *ei = EXT4_I(inode);
> - unsigned long flags;
> - int ret = 0;
> - int ret2 = 0;
>
> dump_completed_IO(inode);
> +
> spin_lock_irqsave(&ei->i_completed_io_lock, flags);
> - while (!list_empty(&ei->i_completed_io_list)){
> - io = list_entry(ei->i_completed_io_list.next,
> - ext4_io_end_t, list);
> + list_replace_init(&ei->i_completed_io_list, &complete_list);
> + spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
> +
> + while(!list_empty(&complete_list)) {
> + io = list_entry(complete_list.next, ext4_io_end_t, list);
> list_del_init(&io->list);
> - io->flag |= EXT4_IO_END_IN_FSYNC;
> - /*
> - * Calling ext4_end_io_nolock() to convert completed
> - * IO to written.
> - *
> - * When ext4_sync_file() is called, run_queue() may already
> - * about to flush the work corresponding to this io structure.
> - * It will be upset if it founds the io structure related
> - * to the work-to-be schedule is freed.
> - *
> - * Thus we need to keep the io structure still valid here after
> - * conversion finished. The io structure has a flag to
> - * avoid double converting from both fsync and background work
> - * queue work.
> - */
> - spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
> - ret = ext4_end_io_nolock(io);
> - if (ret < 0)
> - ret2 = ret;
> - spin_lock_irqsave(&ei->i_completed_io_lock, flags);
> - io->flag &= ~EXT4_IO_END_IN_FSYNC;
> + err = ext4_end_io_nolock(io);
> + ext4_free_io_end(io);
> + if (unlikely(err && !ret))
> + ret = err;
> }
> - spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
> - return (ret2 < 0) ? ret2 : 0;
> + return ret;
> }
>
> /*
> diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
> index 830e1b2..61f13e5 100644
> --- a/fs/ext4/indirect.c
> +++ b/fs/ext4/indirect.c
> @@ -807,11 +807,9 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
>
> retry:
> if (rw == READ && ext4_should_dioread_nolock(inode)) {
> - if (unlikely(!list_empty(&ei->i_completed_io_list))) {
> - mutex_lock(&inode->i_mutex);
> + if (unlikely(!list_empty(&ei->i_completed_io_list)))
> ext4_flush_completed_IO(inode);
> - mutex_unlock(&inode->i_mutex);
> - }
> +
> ret = __blockdev_direct_IO(rw, iocb, inode,
> inode->i_sb->s_bdev, iov,
> offset, nr_segs,
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 202ae3f..762b955 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2885,9 +2885,6 @@ static void ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
> {
> struct inode *inode = iocb->ki_filp->f_path.dentry->d_inode;
> ext4_io_end_t *io_end = iocb->private;
> - struct workqueue_struct *wq;
> - unsigned long flags;
> - struct ext4_inode_info *ei;
>
> /* if not async direct IO or dio with 0 bytes write, just return */
> if (!io_end || !size)
> @@ -2916,24 +2913,14 @@ out:
> io_end->iocb = iocb;
> io_end->result = ret;
> }
> - wq = EXT4_SB(io_end->inode->i_sb)->dio_unwritten_wq;
> -
> - /* Add the io_end to per-inode completed aio dio list*/
> - ei = EXT4_I(io_end->inode);
> - spin_lock_irqsave(&ei->i_completed_io_lock, flags);
> - list_add_tail(&io_end->list, &ei->i_completed_io_list);
> - spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
>
> - /* queue the work to convert unwritten extents to written */
> - queue_work(wq, &io_end->work);
> + ext4_add_complete_io(io_end);
> }
>
> static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate)
> {
> ext4_io_end_t *io_end = bh->b_private;
> - struct workqueue_struct *wq;
> struct inode *inode;
> - unsigned long flags;
>
> if (!test_clear_buffer_uninit(bh) || !io_end)
> goto out;
> @@ -2952,15 +2939,7 @@ static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate)
> */
> inode = io_end->inode;
> ext4_set_io_unwritten_flag(inode, io_end);
> -
> - /* Add the io_end to per-inode completed io list*/
> - spin_lock_irqsave(&EXT4_I(inode)->i_completed_io_lock, flags);
> - list_add_tail(&io_end->list, &EXT4_I(inode)->i_completed_io_list);
> - spin_unlock_irqrestore(&EXT4_I(inode)->i_completed_io_lock, flags);
> -
> - wq = EXT4_SB(inode->i_sb)->dio_unwritten_wq;
> - /* queue the work to convert unwritten extents to written */
> - queue_work(wq, &io_end->work);
> + ext4_add_complete_io(io_end);
> out:
> bh->b_private = NULL;
> bh->b_end_io = NULL;
> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index dcdeef1..c369419 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -57,6 +57,22 @@ void ext4_ioend_wait(struct inode *inode)
> wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_ioend_count) == 0));
> }
>
> +/* Add the io_end to per-inode completed aio dio list. */
> +void ext4_add_complete_io(ext4_io_end_t *io_end)
> +{
> + struct ext4_inode_info *ei = EXT4_I(io_end->inode);
> + struct workqueue_struct *wq;
> + unsigned long flags;
> +
> + wq = EXT4_SB(io_end->inode->i_sb)->dio_unwritten_wq;
> +
> + spin_lock_irqsave(&ei->i_completed_io_lock, flags);
> + if (list_empty(&ei->i_completed_io_list))
> + queue_work(wq, &io_end->work);
> + list_add_tail(&io_end->list, &ei->i_completed_io_list);
> + spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
> +}
> +
> static void put_io_page(struct ext4_io_page *io_page)
> {
> if (atomic_dec_and_test(&io_page->p_count)) {
> @@ -81,12 +97,7 @@ void ext4_free_io_end(ext4_io_end_t *io)
> kmem_cache_free(io_end_cachep, io);
> }
>
> -/*
> - * check a range of space and convert unwritten extents to written.
> - *
> - * Called with inode->i_mutex; we depend on this when we manipulate
> - * io->flag, since we could otherwise race with ext4_flush_completed_IO()
> - */
> +/* check a range of space and convert unwritten extents to written. */
> int ext4_end_io_nolock(ext4_io_end_t *io)
> {
> struct inode *inode = io->inode;
> @@ -94,6 +105,8 @@ int ext4_end_io_nolock(ext4_io_end_t *io)
> ssize_t size = io->size;
> int ret = 0;
>
> + BUG_ON(!list_empty(&io->list));
> +
> ext4_debug("ext4_end_io_nolock: io 0x%p from inode %lu,list->next 0x%p,"
> "list->prev 0x%p\n",
> io, inode->i_ino, io->list.next, io->list.prev);
> @@ -124,45 +137,7 @@ int ext4_end_io_nolock(ext4_io_end_t *io)
> static void ext4_end_io_work(struct work_struct *work)
> {
> ext4_io_end_t *io = container_of(work, ext4_io_end_t, work);
> - struct inode *inode = io->inode;
> - struct ext4_inode_info *ei = EXT4_I(inode);
> - unsigned long flags;
> -
> - spin_lock_irqsave(&ei->i_completed_io_lock, flags);
> - if (io->flag & EXT4_IO_END_IN_FSYNC)
> - goto requeue;
> - if (list_empty(&io->list)) {
> - spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
> - goto free;
> - }
> -
> - if (!mutex_trylock(&inode->i_mutex)) {
> - bool was_queued;
> -requeue:
> - was_queued = !!(io->flag & EXT4_IO_END_QUEUED);
> - io->flag |= EXT4_IO_END_QUEUED;
> - spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
> - /*
> - * Requeue the work instead of waiting so that the work
> - * items queued after this can be processed.
> - */
> - queue_work(EXT4_SB(inode->i_sb)->dio_unwritten_wq, &io->work);
> - /*
> - * To prevent the ext4-dio-unwritten thread from keeping
> - * requeueing end_io requests and occupying cpu for too long,
> - * yield the cpu if it sees an end_io request that has already
> - * been requeued.
> - */
> - if (was_queued)
> - yield();
> - return;
> - }
> - list_del_init(&io->list);
> - spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
> - (void) ext4_end_io_nolock(io);
> - mutex_unlock(&inode->i_mutex);
> -free:
> - ext4_free_io_end(io);
> + (void) ext4_flush_completed_IO(io->inode);
> }
>
> ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags)
> @@ -195,9 +170,7 @@ static void buffer_io_error(struct buffer_head *bh)
> static void ext4_end_bio(struct bio *bio, int error)
> {
> ext4_io_end_t *io_end = bio->bi_private;
> - struct workqueue_struct *wq;
> struct inode *inode;
> - unsigned long flags;
> int i;
> sector_t bi_sector = bio->bi_sector;
>
> @@ -255,14 +228,7 @@ static void ext4_end_bio(struct bio *bio, int error)
> return;
> }
>
> - /* Add the io_end to per-inode completed io list*/
> - spin_lock_irqsave(&EXT4_I(inode)->i_completed_io_lock, flags);
> - list_add_tail(&io_end->list, &EXT4_I(inode)->i_completed_io_list);
> - spin_unlock_irqrestore(&EXT4_I(inode)->i_completed_io_lock, flags);
> -
> - wq = EXT4_SB(inode->i_sb)->dio_unwritten_wq;
> - /* queue the work to convert unwritten extents to written */
> - queue_work(wq, &io_end->work);
> + ext4_add_complete_io(io_end);
> }
>
> void ext4_io_submit(struct ext4_io_submit *io)
> --
> 1.7.7.6
>
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 2/7] ext4: completed_io locking cleanup
2012-09-10 9:23 ` Jan Kara
@ 2012-09-10 10:19 ` Dmitry Monakhov
0 siblings, 0 replies; 30+ messages in thread
From: Dmitry Monakhov @ 2012-09-10 10:19 UTC (permalink / raw)
To: Jan Kara; +Cc: linux-ext4, tytso, jack, wenqing.lz
On Mon, 10 Sep 2012 11:23:12 +0200, Jan Kara <jack@suse.cz> wrote:
> On Sun 09-09-12 21:27:09, Dmitry Monakhov wrote:
> > Current unwritten extent conversion state-machine is very fuzzy.
> > - By unknown reason it want perform conversion under i_mutex. What for?
> > All this games with mutex_trylock result in following deadlock
> > truncate: kworker:
> > ext4_setattr ext4_end_io_work
> > mutex_lock(i_mutex)
> > inode_dio_wait(inode) ->BLOCK
> > DEADLOCK<- mutex_trylock()
> > inode_dio_done()
> > #TEST_CASE1_BEGIN
> > MNT=/mnt_scrach
> > unlink $MNT/file
> > fallocate -l $((1024*1024*1024)) $MNT/file
> > aio-stress -I 100000 -O -s 100m -n -t 1 -c 10 -o 2 -o 3 $MNT/file
> > sleep 2
> > truncate -s 0 $MNT/file
> > #TEST_CASE1_END
> >
> > This patch makes state machine simple and clean:
> >
> > (1) ext4_flush_completed_IO is responsible for handling all pending
> > end_io from ei->i_completed_io_list(per inode list)
> > NOTE1: i_completed_io_lock is acquired only once
> > NOTE2: i_mutex is not required because it does not protect
> > any data guarded by i_mutex
> Removing of i_mutex might be OK but you really need to add more
> justification (both into the changelog and end_io code) than just "it does
> not protect any data guarded by i_mutex". So far i_mutex is supposed to
> protect all modifications of extent tree, now that won't be true anymore.
> We have i_data_sem protecting extent tree as well but that is acquired for
> single extent operation only so it cannot be used for guarding "global
> properties of the extent tree" - e.g. if end_io code would race with
> truncate it could happen that end_io would create some extents beyond EOF.
> Now maybe that cannot happen because of other synchronization mechanisms
> (e.g. inode_dio_wait(), or waiting for PageWriteback while invalidating
> page cache - although extra care needs to be taken when extents straddle
> i_size and there can be IO running on the part of the extent before
> i_size). So I don't immediaetly see a problem but please add more
> justification to convince me (and future readers of the code) here...
Ok you right. Actually I've already tried to figure out correct BUG_ON
condition to spot extent beyond EOF, but this condition is not trivial
and need justification. My current assumption:
Since EXT4_GET_BLOCKS_UNINIT_EXT is used only then file size is not
changed it seems to be correct to prohibit
ext4_convert_unwritten_extents() to handle blocks beyond
EXT4_I(inode)->i_disksize (inode->i_size is not suitable here
because truncate may already reduce it, and now wait for existing dio)
As a naive word in my defence i should say that blocks beyond EOF issue easily
triggered on current kernel via xfstests, but not happen w/ my patch-queue.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 2/7] ext4: completed_io locking cleanup
2012-09-09 17:27 ` [PATCH 2/7] ext4: completed_io locking cleanup Dmitry Monakhov
2012-09-10 9:23 ` Jan Kara
@ 2012-09-13 10:48 ` Zheng Liu
1 sibling, 0 replies; 30+ messages in thread
From: Zheng Liu @ 2012-09-13 10:48 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: linux-ext4, tytso, jack, wenqing.lz
On Sun, Sep 09, 2012 at 09:27:09PM +0400, Dmitry Monakhov wrote:
> Current unwritten extent conversion state-machine is very fuzzy.
> - By unknown reason it want perform conversion under i_mutex. What for?
> All this games with mutex_trylock result in following deadlock
> truncate: kworker:
> ext4_setattr ext4_end_io_work
> mutex_lock(i_mutex)
> inode_dio_wait(inode) ->BLOCK
> DEADLOCK<- mutex_trylock()
> inode_dio_done()
> #TEST_CASE1_BEGIN
> MNT=/mnt_scrach
> unlink $MNT/file
> fallocate -l $((1024*1024*1024)) $MNT/file
> aio-stress -I 100000 -O -s 100m -n -t 1 -c 10 -o 2 -o 3 $MNT/file
> sleep 2
> truncate -s 0 $MNT/file
> #TEST_CASE1_END
>
> This patch makes state machine simple and clean:
>
> (1) ext4_flush_completed_IO is responsible for handling all pending
> end_io from ei->i_completed_io_list(per inode list)
> NOTE1: i_completed_io_lock is acquired only once
> NOTE2: i_mutex is not required because it does not protect
> any data guarded by i_mutex
>
> (2) xxx_end_io schedule end_io context completion simply by pushing it
> to the inode's list.
> NOTE1: because of (1) work should be queued only if
> ->i_completed_io_list was empty at the moment, otherwise it
> work is scheduled already.
>
> - remove useless END_IO_XXX flags
> - Improve smp scalability by removing useless i_mutex which does not
> protect anything
> - Reduce lock contention on i_completed_io_lock
> - Move open coded logic from various xx_end_xx routines to ext4_add_complete_io()
>
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Looks good to me. You can add:
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Regards,
Zheng
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH 3/7] ext4: serialize dio nolocked reads with defrag workers V2
2012-09-09 17:27 [PATCH 0/7] ext4: Bunch of DIO/AIO fixes Dmitry Monakhov
2012-09-09 17:27 ` [PATCH 1/7] ext4: ext4_inode_info diet Dmitry Monakhov
2012-09-09 17:27 ` [PATCH 2/7] ext4: completed_io locking cleanup Dmitry Monakhov
@ 2012-09-09 17:27 ` Dmitry Monakhov
2012-09-10 9:31 ` Jan Kara
2012-09-09 17:27 ` [PATCH 4/7] ext4: fsync should wait for DIO writers Dmitry Monakhov
` (3 subsequent siblings)
6 siblings, 1 reply; 30+ messages in thread
From: Dmitry Monakhov @ 2012-09-09 17:27 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, jack, wenqing.lz, Dmitry Monakhov
Inode's block defrag and ext4_change_inode_journal_flag() may
affect nonlocked DIO reads result, so proper synchronization
required.
- Add missed inode_dio_wait() calls where appropriate
- Check inode state under extra i_dio_count reference.
Changes from V1:
- add missed memory bariers
- move DIOREAD_LOCK state check out from generic should_dioread_nolock
otherwise it will affect existing DIO readers.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
fs/ext4/ext4.h | 3 +++
fs/ext4/indirect.c | 23 +++++++++++++++++++++++
fs/ext4/inode.c | 4 ++++
fs/ext4/move_extent.c | 5 +++++
4 files changed, 35 insertions(+), 0 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index be976ca..0377d2b 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1348,6 +1348,8 @@ enum {
EXT4_STATE_DIO_UNWRITTEN, /* need convert on dio done*/
EXT4_STATE_NEWENTRY, /* File just added to dir */
EXT4_STATE_DELALLOC_RESERVED, /* blks already reserved for delalloc */
+ EXT4_STATE_DIOREAD_LOCK, /* Disable support for dio read
+ nolocking */
};
#define EXT4_INODE_BIT_FNS(name, field, offset) \
@@ -2015,6 +2017,7 @@ extern void ext4_da_update_reserve_space(struct inode *inode,
/* indirect.c */
extern int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,
struct ext4_map_blocks *map, int flags);
+extern void ext4_inode_dio_wait(struct inode *inode, int resume);
extern ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
const struct iovec *iov, loff_t offset,
unsigned long nr_segs);
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 61f13e5..3714413 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -760,6 +760,19 @@ out:
return err;
}
+/* Wait for existing dio workers
+ * Disable DIO read nolock optimization, so new dioreaders will
+ * be forced to grab i_mutex
+*/
+void ext4_inode_dio_wait(struct inode *inode, int resume_readers)
+{
+ ext4_set_inode_state(inode, EXT4_STATE_DIOREAD_LOCK);
+ smp_mb();
+ inode_dio_wait(inode);
+ if (resume_readers)
+ ext4_set_inode_state(inode, EXT4_STATE_DIOREAD_LOCK);
+}
+
/*
* O_DIRECT for ext3 (or indirect map) based files
*
@@ -810,10 +823,20 @@ retry:
if (unlikely(!list_empty(&ei->i_completed_io_list)))
ext4_flush_completed_IO(inode);
+ /* Nolock dioread optimization may be dynamically disabled.
+ * Check inode's state while holding extra i_dio_count ref. */
+ atomic_inc(&inode->i_dio_count);
+ smp_mb();
+ if (!unlikely(ext4_test_inode_state(inode,
+ EXT4_STATE_DIOREAD_LOCK))) {
+ inode_dio_done(inode);
+ goto retry;
+ }
ret = __blockdev_direct_IO(rw, iocb, inode,
inode->i_sb->s_bdev, iov,
offset, nr_segs,
ext4_get_block, NULL, NULL, 0);
+ inode_dio_done(inode);
} else {
ret = blockdev_direct_IO(rw, iocb, inode, iov,
offset, nr_segs, ext4_get_block);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 762b955..ffb4a27 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4720,6 +4720,9 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
return err;
}
+ /* Wait for all existing dio workers */
+ ext4_inode_dio_wait(inode, 0);
+
jbd2_journal_lock_updates(journal);
/*
@@ -4739,6 +4742,7 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
ext4_set_aops(inode);
jbd2_journal_unlock_updates(journal);
+ ext4_clear_inode_state(inode, EXT4_STATE_DIOREAD_LOCK);
/* Finally we can mark the inode as dirty. */
diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index c5826c6..aea0e7a 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -1214,6 +1214,9 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp,
if (ret1 < 0)
return ret1;
+ /* Wait for all existing dio workers */
+ ext4_inode_dio_wait(orig_inode, 0);
+ ext4_inode_dio_wait(donor_inode, 0);
/* Protect extent tree against block allocations via delalloc */
double_down_write_data_sem(orig_inode, donor_inode);
/* Check the filesystem environment whether move_extent can be done */
@@ -1412,6 +1415,8 @@ out:
kfree(holecheck_path);
}
double_up_write_data_sem(orig_inode, donor_inode);
+ ext4_clear_inode_state(orig_inode, EXT4_STATE_DIOREAD_LOCK);
+ ext4_clear_inode_state(donor_inode, EXT4_STATE_DIOREAD_LOCK);
ret2 = mext_inode_double_unlock(orig_inode, donor_inode);
if (ret1)
--
1.7.7.6
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH 3/7] ext4: serialize dio nolocked reads with defrag workers V2
2012-09-09 17:27 ` [PATCH 3/7] ext4: serialize dio nolocked reads with defrag workers V2 Dmitry Monakhov
@ 2012-09-10 9:31 ` Jan Kara
2012-09-10 10:00 ` Jan Kara
0 siblings, 1 reply; 30+ messages in thread
From: Jan Kara @ 2012-09-10 9:31 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: linux-ext4, tytso, jack, wenqing.lz
On Sun 09-09-12 21:27:10, Dmitry Monakhov wrote:
> Inode's block defrag and ext4_change_inode_journal_flag() may
> affect nonlocked DIO reads result, so proper synchronization
> required.
>
> - Add missed inode_dio_wait() calls where appropriate
> - Check inode state under extra i_dio_count reference.
>
> Changes from V1:
> - add missed memory bariers
> - move DIOREAD_LOCK state check out from generic should_dioread_nolock
> otherwise it will affect existing DIO readers.
Looks good. Just maybe I would rename ext4_inode_dio_wait() to
ext4_inode_block_unlocked_dio(). Anyway you can add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> ---
> fs/ext4/ext4.h | 3 +++
> fs/ext4/indirect.c | 23 +++++++++++++++++++++++
> fs/ext4/inode.c | 4 ++++
> fs/ext4/move_extent.c | 5 +++++
> 4 files changed, 35 insertions(+), 0 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index be976ca..0377d2b 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1348,6 +1348,8 @@ enum {
> EXT4_STATE_DIO_UNWRITTEN, /* need convert on dio done*/
> EXT4_STATE_NEWENTRY, /* File just added to dir */
> EXT4_STATE_DELALLOC_RESERVED, /* blks already reserved for delalloc */
> + EXT4_STATE_DIOREAD_LOCK, /* Disable support for dio read
> + nolocking */
> };
>
> #define EXT4_INODE_BIT_FNS(name, field, offset) \
> @@ -2015,6 +2017,7 @@ extern void ext4_da_update_reserve_space(struct inode *inode,
> /* indirect.c */
> extern int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,
> struct ext4_map_blocks *map, int flags);
> +extern void ext4_inode_dio_wait(struct inode *inode, int resume);
> extern ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
> const struct iovec *iov, loff_t offset,
> unsigned long nr_segs);
> diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
> index 61f13e5..3714413 100644
> --- a/fs/ext4/indirect.c
> +++ b/fs/ext4/indirect.c
> @@ -760,6 +760,19 @@ out:
> return err;
> }
>
> +/* Wait for existing dio workers
> + * Disable DIO read nolock optimization, so new dioreaders will
> + * be forced to grab i_mutex
> +*/
> +void ext4_inode_dio_wait(struct inode *inode, int resume_readers)
> +{
> + ext4_set_inode_state(inode, EXT4_STATE_DIOREAD_LOCK);
> + smp_mb();
> + inode_dio_wait(inode);
> + if (resume_readers)
> + ext4_set_inode_state(inode, EXT4_STATE_DIOREAD_LOCK);
> +}
> +
> /*
> * O_DIRECT for ext3 (or indirect map) based files
> *
> @@ -810,10 +823,20 @@ retry:
> if (unlikely(!list_empty(&ei->i_completed_io_list)))
> ext4_flush_completed_IO(inode);
>
> + /* Nolock dioread optimization may be dynamically disabled.
> + * Check inode's state while holding extra i_dio_count ref. */
> + atomic_inc(&inode->i_dio_count);
> + smp_mb();
> + if (!unlikely(ext4_test_inode_state(inode,
> + EXT4_STATE_DIOREAD_LOCK))) {
> + inode_dio_done(inode);
> + goto retry;
> + }
> ret = __blockdev_direct_IO(rw, iocb, inode,
> inode->i_sb->s_bdev, iov,
> offset, nr_segs,
> ext4_get_block, NULL, NULL, 0);
> + inode_dio_done(inode);
> } else {
> ret = blockdev_direct_IO(rw, iocb, inode, iov,
> offset, nr_segs, ext4_get_block);
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 762b955..ffb4a27 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4720,6 +4720,9 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
> return err;
> }
>
> + /* Wait for all existing dio workers */
> + ext4_inode_dio_wait(inode, 0);
> +
> jbd2_journal_lock_updates(journal);
>
> /*
> @@ -4739,6 +4742,7 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
> ext4_set_aops(inode);
>
> jbd2_journal_unlock_updates(journal);
> + ext4_clear_inode_state(inode, EXT4_STATE_DIOREAD_LOCK);
>
> /* Finally we can mark the inode as dirty. */
>
> diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
> index c5826c6..aea0e7a 100644
> --- a/fs/ext4/move_extent.c
> +++ b/fs/ext4/move_extent.c
> @@ -1214,6 +1214,9 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp,
> if (ret1 < 0)
> return ret1;
>
> + /* Wait for all existing dio workers */
> + ext4_inode_dio_wait(orig_inode, 0);
> + ext4_inode_dio_wait(donor_inode, 0);
> /* Protect extent tree against block allocations via delalloc */
> double_down_write_data_sem(orig_inode, donor_inode);
> /* Check the filesystem environment whether move_extent can be done */
> @@ -1412,6 +1415,8 @@ out:
> kfree(holecheck_path);
> }
> double_up_write_data_sem(orig_inode, donor_inode);
> + ext4_clear_inode_state(orig_inode, EXT4_STATE_DIOREAD_LOCK);
> + ext4_clear_inode_state(donor_inode, EXT4_STATE_DIOREAD_LOCK);
> ret2 = mext_inode_double_unlock(orig_inode, donor_inode);
>
> if (ret1)
> --
> 1.7.7.6
>
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 3/7] ext4: serialize dio nolocked reads with defrag workers V2
2012-09-10 9:31 ` Jan Kara
@ 2012-09-10 10:00 ` Jan Kara
0 siblings, 0 replies; 30+ messages in thread
From: Jan Kara @ 2012-09-10 10:00 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: linux-ext4, tytso, jack, wenqing.lz
On Mon 10-09-12 11:31:26, Jan Kara wrote:
> On Sun 09-09-12 21:27:10, Dmitry Monakhov wrote:
> > Inode's block defrag and ext4_change_inode_journal_flag() may
> > affect nonlocked DIO reads result, so proper synchronization
> > required.
> >
> > - Add missed inode_dio_wait() calls where appropriate
> > - Check inode state under extra i_dio_count reference.
> >
> > Changes from V1:
> > - add missed memory bariers
> > - move DIOREAD_LOCK state check out from generic should_dioread_nolock
> > otherwise it will affect existing DIO readers.
> Looks good. Just maybe I would rename ext4_inode_dio_wait() to
> ext4_inode_block_unlocked_dio(). Anyway you can add:
> Reviewed-by: Jan Kara <jack@suse.cz>
Oh, I found one more thing:
> > +/* Wait for existing dio workers
> > + * Disable DIO read nolock optimization, so new dioreaders will
> > + * be forced to grab i_mutex
> > +*/
> > +void ext4_inode_dio_wait(struct inode *inode, int resume_readers)
> > +{
> > + ext4_set_inode_state(inode, EXT4_STATE_DIOREAD_LOCK);
> > + smp_mb();
> > + inode_dio_wait(inode);
> > + if (resume_readers)
> > + ext4_set_inode_state(inode, EXT4_STATE_DIOREAD_LOCK);
> > +}
> > +
I've just now noticed the resume_readers argument.
a) You probably meant to call ext4_clear_inode_state() when that argument
is set?
b) You need a smp_mb() before clearing DIOREAD_LOCK?
c) Please remove the resume_readers argument and make a separate function
from that. That new function could be called:
ext4_inode_unblock_unlocked_dio().
Honza
> > /*
> > * O_DIRECT for ext3 (or indirect map) based files
> > *
> > @@ -810,10 +823,20 @@ retry:
> > if (unlikely(!list_empty(&ei->i_completed_io_list)))
> > ext4_flush_completed_IO(inode);
> >
> > + /* Nolock dioread optimization may be dynamically disabled.
> > + * Check inode's state while holding extra i_dio_count ref. */
> > + atomic_inc(&inode->i_dio_count);
> > + smp_mb();
> > + if (!unlikely(ext4_test_inode_state(inode,
> > + EXT4_STATE_DIOREAD_LOCK))) {
> > + inode_dio_done(inode);
> > + goto retry;
> > + }
> > ret = __blockdev_direct_IO(rw, iocb, inode,
> > inode->i_sb->s_bdev, iov,
> > offset, nr_segs,
> > ext4_get_block, NULL, NULL, 0);
> > + inode_dio_done(inode);
> > } else {
> > ret = blockdev_direct_IO(rw, iocb, inode, iov,
> > offset, nr_segs, ext4_get_block);
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 762b955..ffb4a27 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -4720,6 +4720,9 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
> > return err;
> > }
> >
> > + /* Wait for all existing dio workers */
> > + ext4_inode_dio_wait(inode, 0);
> > +
> > jbd2_journal_lock_updates(journal);
> >
> > /*
> > @@ -4739,6 +4742,7 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
> > ext4_set_aops(inode);
> >
> > jbd2_journal_unlock_updates(journal);
> > + ext4_clear_inode_state(inode, EXT4_STATE_DIOREAD_LOCK);
> >
> > /* Finally we can mark the inode as dirty. */
> >
> > diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
> > index c5826c6..aea0e7a 100644
> > --- a/fs/ext4/move_extent.c
> > +++ b/fs/ext4/move_extent.c
> > @@ -1214,6 +1214,9 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp,
> > if (ret1 < 0)
> > return ret1;
> >
> > + /* Wait for all existing dio workers */
> > + ext4_inode_dio_wait(orig_inode, 0);
> > + ext4_inode_dio_wait(donor_inode, 0);
> > /* Protect extent tree against block allocations via delalloc */
> > double_down_write_data_sem(orig_inode, donor_inode);
> > /* Check the filesystem environment whether move_extent can be done */
> > @@ -1412,6 +1415,8 @@ out:
> > kfree(holecheck_path);
> > }
> > double_up_write_data_sem(orig_inode, donor_inode);
> > + ext4_clear_inode_state(orig_inode, EXT4_STATE_DIOREAD_LOCK);
> > + ext4_clear_inode_state(donor_inode, EXT4_STATE_DIOREAD_LOCK);
> > ret2 = mext_inode_double_unlock(orig_inode, donor_inode);
> >
> > if (ret1)
> > --
> > 1.7.7.6
> >
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH 4/7] ext4: fsync should wait for DIO writers
2012-09-09 17:27 [PATCH 0/7] ext4: Bunch of DIO/AIO fixes Dmitry Monakhov
` (2 preceding siblings ...)
2012-09-09 17:27 ` [PATCH 3/7] ext4: serialize dio nolocked reads with defrag workers V2 Dmitry Monakhov
@ 2012-09-09 17:27 ` Dmitry Monakhov
2012-09-10 9:51 ` Jan Kara
2012-09-13 10:46 ` Zheng Liu
2012-09-09 17:27 ` [PATCH 5/7] ext4: serialize unlocked dio reads with truncate Dmitry Monakhov
` (2 subsequent siblings)
6 siblings, 2 replies; 30+ messages in thread
From: Dmitry Monakhov @ 2012-09-09 17:27 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, jack, wenqing.lz, Dmitry Monakhov
fsync and punch_hole are the places where we have to wait for all
existing writers (writeback, aio, dio), but currently we simply
flush pended end_io request which is not sufficient.
Even more i_mutex is not holded while punch_hole which obviously
result in dangerous data corruption due to write-after-free.
This patch performs following changes:
- Guard punch_hole with i_mutex
- fsync and punch_hole now wait for all writers in flight
NOTE: XXX write-after-free race is still possible because
truncate_pagecache_range() is not completely reliable and where
is no easy way to stop writeback while punch_hole is in progress.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
fs/ext4/extents.c | 10 ++++++++--
fs/ext4/fsync.c | 1 +
2 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index e993879..8252651 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4845,6 +4845,7 @@ int ext4_ext_punch_hole(struct file *file, loff_t offset, loff_t length)
return err;
}
+ mutex_lock(&inode->i_mutex);
/* Now release the pages */
if (last_page_offset > first_page_offset) {
truncate_pagecache_range(inode, first_page_offset,
@@ -4852,12 +4853,15 @@ int ext4_ext_punch_hole(struct file *file, loff_t offset, loff_t length)
}
/* finish any pending end_io work */
+ inode_dio_wait(inode);
ext4_flush_completed_IO(inode);
credits = ext4_writepage_trans_blocks(inode);
handle = ext4_journal_start(inode, credits);
- if (IS_ERR(handle))
- return PTR_ERR(handle);
+ if (IS_ERR(handle)) {
+ err = PTR_ERR(handle);
+ goto out_mutex;
+ }
err = ext4_orphan_add(handle, inode);
if (err)
@@ -4951,6 +4955,8 @@ out:
inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
ext4_mark_inode_dirty(handle, inode);
ext4_journal_stop(handle);
+out_mutex:
+ mutex_unlock(&inode->i_mutex);
return err;
}
int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 24f3719..290c5cf 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -204,6 +204,7 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
if (inode->i_sb->s_flags & MS_RDONLY)
goto out;
+ inode_dio_wait(inode);
ret = ext4_flush_completed_IO(inode);
if (ret < 0)
goto out;
--
1.7.7.6
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH 4/7] ext4: fsync should wait for DIO writers
2012-09-09 17:27 ` [PATCH 4/7] ext4: fsync should wait for DIO writers Dmitry Monakhov
@ 2012-09-10 9:51 ` Jan Kara
2012-09-10 10:56 ` Dmitry Monakhov
2012-09-12 5:40 ` Zheng Liu
2012-09-13 10:46 ` Zheng Liu
1 sibling, 2 replies; 30+ messages in thread
From: Jan Kara @ 2012-09-10 9:51 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: linux-ext4, tytso, jack, wenqing.lz
On Sun 09-09-12 21:27:11, Dmitry Monakhov wrote:
> fsync and punch_hole are the places where we have to wait for all
> existing writers (writeback, aio, dio), but currently we simply
> flush pended end_io request which is not sufficient.
Why not? I guess you mean the fact that there can be DIO in flight for
which end_io() was not called so it is not queued in the queue? But that is
OK - we have not yet called aio_complete() for that IO so for userspace the
write has not happened yet. Thus there's no need to flush it to disk -
fsync() does not say anything about writes in progress while fsync is
called.
> Even more i_mutex is not holded while punch_hole which obviously
> result in dangerous data corruption due to write-after-free.
Yes, that's a bug. I also noticed that but didn't get to fixing it (I'm
actually working on a more long term fix using range locking but that's
more of a research project so having somehow fixed at least the most
blatant locking problems is good).
Honza
>
> This patch performs following changes:
>
> - Guard punch_hole with i_mutex
> - fsync and punch_hole now wait for all writers in flight
> NOTE: XXX write-after-free race is still possible because
> truncate_pagecache_range() is not completely reliable and where
> is no easy way to stop writeback while punch_hole is in progress.
>
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> ---
> fs/ext4/extents.c | 10 ++++++++--
> fs/ext4/fsync.c | 1 +
> 2 files changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index e993879..8252651 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -4845,6 +4845,7 @@ int ext4_ext_punch_hole(struct file *file, loff_t offset, loff_t length)
> return err;
> }
>
> + mutex_lock(&inode->i_mutex);
> /* Now release the pages */
> if (last_page_offset > first_page_offset) {
> truncate_pagecache_range(inode, first_page_offset,
> @@ -4852,12 +4853,15 @@ int ext4_ext_punch_hole(struct file *file, loff_t offset, loff_t length)
> }
>
> /* finish any pending end_io work */
> + inode_dio_wait(inode);
> ext4_flush_completed_IO(inode);
>
> credits = ext4_writepage_trans_blocks(inode);
> handle = ext4_journal_start(inode, credits);
> - if (IS_ERR(handle))
> - return PTR_ERR(handle);
> + if (IS_ERR(handle)) {
> + err = PTR_ERR(handle);
> + goto out_mutex;
> + }
>
> err = ext4_orphan_add(handle, inode);
> if (err)
> @@ -4951,6 +4955,8 @@ out:
> inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
> ext4_mark_inode_dirty(handle, inode);
> ext4_journal_stop(handle);
> +out_mutex:
> + mutex_unlock(&inode->i_mutex);
> return err;
> }
> int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
> diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
> index 24f3719..290c5cf 100644
> --- a/fs/ext4/fsync.c
> +++ b/fs/ext4/fsync.c
> @@ -204,6 +204,7 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
> if (inode->i_sb->s_flags & MS_RDONLY)
> goto out;
>
> + inode_dio_wait(inode);
> ret = ext4_flush_completed_IO(inode);
> if (ret < 0)
> goto out;
> --
> 1.7.7.6
>
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 4/7] ext4: fsync should wait for DIO writers
2012-09-10 9:51 ` Jan Kara
@ 2012-09-10 10:56 ` Dmitry Monakhov
2012-09-12 14:02 ` Jan Kara
2012-09-12 5:40 ` Zheng Liu
1 sibling, 1 reply; 30+ messages in thread
From: Dmitry Monakhov @ 2012-09-10 10:56 UTC (permalink / raw)
To: Jan Kara; +Cc: linux-ext4, tytso, jack, wenqing.lz
On Mon, 10 Sep 2012 11:51:35 +0200, Jan Kara <jack@suse.cz> wrote:
> On Sun 09-09-12 21:27:11, Dmitry Monakhov wrote:
> > fsync and punch_hole are the places where we have to wait for all
> > existing writers (writeback, aio, dio), but currently we simply
> > flush pended end_io request which is not sufficient.
> Why not? I guess you mean the fact that there can be DIO in flight for
> which end_io() was not called so it is not queued in the queue? But that is
> OK - we have not yet called aio_complete() for that IO so for userspace the
> write has not happened yet. Thus there's no need to flush it to disk -
> fsync() does not say anything about writes in progress while fsync is
> called.
>From posix point of view(which is good one) may be wait for aio-dio's is
overwhelming guarantee.
> > Even more i_mutex is not holded while punch_hole which obviously
> > result in dangerous data corruption due to write-after-free.
> Yes, that's a bug. I also noticed that but didn't get to fixing it (I'm
> actually working on a more long term fix using range locking but that's
> more of a research project so having somehow fixed at least the most
> blatant locking problems is good).
Yes you right. In order to do things right we should block:
1) direct io
2) pagecache /mmap users (writeback, readpage)
A assumes I've fixed (1) but (2) is still exist
My current assumption is to do actions similar to writeback
down_write(EXT4_I(inode)->i_data_sem)
while (index <= end && pagevec_lookup(&pvec, mapping, index,...) {
lock_page(pvec[i]);
zero_user_page(pvec[i], 0, PAGE_SIZE);
ret = try_to_release_page(pvec[i]);
}
/* At this moment we know that we locked all pages in range,
* NOTE!!!! currently ext_remove_space may drop i_data_sem internally
* so it should be modified to exit once i_mutex was dropped
*/
ret = ext4_ext_remove_space(inode, from, to, NO_RELOCK)
while (pvec_num)
unlock_page(pvec[i])
}
up_write(EXT4_I(inode)->i_data_sem)
Number of locked pages should not be too large
Or even more instead of massive page locking, we can lock page
one by one, and simulate fake writeback, so all new writers will
wait on that bit, but readers will see zeroes.
down_write(EXT4_I(inode)->i_data_sem)
while (index <= end && pagevec_lookup(&pvec, mapping, index,...) {
lock_page(pvec[i]);
zero_user_page(pvec[i], 0, PAGE_SIZE);
ret = try_to_release_page(pvec[i]);
set_page_writeback(pvec[i]);
unlock_page(pvec[i])
}
ret = ext4_ext_remove_space(inode, from, to, NO_RELOCK)
while (pvec_num) {
end_page_writeback(pvec[i])
}
up_write(EXT4_I(inode)->i_data_sem)
>
> Honza
>
> >
> > This patch performs following changes:
> >
> > - Guard punch_hole with i_mutex
> > - fsync and punch_hole now wait for all writers in flight
> > NOTE: XXX write-after-free race is still possible because
> > truncate_pagecache_range() is not completely reliable and where
> > is no easy way to stop writeback while punch_hole is in progress.
> >
> > Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> > ---
> > fs/ext4/extents.c | 10 ++++++++--
> > fs/ext4/fsync.c | 1 +
> > 2 files changed, 9 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> > index e993879..8252651 100644
> > --- a/fs/ext4/extents.c
> > +++ b/fs/ext4/extents.c
> > @@ -4845,6 +4845,7 @@ int ext4_ext_punch_hole(struct file *file, loff_t offset, loff_t length)
> > return err;
> > }
> >
> > + mutex_lock(&inode->i_mutex);
> > /* Now release the pages */
> > if (last_page_offset > first_page_offset) {
> > truncate_pagecache_range(inode, first_page_offset,
> > @@ -4852,12 +4853,15 @@ int ext4_ext_punch_hole(struct file *file, loff_t offset, loff_t length)
> > }
> >
> > /* finish any pending end_io work */
> > + inode_dio_wait(inode);
> > ext4_flush_completed_IO(inode);
> >
> > credits = ext4_writepage_trans_blocks(inode);
> > handle = ext4_journal_start(inode, credits);
> > - if (IS_ERR(handle))
> > - return PTR_ERR(handle);
> > + if (IS_ERR(handle)) {
> > + err = PTR_ERR(handle);
> > + goto out_mutex;
> > + }
> >
> > err = ext4_orphan_add(handle, inode);
> > if (err)
> > @@ -4951,6 +4955,8 @@ out:
> > inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
> > ext4_mark_inode_dirty(handle, inode);
> > ext4_journal_stop(handle);
> > +out_mutex:
> > + mutex_unlock(&inode->i_mutex);
> > return err;
> > }
> > int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
> > diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
> > index 24f3719..290c5cf 100644
> > --- a/fs/ext4/fsync.c
> > +++ b/fs/ext4/fsync.c
> > @@ -204,6 +204,7 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
> > if (inode->i_sb->s_flags & MS_RDONLY)
> > goto out;
> >
> > + inode_dio_wait(inode);
> > ret = ext4_flush_completed_IO(inode);
> > if (ret < 0)
> > goto out;
> > --
> > 1.7.7.6
> >
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 4/7] ext4: fsync should wait for DIO writers
2012-09-10 10:56 ` Dmitry Monakhov
@ 2012-09-12 14:02 ` Jan Kara
0 siblings, 0 replies; 30+ messages in thread
From: Jan Kara @ 2012-09-12 14:02 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: Jan Kara, linux-ext4, tytso, wenqing.lz
On Mon 10-09-12 14:56:04, Dmitry Monakhov wrote:
> On Mon, 10 Sep 2012 11:51:35 +0200, Jan Kara <jack@suse.cz> wrote:
> > > Even more i_mutex is not holded while punch_hole which obviously
> > > result in dangerous data corruption due to write-after-free.
> > Yes, that's a bug. I also noticed that but didn't get to fixing it (I'm
> > actually working on a more long term fix using range locking but that's
> > more of a research project so having somehow fixed at least the most
> > blatant locking problems is good).
> Yes you right. In order to do things right we should block:
> 1) direct io
> 2) pagecache /mmap users (writeback, readpage)
>
> A assumes I've fixed (1) but (2) is still exist
>
> My current assumption is to do actions similar to writeback
>
> down_write(EXT4_I(inode)->i_data_sem)
> while (index <= end && pagevec_lookup(&pvec, mapping, index,...) {
> lock_page(pvec[i]);
Here you need to use trylock to avoid possible deadlocks...
> zero_user_page(pvec[i], 0, PAGE_SIZE);
> ret = try_to_release_page(pvec[i]);
> }
> /* At this moment we know that we locked all pages in range,
> * NOTE!!!! currently ext_remove_space may drop i_data_sem internally
> * so it should be modified to exit once i_mutex was dropped
> */
> ret = ext4_ext_remove_space(inode, from, to, NO_RELOCK)
> while (pvec_num)
> unlock_page(pvec[i])
> }
> up_write(EXT4_I(inode)->i_data_sem)
>
> Number of locked pages should not be too large
> Or even more instead of massive page locking, we can lock page
> one by one, and simulate fake writeback, so all new writers will
> wait on that bit, but readers will see zeroes.
> down_write(EXT4_I(inode)->i_data_sem)
> while (index <= end && pagevec_lookup(&pvec, mapping, index,...) {
> lock_page(pvec[i]);
> zero_user_page(pvec[i], 0, PAGE_SIZE);
> ret = try_to_release_page(pvec[i]);
> set_page_writeback(pvec[i]);
> unlock_page(pvec[i])
> }
>
> ret = ext4_ext_remove_space(inode, from, to, NO_RELOCK)
> while (pvec_num) {
> end_page_writeback(pvec[i])
> }
> up_write(EXT4_I(inode)->i_data_sem)
Oh, that's a hack. Please don't do that. Using page locks is cleaner
although I agree it's not very good either. That's why I decided not to
loose time with suboptimal solutions and rather look into range locking...
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 4/7] ext4: fsync should wait for DIO writers
2012-09-10 9:51 ` Jan Kara
2012-09-10 10:56 ` Dmitry Monakhov
@ 2012-09-12 5:40 ` Zheng Liu
1 sibling, 0 replies; 30+ messages in thread
From: Zheng Liu @ 2012-09-12 5:40 UTC (permalink / raw)
To: Jan Kara; +Cc: Dmitry Monakhov, linux-ext4, tytso, wenqing.lz
On Mon, Sep 10, 2012 at 11:51:35AM +0200, Jan Kara wrote:
> On Sun 09-09-12 21:27:11, Dmitry Monakhov wrote:
> > fsync and punch_hole are the places where we have to wait for all
> > existing writers (writeback, aio, dio), but currently we simply
> > flush pended end_io request which is not sufficient.
> Why not? I guess you mean the fact that there can be DIO in flight for
> which end_io() was not called so it is not queued in the queue? But that is
> OK - we have not yet called aio_complete() for that IO so for userspace the
> write has not happened yet. Thus there's no need to flush it to disk -
> fsync() does not say anything about writes in progress while fsync is
> called.
>
> > Even more i_mutex is not holded while punch_hole which obviously
> > result in dangerous data corruption due to write-after-free.
> Yes, that's a bug. I also noticed that but didn't get to fixing it (I'm
> actually working on a more long term fix using range locking but that's
> more of a research project so having somehow fixed at least the most
> blatant locking problems is good).
Hi Jan,
Could you please share more detailed information about range locking
with me? Actually, the goal of extent status tree is to implement a
range locking in ext4 [1], and I am working on it. So I think that
you have some good ideas to share with me. :-)
1. http://www.spinics.net/lists/linux-ext4/msg32661.html
If you have some problems, please let me know. Thanks!
Regards,
Zheng
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 4/7] ext4: fsync should wait for DIO writers
2012-09-09 17:27 ` [PATCH 4/7] ext4: fsync should wait for DIO writers Dmitry Monakhov
2012-09-10 9:51 ` Jan Kara
@ 2012-09-13 10:46 ` Zheng Liu
2012-09-13 11:01 ` Dmitry Monakhov
1 sibling, 1 reply; 30+ messages in thread
From: Zheng Liu @ 2012-09-13 10:46 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: linux-ext4, tytso, jack, wenqing.lz
On Sun, Sep 09, 2012 at 09:27:11PM +0400, Dmitry Monakhov wrote:
> fsync and punch_hole are the places where we have to wait for all
> existing writers (writeback, aio, dio), but currently we simply
> flush pended end_io request which is not sufficient.
> Even more i_mutex is not holded while punch_hole which obviously
> result in dangerous data corruption due to write-after-free.
Hi Dmitry,
Lukas already has a patch to take i_mutex locking before punching a
hole. Just a reminding. :-)
https://patchwork.kernel.org/patch/1247271/
Regards,
Zheng
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 4/7] ext4: fsync should wait for DIO writers
2012-09-13 10:46 ` Zheng Liu
@ 2012-09-13 11:01 ` Dmitry Monakhov
2012-09-13 12:36 ` Zheng Liu
0 siblings, 1 reply; 30+ messages in thread
From: Dmitry Monakhov @ 2012-09-13 11:01 UTC (permalink / raw)
To: Zheng Liu; +Cc: linux-ext4, tytso, jack, wenqing.lz
On Thu, 13 Sep 2012 18:46:21 +0800, Zheng Liu <gnehzuil.liu@gmail.com> wrote:
> On Sun, Sep 09, 2012 at 09:27:11PM +0400, Dmitry Monakhov wrote:
> > fsync and punch_hole are the places where we have to wait for all
> > existing writers (writeback, aio, dio), but currently we simply
> > flush pended end_io request which is not sufficient.
> > Even more i_mutex is not holded while punch_hole which obviously
> > result in dangerous data corruption due to write-after-free.
>
> Hi Dmitry,
>
> Lukas already has a patch to take i_mutex locking before punching a
> hole. Just a reminding. :-)
Yes, i've found it after patch was submitted, but this bug make me
nervous a bit because we have broken punch_hole implementation
long time ago, it allow to destroy data easily, user are able to
call it if has WR permission for a file.
So if you ask be i'll vote for hide it under CAP_SYS_RESOURCE until
proper implementation appears.
Same it true for EXT4_IO_MOVE_EXT because it allow to kernel panic
since v2.6.30-6558-g748de67, so all primary distros (RH6,Deb6) are
affected :(
>
> https://patchwork.kernel.org/patch/1247271/
>
> Regards,
> Zheng
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 4/7] ext4: fsync should wait for DIO writers
2012-09-13 11:01 ` Dmitry Monakhov
@ 2012-09-13 12:36 ` Zheng Liu
0 siblings, 0 replies; 30+ messages in thread
From: Zheng Liu @ 2012-09-13 12:36 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: linux-ext4, tytso, jack, wenqing.lz
On Thu, Sep 13, 2012 at 03:01:25PM +0400, Dmitry Monakhov wrote:
> On Thu, 13 Sep 2012 18:46:21 +0800, Zheng Liu <gnehzuil.liu@gmail.com> wrote:
> > On Sun, Sep 09, 2012 at 09:27:11PM +0400, Dmitry Monakhov wrote:
> > > fsync and punch_hole are the places where we have to wait for all
> > > existing writers (writeback, aio, dio), but currently we simply
> > > flush pended end_io request which is not sufficient.
> > > Even more i_mutex is not holded while punch_hole which obviously
> > > result in dangerous data corruption due to write-after-free.
> >
> > Hi Dmitry,
> >
> > Lukas already has a patch to take i_mutex locking before punching a
> > hole. Just a reminding. :-)
> Yes, i've found it after patch was submitted, but this bug make me
> nervous a bit because we have broken punch_hole implementation
> long time ago, it allow to destroy data easily, user are able to
> call it if has WR permission for a file.
> So if you ask be i'll vote for hide it under CAP_SYS_RESOURCE until
> proper implementation appears.
> Same it true for EXT4_IO_MOVE_EXT because it allow to kernel panic
> since v2.6.30-6558-g748de67, so all primary distros (RH6,Deb6) are
> affected :(
IMHO, it will be better when this patch is applied, and I prefer to fix
the problem rather than hidding it using CAP_SYS_RESOURCE. ;-)
Regards,
Zheng
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH 5/7] ext4: serialize unlocked dio reads with truncate
2012-09-09 17:27 [PATCH 0/7] ext4: Bunch of DIO/AIO fixes Dmitry Monakhov
` (3 preceding siblings ...)
2012-09-09 17:27 ` [PATCH 4/7] ext4: fsync should wait for DIO writers Dmitry Monakhov
@ 2012-09-09 17:27 ` Dmitry Monakhov
2012-09-10 9:54 ` Jan Kara
2012-09-09 17:27 ` [PATCH 6/7] ext4: endless truncate due to nonlocked dio readers V2 Dmitry Monakhov
2012-09-09 17:27 ` [PATCH 7/7] ext4: serialize truncate with owerwrite DIO workers V2 Dmitry Monakhov
6 siblings, 1 reply; 30+ messages in thread
From: Dmitry Monakhov @ 2012-09-09 17:27 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, jack, wenqing.lz, Dmitry Monakhov
Current serialization will works only for DIO which holds
i_mutex, but nonlocked DIO following race is possible:
dio_nolock_read_task truncate_task
->ext4_setattr()
->inode_dio_wait()
->ext4_ext_direct_IO
->ext4_ind_direct_IO
->__blockdev_direct_IO
->ext4_get_block
->truncate_setsize()
->ext4_truncate()
#alloc truncated blocks
#to other inode
->submit_io()
#INFORMATION LEAK
In order to serialize with unlocked DIO reads we have to
rearrange wait sequence
1) update i_size first
2) if i_size about to be reduced wait for outstanding DIO requests
3) and only after that truncate inode blocks
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
fs/ext4/inode.c | 7 +++++--
1 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ffb4a27..93e6b09 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4283,7 +4283,6 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
}
if (attr->ia_valid & ATTR_SIZE) {
- inode_dio_wait(inode);
if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
@@ -4332,8 +4331,12 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
}
if (attr->ia_valid & ATTR_SIZE) {
- if (attr->ia_size != i_size_read(inode))
+ if (attr->ia_size != inode->i_size) {
truncate_setsize(inode, attr->ia_size);
+ /* Inode size will be reduced, wait for dio in flight */
+ if (orphan)
+ inode_dio_wait(inode);
+ }
ext4_truncate(inode);
}
--
1.7.7.6
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH 5/7] ext4: serialize unlocked dio reads with truncate
2012-09-09 17:27 ` [PATCH 5/7] ext4: serialize unlocked dio reads with truncate Dmitry Monakhov
@ 2012-09-10 9:54 ` Jan Kara
0 siblings, 0 replies; 30+ messages in thread
From: Jan Kara @ 2012-09-10 9:54 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: linux-ext4, tytso, jack, wenqing.lz
On Sun 09-09-12 21:27:12, Dmitry Monakhov wrote:
> Current serialization will works only for DIO which holds
> i_mutex, but nonlocked DIO following race is possible:
>
> dio_nolock_read_task truncate_task
> ->ext4_setattr()
> ->inode_dio_wait()
> ->ext4_ext_direct_IO
> ->ext4_ind_direct_IO
> ->__blockdev_direct_IO
> ->ext4_get_block
> ->truncate_setsize()
> ->ext4_truncate()
> #alloc truncated blocks
> #to other inode
> ->submit_io()
> #INFORMATION LEAK
>
> In order to serialize with unlocked DIO reads we have to
> rearrange wait sequence
> 1) update i_size first
> 2) if i_size about to be reduced wait for outstanding DIO requests
> 3) and only after that truncate inode blocks
Looks good. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
>
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> ---
> fs/ext4/inode.c | 7 +++++--
> 1 files changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index ffb4a27..93e6b09 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4283,7 +4283,6 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
> }
>
> if (attr->ia_valid & ATTR_SIZE) {
> - inode_dio_wait(inode);
>
> if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
> struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> @@ -4332,8 +4331,12 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
> }
>
> if (attr->ia_valid & ATTR_SIZE) {
> - if (attr->ia_size != i_size_read(inode))
> + if (attr->ia_size != inode->i_size) {
> truncate_setsize(inode, attr->ia_size);
> + /* Inode size will be reduced, wait for dio in flight */
> + if (orphan)
> + inode_dio_wait(inode);
> + }
> ext4_truncate(inode);
> }
>
> --
> 1.7.7.6
>
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH 6/7] ext4: endless truncate due to nonlocked dio readers V2
2012-09-09 17:27 [PATCH 0/7] ext4: Bunch of DIO/AIO fixes Dmitry Monakhov
` (4 preceding siblings ...)
2012-09-09 17:27 ` [PATCH 5/7] ext4: serialize unlocked dio reads with truncate Dmitry Monakhov
@ 2012-09-09 17:27 ` Dmitry Monakhov
2012-09-13 10:41 ` Zheng Liu
2012-09-09 17:27 ` [PATCH 7/7] ext4: serialize truncate with owerwrite DIO workers V2 Dmitry Monakhov
6 siblings, 1 reply; 30+ messages in thread
From: Dmitry Monakhov @ 2012-09-09 17:27 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, jack, wenqing.lz, Dmitry Monakhov
If we have enough aggressive DIO readers, truncate and other dio
waiters will wait forever inside inode_dio_wait(). It is reasonable
to disable nonlock DIO read optimization during truncate.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
fs/ext4/extents.c | 2 +-
fs/ext4/fsync.c | 2 +-
fs/ext4/inode.c | 2 +-
3 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 8252651..b5b801f 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4853,7 +4853,7 @@ int ext4_ext_punch_hole(struct file *file, loff_t offset, loff_t length)
}
/* finish any pending end_io work */
- inode_dio_wait(inode);
+ ext4_inode_dio_wait(inode, 1);
ext4_flush_completed_IO(inode);
credits = ext4_writepage_trans_blocks(inode);
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 290c5cf..bdf6bfd 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -204,7 +204,7 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
if (inode->i_sb->s_flags & MS_RDONLY)
goto out;
- inode_dio_wait(inode);
+ ext4_inode_dio_wait(inode, 1);
ret = ext4_flush_completed_IO(inode);
if (ret < 0)
goto out;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 93e6b09..a850026 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4335,7 +4335,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
truncate_setsize(inode, attr->ia_size);
/* Inode size will be reduced, wait for dio in flight */
if (orphan)
- inode_dio_wait(inode);
+ ext4_inode_dio_wait(inode, 1);
}
ext4_truncate(inode);
}
--
1.7.7.6
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH 6/7] ext4: endless truncate due to nonlocked dio readers V2
2012-09-09 17:27 ` [PATCH 6/7] ext4: endless truncate due to nonlocked dio readers V2 Dmitry Monakhov
@ 2012-09-13 10:41 ` Zheng Liu
2012-09-13 12:07 ` Jan Kara
0 siblings, 1 reply; 30+ messages in thread
From: Zheng Liu @ 2012-09-13 10:41 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: linux-ext4, tytso, jack, wenqing.lz
Hi Dmitry,
Could you please provide more detailed workload to convince me? I
am thinking about whether we really need to disable dioread_nolock
feature in here. In our benchmarks, we don't see this problem.
Regards,
Zheng
On Sun, Sep 09, 2012 at 09:27:13PM +0400, Dmitry Monakhov wrote:
> If we have enough aggressive DIO readers, truncate and other dio
> waiters will wait forever inside inode_dio_wait(). It is reasonable
> to disable nonlock DIO read optimization during truncate.
>
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> ---
> fs/ext4/extents.c | 2 +-
> fs/ext4/fsync.c | 2 +-
> fs/ext4/inode.c | 2 +-
> 3 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 8252651..b5b801f 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -4853,7 +4853,7 @@ int ext4_ext_punch_hole(struct file *file, loff_t offset, loff_t length)
> }
>
> /* finish any pending end_io work */
> - inode_dio_wait(inode);
> + ext4_inode_dio_wait(inode, 1);
> ext4_flush_completed_IO(inode);
>
> credits = ext4_writepage_trans_blocks(inode);
> diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
> index 290c5cf..bdf6bfd 100644
> --- a/fs/ext4/fsync.c
> +++ b/fs/ext4/fsync.c
> @@ -204,7 +204,7 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
> if (inode->i_sb->s_flags & MS_RDONLY)
> goto out;
>
> - inode_dio_wait(inode);
> + ext4_inode_dio_wait(inode, 1);
> ret = ext4_flush_completed_IO(inode);
> if (ret < 0)
> goto out;
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 93e6b09..a850026 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4335,7 +4335,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
> truncate_setsize(inode, attr->ia_size);
> /* Inode size will be reduced, wait for dio in flight */
> if (orphan)
> - inode_dio_wait(inode);
> + ext4_inode_dio_wait(inode, 1);
> }
> ext4_truncate(inode);
> }
> --
> 1.7.7.6
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 6/7] ext4: endless truncate due to nonlocked dio readers V2
2012-09-13 10:41 ` Zheng Liu
@ 2012-09-13 12:07 ` Jan Kara
2012-09-13 12:57 ` Zheng Liu
0 siblings, 1 reply; 30+ messages in thread
From: Jan Kara @ 2012-09-13 12:07 UTC (permalink / raw)
To: Zheng Liu; +Cc: Dmitry Monakhov, linux-ext4, tytso, jack, wenqing.lz
Hello,
On Thu 13-09-12 18:41:36, Zheng Liu wrote:
> Could you please provide more detailed workload to convince me? I
> am thinking about whether we really need to disable dioread_nolock
> feature in here. In our benchmarks, we don't see this problem.
I just did:
# Create file
dd if=/dev/zero of=/mnt/file bs=1M count=30
sync
# Start 10 DIO dio readers in parallel reading the file in a loop
for (( i = 0; i < 10; i++ )); do
while true; do
dd if=/mnt/file bs=4k iflag=direct of=/dev/null
done &
done
sleep 1
# Try to truncate the file - never finishes.
truncate -s 16 /mnt/file
It is pretty easy to hit this. Besides being a DOS attack vector (but I
won't be too concerned about this - there are plenty of ways how local
process can screw you) I can easily imagine some application to get bitten
by this.
Honza
>
> Regards,
> Zheng
>
> On Sun, Sep 09, 2012 at 09:27:13PM +0400, Dmitry Monakhov wrote:
> > If we have enough aggressive DIO readers, truncate and other dio
> > waiters will wait forever inside inode_dio_wait(). It is reasonable
> > to disable nonlock DIO read optimization during truncate.
> >
> > Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> > ---
> > fs/ext4/extents.c | 2 +-
> > fs/ext4/fsync.c | 2 +-
> > fs/ext4/inode.c | 2 +-
> > 3 files changed, 3 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> > index 8252651..b5b801f 100644
> > --- a/fs/ext4/extents.c
> > +++ b/fs/ext4/extents.c
> > @@ -4853,7 +4853,7 @@ int ext4_ext_punch_hole(struct file *file, loff_t offset, loff_t length)
> > }
> >
> > /* finish any pending end_io work */
> > - inode_dio_wait(inode);
> > + ext4_inode_dio_wait(inode, 1);
> > ext4_flush_completed_IO(inode);
> >
> > credits = ext4_writepage_trans_blocks(inode);
> > diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
> > index 290c5cf..bdf6bfd 100644
> > --- a/fs/ext4/fsync.c
> > +++ b/fs/ext4/fsync.c
> > @@ -204,7 +204,7 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
> > if (inode->i_sb->s_flags & MS_RDONLY)
> > goto out;
> >
> > - inode_dio_wait(inode);
> > + ext4_inode_dio_wait(inode, 1);
> > ret = ext4_flush_completed_IO(inode);
> > if (ret < 0)
> > goto out;
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 93e6b09..a850026 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -4335,7 +4335,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
> > truncate_setsize(inode, attr->ia_size);
> > /* Inode size will be reduced, wait for dio in flight */
> > if (orphan)
> > - inode_dio_wait(inode);
> > + ext4_inode_dio_wait(inode, 1);
> > }
> > ext4_truncate(inode);
> > }
> > --
> > 1.7.7.6
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 6/7] ext4: endless truncate due to nonlocked dio readers V2
2012-09-13 12:07 ` Jan Kara
@ 2012-09-13 12:57 ` Zheng Liu
2012-09-13 14:34 ` Jan Kara
0 siblings, 1 reply; 30+ messages in thread
From: Zheng Liu @ 2012-09-13 12:57 UTC (permalink / raw)
To: Jan Kara; +Cc: Dmitry Monakhov, linux-ext4, tytso, wenqing.lz
On Thu, Sep 13, 2012 at 02:07:36PM +0200, Jan Kara wrote:
> Hello,
>
> On Thu 13-09-12 18:41:36, Zheng Liu wrote:
> > Could you please provide more detailed workload to convince me? I
> > am thinking about whether we really need to disable dioread_nolock
> > feature in here. In our benchmarks, we don't see this problem.
> I just did:
>
> # Create file
> dd if=/dev/zero of=/mnt/file bs=1M count=30
> sync
> # Start 10 DIO dio readers in parallel reading the file in a loop
> for (( i = 0; i < 10; i++ )); do
> while true; do
> dd if=/mnt/file bs=4k iflag=direct of=/dev/null
> done &
> done
> sleep 1
>
> # Try to truncate the file - never finishes.
> truncate -s 16 /mnt/file
>
> It is pretty easy to hit this. Besides being a DOS attack vector (but I
> won't be too concerned about this - there are plenty of ways how local
> process can screw you) I can easily imagine some application to get bitten
> by this.
Hi Jan,
Thanks for your explanation, but in my desktop I cannot reproduce this
problem. The size of `file' is 16. Am I missing something?
Regards,
Zheng
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 6/7] ext4: endless truncate due to nonlocked dio readers V2
2012-09-13 12:57 ` Zheng Liu
@ 2012-09-13 14:34 ` Jan Kara
2012-09-13 23:31 ` Zheng Liu
0 siblings, 1 reply; 30+ messages in thread
From: Jan Kara @ 2012-09-13 14:34 UTC (permalink / raw)
To: Zheng Liu; +Cc: Jan Kara, Dmitry Monakhov, linux-ext4, tytso, wenqing.lz
On Thu 13-09-12 20:57:26, Zheng Liu wrote:
> On Thu, Sep 13, 2012 at 02:07:36PM +0200, Jan Kara wrote:
> > Hello,
> >
> > On Thu 13-09-12 18:41:36, Zheng Liu wrote:
> > > Could you please provide more detailed workload to convince me? I
> > > am thinking about whether we really need to disable dioread_nolock
> > > feature in here. In our benchmarks, we don't see this problem.
> > I just did:
> >
> > # Create file
> > dd if=/dev/zero of=/mnt/file bs=1M count=30
> > sync
> > # Start 10 DIO dio readers in parallel reading the file in a loop
> > for (( i = 0; i < 10; i++ )); do
> > while true; do
> > dd if=/mnt/file bs=4k iflag=direct of=/dev/null
> > done &
> > done
> > sleep 1
> >
> > # Try to truncate the file - never finishes.
> > truncate -s 16 /mnt/file
> >
> > It is pretty easy to hit this. Besides being a DOS attack vector (but I
> > won't be too concerned about this - there are plenty of ways how local
> > process can screw you) I can easily imagine some application to get bitten
> > by this.
>
> Hi Jan,
>
> Thanks for your explanation, but in my desktop I cannot reproduce this
> problem. The size of `file' is 16. Am I missing something?
Hum, on my test machine with 3.6-rc1 it does not... Maybe for your
desktop you need a larger sleep before running truncate so that readers
have time to start up? Also I suppose you have ext4 mounted with
dioread_nolock mount option?
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 6/7] ext4: endless truncate due to nonlocked dio readers V2
2012-09-13 14:34 ` Jan Kara
@ 2012-09-13 23:31 ` Zheng Liu
0 siblings, 0 replies; 30+ messages in thread
From: Zheng Liu @ 2012-09-13 23:31 UTC (permalink / raw)
To: Jan Kara; +Cc: Dmitry Monakhov, linux-ext4, tytso, wenqing.lz
On Thu, Sep 13, 2012 at 04:34:55PM +0200, Jan Kara wrote:
> On Thu 13-09-12 20:57:26, Zheng Liu wrote:
> > On Thu, Sep 13, 2012 at 02:07:36PM +0200, Jan Kara wrote:
> > > Hello,
> > >
> > > On Thu 13-09-12 18:41:36, Zheng Liu wrote:
> > > > Could you please provide more detailed workload to convince me? I
> > > > am thinking about whether we really need to disable dioread_nolock
> > > > feature in here. In our benchmarks, we don't see this problem.
> > > I just did:
> > >
> > > # Create file
> > > dd if=/dev/zero of=/mnt/file bs=1M count=30
> > > sync
> > > # Start 10 DIO dio readers in parallel reading the file in a loop
> > > for (( i = 0; i < 10; i++ )); do
> > > while true; do
> > > dd if=/mnt/file bs=4k iflag=direct of=/dev/null
> > > done &
> > > done
> > > sleep 1
> > >
> > > # Try to truncate the file - never finishes.
> > > truncate -s 16 /mnt/file
> > >
> > > It is pretty easy to hit this. Besides being a DOS attack vector (but I
> > > won't be too concerned about this - there are plenty of ways how local
> > > process can screw you) I can easily imagine some application to get bitten
> > > by this.
> >
> > Hi Jan,
> >
> > Thanks for your explanation, but in my desktop I cannot reproduce this
> > problem. The size of `file' is 16. Am I missing something?
> Hum, on my test machine with 3.6-rc1 it does not... Maybe for your
> desktop you need a larger sleep before running truncate so that readers
> have time to start up? Also I suppose you have ext4 mounted with
> dioread_nolock mount option?
Yes, it can be reproduced after increasing sleep time. Thanks.
Regards,
Zheng
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH 7/7] ext4: serialize truncate with owerwrite DIO workers V2
2012-09-09 17:27 [PATCH 0/7] ext4: Bunch of DIO/AIO fixes Dmitry Monakhov
` (5 preceding siblings ...)
2012-09-09 17:27 ` [PATCH 6/7] ext4: endless truncate due to nonlocked dio readers V2 Dmitry Monakhov
@ 2012-09-09 17:27 ` Dmitry Monakhov
2012-09-13 10:37 ` Zheng Liu
6 siblings, 1 reply; 30+ messages in thread
From: Dmitry Monakhov @ 2012-09-09 17:27 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, jack, wenqing.lz, Dmitry Monakhov
Jan Kara have spotted interesting issue:
There are potential data corruption issue with direct IO overwrites
racing with truncate:
Like:
dio write truncate_task
->ext4_ext_direct_IO
->overwrite == 1
->down_read(&EXT4_I(inode)->i_data_sem);
->mutex_unlock(&inode->i_mutex);
->ext4_setattr()
->inode_dio_wait()
->truncate_setsize()
->ext4_truncate()
->down_write(&EXT4_I(inode)->i_data_sem);
->__blockdev_direct_IO
->ext4_get_block
->submit_io()
->up_read(&EXT4_I(inode)->i_data_sem);
# truncate data blocks, allocate them to
# other inode - bad stuff happens because
# dio is still in flight.
In order to serialize with truncate dio worker should grab extra i_dio_count
reference before drop i_mutex.
Changes agains V1:
- wake up dio waiters before i_mutex.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
fs/ext4/inode.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index a850026..c5c4f9d 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3014,6 +3014,7 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
overwrite = *((int *)iocb->private);
if (overwrite) {
+ atomic_inc(&inode->i_dio_count);
down_read(&EXT4_I(inode)->i_data_sem);
mutex_unlock(&inode->i_mutex);
}
@@ -3111,6 +3112,7 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
retake_lock:
/* take i_mutex locking again if we do a ovewrite dio */
if (overwrite) {
+ inode_dio_done(inode);
up_read(&EXT4_I(inode)->i_data_sem);
mutex_lock(&inode->i_mutex);
}
--
1.7.7.6
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH 7/7] ext4: serialize truncate with owerwrite DIO workers V2
2012-09-09 17:27 ` [PATCH 7/7] ext4: serialize truncate with owerwrite DIO workers V2 Dmitry Monakhov
@ 2012-09-13 10:37 ` Zheng Liu
0 siblings, 0 replies; 30+ messages in thread
From: Zheng Liu @ 2012-09-13 10:37 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: linux-ext4, tytso, jack, wenqing.lz
On Sun, Sep 09, 2012 at 09:27:14PM +0400, Dmitry Monakhov wrote:
> Jan Kara have spotted interesting issue:
> There are potential data corruption issue with direct IO overwrites
> racing with truncate:
> Like:
> dio write truncate_task
> ->ext4_ext_direct_IO
> ->overwrite == 1
> ->down_read(&EXT4_I(inode)->i_data_sem);
> ->mutex_unlock(&inode->i_mutex);
> ->ext4_setattr()
> ->inode_dio_wait()
> ->truncate_setsize()
> ->ext4_truncate()
> ->down_write(&EXT4_I(inode)->i_data_sem);
> ->__blockdev_direct_IO
> ->ext4_get_block
> ->submit_io()
> ->up_read(&EXT4_I(inode)->i_data_sem);
> # truncate data blocks, allocate them to
> # other inode - bad stuff happens because
> # dio is still in flight.
>
> In order to serialize with truncate dio worker should grab extra i_dio_count
> reference before drop i_mutex.
>
> Changes agains V1:
> - wake up dio waiters before i_mutex.
>
> Reviewed-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Hi Dmitry,
Indeed there is a data corruption. This patch looks good to me. You
can add:
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Regards,
Zheng
> ---
> fs/ext4/inode.c | 2 ++
> 1 files changed, 2 insertions(+), 0 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index a850026..c5c4f9d 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3014,6 +3014,7 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
> overwrite = *((int *)iocb->private);
>
> if (overwrite) {
> + atomic_inc(&inode->i_dio_count);
> down_read(&EXT4_I(inode)->i_data_sem);
> mutex_unlock(&inode->i_mutex);
> }
> @@ -3111,6 +3112,7 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
> retake_lock:
> /* take i_mutex locking again if we do a ovewrite dio */
> if (overwrite) {
> + inode_dio_done(inode);
> up_read(&EXT4_I(inode)->i_data_sem);
> mutex_lock(&inode->i_mutex);
> }
> --
> 1.7.7.6
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 30+ messages in thread