* [PATCH 00/10] ext4: Bunch of DIO/AIO fixes V3
@ 2012-09-24 11:44 Dmitry Monakhov
2012-09-24 11:44 ` [PATCH 01/10] ext4: ext4_inode_info diet Dmitry Monakhov
` (9 more replies)
0 siblings, 10 replies; 25+ messages in thread
From: Dmitry Monakhov @ 2012-09-24 11:44 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, jack, lczerner, Dmitry Monakhov
There are number of races exist caused by lack of synchronization
between DIO workers in flight and truncate/fsync/punch_hole routines
This patch series try to optimize and fix existing DIO/AIO code
I observe significant performance improvements for big SMP hosts
performind DIO for single file.
Testcases which helps me to catch this type of bugs was posted here
http://www.spinics.net/lists/linux-fsdevel/msg58312.html
or available on my github repo:
https://github.com/dmonakhov/xfstests/tree/ce8e3adab629b2a9be8ba2e73db7dad49eb46614
plese run 286,287,288
TOC:
# first two are cleanups
ext4: ext4_inode_info diet
ext4: give i_aiodio_unwritten more appropriate name
# Cleanup and bug fixes
ext4: fix unwritten counter leakage
ext4: completed_io locking cleanup V3
ext4: serialize dio nonlocked reads with defrag workers V3
ext4: punch_hole should wait for DIO writers
ext4: serialize unlocked dio reads with truncate
ext4: endless truncate due to nonlocked dio readers V2
ext4: serialize truncate with owerwrite DIO workers V2
ext4: fix ext_remove_space for punch_hole case
Changes from V2:
Fix use-after-free for queued end_io_work.
fs/ext4/ext4.h | 33 ++++++++++---
fs/ext4/extents.c | 88 +++++++++++++++++++++++-------------
fs/ext4/file.c | 9 ---
fs/ext4/fsync.c | 85 -----------------------------------
fs/ext4/indirect.c | 19 ++++++-
fs/ext4/inode.c | 49 ++++++++------------
fs/ext4/move_extent.c | 8 +++
fs/ext4/page-io.c | 121 +++++++++++++++++++++++++++++---------------------
fs/ext4/super.c | 3 -
10 files changed, 201 insertions(+), 216 deletions(-)
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 01/10] ext4: ext4_inode_info diet
2012-09-24 11:44 [PATCH 00/10] ext4: Bunch of DIO/AIO fixes V3 Dmitry Monakhov
@ 2012-09-24 11:44 ` Dmitry Monakhov
2012-09-26 12:28 ` Jan Kara
2012-09-24 11:44 ` [PATCH 02/10] ext4: give i_aiodio_unwritten more appropriate name Dmitry Monakhov
` (8 subsequent siblings)
9 siblings, 1 reply; 25+ messages in thread
From: Dmitry Monakhov @ 2012-09-24 11:44 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, jack, lczerner, Dmitry Monakhov
Generic inode has unused i_private pointer which may be used as cur_aio_dio
storage.
TODO: If cur_aio_dio will be passed as an argument to get_block_t this allow
to have concurent AIO_DIO requests.
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
fs/ext4/ext4.h | 5 +++--
fs/ext4/extents.c | 4 ++--
fs/ext4/inode.c | 6 +++---
fs/ext4/super.c | 1 -
4 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 8b6902c..e59fbc5 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -912,8 +912,6 @@ struct ext4_inode_info {
struct list_head i_completed_io_list;
spinlock_t i_completed_io_lock;
atomic_t i_ioend_count; /* Number of outstanding io_end structs */
- /* current io_end structure for async DIO write*/
- ext4_io_end_t *cur_aio_dio;
atomic_t i_aiodio_unwritten; /* Nr. of inflight conversions pending */
spinlock_t i_block_reservation_lock;
@@ -929,6 +927,9 @@ struct ext4_inode_info {
__u32 i_csum_seed;
};
+/* current io_end structure for async DIO write */
+#define EXT4_CUR_AIO_DIO(inode) (inode)->i_private
+
/*
* File system states
*/
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index cc6d2b9..6eb6b0c 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3645,7 +3645,7 @@ ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode,
{
int ret = 0;
int err = 0;
- ext4_io_end_t *io = EXT4_I(inode)->cur_aio_dio;
+ ext4_io_end_t *io = (ext4_io_end_t*) EXT4_CUR_AIO_DIO(inode);
ext_debug("ext4_ext_handle_uninitialized_extents: inode %lu, logical "
"block %llu, max_blocks %u, flags %x, allocated %u\n",
@@ -3903,7 +3903,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
unsigned int allocated = 0, offset = 0;
unsigned int allocated_clusters = 0;
struct ext4_allocation_request ar;
- ext4_io_end_t *io = EXT4_I(inode)->cur_aio_dio;
+ ext4_io_end_t *io = (ext4_io_end_t*) EXT4_CUR_AIO_DIO(inode);
ext4_lblk_t cluster_offset;
ext_debug("blocks %u/%u requested for inode %lu\n",
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 462f4b3..05206ba 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3057,7 +3057,7 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
* hook to the iocb.
*/
iocb->private = NULL;
- EXT4_I(inode)->cur_aio_dio = NULL;
+ EXT4_CUR_AIO_DIO(inode) = NULL;
if (!is_sync_kiocb(iocb)) {
ext4_io_end_t *io_end =
ext4_init_io_end(inode, GFP_NOFS);
@@ -3074,7 +3074,7 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
* is a unwritten extents needs to be converted
* when IO is completed.
*/
- EXT4_I(inode)->cur_aio_dio = iocb->private;
+ EXT4_CUR_AIO_DIO(inode) = iocb->private;
}
if (overwrite)
@@ -3094,7 +3094,7 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
NULL,
DIO_LOCKING);
if (iocb->private)
- EXT4_I(inode)->cur_aio_dio = NULL;
+ EXT4_CUR_AIO_DIO(inode) = NULL;
/*
* The io_end structure takes a reference to the inode,
* that structure needs to be destroyed and the
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index eb7722a..64acf55 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -965,7 +965,6 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
ei->jinode = NULL;
INIT_LIST_HEAD(&ei->i_completed_io_list);
spin_lock_init(&ei->i_completed_io_lock);
- ei->cur_aio_dio = NULL;
ei->i_sync_tid = 0;
ei->i_datasync_tid = 0;
atomic_set(&ei->i_ioend_count, 0);
--
1.7.7.6
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 02/10] ext4: give i_aiodio_unwritten more appropriate name
2012-09-24 11:44 [PATCH 00/10] ext4: Bunch of DIO/AIO fixes V3 Dmitry Monakhov
2012-09-24 11:44 ` [PATCH 01/10] ext4: ext4_inode_info diet Dmitry Monakhov
@ 2012-09-24 11:44 ` Dmitry Monakhov
2012-09-26 12:32 ` Jan Kara
2012-09-24 11:44 ` [PATCH 03/10] ext4: fix unwritten counter leakage Dmitry Monakhov
` (7 subsequent siblings)
9 siblings, 1 reply; 25+ messages in thread
From: Dmitry Monakhov @ 2012-09-24 11:44 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, jack, lczerner, Dmitry Monakhov
AIO/DIO prefix is wrong because it account unwritten extents which
also may be scheduled from buffered write endio
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
fs/ext4/ext4.h | 4 ++--
fs/ext4/file.c | 6 +++---
fs/ext4/page-io.c | 2 +-
fs/ext4/super.c | 2 +-
4 files changed, 7 insertions(+), 7 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index e59fbc5..47f0778 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -912,7 +912,7 @@ struct ext4_inode_info {
struct list_head i_completed_io_list;
spinlock_t i_completed_io_lock;
atomic_t i_ioend_count; /* Number of outstanding io_end structs */
- atomic_t i_aiodio_unwritten; /* Nr. of inflight conversions pending */
+ atomic_t i_unwritten; /* Nr. of inflight conversions pending */
spinlock_t i_block_reservation_lock;
@@ -1335,7 +1335,7 @@ static inline void ext4_set_io_unwritten_flag(struct inode *inode,
{
if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
io_end->flag |= EXT4_IO_END_UNWRITTEN;
- atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten);
+ atomic_inc(&EXT4_I(inode)->i_unwritten);
}
}
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 3b0e3bd..39335bd 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -55,11 +55,11 @@ static int ext4_release_file(struct inode *inode, struct file *filp)
return 0;
}
-static void ext4_aiodio_wait(struct inode *inode)
+static void ext4_unwritten_wait(struct inode *inode)
{
wait_queue_head_t *wq = ext4_ioend_wq(inode);
- wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_aiodio_unwritten) == 0));
+ wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_unwritten) == 0));
}
/*
@@ -116,7 +116,7 @@ ext4_file_dio_write(struct kiocb *iocb, const struct iovec *iov,
"performance will be poor.",
inode->i_ino, current->comm);
mutex_lock(ext4_aio_mutex(inode));
- ext4_aiodio_wait(inode);
+ ext4_unwritten_wait(inode);
}
BUG_ON(iocb->ki_pos != pos);
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index dcdeef1..de77e31 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -113,7 +113,7 @@ int ext4_end_io_nolock(ext4_io_end_t *io)
if (io->flag & EXT4_IO_END_DIRECT)
inode_dio_done(inode);
/* Wake up anyone waiting on unwritten extent conversion */
- if (atomic_dec_and_test(&EXT4_I(inode)->i_aiodio_unwritten))
+ if (atomic_dec_and_test(&EXT4_I(inode)->i_unwritten))
wake_up_all(ext4_ioend_wq(io->inode));
return ret;
}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 64acf55..0a37c75 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -968,7 +968,7 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
ei->i_sync_tid = 0;
ei->i_datasync_tid = 0;
atomic_set(&ei->i_ioend_count, 0);
- atomic_set(&ei->i_aiodio_unwritten, 0);
+ atomic_set(&ei->i_unwritten, 0);
return &ei->vfs_inode;
}
--
1.7.7.6
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 03/10] ext4: fix unwritten counter leakage
2012-09-24 11:44 [PATCH 00/10] ext4: Bunch of DIO/AIO fixes V3 Dmitry Monakhov
2012-09-24 11:44 ` [PATCH 01/10] ext4: ext4_inode_info diet Dmitry Monakhov
2012-09-24 11:44 ` [PATCH 02/10] ext4: give i_aiodio_unwritten more appropriate name Dmitry Monakhov
@ 2012-09-24 11:44 ` Dmitry Monakhov
2012-09-26 13:07 ` Jan Kara
2012-09-24 11:44 ` [PATCH 04/10] ext4: completed_io locking cleanup V3 Dmitry Monakhov
` (6 subsequent siblings)
9 siblings, 1 reply; 25+ messages in thread
From: Dmitry Monakhov @ 2012-09-24 11:44 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, jack, lczerner, Dmitry Monakhov
ext4_set_io_unwritten_flag() will increment i_unwritten counter, so
once we mark end_io with END_IO_UNWRITTEN we have to revert it back
on error path.
- add missed error checks to prevent counter leakage
- ext4_end_io_nolock() will clear END_IO_UNWRITTEN flag to signal
that conversion finished.
- add BUGON to free_end_io() to prevent similar leackage in future.
Visiable effect of this bug is that unaligned aio_stress may deadlock
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
fs/ext4/extents.c | 21 ++++++++++++++-------
fs/ext4/page-io.c | 6 +++++-
2 files changed, 19 insertions(+), 8 deletions(-)
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 6eb6b0c..739c21d 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3660,6 +3660,8 @@ ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode,
if ((flags & EXT4_GET_BLOCKS_PRE_IO)) {
ret = ext4_split_unwritten_extents(handle, inode, map,
path, flags);
+ if (ret <= 0)
+ goto out;
/*
* Flag the inode(non aio case) or end_io struct (aio case)
* that this IO needs to conversion to written when IO is
@@ -3905,6 +3907,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
struct ext4_allocation_request ar;
ext4_io_end_t *io = (ext4_io_end_t*) EXT4_CUR_AIO_DIO(inode);
ext4_lblk_t cluster_offset;
+ int set_unwritten = 0;
ext_debug("blocks %u/%u requested for inode %lu\n",
map->m_lblk, map->m_len, inode->i_ino);
@@ -4127,13 +4130,8 @@ got_allocated_blocks:
* For non asycn direct IO case, flag the inode state
* that we need to perform conversion when IO is done.
*/
- if ((flags & EXT4_GET_BLOCKS_PRE_IO)) {
- if (io)
- ext4_set_io_unwritten_flag(inode, io);
- else
- ext4_set_inode_state(inode,
- EXT4_STATE_DIO_UNWRITTEN);
- }
+ if ((flags & EXT4_GET_BLOCKS_PRE_IO))
+ set_unwritten = 1;
if (ext4_should_dioread_nolock(inode))
map->m_flags |= EXT4_MAP_UNINIT;
}
@@ -4145,6 +4143,15 @@ got_allocated_blocks:
if (!err)
err = ext4_ext_insert_extent(handle, inode, path,
&newex, flags);
+
+ if (!err && set_unwritten) {
+ if (io)
+ ext4_set_io_unwritten_flag(inode, io);
+ else
+ ext4_set_inode_state(inode,
+ EXT4_STATE_DIO_UNWRITTEN);
+ }
+
if (err && free_on_err) {
int fb_flags = flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE ?
EXT4_FREE_BLOCKS_NO_QUOT_UPDATE : 0;
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index de77e31..9970022 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -71,6 +71,8 @@ void ext4_free_io_end(ext4_io_end_t *io)
int i;
BUG_ON(!io);
+ BUG_ON(io->flag & EXT4_IO_END_UNWRITTEN);
+
if (io->page)
put_page(io->page);
for (i = 0; i < io->num_io_pages; i++)
@@ -94,6 +96,8 @@ int ext4_end_io_nolock(ext4_io_end_t *io)
ssize_t size = io->size;
int ret = 0;
+ BUG_ON(!(io->flag & EXT4_IO_END_UNWRITTEN));
+
ext4_debug("ext4_end_io_nolock: io 0x%p from inode %lu,list->next 0x%p,"
"list->prev 0x%p\n",
io, inode->i_ino, io->list.next, io->list.prev);
@@ -106,7 +110,7 @@ int ext4_end_io_nolock(ext4_io_end_t *io)
"(inode %lu, offset %llu, size %zd, error %d)",
inode->i_ino, offset, size, ret);
}
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 04/10] ext4: completed_io locking cleanup V3
2012-09-24 11:44 [PATCH 00/10] ext4: Bunch of DIO/AIO fixes V3 Dmitry Monakhov
` (2 preceding siblings ...)
2012-09-24 11:44 ` [PATCH 03/10] ext4: fix unwritten counter leakage Dmitry Monakhov
@ 2012-09-24 11:44 ` Dmitry Monakhov
2012-09-26 13:42 ` Jan Kara
2012-09-24 11:44 ` [PATCH 05/10] ext4: serialize dio nonlocked reads with defrag workers V3 Dmitry Monakhov
` (5 subsequent siblings)
9 siblings, 1 reply; 25+ messages in thread
From: Dmitry Monakhov @ 2012-09-24 11:44 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, jack, lczerner, Dmitry Monakhov
Current unwritten extent conversion state-machine is very fuzzy.
- By unknown reason it want perform conversion under i_mutex. What for?
It was initially added by Theodore. Please comment your initial assumption.
My diagnosis:
We already protect extent tree with i_data_sem, truncate should
wait for DIO in flight, so the only data we have to protect io->flags
modification, but only flush_completed_IO and work are modified this
flags and we can serialize them via i_completed_io_lock.
Currently all this games with mutex_trylock result in following deadlock
truncate: kworker:
ext4_setattr ext4_end_io_work
mutex_lock(i_mutex)
inode_dio_wait(inode) ->BLOCK
DEADLOCK<- mutex_trylock()
inode_dio_done()
#TEST_CASE1_BEGIN
MNT=/mnt_scrach
unlink $MNT/file
fallocate -l $((1024*1024*1024)) $MNT/file
aio-stress -I 100000 -O -s 100m -n -t 1 -c 10 -o 2 -o 3 $MNT/file
sleep 2
truncate -s 0 $MNT/file
#TEST_CASE1_END
Or use 286's xfstests https://github.com/dmonakhov/xfstests/blob/devel/286
This patch makes state machine simple and clean:
(1) ext4_end_io_work is responsible for handling all pending
end_io from ei->i_completed_io_list(per inode list)
NOTE1: i_completed_io_lock is acquired only once
NOTE2: i_mutex is not required because it does not protect
any data guarded by i_mutex any more
(2) xxx_end_io schedule end_io context completion simply by pushing it
to the inode's list.
NOTE1: because of (1) work should be queued only if
->i_completed_io_list was empty at the moment, otherwise it
work is scheduled already.
(3) No one is able to free inode's blocks while pented io_completion
exist othervise may result in blocks beyond EOF, this
stated by the fact that all truncate routines wait for
all pended unwritten requets in flight
(4) Replace flush_completed_io() with ext4_unwritten_wait(). This
allow greatly simplify state machine because end_io conext
will be destroyed only in one place (end_io_work)
- remove EXT4_IO_END_QUEUED and EXT4_IO_END_FSYNC flags because
end_io is now destroyed from known context
- Improve SMP scalability by removing useless i_mutex which does not
protect io->flags anymore.
- Reduce lock contention on i_completed_io_lock by optimizing list walk.
- Move open coded logic from various xx_end_xx routines to ext4_add_complete_io()
Changes since V2:
Fix use-after-free caused by race truncate vs end_io_work
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
fs/ext4/ext4.h | 7 +--
fs/ext4/extents.c | 4 +-
fs/ext4/file.c | 7 ---
fs/ext4/fsync.c | 85 +--------------------------------------
fs/ext4/indirect.c | 8 +--
fs/ext4/inode.c | 25 +-----------
fs/ext4/page-io.c | 113 ++++++++++++++++++++++++++++++----------------------
7 files changed, 76 insertions(+), 173 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 47f0778..cc423cf 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -184,9 +184,7 @@ struct mpage_da_data {
*/
#define EXT4_IO_END_UNWRITTEN 0x0001
#define EXT4_IO_END_ERROR 0x0002
-#define EXT4_IO_END_QUEUED 0x0004
-#define EXT4_IO_END_DIRECT 0x0008
-#define EXT4_IO_END_IN_FSYNC 0x0010
+#define EXT4_IO_END_DIRECT 0x0004
struct ext4_io_page {
struct page *p_page;
@@ -1939,7 +1937,6 @@ extern void ext4_htree_free_dir_info(struct dir_private_info *p);
/* fsync.c */
extern int ext4_sync_file(struct file *, loff_t, loff_t, int);
-extern int ext4_flush_completed_IO(struct inode *);
/* hash.c */
extern int ext4fs_dirhash(const char *name, int len, struct
@@ -2411,8 +2408,10 @@ extern int ext4_move_extents(struct file *o_filp, struct file *d_filp,
/* page-io.c */
extern int __init ext4_init_pageio(void);
+extern void ext4_add_complete_io(ext4_io_end_t *io_end);
extern void ext4_exit_pageio(void);
extern void ext4_ioend_wait(struct inode *);
+extern void ext4_unwritten_wait(struct inode *inode);
extern void ext4_free_io_end(ext4_io_end_t *io);
extern ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags);
extern int ext4_end_io_nolock(ext4_io_end_t *io);
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 739c21d..6326874 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4293,7 +4293,7 @@ void ext4_ext_truncate(struct inode *inode)
* finish any pending end_io work so we won't run the risk of
* converting any truncated blocks to initialized later
*/
- ext4_flush_completed_IO(inode);
+ ext4_unwritten_wait(inode);
/*
* probably first extent we're gonna free will be last in block
@@ -4860,7 +4860,7 @@ int ext4_ext_punch_hole(struct file *file, loff_t offset, loff_t length)
}
/* finish any pending end_io work */
- ext4_flush_completed_IO(inode);
+ ext4_unwritten_wait(inode);
credits = ext4_writepage_trans_blocks(inode);
handle = ext4_journal_start(inode, credits);
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 39335bd..bdafaff 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -55,13 +55,6 @@ static int ext4_release_file(struct inode *inode, struct file *filp)
return 0;
}
-static void ext4_unwritten_wait(struct inode *inode)
-{
- wait_queue_head_t *wq = ext4_ioend_wq(inode);
-
- wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_unwritten) == 0));
-}
-
/*
* This tests whether the IO in question is block-aligned or not.
* Ext4 utilizes unwritten extents when hole-filling during direct IO, and they
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 323eb15..94f5d3e 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -34,87 +34,6 @@
#include <trace/events/ext4.h>
-static void dump_completed_IO(struct inode * inode)
-{
-#ifdef EXT4FS_DEBUG
- struct list_head *cur, *before, *after;
- ext4_io_end_t *io, *io0, *io1;
- unsigned long flags;
-
- if (list_empty(&EXT4_I(inode)->i_completed_io_list)){
- ext4_debug("inode %lu completed_io list is empty\n", inode->i_ino);
- return;
- }
-
- ext4_debug("Dump inode %lu completed_io list \n", inode->i_ino);
- spin_lock_irqsave(&EXT4_I(inode)->i_completed_io_lock, flags);
- list_for_each_entry(io, &EXT4_I(inode)->i_completed_io_list, list){
- cur = &io->list;
- before = cur->prev;
- io0 = container_of(before, ext4_io_end_t, list);
- after = cur->next;
- io1 = container_of(after, ext4_io_end_t, list);
-
- ext4_debug("io 0x%p from inode %lu,prev 0x%p,next 0x%p\n",
- io, inode->i_ino, io0, io1);
- }
- spin_unlock_irqrestore(&EXT4_I(inode)->i_completed_io_lock, flags);
-#endif
-}
-
-/*
- * This function is called from ext4_sync_file().
- *
- * When IO is completed, the work to convert unwritten extents to
- * written is queued on workqueue but may not get immediately
- * scheduled. When fsync is called, we need to ensure the
- * conversion is complete before fsync returns.
- * The inode keeps track of a list of pending/completed IO that
- * might needs to do the conversion. This function walks through
- * the list and convert the related unwritten extents for completed IO
- * to written.
- * The function return the number of pending IOs on success.
- */
-int ext4_flush_completed_IO(struct inode *inode)
-{
- ext4_io_end_t *io;
- struct ext4_inode_info *ei = EXT4_I(inode);
- unsigned long flags;
- int ret = 0;
- int ret2 = 0;
-
- dump_completed_IO(inode);
- spin_lock_irqsave(&ei->i_completed_io_lock, flags);
- while (!list_empty(&ei->i_completed_io_list)){
- io = list_entry(ei->i_completed_io_list.next,
- ext4_io_end_t, list);
- list_del_init(&io->list);
- io->flag |= EXT4_IO_END_IN_FSYNC;
- /*
- * Calling ext4_end_io_nolock() to convert completed
- * IO to written.
- *
- * When ext4_sync_file() is called, run_queue() may already
- * about to flush the work corresponding to this io structure.
- * It will be upset if it founds the io structure related
- * to the work-to-be schedule is freed.
- *
- * Thus we need to keep the io structure still valid here after
- * conversion finished. The io structure has a flag to
- * avoid double converting from both fsync and background work
- * queue work.
- */
- spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
- ret = ext4_end_io_nolock(io);
- if (ret < 0)
- ret2 = ret;
- spin_lock_irqsave(&ei->i_completed_io_lock, flags);
- io->flag &= ~EXT4_IO_END_IN_FSYNC;
- }
- spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
- return (ret2 < 0) ? ret2 : 0;
-}
-
/*
* If we're not journaling and this is a just-created file, we have to
* sync our parent directory (if it was freshly created) since
@@ -219,9 +138,7 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
if (inode->i_sb->s_flags & MS_RDONLY)
goto out;
- ret = ext4_flush_completed_IO(inode);
- if (ret < 0)
- goto out;
+ ext4_unwritten_wait(inode);
if (!journal) {
ret = __sync_inode(inode, datasync);
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 830e1b2..251e35f 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -807,11 +807,9 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
retry:
if (rw == READ && ext4_should_dioread_nolock(inode)) {
- if (unlikely(!list_empty(&ei->i_completed_io_list))) {
- mutex_lock(&inode->i_mutex);
- ext4_flush_completed_IO(inode);
- mutex_unlock(&inode->i_mutex);
- }
+ if (unlikely(!list_empty(&ei->i_completed_io_list)))
+ ext4_unwritten_wait(inode);
+
ret = __blockdev_direct_IO(rw, iocb, inode,
inode->i_sb->s_bdev, iov,
offset, nr_segs,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 05206ba..05d860e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2882,9 +2882,6 @@ static void ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
{
struct inode *inode = iocb->ki_filp->f_path.dentry->d_inode;
ext4_io_end_t *io_end = iocb->private;
- struct workqueue_struct *wq;
- unsigned long flags;
- struct ext4_inode_info *ei;
/* if not async direct IO or dio with 0 bytes write, just return */
if (!io_end || !size)
@@ -2913,24 +2910,14 @@ out:
io_end->iocb = iocb;
io_end->result = ret;
}
- wq = EXT4_SB(io_end->inode->i_sb)->dio_unwritten_wq;
-
- /* Add the io_end to per-inode completed aio dio list*/
- ei = EXT4_I(io_end->inode);
- spin_lock_irqsave(&ei->i_completed_io_lock, flags);
- list_add_tail(&io_end->list, &ei->i_completed_io_list);
- spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
- /* queue the work to convert unwritten extents to written */
- queue_work(wq, &io_end->work);
+ ext4_add_complete_io(io_end);
}
static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate)
{
ext4_io_end_t *io_end = bh->b_private;
- struct workqueue_struct *wq;
struct inode *inode;
- unsigned long flags;
if (!test_clear_buffer_uninit(bh) || !io_end)
goto out;
@@ -2949,15 +2936,7 @@ static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate)
*/
inode = io_end->inode;
ext4_set_io_unwritten_flag(inode, io_end);
-
- /* Add the io_end to per-inode completed io list*/
- spin_lock_irqsave(&EXT4_I(inode)->i_completed_io_lock, flags);
- list_add_tail(&io_end->list, &EXT4_I(inode)->i_completed_io_list);
- spin_unlock_irqrestore(&EXT4_I(inode)->i_completed_io_lock, flags);
-
- wq = EXT4_SB(inode->i_sb)->dio_unwritten_wq;
- /* queue the work to convert unwritten extents to written */
- queue_work(wq, &io_end->work);
+ ext4_add_complete_io(io_end);
out:
bh->b_private = NULL;
bh->b_end_io = NULL;
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 9970022..fa69bba 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -57,6 +57,29 @@ void ext4_ioend_wait(struct inode *inode)
wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_ioend_count) == 0));
}
+void ext4_unwritten_wait(struct inode *inode)
+{
+ wait_queue_head_t *wq = ext4_ioend_wq(inode);
+
+ wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_unwritten) == 0));
+}
+
+/* Add the io_end to per-inode completed aio dio list. */
+void ext4_add_complete_io(ext4_io_end_t *io_end)
+{
+ struct ext4_inode_info *ei = EXT4_I(io_end->inode);
+ struct workqueue_struct *wq;
+ unsigned long flags;
+
+ wq = EXT4_SB(io_end->inode->i_sb)->dio_unwritten_wq;
+
+ spin_lock_irqsave(&ei->i_completed_io_lock, flags);
+ if (list_empty(&ei->i_completed_io_list))
+ queue_work(wq, &io_end->work);
+ list_add_tail(&io_end->list, &ei->i_completed_io_list);
+ spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
+}
+
static void put_io_page(struct ext4_io_page *io_page)
{
if (atomic_dec_and_test(&io_page->p_count)) {
@@ -83,12 +106,7 @@ void ext4_free_io_end(ext4_io_end_t *io)
kmem_cache_free(io_end_cachep, io);
}
-/*
- * check a range of space and convert unwritten extents to written.
- *
- * Called with inode->i_mutex; we depend on this when we manipulate
- * io->flag, since we could otherwise race with ext4_flush_completed_IO()
- */
+/* check a range of space and convert unwritten extents to written. */
int ext4_end_io_nolock(ext4_io_end_t *io)
{
struct inode *inode = io->inode;
@@ -98,6 +116,8 @@ int ext4_end_io_nolock(ext4_io_end_t *io)
BUG_ON(!(io->flag & EXT4_IO_END_UNWRITTEN));
+ BUG_ON(!list_empty(&io->list));
+
ext4_debug("ext4_end_io_nolock: io 0x%p from inode %lu,list->next 0x%p,"
"list->prev 0x%p\n",
io, inode->i_ino, io->list.next, io->list.prev);
@@ -122,6 +142,32 @@ int ext4_end_io_nolock(ext4_io_end_t *io)
return ret;
}
+static void dump_completed_IO(struct inode * inode)
+{
+#ifdef EXT4FS_DEBUG
+ struct list_head *cur, *before, *after;
+ ext4_io_end_t *io, *io0, *io1;
+ unsigned long flags;
+
+ if (list_empty(&EXT4_I(inode)->i_completed_io_list)){
+ ext4_debug("inode %lu completed_io list is empty\n", inode->i_ino);
+ return;
+ }
+
+ ext4_debug("Dump inode %lu completed_io list \n", inode->i_ino);
+ list_for_each_entry(io, &EXT4_I(inode)->i_completed_io_list, list){
+ cur = &io->list;
+ before = cur->prev;
+ io0 = container_of(before, ext4_io_end_t, list);
+ after = cur->next;
+ io1 = container_of(after, ext4_io_end_t, list);
+
+ ext4_debug("io 0x%p from inode %lu,prev 0x%p,next 0x%p\n",
+ io, inode->i_ino, io0, io1);
+ }
+#endif
+}
+
/*
* work on completed aio dio IO, to convert unwritten extents to extents
*/
@@ -131,44 +177,24 @@ static void ext4_end_io_work(struct work_struct *work)
struct inode *inode = io->inode;
struct ext4_inode_info *ei = EXT4_I(inode);
unsigned long flags;
+ struct list_head complete_list;
spin_lock_irqsave(&ei->i_completed_io_lock, flags);
- if (io->flag & EXT4_IO_END_IN_FSYNC)
- goto requeue;
- if (list_empty(&io->list)) {
- spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
- goto free;
- }
+ dump_completed_IO(inode);
+ list_replace_init(&ei->i_completed_io_list, &complete_list);
+ spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
- if (!mutex_trylock(&inode->i_mutex)) {
- bool was_queued;
-requeue:
- was_queued = !!(io->flag & EXT4_IO_END_QUEUED);
- io->flag |= EXT4_IO_END_QUEUED;
- spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
- /*
- * Requeue the work instead of waiting so that the work
- * items queued after this can be processed.
- */
- queue_work(EXT4_SB(inode->i_sb)->dio_unwritten_wq, &io->work);
- /*
- * To prevent the ext4-dio-unwritten thread from keeping
- * requeueing end_io requests and occupying cpu for too long,
- * yield the cpu if it sees an end_io request that has already
- * been requeued.
- */
- if (was_queued)
- yield();
- return;
+ BUG_ON(list_empty(&complete_list));
+
+ while(!list_empty(&complete_list)) {
+ io = list_entry(complete_list.next, ext4_io_end_t, list);
+ list_del_init(&io->list);
+ ext4_end_io_nolock(io);
+ ext4_free_io_end(io);
}
- list_del_init(&io->list);
- spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
- (void) ext4_end_io_nolock(io);
- mutex_unlock(&inode->i_mutex);
-free:
- ext4_free_io_end(io);
}
+
ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags)
{
ext4_io_end_t *io = kmem_cache_zalloc(io_end_cachep, flags);
@@ -199,9 +225,7 @@ static void buffer_io_error(struct buffer_head *bh)
static void ext4_end_bio(struct bio *bio, int error)
{
ext4_io_end_t *io_end = bio->bi_private;
- struct workqueue_struct *wq;
struct inode *inode;
- unsigned long flags;
int i;
sector_t bi_sector = bio->bi_sector;
@@ -259,14 +283,7 @@ static void ext4_end_bio(struct bio *bio, int error)
return;
}
- /* Add the io_end to per-inode completed io list*/
- spin_lock_irqsave(&EXT4_I(inode)->i_completed_io_lock, flags);
- list_add_tail(&io_end->list, &EXT4_I(inode)->i_completed_io_list);
- spin_unlock_irqrestore(&EXT4_I(inode)->i_completed_io_lock, flags);
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 05/10] ext4: serialize dio nonlocked reads with defrag workers V3
2012-09-24 11:44 [PATCH 00/10] ext4: Bunch of DIO/AIO fixes V3 Dmitry Monakhov
` (3 preceding siblings ...)
2012-09-24 11:44 ` [PATCH 04/10] ext4: completed_io locking cleanup V3 Dmitry Monakhov
@ 2012-09-24 11:44 ` Dmitry Monakhov
2012-09-26 13:49 ` Jan Kara
2012-09-24 11:44 ` [PATCH 06/10] ext4: punch_hole should wait for DIO writers V2 Dmitry Monakhov
` (4 subsequent siblings)
9 siblings, 1 reply; 25+ messages in thread
From: Dmitry Monakhov @ 2012-09-24 11:44 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, jack, lczerner, Dmitry Monakhov
Inode's block defrag and ext4_change_inode_journal_flag() may
affect nonlocked DIO reads result, so proper synchronization
required.
- Add missed inode_dio_wait() calls where appropriate
- Check inode state under extra i_dio_count reference.
Changes from V1:
- add missed memory bariers
- move DIOREAD_LOCK state check out from generic should_dioread_nolock
otherwise it will affect existing DIO readers.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
fs/ext4/ext4.h | 17 +++++++++++++++++
fs/ext4/indirect.c | 13 +++++++++++++
fs/ext4/inode.c | 5 +++++
fs/ext4/move_extent.c | 8 ++++++++
4 files changed, 43 insertions(+), 0 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index cc423cf..fccd263 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1350,6 +1350,8 @@ enum {
EXT4_STATE_DIO_UNWRITTEN, /* need convert on dio done*/
EXT4_STATE_NEWENTRY, /* File just added to dir */
EXT4_STATE_DELALLOC_RESERVED, /* blks already reserved for delalloc */
+ EXT4_STATE_DIOREAD_LOCK, /* Disable support for dio read
+ nolocking */
};
#define EXT4_INODE_BIT_FNS(name, field, offset) \
@@ -2462,6 +2464,21 @@ static inline void set_bitmap_uptodate(struct buffer_head *bh)
set_bit(BH_BITMAP_UPTODATE, &(bh)->b_state);
}
+/*
+ * Disable DIO read nolock optimization, so new dioreaders will be forced
+ * to grab i_mutex
+ */
+static inline void ext4_inode_block_unlocked_dio(struct inode *inode)
+{
+ ext4_set_inode_state(inode, EXT4_STATE_DIOREAD_LOCK);
+ smp_mb();
+}
+static inline void ext4_inode_resume_unlocked_dio(struct inode *inode)
+{
+ smp_mb();
+ ext4_clear_inode_state(inode, EXT4_STATE_DIOREAD_LOCK);
+}
+
#define in_range(b, first, len) ((b) >= (first) && (b) <= (first) + (len) - 1)
/* For ioend & aio unwritten conversion wait queues */
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 251e35f..ec38f81 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -810,10 +810,23 @@ retry:
if (unlikely(!list_empty(&ei->i_completed_io_list)))
ext4_unwritten_wait(inode);
+ /*
+ * Nolock dioread optimization may be dynamically disabled
+ * via ext4_inode_block_unlocked_dio(). Check inode's state
+ * while holding extra i_dio_count ref.
+ */
+ atomic_inc(&inode->i_dio_count);
+ smp_mb();
+ if (!unlikely(ext4_test_inode_state(inode,
+ EXT4_STATE_DIOREAD_LOCK))) {
+ inode_dio_done(inode);
+ goto retry;
+ }
ret = __blockdev_direct_IO(rw, iocb, inode,
inode->i_sb->s_bdev, iov,
offset, nr_segs,
ext4_get_block, NULL, NULL, 0);
+ inode_dio_done(inode);
} else {
ret = blockdev_direct_IO(rw, iocb, inode, iov,
offset, nr_segs, ext4_get_block);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 05d860e..542a462 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4717,6 +4717,10 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
return err;
}
+ /* Wait for all existing dio workers */
+ ext4_inode_block_unlocked_dio(inode);
+ inode_dio_wait(inode);
+
jbd2_journal_lock_updates(journal);
/*
@@ -4736,6 +4740,7 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
ext4_set_aops(inode);
jbd2_journal_unlock_updates(journal);
+ ext4_inode_resume_unlocked_dio(inode);
/* Finally we can mark the inode as dirty. */
diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index c5826c6..ae0357b 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -1214,6 +1214,12 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp,
if (ret1 < 0)
return ret1;
+ /* Wait for all existing dio workers */
+ ext4_inode_block_unlocked_dio(orig_inode);
+ ext4_inode_block_unlocked_dio(donor_inode);
+ inode_dio_wait(orig_inode);
+ inode_dio_wait(donor_inode);
+
/* Protect extent tree against block allocations via delalloc */
double_down_write_data_sem(orig_inode, donor_inode);
/* Check the filesystem environment whether move_extent can be done */
@@ -1412,6 +1418,8 @@ out:
kfree(holecheck_path);
}
double_up_write_data_sem(orig_inode, donor_inode);
+ ext4_inode_block_unlocked_dio(orig_inode);
+ ext4_inode_block_unlocked_dio(donor_inode);
ret2 = mext_inode_double_unlock(orig_inode, donor_inode);
if (ret1)
--
1.7.7.6
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 06/10] ext4: punch_hole should wait for DIO writers V2
2012-09-24 11:44 [PATCH 00/10] ext4: Bunch of DIO/AIO fixes V3 Dmitry Monakhov
` (4 preceding siblings ...)
2012-09-24 11:44 ` [PATCH 05/10] ext4: serialize dio nonlocked reads with defrag workers V3 Dmitry Monakhov
@ 2012-09-24 11:44 ` Dmitry Monakhov
2012-09-26 13:56 ` Jan Kara
2012-09-24 11:44 ` [PATCH 07/10] ext4: serialize unlocked dio reads with truncate Dmitry Monakhov
` (3 subsequent siblings)
9 siblings, 1 reply; 25+ messages in thread
From: Dmitry Monakhov @ 2012-09-24 11:44 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, jack, lczerner, Dmitry Monakhov
punch_hole are the places where we have to wait for all existing writers
(writeback, aio, dio), but currently we simply flush pended end_io request
which is not sufficient. Even more i_mutex is not holded while punch_hole
which obviously result in dangerous data corruption due to
write-after-free.
This patch performs following changes:
- Guard punch_hole with i_mutex
- Recheck inode flags under i_mutex
- Block all new dio readers in order to prevent information leak caused by
read-after-free pattern.
- punch_hole now wait for all writers in flight
NOTE: XXX write-after-free race is still possible because
truncate_pagecache_range() is not completely reliable and where
is no easy way to stop writeback while punch_hole is in progress.
Changes from V1:
Add flag checks once we hold i_mutex
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
fs/ext4/extents.c | 48 ++++++++++++++++++++++++++++++++----------------
1 files changed, 32 insertions(+), 16 deletions(-)
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 6326874..2d58f4c 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4821,9 +4821,29 @@ int ext4_ext_punch_hole(struct file *file, loff_t offset, loff_t length)
loff_t first_page_offset, last_page_offset;
int credits, err = 0;
+ /*
+ * Write out all dirty pages to avoid race conditions
+ * Then release them.
+ */
+ if (mapping->nrpages && mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
+ err = filemap_write_and_wait_range(mapping,
+ offset, offset + length - 1);
+
+ if (err)
+ return err;
+ }
+
+ mutex_lock(&inode->i_mutex);
+ /* Need recheck file flags under mutex */
+ /* It's not possible punch hole on append only file */
+ if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
+ return -EPERM;
+ if (IS_SWAPFILE(inode))
+ return -ETXTBSY;
+
/* No need to punch hole beyond i_size */
if (offset >= inode->i_size)
- return 0;
+ goto out_mutex;
/*
* If the hole extends beyond i_size, set the hole
@@ -4841,31 +4861,23 @@ int ext4_ext_punch_hole(struct file *file, loff_t offset, loff_t length)
first_page_offset = first_page << PAGE_CACHE_SHIFT;
last_page_offset = last_page << PAGE_CACHE_SHIFT;
- /*
- * Write out all dirty pages to avoid race conditions
- * Then release them.
- */
- if (mapping->nrpages && mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
- err = filemap_write_and_wait_range(mapping,
- offset, offset + length - 1);
-
- if (err)
- return err;
- }
-
/* Now release the pages */
if (last_page_offset > first_page_offset) {
truncate_pagecache_range(inode, first_page_offset,
last_page_offset - 1);
}
- /* finish any pending end_io work */
+ /* Wait all existing dio workers, newcomers will block on i_mutex */
+ ext4_inode_block_unlocked_dio(inode);
+ inode_dio_wait(inode);
ext4_unwritten_wait(inode);
credits = ext4_writepage_trans_blocks(inode);
handle = ext4_journal_start(inode, credits);
- if (IS_ERR(handle))
- return PTR_ERR(handle);
+ if (IS_ERR(handle)) {
+ err = PTR_ERR(handle);
+ goto out_dio;
+ }
/*
@@ -4955,6 +4967,10 @@ out:
inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
ext4_mark_inode_dirty(handle, inode);
ext4_journal_stop(handle);
+out_dio:
+ ext4_inode_resume_unlocked_dio(inode);
+out_mutex:
+ mutex_unlock(&inode->i_mutex);
return err;
}
int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
--
1.7.7.6
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 07/10] ext4: serialize unlocked dio reads with truncate
2012-09-24 11:44 [PATCH 00/10] ext4: Bunch of DIO/AIO fixes V3 Dmitry Monakhov
` (5 preceding siblings ...)
2012-09-24 11:44 ` [PATCH 06/10] ext4: punch_hole should wait for DIO writers V2 Dmitry Monakhov
@ 2012-09-24 11:44 ` Dmitry Monakhov
2012-09-24 11:44 ` [PATCH 08/10] ext4: endless truncate due to nonlocked dio readers V2 Dmitry Monakhov
` (2 subsequent siblings)
9 siblings, 0 replies; 25+ messages in thread
From: Dmitry Monakhov @ 2012-09-24 11:44 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, jack, lczerner, Dmitry Monakhov
Current serialization will works only for DIO which holds
i_mutex, but nonlocked DIO following race is possible:
dio_nolock_read_task truncate_task
->ext4_setattr()
->inode_dio_wait()
->ext4_ext_direct_IO
->ext4_ind_direct_IO
->__blockdev_direct_IO
->ext4_get_block
->truncate_setsize()
->ext4_truncate()
#alloc truncated blocks
#to other inode
->submit_io()
#INFORMATION LEAK
In order to serialize with unlocked DIO reads we have to
rearrange wait sequence
1) update i_size first
2) if i_size about to be reduced wait for outstanding DIO requests
3) and only after that truncate inode blocks
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
fs/ext4/inode.c | 7 +++++--
1 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 542a462..32e9701 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4280,7 +4280,6 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
}
if (attr->ia_valid & ATTR_SIZE) {
- inode_dio_wait(inode);
if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
@@ -4329,8 +4328,12 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
}
if (attr->ia_valid & ATTR_SIZE) {
- if (attr->ia_size != i_size_read(inode))
+ if (attr->ia_size != inode->i_size) {
truncate_setsize(inode, attr->ia_size);
+ /* Inode size will be reduced, wait for dio in flight */
+ if (orphan)
+ inode_dio_wait(inode);
+ }
ext4_truncate(inode);
}
--
1.7.7.6
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 08/10] ext4: endless truncate due to nonlocked dio readers V2
2012-09-24 11:44 [PATCH 00/10] ext4: Bunch of DIO/AIO fixes V3 Dmitry Monakhov
` (6 preceding siblings ...)
2012-09-24 11:44 ` [PATCH 07/10] ext4: serialize unlocked dio reads with truncate Dmitry Monakhov
@ 2012-09-24 11:44 ` Dmitry Monakhov
2012-09-26 14:05 ` Jan Kara
2012-09-24 11:44 ` [PATCH 09/10] ext4: serialize truncate with owerwrite DIO workers V2 Dmitry Monakhov
2012-09-24 11:44 ` [PATCH 10/10] ext4: fix ext_remove_space for punch_hole case Dmitry Monakhov
9 siblings, 1 reply; 25+ messages in thread
From: Dmitry Monakhov @ 2012-09-24 11:44 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, jack, lczerner, Dmitry Monakhov
If we have enough aggressive DIO readers, truncate and other dio
waiters will wait forever inside inode_dio_wait(). It is reasonable
to disable nonlock DIO read optimization during truncate.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
fs/ext4/inode.c | 8 ++++++--
1 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 32e9701..d3f86e7 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4330,9 +4330,13 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
if (attr->ia_valid & ATTR_SIZE) {
if (attr->ia_size != inode->i_size) {
truncate_setsize(inode, attr->ia_size);
- /* Inode size will be reduced, wait for dio in flight */
- if (orphan)
+ /* Inode size will be reduced, wait for dio in flight.
+ * Temproraly disable unlocked DIO to prevent livelock */
+ if (orphan) {
+ ext4_inode_block_unlocked_dio(inode);
inode_dio_wait(inode);
+ ext4_inode_resume_unlocked_dio(inode);
+ }
}
ext4_truncate(inode);
}
--
1.7.7.6
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 09/10] ext4: serialize truncate with owerwrite DIO workers V2
2012-09-24 11:44 [PATCH 00/10] ext4: Bunch of DIO/AIO fixes V3 Dmitry Monakhov
` (7 preceding siblings ...)
2012-09-24 11:44 ` [PATCH 08/10] ext4: endless truncate due to nonlocked dio readers V2 Dmitry Monakhov
@ 2012-09-24 11:44 ` Dmitry Monakhov
2012-09-24 11:44 ` [PATCH 10/10] ext4: fix ext_remove_space for punch_hole case Dmitry Monakhov
9 siblings, 0 replies; 25+ messages in thread
From: Dmitry Monakhov @ 2012-09-24 11:44 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, jack, lczerner, Dmitry Monakhov
Jan Kara have spotted interesting issue:
There are potential data corruption issue with direct IO overwrites
racing with truncate:
Like:
dio write truncate_task
->ext4_ext_direct_IO
->overwrite == 1
->down_read(&EXT4_I(inode)->i_data_sem);
->mutex_unlock(&inode->i_mutex);
->ext4_setattr()
->inode_dio_wait()
->truncate_setsize()
->ext4_truncate()
->down_write(&EXT4_I(inode)->i_data_sem);
->__blockdev_direct_IO
->ext4_get_block
->submit_io()
->up_read(&EXT4_I(inode)->i_data_sem);
# truncate data blocks, allocate them to
# other inode - bad stuff happens because
# dio is still in flight.
In order to serialize with truncate dio worker should grab extra i_dio_count
reference before drop i_mutex.
Changes agains V1:
- wake up dio waiters before i_mutex.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
fs/ext4/inode.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d3f86e7..00c3222 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3011,6 +3011,7 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
overwrite = *((int *)iocb->private);
if (overwrite) {
+ atomic_inc(&inode->i_dio_count);
down_read(&EXT4_I(inode)->i_data_sem);
mutex_unlock(&inode->i_mutex);
}
@@ -3108,6 +3109,7 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
retake_lock:
/* take i_mutex locking again if we do a ovewrite dio */
if (overwrite) {
+ inode_dio_done(inode);
up_read(&EXT4_I(inode)->i_data_sem);
mutex_lock(&inode->i_mutex);
}
--
1.7.7.6
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 10/10] ext4: fix ext_remove_space for punch_hole case
2012-09-24 11:44 [PATCH 00/10] ext4: Bunch of DIO/AIO fixes V3 Dmitry Monakhov
` (8 preceding siblings ...)
2012-09-24 11:44 ` [PATCH 09/10] ext4: serialize truncate with owerwrite DIO workers V2 Dmitry Monakhov
@ 2012-09-24 11:44 ` Dmitry Monakhov
9 siblings, 0 replies; 25+ messages in thread
From: Dmitry Monakhov @ 2012-09-24 11:44 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, jack, lczerner, Dmitry Monakhov
Inode is allowed to have empty leaf only if it this is blockless inode
i.e. (depth == 0).
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
fs/ext4/extents.c | 11 +++++++----
1 files changed, 7 insertions(+), 4 deletions(-)
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 2d58f4c..57d6438 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2648,12 +2648,15 @@ again:
return PTR_ERR(path);
}
depth = ext_depth(inode);
+ /* Leaf not may not exist only if inode has no blocks at all */
ex = path[depth].p_ext;
if (!ex) {
- ext4_ext_drop_refs(path);
- kfree(path);
- path = NULL;
- goto cont;
+ if (depth) {
+ EXT4_ERROR_INODE(inode, "path[%d].p_hdr == NULL",
+ depth);
+ err = -EIO;
+ }
+ goto out;
}
ee_block = le32_to_cpu(ex->ee_block);
--
1.7.7.6
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH 01/10] ext4: ext4_inode_info diet
2012-09-24 11:44 ` [PATCH 01/10] ext4: ext4_inode_info diet Dmitry Monakhov
@ 2012-09-26 12:28 ` Jan Kara
0 siblings, 0 replies; 25+ messages in thread
From: Jan Kara @ 2012-09-26 12:28 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: linux-ext4, tytso, jack, lczerner
On Mon 24-09-12 15:44:11, Dmitry Monakhov wrote:
> Generic inode has unused i_private pointer which may be used as cur_aio_dio
> storage.
>
> TODO: If cur_aio_dio will be passed as an argument to get_block_t this allow
> to have concurent AIO_DIO requests.
>
> Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> ---
> fs/ext4/ext4.h | 5 +++--
> fs/ext4/extents.c | 4 ++--
> fs/ext4/inode.c | 6 +++---
> fs/ext4/super.c | 1 -
> 4 files changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 8b6902c..e59fbc5 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -912,8 +912,6 @@ struct ext4_inode_info {
> struct list_head i_completed_io_list;
> spinlock_t i_completed_io_lock;
> atomic_t i_ioend_count; /* Number of outstanding io_end structs */
> - /* current io_end structure for async DIO write*/
> - ext4_io_end_t *cur_aio_dio;
> atomic_t i_aiodio_unwritten; /* Nr. of inflight conversions pending */
>
> spinlock_t i_block_reservation_lock;
> @@ -929,6 +927,9 @@ struct ext4_inode_info {
> __u32 i_csum_seed;
> };
>
> +/* current io_end structure for async DIO write */
> +#define EXT4_CUR_AIO_DIO(inode) (inode)->i_private
Inline functions are prefered over macros (due to type checks they
provide etc.). Also that will allow you to get rid of those type casts
when using EXT4_CUR_AIO_DIO(). So please make a pair of get/set inline
functions out of this macro.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 02/10] ext4: give i_aiodio_unwritten more appropriate name
2012-09-24 11:44 ` [PATCH 02/10] ext4: give i_aiodio_unwritten more appropriate name Dmitry Monakhov
@ 2012-09-26 12:32 ` Jan Kara
0 siblings, 0 replies; 25+ messages in thread
From: Jan Kara @ 2012-09-26 12:32 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: linux-ext4, tytso, jack, lczerner
On Mon 24-09-12 15:44:12, Dmitry Monakhov wrote:
> AIO/DIO prefix is wrong because it account unwritten extents which
> also may be scheduled from buffered write endio
Correct. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
>
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> ---
> fs/ext4/ext4.h | 4 ++--
> fs/ext4/file.c | 6 +++---
> fs/ext4/page-io.c | 2 +-
> fs/ext4/super.c | 2 +-
> 4 files changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index e59fbc5..47f0778 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -912,7 +912,7 @@ struct ext4_inode_info {
> struct list_head i_completed_io_list;
> spinlock_t i_completed_io_lock;
> atomic_t i_ioend_count; /* Number of outstanding io_end structs */
> - atomic_t i_aiodio_unwritten; /* Nr. of inflight conversions pending */
> + atomic_t i_unwritten; /* Nr. of inflight conversions pending */
>
> spinlock_t i_block_reservation_lock;
>
> @@ -1335,7 +1335,7 @@ static inline void ext4_set_io_unwritten_flag(struct inode *inode,
> {
> if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
> io_end->flag |= EXT4_IO_END_UNWRITTEN;
> - atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten);
> + atomic_inc(&EXT4_I(inode)->i_unwritten);
> }
> }
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 3b0e3bd..39335bd 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -55,11 +55,11 @@ static int ext4_release_file(struct inode *inode, struct file *filp)
> return 0;
> }
>
> -static void ext4_aiodio_wait(struct inode *inode)
> +static void ext4_unwritten_wait(struct inode *inode)
> {
> wait_queue_head_t *wq = ext4_ioend_wq(inode);
>
> - wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_aiodio_unwritten) == 0));
> + wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_unwritten) == 0));
> }
>
> /*
> @@ -116,7 +116,7 @@ ext4_file_dio_write(struct kiocb *iocb, const struct iovec *iov,
> "performance will be poor.",
> inode->i_ino, current->comm);
> mutex_lock(ext4_aio_mutex(inode));
> - ext4_aiodio_wait(inode);
> + ext4_unwritten_wait(inode);
> }
>
> BUG_ON(iocb->ki_pos != pos);
> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index dcdeef1..de77e31 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -113,7 +113,7 @@ int ext4_end_io_nolock(ext4_io_end_t *io)
> if (io->flag & EXT4_IO_END_DIRECT)
> inode_dio_done(inode);
> /* Wake up anyone waiting on unwritten extent conversion */
> - if (atomic_dec_and_test(&EXT4_I(inode)->i_aiodio_unwritten))
> + if (atomic_dec_and_test(&EXT4_I(inode)->i_unwritten))
> wake_up_all(ext4_ioend_wq(io->inode));
> return ret;
> }
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 64acf55..0a37c75 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -968,7 +968,7 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
> ei->i_sync_tid = 0;
> ei->i_datasync_tid = 0;
> atomic_set(&ei->i_ioend_count, 0);
> - atomic_set(&ei->i_aiodio_unwritten, 0);
> + atomic_set(&ei->i_unwritten, 0);
>
> return &ei->vfs_inode;
> }
> --
> 1.7.7.6
>
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 03/10] ext4: fix unwritten counter leakage
2012-09-24 11:44 ` [PATCH 03/10] ext4: fix unwritten counter leakage Dmitry Monakhov
@ 2012-09-26 13:07 ` Jan Kara
2012-09-27 12:19 ` Dmitry Monakhov
0 siblings, 1 reply; 25+ messages in thread
From: Jan Kara @ 2012-09-26 13:07 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: linux-ext4, tytso, jack, lczerner
On Mon 24-09-12 15:44:13, Dmitry Monakhov wrote:
> ext4_set_io_unwritten_flag() will increment i_unwritten counter, so
> once we mark end_io with END_IO_UNWRITTEN we have to revert it back
^^ EXT4_IO_END_UNWRITTEN
> on error path.
>
> - add missed error checks to prevent counter leakage
> - ext4_end_io_nolock() will clear END_IO_UNWRITTEN flag to signal
^^ EXT4_IO_END_UNWRITTEN
> that conversion finished.
> - add BUGON to free_end_io() to prevent similar leackage in future.
^^ BUG_ON ^^ext4_free_io_end() ^^ leakage
> Visiable effect of this bug is that unaligned aio_stress may deadlock
^^ Visible
Umm, and won't it be more foolproof it we just decrement i_unwritten in
ext4_free_io_end() when we see EXT4_IO_END_UNWRITTEN set?
That still leaves the mess with EXT4_STATE_DIO_UNWRITTEN unhandled. But
that's a separate issue. We seem to clear that flag only in
ext4_ext_direct_IO() although it could be set even when buffered write
converts extents. And error cases seem to be buggy as well.
Honza
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> ---
> fs/ext4/extents.c | 21 ++++++++++++++-------
> fs/ext4/page-io.c | 6 +++++-
> 2 files changed, 19 insertions(+), 8 deletions(-)
>
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 6eb6b0c..739c21d 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -3660,6 +3660,8 @@ ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode,
> if ((flags & EXT4_GET_BLOCKS_PRE_IO)) {
> ret = ext4_split_unwritten_extents(handle, inode, map,
> path, flags);
> + if (ret <= 0)
> + goto out;
> /*
> * Flag the inode(non aio case) or end_io struct (aio case)
> * that this IO needs to conversion to written when IO is
> @@ -3905,6 +3907,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
> struct ext4_allocation_request ar;
> ext4_io_end_t *io = (ext4_io_end_t*) EXT4_CUR_AIO_DIO(inode);
> ext4_lblk_t cluster_offset;
> + int set_unwritten = 0;
>
> ext_debug("blocks %u/%u requested for inode %lu\n",
> map->m_lblk, map->m_len, inode->i_ino);
> @@ -4127,13 +4130,8 @@ got_allocated_blocks:
> * For non asycn direct IO case, flag the inode state
> * that we need to perform conversion when IO is done.
> */
> - if ((flags & EXT4_GET_BLOCKS_PRE_IO)) {
> - if (io)
> - ext4_set_io_unwritten_flag(inode, io);
> - else
> - ext4_set_inode_state(inode,
> - EXT4_STATE_DIO_UNWRITTEN);
> - }
> + if ((flags & EXT4_GET_BLOCKS_PRE_IO))
> + set_unwritten = 1;
> if (ext4_should_dioread_nolock(inode))
> map->m_flags |= EXT4_MAP_UNINIT;
> }
> @@ -4145,6 +4143,15 @@ got_allocated_blocks:
> if (!err)
> err = ext4_ext_insert_extent(handle, inode, path,
> &newex, flags);
> +
> + if (!err && set_unwritten) {
> + if (io)
> + ext4_set_io_unwritten_flag(inode, io);
> + else
> + ext4_set_inode_state(inode,
> + EXT4_STATE_DIO_UNWRITTEN);
> + }
> +
> if (err && free_on_err) {
> int fb_flags = flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE ?
> EXT4_FREE_BLOCKS_NO_QUOT_UPDATE : 0;
> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index de77e31..9970022 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -71,6 +71,8 @@ void ext4_free_io_end(ext4_io_end_t *io)
> int i;
>
> BUG_ON(!io);
> + BUG_ON(io->flag & EXT4_IO_END_UNWRITTEN);
> +
> if (io->page)
> put_page(io->page);
> for (i = 0; i < io->num_io_pages; i++)
> @@ -94,6 +96,8 @@ int ext4_end_io_nolock(ext4_io_end_t *io)
> ssize_t size = io->size;
> int ret = 0;
>
> + BUG_ON(!(io->flag & EXT4_IO_END_UNWRITTEN));
> +
> ext4_debug("ext4_end_io_nolock: io 0x%p from inode %lu,list->next 0x%p,"
> "list->prev 0x%p\n",
> io, inode->i_ino, io->list.next, io->list.prev);
> @@ -106,7 +110,7 @@ int ext4_end_io_nolock(ext4_io_end_t *io)
> "(inode %lu, offset %llu, size %zd, error %d)",
> inode->i_ino, offset, size, ret);
> }
> -
> + io->flag &= ~EXT4_IO_END_UNWRITTEN;
> if (io->iocb)
> aio_complete(io->iocb, io->result, 0);
>
> --
> 1.7.7.6
>
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 04/10] ext4: completed_io locking cleanup V3
2012-09-24 11:44 ` [PATCH 04/10] ext4: completed_io locking cleanup V3 Dmitry Monakhov
@ 2012-09-26 13:42 ` Jan Kara
2012-09-27 11:24 ` Dmitry Monakhov
0 siblings, 1 reply; 25+ messages in thread
From: Jan Kara @ 2012-09-26 13:42 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: linux-ext4, tytso, jack, lczerner
On Mon 24-09-12 15:44:14, Dmitry Monakhov wrote:
> Current unwritten extent conversion state-machine is very fuzzy.
> - By unknown reason it want perform conversion under i_mutex. What for?
> It was initially added by Theodore. Please comment your initial assumption.
> My diagnosis:
> We already protect extent tree with i_data_sem, truncate should
> wait for DIO in flight, so the only data we have to protect io->flags
> modification, but only flush_completed_IO and work are modified this
> flags and we can serialize them via i_completed_io_lock.
>
> Currently all this games with mutex_trylock result in following deadlock
> truncate: kworker:
> ext4_setattr ext4_end_io_work
> mutex_lock(i_mutex)
> inode_dio_wait(inode) ->BLOCK
> DEADLOCK<- mutex_trylock()
> inode_dio_done()
> #TEST_CASE1_BEGIN
> MNT=/mnt_scrach
> unlink $MNT/file
> fallocate -l $((1024*1024*1024)) $MNT/file
> aio-stress -I 100000 -O -s 100m -n -t 1 -c 10 -o 2 -o 3 $MNT/file
> sleep 2
> truncate -s 0 $MNT/file
> #TEST_CASE1_END
>
> Or use 286's xfstests https://github.com/dmonakhov/xfstests/blob/devel/286
>
> This patch makes state machine simple and clean:
> (1) ext4_end_io_work is responsible for handling all pending
> end_io from ei->i_completed_io_list(per inode list)
> NOTE1: i_completed_io_lock is acquired only once
> NOTE2: i_mutex is not required because it does not protect
> any data guarded by i_mutex any more
>
> (2) xxx_end_io schedule end_io context completion simply by pushing it
> to the inode's list.
> NOTE1: because of (1) work should be queued only if
> ->i_completed_io_list was empty at the moment, otherwise it
> work is scheduled already.
>
> (3) No one is able to free inode's blocks while pented io_completion
> exist othervise may result in blocks beyond EOF, this
> stated by the fact that all truncate routines wait for
> all pended unwritten requets in flight
>
> (4) Replace flush_completed_io() with ext4_unwritten_wait(). This
> allow greatly simplify state machine because end_io conext
> will be destroyed only in one place (end_io_work)
>
>
> - remove EXT4_IO_END_QUEUED and EXT4_IO_END_FSYNC flags because
> end_io is now destroyed from known context
> - Improve SMP scalability by removing useless i_mutex which does not
> protect io->flags anymore.
> - Reduce lock contention on i_completed_io_lock by optimizing list walk.
> - Move open coded logic from various xx_end_xx routines to ext4_add_complete_io()
>
> Changes since V2:
> Fix use-after-free caused by race truncate vs end_io_work
Nice work! Some comments below:
...
> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index 9970022..fa69bba 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -57,6 +57,29 @@ void ext4_ioend_wait(struct inode *inode)
> wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_ioend_count) == 0));
> }
>
> +void ext4_unwritten_wait(struct inode *inode)
> +{
> + wait_queue_head_t *wq = ext4_ioend_wq(inode);
> +
> + wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_unwritten) == 0));
> +}
I would add WARN_ON_ONCE(!mutex_locked(inode->i_mutex)) here because
without i_mutex this could be easily livelocked... Also I'm somewhat uneasy
that we wait for worker to do the work but it can be rather busy with
completing work for other inodes. So won't this slow down e.g. fsync() or
truncate() when there is heavy writing to other inodes? I guess some
numbers would be appropriate here...
> @@ -83,12 +106,7 @@ void ext4_free_io_end(ext4_io_end_t *io)
> kmem_cache_free(io_end_cachep, io);
> }
>
> -/*
> - * check a range of space and convert unwritten extents to written.
> - *
> - * Called with inode->i_mutex; we depend on this when we manipulate
> - * io->flag, since we could otherwise race with ext4_flush_completed_IO()
> - */
> +/* check a range of space and convert unwritten extents to written. */
> int ext4_end_io_nolock(ext4_io_end_t *io)
> {
> struct inode *inode = io->inode;
ext4_end_io_nolock() is a misnomer now. So just make it ext4_end_io() and
make it static.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 05/10] ext4: serialize dio nonlocked reads with defrag workers V3
2012-09-24 11:44 ` [PATCH 05/10] ext4: serialize dio nonlocked reads with defrag workers V3 Dmitry Monakhov
@ 2012-09-26 13:49 ` Jan Kara
0 siblings, 0 replies; 25+ messages in thread
From: Jan Kara @ 2012-09-26 13:49 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: linux-ext4, tytso, jack, lczerner
On Mon 24-09-12 15:44:15, Dmitry Monakhov wrote:
> Inode's block defrag and ext4_change_inode_journal_flag() may
> affect nonlocked DIO reads result, so proper synchronization
> required.
>
> - Add missed inode_dio_wait() calls where appropriate
> - Check inode state under extra i_dio_count reference.
>
>
> Changes from V1:
> - add missed memory bariers
> - move DIOREAD_LOCK state check out from generic should_dioread_nolock
> otherwise it will affect existing DIO readers.
>
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> ---
> fs/ext4/ext4.h | 17 +++++++++++++++++
> fs/ext4/indirect.c | 13 +++++++++++++
> fs/ext4/inode.c | 5 +++++
> fs/ext4/move_extent.c | 8 ++++++++
> 4 files changed, 43 insertions(+), 0 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index cc423cf..fccd263 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1350,6 +1350,8 @@ enum {
> EXT4_STATE_DIO_UNWRITTEN, /* need convert on dio done*/
> EXT4_STATE_NEWENTRY, /* File just added to dir */
> EXT4_STATE_DELALLOC_RESERVED, /* blks already reserved for delalloc */
> + EXT4_STATE_DIOREAD_LOCK, /* Disable support for dio read
> + nolocking */
> };
>
> #define EXT4_INODE_BIT_FNS(name, field, offset) \
> @@ -2462,6 +2464,21 @@ static inline void set_bitmap_uptodate(struct buffer_head *bh)
> set_bit(BH_BITMAP_UPTODATE, &(bh)->b_state);
> }
>
> +/*
> + * Disable DIO read nolock optimization, so new dioreaders will be forced
> + * to grab i_mutex
> + */
> +static inline void ext4_inode_block_unlocked_dio(struct inode *inode)
> +{
> + ext4_set_inode_state(inode, EXT4_STATE_DIOREAD_LOCK);
> + smp_mb();
> +}
> +static inline void ext4_inode_resume_unlocked_dio(struct inode *inode)
> +{
> + smp_mb();
> + ext4_clear_inode_state(inode, EXT4_STATE_DIOREAD_LOCK);
> +}
> +
> #define in_range(b, first, len) ((b) >= (first) && (b) <= (first) + (len) - 1)
>
> /* For ioend & aio unwritten conversion wait queues */
> diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
> index 251e35f..ec38f81 100644
> --- a/fs/ext4/indirect.c
> +++ b/fs/ext4/indirect.c
> @@ -810,10 +810,23 @@ retry:
> if (unlikely(!list_empty(&ei->i_completed_io_list)))
> ext4_unwritten_wait(inode);
>
> + /*
> + * Nolock dioread optimization may be dynamically disabled
> + * via ext4_inode_block_unlocked_dio(). Check inode's state
> + * while holding extra i_dio_count ref.
> + */
> + atomic_inc(&inode->i_dio_count);
> + smp_mb();
> + if (!unlikely(ext4_test_inode_state(inode,
> + EXT4_STATE_DIOREAD_LOCK))) {
Isn't this condition reverted? We want to retry if
EXT4_STATE_DIOREAD_LOCK is set, don't we?
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 06/10] ext4: punch_hole should wait for DIO writers V2
2012-09-24 11:44 ` [PATCH 06/10] ext4: punch_hole should wait for DIO writers V2 Dmitry Monakhov
@ 2012-09-26 13:56 ` Jan Kara
0 siblings, 0 replies; 25+ messages in thread
From: Jan Kara @ 2012-09-26 13:56 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: linux-ext4, tytso, jack, lczerner
On Mon 24-09-12 15:44:16, Dmitry Monakhov wrote:
> punch_hole are the places where we have to wait for all existing writers
> (writeback, aio, dio), but currently we simply flush pended end_io request
> which is not sufficient. Even more i_mutex is not holded while punch_hole
> which obviously result in dangerous data corruption due to
> write-after-free.
>
> This patch performs following changes:
> - Guard punch_hole with i_mutex
> - Recheck inode flags under i_mutex
> - Block all new dio readers in order to prevent information leak caused by
> read-after-free pattern.
> - punch_hole now wait for all writers in flight
> NOTE: XXX write-after-free race is still possible because
> truncate_pagecache_range() is not completely reliable and where
> is no easy way to stop writeback while punch_hole is in progress.
The race I know about is writing via mmap after
truncate_pagecache_range() but before buffers are really truncated. Do you
know about something else?
> Changes from V1:
> Add flag checks once we hold i_mutex
Looks good. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
>
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> ---
> fs/ext4/extents.c | 48 ++++++++++++++++++++++++++++++++----------------
> 1 files changed, 32 insertions(+), 16 deletions(-)
>
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 6326874..2d58f4c 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -4821,9 +4821,29 @@ int ext4_ext_punch_hole(struct file *file, loff_t offset, loff_t length)
> loff_t first_page_offset, last_page_offset;
> int credits, err = 0;
>
> + /*
> + * Write out all dirty pages to avoid race conditions
> + * Then release them.
> + */
> + if (mapping->nrpages && mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> + err = filemap_write_and_wait_range(mapping,
> + offset, offset + length - 1);
> +
> + if (err)
> + return err;
> + }
> +
> + mutex_lock(&inode->i_mutex);
> + /* Need recheck file flags under mutex */
> + /* It's not possible punch hole on append only file */
> + if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
> + return -EPERM;
> + if (IS_SWAPFILE(inode))
> + return -ETXTBSY;
> +
> /* No need to punch hole beyond i_size */
> if (offset >= inode->i_size)
> - return 0;
> + goto out_mutex;
>
> /*
> * If the hole extends beyond i_size, set the hole
> @@ -4841,31 +4861,23 @@ int ext4_ext_punch_hole(struct file *file, loff_t offset, loff_t length)
> first_page_offset = first_page << PAGE_CACHE_SHIFT;
> last_page_offset = last_page << PAGE_CACHE_SHIFT;
>
> - /*
> - * Write out all dirty pages to avoid race conditions
> - * Then release them.
> - */
> - if (mapping->nrpages && mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> - err = filemap_write_and_wait_range(mapping,
> - offset, offset + length - 1);
> -
> - if (err)
> - return err;
> - }
> -
> /* Now release the pages */
> if (last_page_offset > first_page_offset) {
> truncate_pagecache_range(inode, first_page_offset,
> last_page_offset - 1);
> }
>
> - /* finish any pending end_io work */
> + /* Wait all existing dio workers, newcomers will block on i_mutex */
> + ext4_inode_block_unlocked_dio(inode);
> + inode_dio_wait(inode);
> ext4_unwritten_wait(inode);
>
> credits = ext4_writepage_trans_blocks(inode);
> handle = ext4_journal_start(inode, credits);
> - if (IS_ERR(handle))
> - return PTR_ERR(handle);
> + if (IS_ERR(handle)) {
> + err = PTR_ERR(handle);
> + goto out_dio;
> + }
>
>
> /*
> @@ -4955,6 +4967,10 @@ out:
> inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
> ext4_mark_inode_dirty(handle, inode);
> ext4_journal_stop(handle);
> +out_dio:
> + ext4_inode_resume_unlocked_dio(inode);
> +out_mutex:
> + mutex_unlock(&inode->i_mutex);
> return err;
> }
> int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
> --
> 1.7.7.6
>
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 08/10] ext4: endless truncate due to nonlocked dio readers V2
2012-09-24 11:44 ` [PATCH 08/10] ext4: endless truncate due to nonlocked dio readers V2 Dmitry Monakhov
@ 2012-09-26 14:05 ` Jan Kara
2012-09-27 15:11 ` Dmitry Monakhov
0 siblings, 1 reply; 25+ messages in thread
From: Jan Kara @ 2012-09-26 14:05 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: linux-ext4, tytso, jack, lczerner
On Mon 24-09-12 15:44:18, Dmitry Monakhov wrote:
> If we have enough aggressive DIO readers, truncate and other dio
> waiters will wait forever inside inode_dio_wait(). It is reasonable
> to disable nonlock DIO read optimization during truncate.
Umm, actually this is a problem with any inode_dio_wait() call in ext4,
isn't it? So I'd just create ext4_inode_dio_wait() doing
ext4_inode_block_unlocked_dio(inode);
inode_dio_wait(inode);
ext4_inode_resume_unlocked_dio(inode);
and use it instead of inode_dio_wait().
Honza
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> ---
> fs/ext4/inode.c | 8 ++++++--
> 1 files changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 32e9701..d3f86e7 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4330,9 +4330,13 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
> if (attr->ia_valid & ATTR_SIZE) {
> if (attr->ia_size != inode->i_size) {
> truncate_setsize(inode, attr->ia_size);
> - /* Inode size will be reduced, wait for dio in flight */
> - if (orphan)
> + /* Inode size will be reduced, wait for dio in flight.
> + * Temproraly disable unlocked DIO to prevent livelock */
^^ Temporarily
> + if (orphan) {
> + ext4_inode_block_unlocked_dio(inode);
> inode_dio_wait(inode);
> + ext4_inode_resume_unlocked_dio(inode);
> + }
> }
> ext4_truncate(inode);
> }
> --
> 1.7.7.6
>
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 04/10] ext4: completed_io locking cleanup V3
2012-09-26 13:42 ` Jan Kara
@ 2012-09-27 11:24 ` Dmitry Monakhov
0 siblings, 0 replies; 25+ messages in thread
From: Dmitry Monakhov @ 2012-09-27 11:24 UTC (permalink / raw)
To: Jan Kara; +Cc: linux-ext4, tytso, jack, lczerner
On Wed, 26 Sep 2012 15:42:12 +0200, Jan Kara <jack@suse.cz> wrote:
> On Mon 24-09-12 15:44:14, Dmitry Monakhov wrote:
> > Current unwritten extent conversion state-machine is very fuzzy.
> > - By unknown reason it want perform conversion under i_mutex. What for?
> > It was initially added by Theodore. Please comment your initial assumption.
> > My diagnosis:
> > We already protect extent tree with i_data_sem, truncate should
> > wait for DIO in flight, so the only data we have to protect io->flags
> > modification, but only flush_completed_IO and work are modified this
> > flags and we can serialize them via i_completed_io_lock.
> >
> > Currently all this games with mutex_trylock result in following deadlock
> > truncate: kworker:
> > ext4_setattr ext4_end_io_work
> > mutex_lock(i_mutex)
> > inode_dio_wait(inode) ->BLOCK
> > DEADLOCK<- mutex_trylock()
> > inode_dio_done()
> > #TEST_CASE1_BEGIN
> > MNT=/mnt_scrach
> > unlink $MNT/file
> > fallocate -l $((1024*1024*1024)) $MNT/file
> > aio-stress -I 100000 -O -s 100m -n -t 1 -c 10 -o 2 -o 3 $MNT/file
> > sleep 2
> > truncate -s 0 $MNT/file
> > #TEST_CASE1_END
> >
> > Or use 286's xfstests https://github.com/dmonakhov/xfstests/blob/devel/286
> >
> > This patch makes state machine simple and clean:
> > (1) ext4_end_io_work is responsible for handling all pending
> > end_io from ei->i_completed_io_list(per inode list)
> > NOTE1: i_completed_io_lock is acquired only once
> > NOTE2: i_mutex is not required because it does not protect
> > any data guarded by i_mutex any more
> >
> > (2) xxx_end_io schedule end_io context completion simply by pushing it
> > to the inode's list.
> > NOTE1: because of (1) work should be queued only if
> > ->i_completed_io_list was empty at the moment, otherwise it
> > work is scheduled already.
> >
> > (3) No one is able to free inode's blocks while pented io_completion
> > exist othervise may result in blocks beyond EOF, this
> > stated by the fact that all truncate routines wait for
> > all pended unwritten requets in flight
> >
> > (4) Replace flush_completed_io() with ext4_unwritten_wait(). This
> > allow greatly simplify state machine because end_io conext
> > will be destroyed only in one place (end_io_work)
> >
> >
> > - remove EXT4_IO_END_QUEUED and EXT4_IO_END_FSYNC flags because
> > end_io is now destroyed from known context
> > - Improve SMP scalability by removing useless i_mutex which does not
> > protect io->flags anymore.
> > - Reduce lock contention on i_completed_io_lock by optimizing list walk.
> > - Move open coded logic from various xx_end_xx routines to ext4_add_complete_io()
> >
> > Changes since V2:
> > Fix use-after-free caused by race truncate vs end_io_work
> Nice work! Some comments below:
>
> ...
> > diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> > index 9970022..fa69bba 100644
> > --- a/fs/ext4/page-io.c
> > +++ b/fs/ext4/page-io.c
> > @@ -57,6 +57,29 @@ void ext4_ioend_wait(struct inode *inode)
> > wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_ioend_count) == 0));
> > }
> >
> > +void ext4_unwritten_wait(struct inode *inode)
> > +{
> > + wait_queue_head_t *wq = ext4_ioend_wq(inode);
> > +
> > + wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_unwritten) == 0));
> > +}
> I would add WARN_ON_ONCE(!mutex_locked(inode->i_mutex)) here because
> without i_mutex this could be easily livelocked... Also I'm somewhat uneasy
> that we wait for worker to do the work but it can be rather busy with
> completing work for other inodes. So won't this slow down e.g. fsync() or
> truncate() when there is heavy writing to other inodes? I guess some
> numbers would be appropriate here...
Unfortunately such caller exist, it is called from nonlock dio read, and
live lock is really happen on crazy loads. So there are two possible way
to solve this:
1) guard i_unwritten_wait with i_mutex (as it was before this patch)
2) use old flush_completed_io logic and flush complete_io_list out of order.
I'll use (2) even if it makes code-flow a bit harder to understand.
Also i've realized that it would be reasonable to split this patch in to two:
1) reorganize complete_io state machine (guard all changes with i_complete_io_lock)
2) remove i_mutex where appropriate
P.S: agree with all other comments to this and other patches, will
prepare new version today, +8 hrs for autotest, and will submit it
tomorrow morning.
>
> > @@ -83,12 +106,7 @@ void ext4_free_io_end(ext4_io_end_t *io)
> > kmem_cache_free(io_end_cachep, io);
> > }
> >
> > -/*
> > - * check a range of space and convert unwritten extents to written.
> > - *
> > - * Called with inode->i_mutex; we depend on this when we manipulate
> > - * io->flag, since we could otherwise race with ext4_flush_completed_IO()
> > - */
> > +/* check a range of space and convert unwritten extents to written. */
> > int ext4_end_io_nolock(ext4_io_end_t *io)
> > {
> > struct inode *inode = io->inode;
> ext4_end_io_nolock() is a misnomer now. So just make it ext4_end_io() and
> make it static.
>
> Honza
>
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 03/10] ext4: fix unwritten counter leakage
2012-09-26 13:07 ` Jan Kara
@ 2012-09-27 12:19 ` Dmitry Monakhov
2012-09-27 12:34 ` Jan Kara
0 siblings, 1 reply; 25+ messages in thread
From: Dmitry Monakhov @ 2012-09-27 12:19 UTC (permalink / raw)
To: Jan Kara; +Cc: linux-ext4, tytso, jack, lczerner
On Wed, 26 Sep 2012 15:07:14 +0200, Jan Kara <jack@suse.cz> wrote:
> On Mon 24-09-12 15:44:13, Dmitry Monakhov wrote:
> > ext4_set_io_unwritten_flag() will increment i_unwritten counter, so
> > once we mark end_io with END_IO_UNWRITTEN we have to revert it back
> ^^ EXT4_IO_END_UNWRITTEN
> > on error path.
> >
> > - add missed error checks to prevent counter leakage
> > - ext4_end_io_nolock() will clear END_IO_UNWRITTEN flag to signal
> ^^ EXT4_IO_END_UNWRITTEN
> > that conversion finished.
> > - add BUGON to free_end_io() to prevent similar leackage in future.
> ^^ BUG_ON ^^ext4_free_io_end() ^^ leakage
>
> > Visiable effect of this bug is that unaligned aio_stress may deadlock
> ^^ Visible
>
> Umm, and won't it be more foolproof it we just decrement i_unwritten in
> ext4_free_io_end() when we see EXT4_IO_END_UNWRITTEN set?
I'd like to consider BUG_ON inside ext4_free_io_end as a sanity check to
force all callers to perform all necessary error checks in known context.
>
> That still leaves the mess with EXT4_STATE_DIO_UNWRITTEN unhandled. But
> that's a separate issue. We seem to clear that flag only in
> ext4_ext_direct_IO() although it could be set even when buffered write
> converts extents. And error cases seem to be buggy as well.
No, each unwritten extent will be added to i_complete_io_list regardless
to it's origin (buffered or DIO), and will be completed via
ext4_end_io_nolock(). So assertion is correct.
>
> Honza
>
> > Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> > ---
> > fs/ext4/extents.c | 21 ++++++++++++++-------
> > fs/ext4/page-io.c | 6 +++++-
> > 2 files changed, 19 insertions(+), 8 deletions(-)
> >
> > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> > index 6eb6b0c..739c21d 100644
> > --- a/fs/ext4/extents.c
> > +++ b/fs/ext4/extents.c
> > @@ -3660,6 +3660,8 @@ ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode,
> > if ((flags & EXT4_GET_BLOCKS_PRE_IO)) {
> > ret = ext4_split_unwritten_extents(handle, inode, map,
> > path, flags);
> > + if (ret <= 0)
> > + goto out;
> > /*
> > * Flag the inode(non aio case) or end_io struct (aio case)
> > * that this IO needs to conversion to written when IO is
> > @@ -3905,6 +3907,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
> > struct ext4_allocation_request ar;
> > ext4_io_end_t *io = (ext4_io_end_t*) EXT4_CUR_AIO_DIO(inode);
> > ext4_lblk_t cluster_offset;
> > + int set_unwritten = 0;
> >
> > ext_debug("blocks %u/%u requested for inode %lu\n",
> > map->m_lblk, map->m_len, inode->i_ino);
> > @@ -4127,13 +4130,8 @@ got_allocated_blocks:
> > * For non asycn direct IO case, flag the inode state
> > * that we need to perform conversion when IO is done.
> > */
> > - if ((flags & EXT4_GET_BLOCKS_PRE_IO)) {
> > - if (io)
> > - ext4_set_io_unwritten_flag(inode, io);
> > - else
> > - ext4_set_inode_state(inode,
> > - EXT4_STATE_DIO_UNWRITTEN);
> > - }
> > + if ((flags & EXT4_GET_BLOCKS_PRE_IO))
> > + set_unwritten = 1;
> > if (ext4_should_dioread_nolock(inode))
> > map->m_flags |= EXT4_MAP_UNINIT;
> > }
> > @@ -4145,6 +4143,15 @@ got_allocated_blocks:
> > if (!err)
> > err = ext4_ext_insert_extent(handle, inode, path,
> > &newex, flags);
> > +
> > + if (!err && set_unwritten) {
> > + if (io)
> > + ext4_set_io_unwritten_flag(inode, io);
> > + else
> > + ext4_set_inode_state(inode,
> > + EXT4_STATE_DIO_UNWRITTEN);
> > + }
> > +
> > if (err && free_on_err) {
> > int fb_flags = flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE ?
> > EXT4_FREE_BLOCKS_NO_QUOT_UPDATE : 0;
> > diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> > index de77e31..9970022 100644
> > --- a/fs/ext4/page-io.c
> > +++ b/fs/ext4/page-io.c
> > @@ -71,6 +71,8 @@ void ext4_free_io_end(ext4_io_end_t *io)
> > int i;
> >
> > BUG_ON(!io);
> > + BUG_ON(io->flag & EXT4_IO_END_UNWRITTEN);
> > +
> > if (io->page)
> > put_page(io->page);
> > for (i = 0; i < io->num_io_pages; i++)
> > @@ -94,6 +96,8 @@ int ext4_end_io_nolock(ext4_io_end_t *io)
> > ssize_t size = io->size;
> > int ret = 0;
> >
> > + BUG_ON(!(io->flag & EXT4_IO_END_UNWRITTEN));
> > +
> > ext4_debug("ext4_end_io_nolock: io 0x%p from inode %lu,list->next 0x%p,"
> > "list->prev 0x%p\n",
> > io, inode->i_ino, io->list.next, io->list.prev);
> > @@ -106,7 +110,7 @@ int ext4_end_io_nolock(ext4_io_end_t *io)
> > "(inode %lu, offset %llu, size %zd, error %d)",
> > inode->i_ino, offset, size, ret);
> > }
> > -
> > + io->flag &= ~EXT4_IO_END_UNWRITTEN;
> > if (io->iocb)
> > aio_complete(io->iocb, io->result, 0);
> >
> > --
> > 1.7.7.6
> >
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 03/10] ext4: fix unwritten counter leakage
2012-09-27 12:19 ` Dmitry Monakhov
@ 2012-09-27 12:34 ` Jan Kara
2012-09-27 12:54 ` Dmitry Monakhov
0 siblings, 1 reply; 25+ messages in thread
From: Jan Kara @ 2012-09-27 12:34 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: Jan Kara, linux-ext4, tytso, lczerner
On Thu 27-09-12 16:19:01, Dmitry Monakhov wrote:
> On Wed, 26 Sep 2012 15:07:14 +0200, Jan Kara <jack@suse.cz> wrote:
> > On Mon 24-09-12 15:44:13, Dmitry Monakhov wrote:
> > > ext4_set_io_unwritten_flag() will increment i_unwritten counter, so
> > > once we mark end_io with END_IO_UNWRITTEN we have to revert it back
> > ^^ EXT4_IO_END_UNWRITTEN
> > > on error path.
> > >
> > > - add missed error checks to prevent counter leakage
> > > - ext4_end_io_nolock() will clear END_IO_UNWRITTEN flag to signal
> > ^^ EXT4_IO_END_UNWRITTEN
> > > that conversion finished.
> > > - add BUGON to free_end_io() to prevent similar leackage in future.
> > ^^ BUG_ON ^^ext4_free_io_end() ^^ leakage
> >
> > > Visiable effect of this bug is that unaligned aio_stress may deadlock
> > ^^ Visible
> >
> > Umm, and won't it be more foolproof it we just decrement i_unwritten in
> > ext4_free_io_end() when we see EXT4_IO_END_UNWRITTEN set?
> I'd like to consider BUG_ON inside ext4_free_io_end as a sanity check to
> force all callers to perform all necessary error checks in known context.
I'm not sure how "performing all necessary error checks in known context"
relates to ext4_free_io_end() cleaning up the structure on its own or
whether someone has to do it beforehand... Can you maybe elaborate a bit
more?
> > That still leaves the mess with EXT4_STATE_DIO_UNWRITTEN unhandled. But
> > that's a separate issue. We seem to clear that flag only in
> > ext4_ext_direct_IO() although it could be set even when buffered write
> > converts extents. And error cases seem to be buggy as well.
> No, each unwritten extent will be added to i_complete_io_list regardless
> to it's origin (buffered or DIO), and will be completed via
> ext4_end_io_nolock(). So assertion is correct.
Yes, I agree with what you say. My note was just an off-topic rambling
about inode flag EXT4_STATE_DIO_UNWRITTEN whose handling seem to be buggy
as well.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 03/10] ext4: fix unwritten counter leakage
2012-09-27 12:34 ` Jan Kara
@ 2012-09-27 12:54 ` Dmitry Monakhov
2012-09-27 13:07 ` Jan Kara
0 siblings, 1 reply; 25+ messages in thread
From: Dmitry Monakhov @ 2012-09-27 12:54 UTC (permalink / raw)
To: Jan Kara; +Cc: Jan Kara, linux-ext4, tytso, lczerner
On Thu, 27 Sep 2012 14:34:20 +0200, Jan Kara <jack@suse.cz> wrote:
> On Thu 27-09-12 16:19:01, Dmitry Monakhov wrote:
> > On Wed, 26 Sep 2012 15:07:14 +0200, Jan Kara <jack@suse.cz> wrote:
> > > On Mon 24-09-12 15:44:13, Dmitry Monakhov wrote:
> > > > ext4_set_io_unwritten_flag() will increment i_unwritten counter, so
> > > > once we mark end_io with END_IO_UNWRITTEN we have to revert it back
> > > ^^ EXT4_IO_END_UNWRITTEN
> > > > on error path.
> > > >
> > > > - add missed error checks to prevent counter leakage
> > > > - ext4_end_io_nolock() will clear END_IO_UNWRITTEN flag to signal
> > > ^^ EXT4_IO_END_UNWRITTEN
> > > > that conversion finished.
> > > > - add BUGON to free_end_io() to prevent similar leackage in future.
> > > ^^ BUG_ON ^^ext4_free_io_end() ^^ leakage
> > >
> > > > Visiable effect of this bug is that unaligned aio_stress may deadlock
> > > ^^ Visible
> > >
> > > Umm, and won't it be more foolproof it we just decrement i_unwritten in
> > > ext4_free_io_end() when we see EXT4_IO_END_UNWRITTEN set?
> > I'd like to consider BUG_ON inside ext4_free_io_end as a sanity check to
> > force all callers to perform all necessary error checks in known context.
> I'm not sure how "performing all necessary error checks in known context"
> relates to ext4_free_io_end() cleaning up the structure on its own or
> whether someone has to do it beforehand... Can you maybe elaborate a bit
> more?
I assume that if end_io was tagged with UNWRITTEN flag it should goes trough
complete_io_list and end_io_nolock(), or caller should cancel it by
itself in case of error, otherwise we may miss valid unwritten end_io
but was not scheduled to complete_end_io routine by occasion (and endup
in silent data loss). In my opinion at the time then ext4_free_io_end()
was called all possible conversions should be completed.
>
> > > That still leaves the mess with EXT4_STATE_DIO_UNWRITTEN unhandled. But
> > > that's a separate issue. We seem to clear that flag only in
> > > ext4_ext_direct_IO() although it could be set even when buffered write
> > > converts extents. And error cases seem to be buggy as well.
> > No, each unwritten extent will be added to i_complete_io_list regardless
> > to it's origin (buffered or DIO), and will be completed via
> > ext4_end_io_nolock(). So assertion is correct.
> Yes, I agree with what you say. My note was just an off-topic rambling
> about inode flag EXT4_STATE_DIO_UNWRITTEN whose handling seem to be buggy
> as well.
>
> Honza
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 03/10] ext4: fix unwritten counter leakage
2012-09-27 12:54 ` Dmitry Monakhov
@ 2012-09-27 13:07 ` Jan Kara
0 siblings, 0 replies; 25+ messages in thread
From: Jan Kara @ 2012-09-27 13:07 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: Jan Kara, linux-ext4, tytso, lczerner
On Thu 27-09-12 16:54:13, Dmitry Monakhov wrote:
> On Thu, 27 Sep 2012 14:34:20 +0200, Jan Kara <jack@suse.cz> wrote:
> > On Thu 27-09-12 16:19:01, Dmitry Monakhov wrote:
> > > On Wed, 26 Sep 2012 15:07:14 +0200, Jan Kara <jack@suse.cz> wrote:
> > > > On Mon 24-09-12 15:44:13, Dmitry Monakhov wrote:
> > > > > ext4_set_io_unwritten_flag() will increment i_unwritten counter, so
> > > > > once we mark end_io with END_IO_UNWRITTEN we have to revert it back
> > > > ^^ EXT4_IO_END_UNWRITTEN
> > > > > on error path.
> > > > >
> > > > > - add missed error checks to prevent counter leakage
> > > > > - ext4_end_io_nolock() will clear END_IO_UNWRITTEN flag to signal
> > > > ^^ EXT4_IO_END_UNWRITTEN
> > > > > that conversion finished.
> > > > > - add BUGON to free_end_io() to prevent similar leackage in future.
> > > > ^^ BUG_ON ^^ext4_free_io_end() ^^ leakage
> > > >
> > > > > Visiable effect of this bug is that unaligned aio_stress may deadlock
> > > > ^^ Visible
> > > >
> > > > Umm, and won't it be more foolproof it we just decrement i_unwritten in
> > > > ext4_free_io_end() when we see EXT4_IO_END_UNWRITTEN set?
> > > I'd like to consider BUG_ON inside ext4_free_io_end as a sanity check to
> > > force all callers to perform all necessary error checks in known context.
> > I'm not sure how "performing all necessary error checks in known context"
> > relates to ext4_free_io_end() cleaning up the structure on its own or
> > whether someone has to do it beforehand... Can you maybe elaborate a bit
> > more?
> I assume that if end_io was tagged with UNWRITTEN flag it should goes trough
> complete_io_list and end_io_nolock(), or caller should cancel it by
> itself in case of error, otherwise we may miss valid unwritten end_io
> but was not scheduled to complete_end_io routine by occasion (and endup
> in silent data loss). In my opinion at the time then ext4_free_io_end()
> was called all possible conversions should be completed.
Fair enough. I just hated verifying all the paths in extent.c and
checking whether they could happen to have end_io with UNWRITTEN flag set.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 08/10] ext4: endless truncate due to nonlocked dio readers V2
2012-09-26 14:05 ` Jan Kara
@ 2012-09-27 15:11 ` Dmitry Monakhov
2012-09-27 15:23 ` Jan Kara
0 siblings, 1 reply; 25+ messages in thread
From: Dmitry Monakhov @ 2012-09-27 15:11 UTC (permalink / raw)
To: Jan Kara; +Cc: linux-ext4, tytso, jack, lczerner
On Wed, 26 Sep 2012 16:05:38 +0200, Jan Kara <jack@suse.cz> wrote:
> On Mon 24-09-12 15:44:18, Dmitry Monakhov wrote:
> > If we have enough aggressive DIO readers, truncate and other dio
> > waiters will wait forever inside inode_dio_wait(). It is reasonable
> > to disable nonlock DIO read optimization during truncate.
> Umm, actually this is a problem with any inode_dio_wait() call in ext4,
> isn't it? So I'd just create ext4_inode_dio_wait() doing
> ext4_inode_block_unlocked_dio(inode);
> inode_dio_wait(inode);
> ext4_inode_resume_unlocked_dio(inode);
>
> and use it instead of inode_dio_wait().
Ops sorry miss that comment.
Actually all other places are very special, and guarded already:
1) ext4_ext_punch_hole()
2) ext4_move_extents()
3) ext4_change_inode_journal_flag()
Such functions require explicit scope where nonlocked DIO read should
be disabled. So ext4_setattr() is the only place where we wait for
existing dio, but nonlocked dio reads are allowed and may result in
temporal live-lock.
>
> Honza
>
> > Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> > ---
> > fs/ext4/inode.c | 8 ++++++--
> > 1 files changed, 6 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 32e9701..d3f86e7 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -4330,9 +4330,13 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
> > if (attr->ia_valid & ATTR_SIZE) {
> > if (attr->ia_size != inode->i_size) {
> > truncate_setsize(inode, attr->ia_size);
> > - /* Inode size will be reduced, wait for dio in flight */
> > - if (orphan)
> > + /* Inode size will be reduced, wait for dio in flight.
> > + * Temproraly disable unlocked DIO to prevent livelock */
> ^^ Temporarily
>
> > + if (orphan) {
> > + ext4_inode_block_unlocked_dio(inode);
> > inode_dio_wait(inode);
> > + ext4_inode_resume_unlocked_dio(inode);
> > + }
> > }
> > ext4_truncate(inode);
> > }
> > --
> > 1.7.7.6
> >
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 08/10] ext4: endless truncate due to nonlocked dio readers V2
2012-09-27 15:11 ` Dmitry Monakhov
@ 2012-09-27 15:23 ` Jan Kara
0 siblings, 0 replies; 25+ messages in thread
From: Jan Kara @ 2012-09-27 15:23 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: Jan Kara, linux-ext4, tytso, lczerner
On Thu 27-09-12 19:11:13, Dmitry Monakhov wrote:
> On Wed, 26 Sep 2012 16:05:38 +0200, Jan Kara <jack@suse.cz> wrote:
> > On Mon 24-09-12 15:44:18, Dmitry Monakhov wrote:
> > > If we have enough aggressive DIO readers, truncate and other dio
> > > waiters will wait forever inside inode_dio_wait(). It is reasonable
> > > to disable nonlock DIO read optimization during truncate.
> > Umm, actually this is a problem with any inode_dio_wait() call in ext4,
> > isn't it? So I'd just create ext4_inode_dio_wait() doing
> > ext4_inode_block_unlocked_dio(inode);
> > inode_dio_wait(inode);
> > ext4_inode_resume_unlocked_dio(inode);
> >
> > and use it instead of inode_dio_wait().
> Ops sorry miss that comment.
> Actually all other places are very special, and guarded already:
> 1) ext4_ext_punch_hole()
> 2) ext4_move_extents()
> 3) ext4_change_inode_journal_flag()
> Such functions require explicit scope where nonlocked DIO read should
> be disabled. So ext4_setattr() is the only place where we wait for
> existing dio, but nonlocked dio reads are allowed and may result in
> temporal live-lock.
Ah, OK. Thanks for explanation. I'm just looking for a way how we could
avoid future bugs arising from someone adding inode_dio_wait() somewhere
without realizing there's this special unlocked DIO read issue... Even I
can forget that in an year or two.
But OTOH if we are unaware of that special case, you can get the exclusion
wrong anyway because you may well need to block unlocked DIO for a longer
time. What a mess. So I guess I'm fine with how the patch is now.
Honza
> > > Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> > > ---
> > > fs/ext4/inode.c | 8 ++++++--
> > > 1 files changed, 6 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > index 32e9701..d3f86e7 100644
> > > --- a/fs/ext4/inode.c
> > > +++ b/fs/ext4/inode.c
> > > @@ -4330,9 +4330,13 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
> > > if (attr->ia_valid & ATTR_SIZE) {
> > > if (attr->ia_size != inode->i_size) {
> > > truncate_setsize(inode, attr->ia_size);
> > > - /* Inode size will be reduced, wait for dio in flight */
> > > - if (orphan)
> > > + /* Inode size will be reduced, wait for dio in flight.
> > > + * Temproraly disable unlocked DIO to prevent livelock */
> > ^^ Temporarily
> >
> > > + if (orphan) {
> > > + ext4_inode_block_unlocked_dio(inode);
> > > inode_dio_wait(inode);
> > > + ext4_inode_resume_unlocked_dio(inode);
> > > + }
> > > }
> > > ext4_truncate(inode);
> > > }
> > > --
> > > 1.7.7.6
> > >
> > --
> > Jan Kara <jack@suse.cz>
> > SUSE Labs, CR
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2012-09-27 15:23 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-09-24 11:44 [PATCH 00/10] ext4: Bunch of DIO/AIO fixes V3 Dmitry Monakhov
2012-09-24 11:44 ` [PATCH 01/10] ext4: ext4_inode_info diet Dmitry Monakhov
2012-09-26 12:28 ` Jan Kara
2012-09-24 11:44 ` [PATCH 02/10] ext4: give i_aiodio_unwritten more appropriate name Dmitry Monakhov
2012-09-26 12:32 ` Jan Kara
2012-09-24 11:44 ` [PATCH 03/10] ext4: fix unwritten counter leakage Dmitry Monakhov
2012-09-26 13:07 ` Jan Kara
2012-09-27 12:19 ` Dmitry Monakhov
2012-09-27 12:34 ` Jan Kara
2012-09-27 12:54 ` Dmitry Monakhov
2012-09-27 13:07 ` Jan Kara
2012-09-24 11:44 ` [PATCH 04/10] ext4: completed_io locking cleanup V3 Dmitry Monakhov
2012-09-26 13:42 ` Jan Kara
2012-09-27 11:24 ` Dmitry Monakhov
2012-09-24 11:44 ` [PATCH 05/10] ext4: serialize dio nonlocked reads with defrag workers V3 Dmitry Monakhov
2012-09-26 13:49 ` Jan Kara
2012-09-24 11:44 ` [PATCH 06/10] ext4: punch_hole should wait for DIO writers V2 Dmitry Monakhov
2012-09-26 13:56 ` Jan Kara
2012-09-24 11:44 ` [PATCH 07/10] ext4: serialize unlocked dio reads with truncate Dmitry Monakhov
2012-09-24 11:44 ` [PATCH 08/10] ext4: endless truncate due to nonlocked dio readers V2 Dmitry Monakhov
2012-09-26 14:05 ` Jan Kara
2012-09-27 15:11 ` Dmitry Monakhov
2012-09-27 15:23 ` Jan Kara
2012-09-24 11:44 ` [PATCH 09/10] ext4: serialize truncate with owerwrite DIO workers V2 Dmitry Monakhov
2012-09-24 11:44 ` [PATCH 10/10] ext4: fix ext_remove_space for punch_hole case Dmitry Monakhov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).