* Re: [PATCH v3 2/3] ext4: fix races between changing inode journal mode and ext4_writepages
@ 2016-03-10 22:20 Daeho Jeong
0 siblings, 0 replies; 3+ messages in thread
From: Daeho Jeong @ 2016-03-10 22:20 UTC (permalink / raw)
To: Jan Kara
Cc: tytso@mit.edu, linux-ext4@vger.kernel.org,
정대호
> You need to change how dax_writeback_mapping_range() is called a few lines
> below so that it also exits via out_writepages: and not directly.
Oops, I found that I made this patch on a little old version where ext4_writepages()
doesn't use dax_writeback_mapping_range() function. Sorry about that.
I'll fix this problem.
Thank you. :-)
^ permalink raw reply [flat|nested] 3+ messages in thread
* [PATCH v3 1/3] ext4: handle unwritten or delalloc buffers before enabling per-file data journaling
@ 2016-03-09 8:09 Daeho Jeong
2016-03-09 8:09 ` [PATCH v3 2/3] ext4: fix races between changing inode journal mode and ext4_writepages Daeho Jeong
0 siblings, 1 reply; 3+ messages in thread
From: Daeho Jeong @ 2016-03-09 8:09 UTC (permalink / raw)
To: tytso, jack, linux-ext4; +Cc: Daeho Jeong
We already allocate delalloc blocks before changing the inode mode into
"per-file data journal" mode to prevent delalloc blocks from remaining
not allocated, but another issue concerned with "BH_Unwritten" status
still exists. For example, by fallocate(), several buffers' status
change into "BH_Unwritten", but these buffers cannot be processed by
ext4_alloc_da_blocks(). So, they still remain in unwritten status after
per-file data journaling is enabled and they cannot be changed into
written status any more and, if they are journaled and eventually
checkpointed, these unwritten buffer will cause a kernel panic by the
below BUG_ON() function of submit_bh_wbc() when they are submitted
during checkpointing.
static int submit_bh_wbc(int rw, struct buffer_head *bh,...
{
...
BUG_ON(buffer_unwritten(bh));
Moreover, when "dioread_nolock" option is enabled, the status of a
buffer is changed into "BH_Unwritten" after write_begin() completes and
the "BH_Unwritten" status will be cleared after I/O is done. Therefore,
if a buffer's status is changed into unwrutten but the buffer's I/O is
not submitted and completed, it can cause the same problem after
enabling per-file data journaling. You can easily generate this bug by
executing the following command.
./kvm-xfstests -C 10000 -m nodelalloc,dioread_nolock generic/269
To resolve these problems and define a boundary between the previous
mode and per-file data journaling mode, we need to flush and wait all
the I/O of buffers of a file before enabling per-file data journaling
of the file.
Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
fs/ext4/inode.c | 31 ++++++++++++++++++++-----------
1 file changed, 20 insertions(+), 11 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9cc57c3..9ecfb76 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5378,22 +5378,29 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
return 0;
if (is_journal_aborted(journal))
return -EROFS;
- /* We have to allocate physical blocks for delalloc blocks
- * before flushing journal. otherwise delalloc blocks can not
- * be allocated any more. even more truncate on delalloc blocks
- * could trigger BUG by flushing delalloc blocks in journal.
- * There is no delalloc block in non-journal data mode.
- */
- if (val && test_opt(inode->i_sb, DELALLOC)) {
- err = ext4_alloc_da_blocks(inode);
- if (err < 0)
- return err;
- }
/* Wait for all existing dio workers */
ext4_inode_block_unlocked_dio(inode);
inode_dio_wait(inode);
+ /*
+ * Before flushing the journal and switching inode's aops, we have
+ * to flush all dirty data the inode has. There can be outstanding
+ * delayed allocations, there can be unwritten extents created by
+ * fallocate or buffered writes in dioread_nolock mode covered by
+ * dirty data which can be converted only after flushing the dirty
+ * data (and journalled aops don't know how to handle these cases).
+ */
+ if (val) {
+ down_write(&EXT4_I(inode)->i_mmap_sem);
+ err = filemap_write_and_wait(inode->i_mapping);
+ if (err < 0) {
+ up_write(&EXT4_I(inode)->i_mmap_sem);
+ ext4_inode_resume_unlocked_dio(inode);
+ return err;
+ }
+ }
+
jbd2_journal_lock_updates(journal);
/*
@@ -5418,6 +5425,8 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
ext4_set_aops(inode);
jbd2_journal_unlock_updates(journal);
+ if (val)
+ up_write(&EXT4_I(inode)->i_mmap_sem);
ext4_inode_resume_unlocked_dio(inode);
/* Finally we can mark the inode as dirty. */
--
1.7.9.5
^ permalink raw reply related [flat|nested] 3+ messages in thread
* [PATCH v3 2/3] ext4: fix races between changing inode journal mode and ext4_writepages
2016-03-09 8:09 [PATCH v3 1/3] ext4: handle unwritten or delalloc buffers before enabling per-file data journaling Daeho Jeong
@ 2016-03-09 8:09 ` Daeho Jeong
2016-03-10 10:04 ` Jan Kara
0 siblings, 1 reply; 3+ messages in thread
From: Daeho Jeong @ 2016-03-09 8:09 UTC (permalink / raw)
To: tytso, jack, linux-ext4; +Cc: Daeho Jeong
Now, in ext4, there is a race condition between changing inode journal
mode and ext4_writepages(). While ext4_writepages() is executed on
a non-journalled mode inode, the inode's journal mode could be enabled
by ioctl() and then, some pages dirtied after switching the journal
mode will be still exposed to ext4_writepages() in non-journaled mode.
To resolve this problem, we use fs-wide per-cpu rw semaphore by
Jan Kara's suggestion because we don't want to waste ext4_inode_info's
space for this extra rare case.
Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
fs/ext4/ext4.h | 4 ++++
fs/ext4/inode.c | 7 +++++++
fs/ext4/super.c | 4 ++++
kernel/locking/percpu-rwsem.c | 1 +
4 files changed, 16 insertions(+)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 157b458..c757a3d 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -33,6 +33,7 @@
#include <linux/ratelimit.h>
#include <crypto/hash.h>
#include <linux/falloc.h>
+#include <linux/percpu-rwsem.h>
#ifdef __KERNEL__
#include <linux/compat.h>
#endif
@@ -1475,6 +1476,9 @@ struct ext4_sb_info {
struct ratelimit_state s_err_ratelimit_state;
struct ratelimit_state s_warning_ratelimit_state;
struct ratelimit_state s_msg_ratelimit_state;
+
+ /* Barrier between changing inodes' journal flags and writepages ops. */
+ struct percpu_rw_semaphore s_journal_flag_rwsem;
};
static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9ecfb76..1176142 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2476,6 +2476,7 @@ static int ext4_writepages(struct address_space *mapping,
struct blk_plug plug;
bool give_up_on_write = false;
+ percpu_down_read(&sbi->s_journal_flag_rwsem);
trace_ext4_writepages(inode, wbc);
/*
@@ -2646,6 +2647,7 @@ retry:
out_writepages:
trace_ext4_writepages_result(inode, wbc, ret,
nr_to_write - wbc->nr_to_write);
+ percpu_up_read(&sbi->s_journal_flag_rwsem);
return ret;
}
@@ -5362,6 +5364,7 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
journal_t *journal;
handle_t *handle;
int err;
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
/*
* We have to be very careful here: changing a data block's
@@ -5401,6 +5404,7 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
}
}
+ percpu_down_write(&sbi->s_journal_flag_rwsem);
jbd2_journal_lock_updates(journal);
/*
@@ -5417,6 +5421,7 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
err = jbd2_journal_flush(journal);
if (err < 0) {
jbd2_journal_unlock_updates(journal);
+ percpu_up_write(&sbi->s_journal_flag_rwsem);
ext4_inode_resume_unlocked_dio(inode);
return err;
}
@@ -5425,6 +5430,8 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
ext4_set_aops(inode);
jbd2_journal_unlock_updates(journal);
+ percpu_up_write(&sbi->s_journal_flag_rwsem);
+
if (val)
up_write(&EXT4_I(inode)->i_mmap_sem);
ext4_inode_resume_unlocked_dio(inode);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 3ed01ec..a12950d 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -861,6 +861,7 @@ static void ext4_put_super(struct super_block *sb)
percpu_counter_destroy(&sbi->s_freeinodes_counter);
percpu_counter_destroy(&sbi->s_dirs_counter);
percpu_counter_destroy(&sbi->s_dirtyclusters_counter);
+ percpu_free_rwsem(&sbi->s_journal_flag_rwsem);
brelse(sbi->s_sbh);
#ifdef CONFIG_QUOTA
for (i = 0; i < EXT4_MAXQUOTAS; i++)
@@ -3926,6 +3927,9 @@ no_journal:
if (!err)
err = percpu_counter_init(&sbi->s_dirtyclusters_counter, 0,
GFP_KERNEL);
+ if (!err)
+ err = percpu_init_rwsem(&sbi->s_journal_flag_rwsem);
+
if (err) {
ext4_msg(sb, KERN_ERR, "insufficient memory");
goto failed_mount6;
diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index f231e0b..bec0b64 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -37,6 +37,7 @@ void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
free_percpu(brw->fast_read_ctr);
brw->fast_read_ctr = NULL; /* catch use after free bugs */
}
+EXPORT_SYMBOL_GPL(percpu_free_rwsem);
/*
* This is the fast-path for down_read/up_read. If it succeeds we rely
--
1.7.9.5
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH v3 2/3] ext4: fix races between changing inode journal mode and ext4_writepages
2016-03-09 8:09 ` [PATCH v3 2/3] ext4: fix races between changing inode journal mode and ext4_writepages Daeho Jeong
@ 2016-03-10 10:04 ` Jan Kara
0 siblings, 0 replies; 3+ messages in thread
From: Jan Kara @ 2016-03-10 10:04 UTC (permalink / raw)
To: Daeho Jeong; +Cc: tytso, jack, linux-ext4
On Wed 09-03-16 17:09:46, Daeho Jeong wrote:
> Now, in ext4, there is a race condition between changing inode journal
> mode and ext4_writepages(). While ext4_writepages() is executed on
> a non-journalled mode inode, the inode's journal mode could be enabled
> by ioctl() and then, some pages dirtied after switching the journal
> mode will be still exposed to ext4_writepages() in non-journaled mode.
> To resolve this problem, we use fs-wide per-cpu rw semaphore by
> Jan Kara's suggestion because we don't want to waste ext4_inode_info's
> space for this extra rare case.
>
> Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
> Signed-off-by: Jan Kara <jack@suse.cz>
The patch is almost fine except for one small issue:
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 9ecfb76..1176142 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2476,6 +2476,7 @@ static int ext4_writepages(struct address_space *mapping,
> struct blk_plug plug;
> bool give_up_on_write = false;
>
> + percpu_down_read(&sbi->s_journal_flag_rwsem);
> trace_ext4_writepages(inode, wbc);
You need to change how dax_writeback_mapping_range() is called a few lines
below so that it also exits via out_writepages: and not directly.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2016-03-10 22:20 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-10 22:20 [PATCH v3 2/3] ext4: fix races between changing inode journal mode and ext4_writepages Daeho Jeong
-- strict thread matches above, loose matches on Subject: below --
2016-03-09 8:09 [PATCH v3 1/3] ext4: handle unwritten or delalloc buffers before enabling per-file data journaling Daeho Jeong
2016-03-09 8:09 ` [PATCH v3 2/3] ext4: fix races between changing inode journal mode and ext4_writepages Daeho Jeong
2016-03-10 10:04 ` Jan Kara
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).