[PATCH 0/4 v2] ext4: Fix data exposure after a crash

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4 v2] ext4: Fix data exposure after a crash
@ 2016-03-30 16:19 Jan Kara
  2016-03-30 16:19 ` [PATCH 1/4] " Jan Kara
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Jan Kara @ 2016-03-30 16:19 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Weller.Huang, Jan Kara

Hello,

this is a second version of the patches to fix data exposure after a crash
in data=ordered mode. Since previous revision I have implemented suggestion
by Ted that it is enough to wait for outstanding writeback from jbd2 thread
when IO was already submitted during delalloc writeback (see patches 3 and
4 for details).

I have checked that indeed jbd2 did not submit any data buffers under
appropriate load (I've used randwrite workload from fio) with the improvement
but I didn't observe any measurable improvement in transaction commit times
- waiting for already outstanding writeback was dominating the commit time and
the additional buffers jbd2 was originally submitting apparently didn't make
a big difference. At least for this workload. I have a hard time coming up with
some other workload where the optimization would make a big difference since 
flusher thread ends up submitting overwritten blocks anyway which makes jbd2
thread wait for the IO. But still conceptually the change makes sense so it
may be worth it. So Ted, feel free to either merge or drop the last two
patches in this series...

								Honza

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/4] ext4: Fix data exposure after a crash
  2016-03-30 16:19 [PATCH 0/4 v2] ext4: Fix data exposure after a crash Jan Kara
@ 2016-03-30 16:19 ` Jan Kara
  2016-04-24  3:48   ` Theodore Ts'o
  2016-03-30 16:19 ` [PATCH 2/4] ext4: Remove EXT4_STATE_ORDERED_MODE Jan Kara
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 10+ messages in thread
From: Jan Kara @ 2016-03-30 16:19 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Weller.Huang, Jan Kara, stable

Huang has reported that in his powerfail testing he is seeing stale
block contents in some of recently allocated blocks although he mounts
ext4 in data=ordered mode. After some investigation I have found out
that indeed when delayed allocation is used, we don't add inode to
transaction's list of inodes needing flushing before commit. Originally
we were doing that but commit f3b59291a69d removed the logic with a
flawed argument that it is not needed.

The problem is that although for delayed allocated blocks we write their
contents immediately after allocating them, there is no guarantee that
the IO scheduler or device doesn't reorder things and thus transaction
allocating blocks and attaching them to inode can reach stable storage
before actual block contents. Actually whenever we attach freshly
allocated blocks to inode using a written extent, we should add inode to
transaction's ordered inode list to make sure we properly wait for block
contents to be written before committing the transaction. So that is
what we do in this patch. This also handles other cases where stale data
exposure was possible - like filling hole via mmap in
data=ordered,nodelalloc mode.

The only exception to the above rule are extending direct IO writes where
blkdev_direct_IO() waits for IO to complete before increasing i_size and
thus stale data exposure is not possible. For now we don't complicate
the code with optimizing this special case since the overhead is pretty
low. In case this is observed to be a performance problem we can always
handle it using a special flag to ext4_map_blocks().

CC: stable@vger.kernel.org
Fixes: f3b59291a69d0b734be1fc8be489fef2dd846d3d
Reported-by: "HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com>
Tested-by: "HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index dab84a2530ff..747b0e64b9d2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -684,6 +684,20 @@ out_sem:
 		ret = check_block_validity(inode, map);
 		if (ret != 0)
 			return ret;
+
+		/*
+		 * Inodes with freshly allocated blocks where contents will be
+		 * visible after transaction commit must be on transaction's
+		 * ordered data list.
+		 */
+		if (map->m_flags & EXT4_MAP_NEW &&
+		    !(map->m_flags & EXT4_MAP_UNWRITTEN) &&
+		    !(flags & EXT4_GET_BLOCKS_ZERO) &&
+		    ext4_should_order_data(inode)) {
+			ret = ext4_jbd2_file_inode(handle, inode);
+			if (ret)
+				return ret;
+		}
 	}
 	return retval;
 }
@@ -1291,15 +1305,6 @@ static int ext4_write_end(struct file *file,
 	int i_size_changed = 0;

 	trace_ext4_write_end(inode, pos, len, copied);
-	if (ext4_test_inode_state(inode, EXT4_STATE_ORDERED_MODE)) {
-		ret = ext4_jbd2_file_inode(handle, inode);
-		if (ret) {
-			unlock_page(page);
-			page_cache_release(page);
-			goto errout;
-		}
-	}

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/4] ext4: Remove EXT4_STATE_ORDERED_MODE
  2016-03-30 16:19 [PATCH 0/4 v2] ext4: Fix data exposure after a crash Jan Kara
  2016-03-30 16:19 ` [PATCH 1/4] " Jan Kara
@ 2016-03-30 16:19 ` Jan Kara
  2016-04-24  3:48   ` Theodore Ts'o
  2016-03-30 16:19 ` [PATCH 3/4] jbd2: Add support for avoiding data writes during transaction commits Jan Kara
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 10+ messages in thread
From: Jan Kara @ 2016-03-30 16:19 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Weller.Huang, Jan Kara

This flag is just duplicating what ext4_should_order_data() tells you
and is used in a single place. Furthermore it doesn't reflect changes to
inode data journalling flag so it may be possibly misleading. Just
remove it.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h  | 1 -
 fs/ext4/inode.c | 5 +----
 2 files changed, 1 insertion(+), 5 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index c04743519865..8647d1253903 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1526,7 +1526,6 @@ enum {
 	EXT4_STATE_DIOREAD_LOCK,	/* Disable support for dio read
 					   nolocking */
 	EXT4_STATE_MAY_INLINE_DATA,	/* may have in-inode data */
-	EXT4_STATE_ORDERED_MODE,	/* data=ordered mode */
 	EXT4_STATE_EXT_PRECACHED,	/* extents have been precached */
 };
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 747b0e64b9d2..8ffba0ec3b80 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3541,10 +3541,7 @@ void ext4_set_aops(struct inode *inode)
 {
 	switch (ext4_inode_journal_mode(inode)) {
 	case EXT4_INODE_ORDERED_DATA_MODE:
-		ext4_set_inode_state(inode, EXT4_STATE_ORDERED_MODE);
-		break;
 	case EXT4_INODE_WRITEBACK_DATA_MODE:
-		ext4_clear_inode_state(inode, EXT4_STATE_ORDERED_MODE);
 		break;
 	case EXT4_INODE_JOURNAL_DATA_MODE:
 		inode->i_mapping->a_ops = &ext4_journalled_aops;
@@ -3637,7 +3634,7 @@ static int __ext4_block_zero_page_range(handle_t *handle,
 	} else {
 		err = 0;
 		mark_buffer_dirty(bh);
-		if (ext4_test_inode_state(inode, EXT4_STATE_ORDERED_MODE))
+		if (ext4_should_order_data(inode))
 			err = ext4_jbd2_file_inode(handle, inode);
 	}
 
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/4] jbd2: Add support for avoiding data writes during transaction commits
  2016-03-30 16:19 [PATCH 0/4 v2] ext4: Fix data exposure after a crash Jan Kara
  2016-03-30 16:19 ` [PATCH 1/4] " Jan Kara
  2016-03-30 16:19 ` [PATCH 2/4] ext4: Remove EXT4_STATE_ORDERED_MODE Jan Kara
@ 2016-03-30 16:19 ` Jan Kara
  2016-03-30 16:19 ` [PATCH 4/4] ext4: Do not ask jbd2 to write data for delalloc buffers Jan Kara
  2016-04-13 13:32 ` [PATCH 0/4 v2] ext4: Fix data exposure after a crash Jan Kara
  4 siblings, 0 replies; 10+ messages in thread
From: Jan Kara @ 2016-03-30 16:19 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Weller.Huang, Jan Kara

Currently when filesystem needs to make sure data is on permanent
storage before committing a transaction it adds inode to transaction's
inode list. During transaction commit, jbd2 writes back all dirty
buffers that have allocated underlying blocks and waits for the IO to
finish. However when doing writeback for delayed allocated data, we
allocate blocks and immediately submit the data. Thus asking jbd2 to
write dirty pages just unnecessarily adds more work to jbd2 possibly
writing back other redirtied blocks.

Add support to jbd2 to allow filesystem to ask jbd2 to only wait for
outstanding data writes before committing a transaction and thus avoid
unnecessary writes.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4_jbd2.h   |  3 ++-
 fs/jbd2/commit.c      |  4 ++++
 fs/jbd2/journal.c     |  3 ++-
 fs/jbd2/transaction.c | 22 ++++++++++++++++++----
 fs/ocfs2/journal.h    |  2 +-
 include/linux/jbd2.h  | 13 +++++++++++--
 6 files changed, 38 insertions(+), 9 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 5f5846211095..f1c940b38b30 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -362,7 +362,8 @@ static inline int ext4_journal_force_commit(journal_t *journal)
 static inline int ext4_jbd2_file_inode(handle_t *handle, struct inode *inode)
 {
 	if (ext4_handle_valid(handle))
-		return jbd2_journal_file_inode(handle, EXT4_I(inode)->jinode);
+		return jbd2_journal_inode_add_write(handle,
+						    EXT4_I(inode)->jinode);
 	return 0;
 }
 
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 517f2de784cf..ad6efdabca2c 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -219,6 +219,8 @@ static int journal_submit_data_buffers(journal_t *journal,
 
 	spin_lock(&journal->j_list_lock);
 	list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
+		if (!(jinode->i_flags & JI_WRITE_DATA))
+			continue;
 		mapping = jinode->i_vfs_inode->i_mapping;
 		jinode->i_flags |= JI_COMMIT_RUNNING;
 		spin_unlock(&journal->j_list_lock);
@@ -256,6 +258,8 @@ static int journal_finish_inode_data_buffers(journal_t *journal,
 	/* For locking, see the comment in journal_submit_data_buffers() */
 	spin_lock(&journal->j_list_lock);
 	list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
+		if (!(jinode->i_flags & JI_WAIT_DATA))
+			continue;
 		jinode->i_flags |= JI_COMMIT_RUNNING;
 		spin_unlock(&journal->j_list_lock);
 		err = filemap_fdatawait(jinode->i_vfs_inode->i_mapping);
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index de73a9516a54..ad7de5f9aa69 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -94,7 +94,8 @@ EXPORT_SYMBOL(jbd2_journal_blocks_per_page);
 EXPORT_SYMBOL(jbd2_journal_invalidatepage);
 EXPORT_SYMBOL(jbd2_journal_try_to_free_buffers);
 EXPORT_SYMBOL(jbd2_journal_force_commit);
-EXPORT_SYMBOL(jbd2_journal_file_inode);
+EXPORT_SYMBOL(jbd2_journal_inode_add_write);
+EXPORT_SYMBOL(jbd2_journal_inode_add_wait);
 EXPORT_SYMBOL(jbd2_journal_init_jbd_inode);
 EXPORT_SYMBOL(jbd2_journal_release_jbd_inode);
 EXPORT_SYMBOL(jbd2_journal_begin_ordered_truncate);
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 01e4652d88f6..b2b25d65e994 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -2462,7 +2462,8 @@ void jbd2_journal_refile_buffer(journal_t *journal, struct journal_head *jh)
 /*
  * File inode in the inode list of the handle's transaction
  */
-int jbd2_journal_file_inode(handle_t *handle, struct jbd2_inode *jinode)
+static int jbd2_journal_file_inode(handle_t *handle, struct jbd2_inode *jinode,
+				   unsigned long flags)
 {
 	transaction_t *transaction = handle->h_transaction;
 	journal_t *journal;
@@ -2487,12 +2488,14 @@ int jbd2_journal_file_inode(handle_t *handle, struct jbd2_inode *jinode)
 	 * and if jinode->i_next_transaction == transaction, commit code
 	 * will only file the inode where we want it.
 	 */
-	if (jinode->i_transaction == transaction ||
-	    jinode->i_next_transaction == transaction)
+	if ((jinode->i_transaction == transaction ||
+	    jinode->i_next_transaction == transaction) &&
+	    (jinode->i_flags & flags) == flags)
 		return 0;
 
 	spin_lock(&journal->j_list_lock);
-
+	jinode->i_flags |= flags;
+	/* Is inode already attached where we need it? */
 	if (jinode->i_transaction == transaction ||
 	    jinode->i_next_transaction == transaction)
 		goto done;
@@ -2523,6 +2526,17 @@ done:
 	return 0;
 }
 
+int jbd2_journal_inode_add_write(handle_t *handle, struct jbd2_inode *jinode)
+{
+	return jbd2_journal_file_inode(handle, jinode,
+				       JI_WRITE_DATA | JI_WAIT_DATA);
+}
+
+int jbd2_journal_inode_add_wait(handle_t *handle, struct jbd2_inode *jinode)
+{
+	return jbd2_journal_file_inode(handle, jinode, JI_WAIT_DATA);
+}
+
 /*
  * File truncate and transaction commit interact with each other in a
  * non-trivial way.  If a transaction writing data block A is
diff --git a/fs/ocfs2/journal.h b/fs/ocfs2/journal.h
index f4cd3c3e9fb7..497a4171ef61 100644
--- a/fs/ocfs2/journal.h
+++ b/fs/ocfs2/journal.h
@@ -619,7 +619,7 @@ static inline int ocfs2_calc_tree_trunc_credits(struct super_block *sb,
 
 static inline int ocfs2_jbd2_file_inode(handle_t *handle, struct inode *inode)
 {
-	return jbd2_journal_file_inode(handle, &OCFS2_I(inode)->ip_jinode);
+	return jbd2_journal_inode_add_write(handle, &OCFS2_I(inode)->ip_jinode);
 }
 
 static inline int ocfs2_begin_ordered_truncate(struct inode *inode,
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index fd1083c46c61..39511484ad10 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -403,11 +403,19 @@ static inline void jbd_unlock_bh_journal_head(struct buffer_head *bh)
 
 /* Flags in jbd_inode->i_flags */
 #define __JI_COMMIT_RUNNING 0
-/* Commit of the inode data in progress. We use this flag to protect us from
+#define __JI_WRITE_DATA 1
+#define __JI_WAIT_DATA 2
+
+/*
+ * Commit of the inode data in progress. We use this flag to protect us from
  * concurrent deletion of inode. We cannot use reference to inode for this
  * since we cannot afford doing last iput() on behalf of kjournald
  */
 #define JI_COMMIT_RUNNING (1 << __JI_COMMIT_RUNNING)
+/* Write allocated dirty buffers in this inode before commit */
+#define JI_WRITE_DATA (1 << __JI_WRITE_DATA)
+/* Wait for outstanding data writes for this inode before commit */
+#define JI_WAIT_DATA (1 << __JI_WAIT_DATA)
 
 /**
  * struct jbd_inode is the structure linking inodes in ordered mode
@@ -1270,7 +1278,8 @@ extern int	   jbd2_journal_clear_err  (journal_t *);
 extern int	   jbd2_journal_bmap(journal_t *, unsigned long, unsigned long long *);
 extern int	   jbd2_journal_force_commit(journal_t *);
 extern int	   jbd2_journal_force_commit_nested(journal_t *);
-extern int	   jbd2_journal_file_inode(handle_t *handle, struct jbd2_inode *inode);
+extern int	   jbd2_journal_inode_add_write(handle_t *handle, struct jbd2_inode *inode);
+extern int	   jbd2_journal_inode_add_wait(handle_t *handle, struct jbd2_inode *inode);
 extern int	   jbd2_journal_begin_ordered_truncate(journal_t *journal,
 				struct jbd2_inode *inode, loff_t new_size);
 extern void	   jbd2_journal_init_jbd_inode(struct jbd2_inode *jinode, struct inode *inode);
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 4/4] ext4: Do not ask jbd2 to write data for delalloc buffers
  2016-03-30 16:19 [PATCH 0/4 v2] ext4: Fix data exposure after a crash Jan Kara
                   ` (2 preceding siblings ...)
  2016-03-30 16:19 ` [PATCH 3/4] jbd2: Add support for avoiding data writes during transaction commits Jan Kara
@ 2016-03-30 16:19 ` Jan Kara
  2016-04-13 13:32 ` [PATCH 0/4 v2] ext4: Fix data exposure after a crash Jan Kara
  4 siblings, 0 replies; 10+ messages in thread
From: Jan Kara @ 2016-03-30 16:19 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Weller.Huang, Jan Kara

Currently we ask jbd2 to write all dirty allocated buffers before
committing a transaction when doing writeback of delay allocated blocks.
However this is unnecessary since we move all pages to writeback state
before dropping a transaction handle and then submit all the necessary
IO. We still need the transaction commit to wait for all the outstanding
writeback before flushing disk caches during transaction commit to avoid
data exposure issues though. Use the new jbd2 capability and ask it to
only wait for outstanding writeback during transaction commit when
writing back data in ext4_writepages().

Tested-by: "HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h        |  3 +++
 fs/ext4/ext4_jbd2.h   | 12 +++++++++++-
 fs/ext4/inode.c       | 10 +++++++---
 fs/ext4/move_extent.c |  2 +-
 4 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 8647d1253903..118bd14c1c20 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -581,6 +581,9 @@ enum {
 #define EXT4_GET_BLOCKS_ZERO			0x0200
 #define EXT4_GET_BLOCKS_CREATE_ZERO		(EXT4_GET_BLOCKS_CREATE |\
 					EXT4_GET_BLOCKS_ZERO)
+	/* Caller will submit data before dropping transaction handle. This
+	 * allows jbd2 to avoid submitting data before commit. */
+#define EXT4_GET_BLOCKS_IO_SUBMIT		0x0400
 
 /*
  * The bit position of these flags must not overlap with any of the
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index f1c940b38b30..09c1ef38cbe6 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -359,7 +359,8 @@ static inline int ext4_journal_force_commit(journal_t *journal)
 	return 0;
 }
 
-static inline int ext4_jbd2_file_inode(handle_t *handle, struct inode *inode)
+static inline int ext4_jbd2_inode_add_write(handle_t *handle,
+					    struct inode *inode)
 {
 	if (ext4_handle_valid(handle))
 		return jbd2_journal_inode_add_write(handle,
@@ -367,6 +368,15 @@ static inline int ext4_jbd2_file_inode(handle_t *handle, struct inode *inode)
 	return 0;
 }
 
+static inline int ext4_jbd2_inode_add_wait(handle_t *handle,
+					   struct inode *inode)
+{
+	if (ext4_handle_valid(handle))
+		return jbd2_journal_inode_add_wait(handle,
+						   EXT4_I(inode)->jinode);
+	return 0;
+}
+
 static inline void ext4_update_inode_fsync_trans(handle_t *handle,
 						 struct inode *inode,
 						 int datasync)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 8ffba0ec3b80..58e05feb999f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -694,7 +694,10 @@ out_sem:
 		    !(map->m_flags & EXT4_MAP_UNWRITTEN) &&
 		    !(flags & EXT4_GET_BLOCKS_ZERO) &&
 		    ext4_should_order_data(inode)) {
-			ret = ext4_jbd2_file_inode(handle, inode);
+			if (flags & EXT4_GET_BLOCKS_IO_SUBMIT)
+				ret = ext4_jbd2_inode_add_wait(handle, inode);
+			else
+				ret = ext4_jbd2_inode_add_write(handle, inode);
 			if (ret)
 				return ret;
 		}
@@ -2320,7 +2323,8 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
 	 * the data was copied into the page cache.
 	 */
 	get_blocks_flags = EXT4_GET_BLOCKS_CREATE |
-			   EXT4_GET_BLOCKS_METADATA_NOFAIL;
+			   EXT4_GET_BLOCKS_METADATA_NOFAIL |
+			   EXT4_GET_BLOCKS_IO_SUBMIT;
 	dioread_nolock = ext4_should_dioread_nolock(inode);
 	if (dioread_nolock)
 		get_blocks_flags |= EXT4_GET_BLOCKS_IO_CREATE_EXT;
@@ -3635,7 +3639,7 @@ static int __ext4_block_zero_page_range(handle_t *handle,
 		err = 0;
 		mark_buffer_dirty(bh);
 		if (ext4_should_order_data(inode))
-			err = ext4_jbd2_file_inode(handle, inode);
+			err = ext4_jbd2_inode_add_write(handle, inode);
 	}
 
 unlock:
diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index 4098acc701c3..93e9635bc65e 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -400,7 +400,7 @@ data_copy:
 
 	/* Even in case of data=writeback it is reasonable to pin
 	 * inode to transaction, to prevent unexpected data loss */
-	*err = ext4_jbd2_file_inode(handle, orig_inode);
+	*err = ext4_jbd2_inode_add_write(handle, orig_inode);
 
 unlock_pages:
 	unlock_page(pagep[0]);
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/4 v2] ext4: Fix data exposure after a crash
  2016-03-30 16:19 [PATCH 0/4 v2] ext4: Fix data exposure after a crash Jan Kara
                   ` (3 preceding siblings ...)
  2016-03-30 16:19 ` [PATCH 4/4] ext4: Do not ask jbd2 to write data for delalloc buffers Jan Kara
@ 2016-04-13 13:32 ` Jan Kara
  4 siblings, 0 replies; 10+ messages in thread
From: Jan Kara @ 2016-04-13 13:32 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Weller.Huang, Jan Kara

Hello!

On Wed 30-03-16 18:19:23, Jan Kara wrote:
> this is a second version of the patches to fix data exposure after a crash
> in data=ordered mode. Since previous revision I have implemented suggestion
> by Ted that it is enough to wait for outstanding writeback from jbd2 thread
> when IO was already submitted during delalloc writeback (see patches 3 and
> 4 for details).
> 
> I have checked that indeed jbd2 did not submit any data buffers under
> appropriate load (I've used randwrite workload from fio) with the improvement
> but I didn't observe any measurable improvement in transaction commit times
> - waiting for already outstanding writeback was dominating the commit time and
> the additional buffers jbd2 was originally submitting apparently didn't make
> a big difference. At least for this workload. I have a hard time coming up with
> some other workload where the optimization would make a big difference since 
> flusher thread ends up submitting overwritten blocks anyway which makes jbd2
> thread wait for the IO. But still conceptually the change makes sense so it
> may be worth it. So Ted, feel free to either merge or drop the last two
> patches in this series...

Ping Ted? It would be good to merge at least the first patch in the series
during this cycle... Thanks!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/4] ext4: Fix data exposure after a crash
  2016-03-30 16:19 ` [PATCH 1/4] " Jan Kara
@ 2016-04-24  3:48   ` Theodore Ts'o
  2016-04-24  4:55     ` Theodore Ts'o
  0 siblings, 1 reply; 10+ messages in thread
From: Theodore Ts'o @ 2016-04-24  3:48 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, Weller.Huang, stable

On Wed, Mar 30, 2016 at 06:19:24PM +0200, Jan Kara wrote:
> Huang has reported that in his powerfail testing he is seeing stale
> block contents in some of recently allocated blocks although he mounts
> ext4 in data=ordered mode. After some investigation I have found out
> that indeed when delayed allocation is used, we don't add inode to
> transaction's list of inodes needing flushing before commit. Originally
> we were doing that but commit f3b59291a69d removed the logic with a
> flawed argument that it is not needed.
> 
> The problem is that although for delayed allocated blocks we write their
> contents immediately after allocating them, there is no guarantee that
> the IO scheduler or device doesn't reorder things and thus transaction
> allocating blocks and attaching them to inode can reach stable storage
> before actual block contents. Actually whenever we attach freshly
> allocated blocks to inode using a written extent, we should add inode to
> transaction's ordered inode list to make sure we properly wait for block
> contents to be written before committing the transaction. So that is
> what we do in this patch. This also handles other cases where stale data
> exposure was possible - like filling hole via mmap in
> data=ordered,nodelalloc mode.
> 
> The only exception to the above rule are extending direct IO writes where
> blkdev_direct_IO() waits for IO to complete before increasing i_size and
> thus stale data exposure is not possible. For now we don't complicate
> the code with optimizing this special case since the overhead is pretty
> low. In case this is observed to be a performance problem we can always
> handle it using a special flag to ext4_map_blocks().
> 
> CC: stable@vger.kernel.org
> Fixes: f3b59291a69d0b734be1fc8be489fef2dd846d3d
> Reported-by: "HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com>
> Tested-by: "HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com>
> Signed-off-by: Jan Kara <jack@suse.cz>

Applied, thanks.

						- Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/4] ext4: Remove EXT4_STATE_ORDERED_MODE
  2016-03-30 16:19 ` [PATCH 2/4] ext4: Remove EXT4_STATE_ORDERED_MODE Jan Kara
@ 2016-04-24  3:48   ` Theodore Ts'o
  0 siblings, 0 replies; 10+ messages in thread
From: Theodore Ts'o @ 2016-04-24  3:48 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, Weller.Huang

On Wed, Mar 30, 2016 at 06:19:25PM +0200, Jan Kara wrote:
> This flag is just duplicating what ext4_should_order_data() tells you
> and is used in a single place. Furthermore it doesn't reflect changes to
> inode data journalling flag so it may be possibly misleading. Just
> remove it.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Applied, thanks.

					- Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/4] ext4: Fix data exposure after a crash
  2016-04-24  3:48   ` Theodore Ts'o
@ 2016-04-24  4:55     ` Theodore Ts'o
  2016-04-25 10:24       ` Jan Kara
  0 siblings, 1 reply; 10+ messages in thread
From: Theodore Ts'o @ 2016-04-24  4:55 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, Weller.Huang, stable

I had to add a !IS_NOQUOTA check:

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 576f64a..250c2df 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -693,6 +693,7 @@ out_sem:
 		if (map->m_flags & EXT4_MAP_NEW &&
 		    !(map->m_flags & EXT4_MAP_UNWRITTEN) &&
 		    !(flags & EXT4_GET_BLOCKS_ZERO) &&
+		    !IS_NOQUOTA(inode) &&
 		    ext4_should_order_data(inode)) {
 			ret = ext4_jbd2_file_inode(handle, inode);
 			if (ret)


In order to prevent crashes when writing to the quota file (see
below).

Cheers,

				- Ted

generic/244		[00:42:13]run fstests generic/244 at 2016-04-24 00:42:13
BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: [<ffffffff8128a126>] jbd2_journal_file_inode+0x35/0xde
PGD 0 
Oops: 0000 [#1] SMP 
CPU: 0 PID: 3316 Comm: setquota Not tainted 4.6.0-rc4-ext4-00002-gc66f90b #246
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff88007aa8f440 ti: ffff88007a970000 task.ti: ffff88007a970000
RIP: 0010:[<ffffffff8128a126>]  [<ffffffff8128a126>] jbd2_journal_file_inode+0x35/0xde
RSP: 0018:ffff88007a973980  EFLAGS: 00010246
RAX: 0000000000000001 RBX: ffff88007a973a18 RCX: 00000000a8bac810
RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff88007bbbc440
RBP: ffff88007b9b8e00 R08: 0000000000000002 R09: ffff88007a973950
R10: 0000000000000003 R11: 47ffffffffffffff R12: 0000000000000000
R13: ffff88007a838000 R14: ffff88007bbbc440 R15: 0000000000000001
FS:  00007f8bf247c700(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000007ab85000 CR4: 00000000000006f0
Stack:
 ffff88007a973a18 ffff880079dbd5f0 ffff880079dbd670 0000000000000000
 ffff88007bbbc440 ffffffff8123f810 ffff88007b8b3a88 ffff88007b8b3a88
 ffff88007b8b3a88 0000000000000000 0000000000000000 0000000000000000
Call Trace:
 [<ffffffff8123f810>] ? ext4_map_blocks+0x45e/0x47d
 [<ffffffff8123fe0d>] ? ext4_getblk+0x3c/0x140
 [<ffffffff8123ff1d>] ? ext4_bread+0xc/0x60
 [<ffffffff81259598>] ? ext4_quota_write+0xe1/0x19d
 [<ffffffff81206b8f>] ? write_blk+0x2f/0x66
 [<ffffffff81206c8d>] ? get_free_dqblk+0x64/0x8f
 [<ffffffff81207545>] ? do_insert_tree+0x4c/0x384
 [<ffffffff812077fc>] ? do_insert_tree+0x303/0x384
 [<ffffffff812077fc>] ? do_insert_tree+0x303/0x384
 [<ffffffff812077fc>] ? do_insert_tree+0x303/0x384
 [<ffffffff811ad0c1>] ? __kmalloc+0x134/0x1a7
 [<ffffffff81207906>] ? qtree_write_dquot+0x89/0x15e
 [<ffffffff81207bd9>] ? qtree_read_dquot+0x1d6/0x1e3
 [<ffffffff81202b91>] ? dquot_acquire+0x8f/0xf9
 [<ffffffff812577e3>] ? ext4_acquire_dquot+0x76/0x94
 [<ffffffff81204198>] ? dqget+0x273/0x430
 [<ffffffff81204972>] ? dquot_get_dqblk+0xf/0x32
 [<ffffffff81208861>] ? quota_getquota+0x5f/0x111
 [<ffffffff811abba1>] ? cache_alloc_debugcheck_after.isra.17+0x4d/0x15a


   	    

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/4] ext4: Fix data exposure after a crash
  2016-04-24  4:55     ` Theodore Ts'o
@ 2016-04-25 10:24       ` Jan Kara
  0 siblings, 0 replies; 10+ messages in thread
From: Jan Kara @ 2016-04-25 10:24 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Jan Kara, linux-ext4, Weller.Huang, stable

On Sun 24-04-16 00:55:51, Ted Tso wrote:
> I had to add a !IS_NOQUOTA check:
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 576f64a..250c2df 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -693,6 +693,7 @@ out_sem:
>  		if (map->m_flags & EXT4_MAP_NEW &&
>  		    !(map->m_flags & EXT4_MAP_UNWRITTEN) &&
>  		    !(flags & EXT4_GET_BLOCKS_ZERO) &&
> +		    !IS_NOQUOTA(inode) &&
>  		    ext4_should_order_data(inode)) {
>  			ret = ext4_jbd2_file_inode(handle, inode);
>  			if (ret)
> 
> 
> In order to prevent crashes when writing to the quota file (see
> below).

Good catch. The fix looks correct to me, thanks for fixing this up. Since
data for quota files is actually journalled, it may be actually a cleaner
fix to reflect this in ext4_inode_journal_mode() and thus
ext4_should_order_data() would catch this case. But this has more potential
for breakage elsewhere so I will do that as a separate cleanup patch.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2016-04-25 10:24 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-30 16:19 [PATCH 0/4 v2] ext4: Fix data exposure after a crash Jan Kara
2016-03-30 16:19 ` [PATCH 1/4] " Jan Kara
2016-04-24  3:48   ` Theodore Ts'o
2016-04-24  4:55     ` Theodore Ts'o
2016-04-25 10:24       ` Jan Kara
2016-03-30 16:19 ` [PATCH 2/4] ext4: Remove EXT4_STATE_ORDERED_MODE Jan Kara
2016-04-24  3:48   ` Theodore Ts'o
2016-03-30 16:19 ` [PATCH 3/4] jbd2: Add support for avoiding data writes during transaction commits Jan Kara
2016-03-30 16:19 ` [PATCH 4/4] ext4: Do not ask jbd2 to write data for delalloc buffers Jan Kara
2016-04-13 13:32 ` [PATCH 0/4 v2] ext4: Fix data exposure after a crash Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).