linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] ext4: Fix data exposure after a crash
@ 2016-01-11 10:23 Jan Kara
  2016-01-11 10:23 ` [PATCH 2/2] ext4: Remove EXT4_STATE_ORDERED_MODE Jan Kara
  2016-02-19 18:44 ` [PATCH 1/2] ext4: Fix data exposure after a crash Jan Kara
  0 siblings, 2 replies; 3+ messages in thread
From: Jan Kara @ 2016-01-11 10:23 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, HUANG Weller (CM/ESW12-CN), Jan Kara, stable

Huang has reported that in his powerfail testing he is seeing stale
block contents in some of recently allocated blocks although he mounts
ext4 in data=ordered mode. After some investigation I have found out
that indeed when delayed allocation is used, we don't add inode to
transaction's list of inodes needing flushing before commit. Originally
we were doing that but commit f3b59291a69d removed the logic with a
flawed argument that it is not needed.

The problem is that although for delayed allocated blocks we write their
contents immediately after allocating them, there is no guarantee that
the IO scheduler or device doesn't reorder things and thus transaction
allocating blocks and attaching them to inode can reach stable storage
before actual block contents. Actually whenever we attach freshly
allocated blocks to inode using a written extent, we should add inode to
transaction's ordered inode list to make sure we properly wait for block
contents to be written before committing the transaction. So that is
what we do in this patch. This also handles other cases where stale data
exposure was possible - like filling hole via mmap in
data=ordered,nodelalloc mode.

The only exception to the above rule are extending direct IO writes where
blkdev_direct_IO() waits for IO to complete before increasing i_size and
thus stale data exposure is not possible. For now we don't complicate
the code with optimizing this special case since the overhead is pretty
low. In case this is observed to be a performance problem we can always
handle it using a special flag to ext4_map_blocks().

CC: stable@vger.kernel.org
Fixes: f3b59291a69d0b734be1fc8be489fef2dd846d3d
Reported-by: "HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com>
Tested-by: "HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ff2f3cd38522..b216a3eb41a8 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -682,6 +682,20 @@ out_sem:
 		ret = check_block_validity(inode, map);
 		if (ret != 0)
 			return ret;
+
+		/*
+		 * Inodes with freshly allocated blocks where contents will be
+		 * visible after transaction commit must be on transaction's
+		 * ordered data list.
+		 */
+		if (map->m_flags & EXT4_MAP_NEW &&
+		    !(map->m_flags & EXT4_MAP_UNWRITTEN) &&
+		    !(flags & EXT4_GET_BLOCKS_ZERO) &&
+		    ext4_should_order_data(inode)) {
+			ret = ext4_jbd2_file_inode(handle, inode);
+			if (ret)
+				return ret;
+		}
 	}
 	return retval;
 }
@@ -1135,15 +1149,6 @@ static int ext4_write_end(struct file *file,
 	int i_size_changed = 0;
 
 	trace_ext4_write_end(inode, pos, len, copied);
-	if (ext4_test_inode_state(inode, EXT4_STATE_ORDERED_MODE)) {
-		ret = ext4_jbd2_file_inode(handle, inode);
-		if (ret) {
-			unlock_page(page);
-			page_cache_release(page);
-			goto errout;
-		}
-	}
-
 	if (ext4_has_inline_data(inode)) {
 		ret = ext4_write_inline_data_end(inode, pos, len,
 						 copied, page);
-- 
2.6.2

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH 2/2] ext4: Remove EXT4_STATE_ORDERED_MODE
  2016-01-11 10:23 [PATCH 1/2] ext4: Fix data exposure after a crash Jan Kara
@ 2016-01-11 10:23 ` Jan Kara
  2016-02-19 18:44 ` [PATCH 1/2] ext4: Fix data exposure after a crash Jan Kara
  1 sibling, 0 replies; 3+ messages in thread
From: Jan Kara @ 2016-01-11 10:23 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, HUANG Weller (CM/ESW12-CN), Jan Kara

This flag is just duplicating what ext4_should_order_data() tells you
and is used in a single place. Furthermore it doesn't reflect changes to
inode data journalling flag so it may be possibly misleading. Just
remove it.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h  | 1 -
 fs/ext4/inode.c | 5 +----
 2 files changed, 1 insertion(+), 5 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 1e20fa94fcf6..bc4910bce4ff 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1489,7 +1489,6 @@ enum {
 	EXT4_STATE_DIOREAD_LOCK,	/* Disable support for dio read
 					   nolocking */
 	EXT4_STATE_MAY_INLINE_DATA,	/* may have in-inode data */
-	EXT4_STATE_ORDERED_MODE,	/* data=ordered mode */
 	EXT4_STATE_EXT_PRECACHED,	/* extents have been precached */
 };
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b216a3eb41a8..173da9d467d1 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3444,10 +3444,7 @@ void ext4_set_aops(struct inode *inode)
 {
 	switch (ext4_inode_journal_mode(inode)) {
 	case EXT4_INODE_ORDERED_DATA_MODE:
-		ext4_set_inode_state(inode, EXT4_STATE_ORDERED_MODE);
-		break;
 	case EXT4_INODE_WRITEBACK_DATA_MODE:
-		ext4_clear_inode_state(inode, EXT4_STATE_ORDERED_MODE);
 		break;
 	case EXT4_INODE_JOURNAL_DATA_MODE:
 		inode->i_mapping->a_ops = &ext4_journalled_aops;
@@ -3540,7 +3537,7 @@ static int __ext4_block_zero_page_range(handle_t *handle,
 	} else {
 		err = 0;
 		mark_buffer_dirty(bh);
-		if (ext4_test_inode_state(inode, EXT4_STATE_ORDERED_MODE))
+		if (ext4_should_order_data(inode))
 			err = ext4_jbd2_file_inode(handle, inode);
 	}
 
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH 1/2] ext4: Fix data exposure after a crash
  2016-01-11 10:23 [PATCH 1/2] ext4: Fix data exposure after a crash Jan Kara
  2016-01-11 10:23 ` [PATCH 2/2] ext4: Remove EXT4_STATE_ORDERED_MODE Jan Kara
@ 2016-02-19 18:44 ` Jan Kara
  1 sibling, 0 replies; 3+ messages in thread
From: Jan Kara @ 2016-02-19 18:44 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, HUANG Weller (CM/ESW12-CN), Jan Kara, stable

Hi Ted,

It seems this patch (and the following cleanup) got missed. Can you please
merge it? Thanks!

								Honza

On Mon 11-01-16 11:23:49, Jan Kara wrote:
> Huang has reported that in his powerfail testing he is seeing stale
> block contents in some of recently allocated blocks although he mounts
> ext4 in data=ordered mode. After some investigation I have found out
> that indeed when delayed allocation is used, we don't add inode to
> transaction's list of inodes needing flushing before commit. Originally
> we were doing that but commit f3b59291a69d removed the logic with a
> flawed argument that it is not needed.
> 
> The problem is that although for delayed allocated blocks we write their
> contents immediately after allocating them, there is no guarantee that
> the IO scheduler or device doesn't reorder things and thus transaction
> allocating blocks and attaching them to inode can reach stable storage
> before actual block contents. Actually whenever we attach freshly
> allocated blocks to inode using a written extent, we should add inode to
> transaction's ordered inode list to make sure we properly wait for block
> contents to be written before committing the transaction. So that is
> what we do in this patch. This also handles other cases where stale data
> exposure was possible - like filling hole via mmap in
> data=ordered,nodelalloc mode.
> 
> The only exception to the above rule are extending direct IO writes where
> blkdev_direct_IO() waits for IO to complete before increasing i_size and
> thus stale data exposure is not possible. For now we don't complicate
> the code with optimizing this special case since the overhead is pretty
> low. In case this is observed to be a performance problem we can always
> handle it using a special flag to ext4_map_blocks().
> 
> CC: stable@vger.kernel.org
> Fixes: f3b59291a69d0b734be1fc8be489fef2dd846d3d
> Reported-by: "HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com>
> Tested-by: "HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/ext4/inode.c | 23 ++++++++++++++---------
>  1 file changed, 14 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index ff2f3cd38522..b216a3eb41a8 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -682,6 +682,20 @@ out_sem:
>  		ret = check_block_validity(inode, map);
>  		if (ret != 0)
>  			return ret;
> +
> +		/*
> +		 * Inodes with freshly allocated blocks where contents will be
> +		 * visible after transaction commit must be on transaction's
> +		 * ordered data list.
> +		 */
> +		if (map->m_flags & EXT4_MAP_NEW &&
> +		    !(map->m_flags & EXT4_MAP_UNWRITTEN) &&
> +		    !(flags & EXT4_GET_BLOCKS_ZERO) &&
> +		    ext4_should_order_data(inode)) {
> +			ret = ext4_jbd2_file_inode(handle, inode);
> +			if (ret)
> +				return ret;
> +		}
>  	}
>  	return retval;
>  }
> @@ -1135,15 +1149,6 @@ static int ext4_write_end(struct file *file,
>  	int i_size_changed = 0;
>  
>  	trace_ext4_write_end(inode, pos, len, copied);
> -	if (ext4_test_inode_state(inode, EXT4_STATE_ORDERED_MODE)) {
> -		ret = ext4_jbd2_file_inode(handle, inode);
> -		if (ret) {
> -			unlock_page(page);
> -			page_cache_release(page);
> -			goto errout;
> -		}
> -	}
> -
>  	if (ext4_has_inline_data(inode)) {
>  		ret = ext4_write_inline_data_end(inode, pos, len,
>  						 copied, page);
> -- 
> 2.6.2
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2016-02-19 18:44 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-01-11 10:23 [PATCH 1/2] ext4: Fix data exposure after a crash Jan Kara
2016-01-11 10:23 ` [PATCH 2/2] ext4: Remove EXT4_STATE_ORDERED_MODE Jan Kara
2016-02-19 18:44 ` [PATCH 1/2] ext4: Fix data exposure after a crash Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).