[PATCH 0/4 v2] ext4: Fix races between writeback and truncate

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4 v2] ext4: Fix races between writeback and truncate
@ 2013-08-05 13:52 Jan Kara
  2013-08-05 13:52 ` [PATCH 1/4] ext4: Move test whether extent to map can be extended to one place Jan Kara
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Jan Kara @ 2013-08-05 13:52 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Dave Jones, Zheng Liu

  Hello,

  this patch set fixes two races between truncate and writeback code.
The first one was introduced by my rewrite of ext4 writeback path and
Dave and Zheng have spotted them in their testing. It could result in
some leaked fs blocks and delalloc accounting mismatch. The first two
patches fix that bug.

The second race has been there for ages and could result in i_size being
wrong on disk. The last two patches fix that race.

								Honza

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/4] ext4: Move test whether extent to map can be extended to one place
  2013-08-05 13:52 [PATCH 0/4 v2] ext4: Fix races between writeback and truncate Jan Kara
@ 2013-08-05 13:52 ` Jan Kara
  2013-08-17 13:59   ` Theodore Ts'o
  2013-08-05 13:52 ` [PATCH 2/4] ext4: Fix ext4_writepages() in presence of truncate Jan Kara
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: Jan Kara @ 2013-08-05 13:52 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Dave Jones, Zheng Liu, Jan Kara

Currently the logic whether the current buffer can be added to an extent
of buffers to map is split between mpage_add_bh_to_extent() and
add_page_bufs_to_extent(). Move the whole logic to
mpage_add_bh_to_extent() which makes things a bit more straightforward
and make following i_size fixes easier.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 49 +++++++++++++++++++++++++++++--------------------
 1 file changed, 29 insertions(+), 20 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ba33c67..afaf26b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1907,34 +1907,48 @@ static int ext4_writepage(struct page *page,
  *
  * @mpd - extent of blocks
  * @lblk - logical number of the block in the file
- * @b_state - b_state of the buffer head added
+ * @bh - buffer head we want to add to the extent
  *
- * the function is used to collect contig. blocks in same state
+ * The function is used to collect contig. blocks in the same state. If the
+ * buffer doesn't require mapping for writeback and we haven't started the
+ * extent of buffers to map yet, the function returns 'true' immediately - the
+ * caller can write the buffer right away. Otherwise the function returns true
+ * if the block has been added to the extent, false if the block couldn't be
+ * added.
  */
-static int mpage_add_bh_to_extent(struct mpage_da_data *mpd, ext4_lblk_t lblk,
-				  unsigned long b_state)
+static bool mpage_add_bh_to_extent(struct mpage_da_data *mpd, ext4_lblk_t lblk,
+				   struct buffer_head *bh)
 {
 	struct ext4_map_blocks *map = &mpd->map;
 
-	/* Don't go larger than mballoc is willing to allocate */
-	if (map->m_len >= MAX_WRITEPAGES_EXTENT_LEN)
-		return 0;
+	/* Buffer that doesn't need mapping for writeback? */
+	if (!buffer_dirty(bh) || !buffer_mapped(bh) ||
+	    (!buffer_delay(bh) && !buffer_unwritten(bh))) {
+		/* So far no extent to map => we write the buffer right away */
+		if (map->m_len == 0)
+			return true;
+		return false;
+	}
 
 	/* First block in the extent? */
 	if (map->m_len == 0) {
 		map->m_lblk = lblk;
 		map->m_len = 1;
-		map->m_flags = b_state & BH_FLAGS;
-		return 1;
+		map->m_flags = bh->b_state & BH_FLAGS;
+		return true;
 	}
 
+	/* Don't go larger than mballoc is willing to allocate */
+	if (map->m_len >= MAX_WRITEPAGES_EXTENT_LEN)
+		return false;
+
 	/* Can we merge the block to our big extent? */
 	if (lblk == map->m_lblk + map->m_len &&
-	    (b_state & BH_FLAGS) == map->m_flags) {
+	    (bh->b_state & BH_FLAGS) == map->m_flags) {
 		map->m_len++;
-		return 1;
+		return true;
 	}
-	return 0;
+	return false;
 }
 
 static bool add_page_bufs_to_extent(struct mpage_da_data *mpd,
@@ -1949,18 +1963,13 @@ static bool add_page_bufs_to_extent(struct mpage_da_data *mpd,
 	do {
 		BUG_ON(buffer_locked(bh));
 
-		if (!buffer_dirty(bh) || !buffer_mapped(bh) ||
-		    (!buffer_delay(bh) && !buffer_unwritten(bh)) ||
-		    lblk >= blocks) {
+		if (lblk >= blocks || !mpage_add_bh_to_extent(mpd, lblk, bh)) {
 			/* Found extent to map? */
 			if (mpd->map.m_len)
 				return false;
-			if (lblk >= blocks)
-				return true;
-			continue;
+			/* Everything mapped so far and we hit EOF */
+			return true;
 		}
-		if (!mpage_add_bh_to_extent(mpd, lblk, bh->b_state))
-			return false;
 	} while (lblk++, (bh = bh->b_this_page) != head);
 	return true;
 }
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4] ext4: Move test whether extent to map can be extended to one place
  2013-08-05 13:52 ` [PATCH 1/4] ext4: Move test whether extent to map can be extended to one place Jan Kara
@ 2013-08-17 13:59   ` Theodore Ts'o
  0 siblings, 0 replies; 12+ messages in thread
From: Theodore Ts'o @ 2013-08-17 13:59 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, Dave Jones, Zheng Liu

On Mon, Aug 05, 2013 at 03:52:21PM +0200, Jan Kara wrote:
> Currently the logic whether the current buffer can be added to an extent
> of buffers to map is split between mpage_add_bh_to_extent() and
> add_page_bufs_to_extent(). Move the whole logic to
> mpage_add_bh_to_extent() which makes things a bit more straightforward
> and make following i_size fixes easier.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Thanks, applied.

					- Ted

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 2/4] ext4: Fix ext4_writepages() in presence of truncate
  2013-08-05 13:52 [PATCH 0/4 v2] ext4: Fix races between writeback and truncate Jan Kara
  2013-08-05 13:52 ` [PATCH 1/4] ext4: Move test whether extent to map can be extended to one place Jan Kara
@ 2013-08-05 13:52 ` Jan Kara
  2013-08-17 14:08   ` Theodore Ts'o
  2013-08-05 13:52 ` [PATCH 3/4] ext4: Simplify truncation code in ext4_setattr() Jan Kara
  2013-08-05 13:52 ` [PATCH 4/4] ext4: Fix lost truncate due to race with writeback Jan Kara
  3 siblings, 1 reply; 12+ messages in thread
From: Jan Kara @ 2013-08-05 13:52 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Dave Jones, Zheng Liu, Jan Kara

Inode size can arbitrarily change while writeback is in progress. When
ext4_writepages() has prepared a long extent for mapping and truncate
then reduces i_size, mpage_map_and_submit_buffers() will always map just
one buffer in a page instead of all of them due to lblk < blocks check.
So we end up not using all blocks we've allocated (thus leaking them)
and also delalloc accounting goes wrong manifesting as a warning like:

ext4_da_release_space:1333: ext4_da_release_space: ino 12, to_free 1
with only 0 reserved data blocks

Note that the problem can happen only when blocksize < pagesize because
otherwise we have only a single buffer in the page.

Fix the problem by removing the size check from the mapping loop. We
have an extent allocated so we have to use it all before checking for
i_size. We also rename add_page_bufs_to_extent() to
mpage_process_page_bufs() and make that function submit the page for IO
if all buffers (upto EOF) in it are mapped.

Reported-by: Dave Jones <davej@redhat.com>
Reported-by: Zheng Liu <gnehzuil.liu@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 107 ++++++++++++++++++++++++++++++++++----------------------
 1 file changed, 66 insertions(+), 41 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index afaf26b..4263878 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1893,6 +1893,26 @@ static int ext4_writepage(struct page *page,
 	return ret;
 }
 
+static int mpage_submit_page(struct mpage_da_data *mpd, struct page *page)
+{
+	int len;
+	loff_t size = i_size_read(mpd->inode);
+	int err;
+
+	BUG_ON(page->index != mpd->first_page);
+	if (page->index == size >> PAGE_CACHE_SHIFT)
+		len = size & ~PAGE_CACHE_MASK;
+	else
+		len = PAGE_CACHE_SIZE;
+	clear_page_dirty_for_io(page);
+	err = ext4_bio_write_page(&mpd->io_submit, page, len, mpd->wbc);
+	if (!err)
+		mpd->wbc->nr_to_write--;
+	mpd->first_page++;
+
+	return err;
+}
+
 #define BH_FLAGS ((1 << BH_Unwritten) | (1 << BH_Delay))
 
 /*
@@ -1951,12 +1971,29 @@ static bool mpage_add_bh_to_extent(struct mpage_da_data *mpd, ext4_lblk_t lblk,
 	return false;
 }
 
-static bool add_page_bufs_to_extent(struct mpage_da_data *mpd,
-				    struct buffer_head *head,
-				    struct buffer_head *bh,
-				    ext4_lblk_t lblk)
+/*
+ * mpage_process_page_bufs - submit page buffers for IO or add them to extent
+ *
+ * @mpd - extent of blocks for mapping
+ * @head - the first buffer in the page
+ * @bh - buffer we should start processing from
+ * @lblk - logical number of the block in the file corresponding to @bh
+ *
+ * Walk through page buffers from @bh upto @head (exclusive) and either submit
+ * the page for IO if all buffers in this page were mapped and there's no
+ * accumulated extent of buffers to map or add buffers in the page to the
+ * extent of buffers to map. The function returns 1 if the caller can continue
+ * by processing the next page, 0 if it should stop adding buffers to the
+ * extent to map because we cannot extend it anymore. It can also return value
+ * < 0 in case of error during IO submission.
+ */
+static int mpage_process_page_bufs(struct mpage_da_data *mpd,
+				   struct buffer_head *head,
+				   struct buffer_head *bh,
+				   ext4_lblk_t lblk)
 {
 	struct inode *inode = mpd->inode;
+	int err;
 	ext4_lblk_t blocks = (i_size_read(inode) + (1 << inode->i_blkbits) - 1)
 							>> inode->i_blkbits;
 
@@ -1966,32 +2003,18 @@ static bool add_page_bufs_to_extent(struct mpage_da_data *mpd,
 		if (lblk >= blocks || !mpage_add_bh_to_extent(mpd, lblk, bh)) {
 			/* Found extent to map? */
 			if (mpd->map.m_len)
-				return false;
+				return 0;
 			/* Everything mapped so far and we hit EOF */
-			return true;
+			break;
 		}
 	} while (lblk++, (bh = bh->b_this_page) != head);
-	return true;
-}
-
-static int mpage_submit_page(struct mpage_da_data *mpd, struct page *page)
-{
-	int len;
-	loff_t size = i_size_read(mpd->inode);
-	int err;
-
-	BUG_ON(page->index != mpd->first_page);
-	if (page->index == size >> PAGE_CACHE_SHIFT)
-		len = size & ~PAGE_CACHE_MASK;
-	else
-		len = PAGE_CACHE_SIZE;
-	clear_page_dirty_for_io(page);
-	err = ext4_bio_write_page(&mpd->io_submit, page, len, mpd->wbc);
-	if (!err)
-		mpd->wbc->nr_to_write--;
-	mpd->first_page++;
-
-	return err;
+	/* So far everything mapped? Submit the page for IO. */
+	if (mpd->map.m_len == 0) {
+		err = mpage_submit_page(mpd, head->b_page);
+		if (err < 0)
+			return err;
+	}
+	return lblk < blocks;
 }
 
 /*
@@ -2015,8 +2038,6 @@ static int mpage_map_and_submit_buffers(struct mpage_da_data *mpd)
 	struct inode *inode = mpd->inode;
 	struct buffer_head *head, *bh;
 	int bpp_bits = PAGE_CACHE_SHIFT - inode->i_blkbits;
-	ext4_lblk_t blocks = (i_size_read(inode) + (1 << inode->i_blkbits) - 1)
-							>> inode->i_blkbits;
 	pgoff_t start, end;
 	ext4_lblk_t lblk;
 	sector_t pblock;
@@ -2051,18 +2072,26 @@ static int mpage_map_and_submit_buffers(struct mpage_da_data *mpd)
 					 */
 					mpd->map.m_len = 0;
 					mpd->map.m_flags = 0;
-					add_page_bufs_to_extent(mpd, head, bh,
-								lblk);
+					/*
+					 * FIXME: If dioread_nolock supports
+					 * blocksize < pagesize, we need to make
+					 * sure we add size mapped so far to
+					 * io_end->size as the following call
+					 * can submit the page for IO.
+					 */
+					err = mpage_process_page_bufs(mpd, head,
+								      bh, lblk);
 					pagevec_release(&pvec);
-					return 0;
+					if (err > 0)
+						err = 0;
+					return err;
 				}
 				if (buffer_delay(bh)) {
 					clear_buffer_delay(bh);
 					bh->b_blocknr = pblock++;
 				}
 				clear_buffer_unwritten(bh);
-			} while (++lblk < blocks &&
-				 (bh = bh->b_this_page) != head);
+			} while (lblk++, (bh = bh->b_this_page) != head);
 
 			/*
 			 * FIXME: This is going to break if dioread_nolock
@@ -2331,14 +2360,10 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
 			lblk = ((ext4_lblk_t)page->index) <<
 				(PAGE_CACHE_SHIFT - blkbits);
 			head = page_buffers(page);
-			if (!add_page_bufs_to_extent(mpd, head, head, lblk))
+			err = mpage_process_page_bufs(mpd, head, head, lblk);
+			if (err <= 0)
 				goto out;
-			/* So far everything mapped? Submit the page for IO. */
-			if (mpd->map.m_len == 0) {
-				err = mpage_submit_page(mpd, page);
-				if (err < 0)
-					goto out;
-			}
+			err = 0;
 
 			/*
 			 * Accumulated enough dirty pages? This doesn't apply
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/4] ext4: Fix ext4_writepages() in presence of truncate
  2013-08-05 13:52 ` [PATCH 2/4] ext4: Fix ext4_writepages() in presence of truncate Jan Kara
@ 2013-08-17 14:08   ` Theodore Ts'o
  0 siblings, 0 replies; 12+ messages in thread
From: Theodore Ts'o @ 2013-08-17 14:08 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, Dave Jones, Zheng Liu

On Mon, Aug 05, 2013 at 03:52:22PM +0200, Jan Kara wrote:
> Inode size can arbitrarily change while writeback is in progress. When
> ext4_writepages() has prepared a long extent for mapping and truncate
> then reduces i_size, mpage_map_and_submit_buffers() will always map just
> one buffer in a page instead of all of them due to lblk < blocks check.
> So we end up not using all blocks we've allocated (thus leaking them)
> and also delalloc accounting goes wrong manifesting as a warning like:
> 
> ext4_da_release_space:1333: ext4_da_release_space: ino 12, to_free 1
> with only 0 reserved data blocks
> 
> Note that the problem can happen only when blocksize < pagesize because
> otherwise we have only a single buffer in the page.
> 
> Fix the problem by removing the size check from the mapping loop. We
> have an extent allocated so we have to use it all before checking for
> i_size. We also rename add_page_bufs_to_extent() to
> mpage_process_page_bufs() and make that function submit the page for IO
> if all buffers (upto EOF) in it are mapped.
> 
> Reported-by: Dave Jones <davej@redhat.com>
> Reported-by: Zheng Liu <gnehzuil.liu@gmail.com>
> Signed-off-by: Jan Kara <jack@suse.cz>

Thanks, applied.

					- Ted

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 3/4] ext4: Simplify truncation code in ext4_setattr()
  2013-08-05 13:52 [PATCH 0/4 v2] ext4: Fix races between writeback and truncate Jan Kara
  2013-08-05 13:52 ` [PATCH 1/4] ext4: Move test whether extent to map can be extended to one place Jan Kara
  2013-08-05 13:52 ` [PATCH 2/4] ext4: Fix ext4_writepages() in presence of truncate Jan Kara
@ 2013-08-05 13:52 ` Jan Kara
  2013-08-17 14:08   ` Theodore Ts'o
  2013-08-05 13:52 ` [PATCH 4/4] ext4: Fix lost truncate due to race with writeback Jan Kara
  3 siblings, 1 reply; 12+ messages in thread
From: Jan Kara @ 2013-08-05 13:52 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Dave Jones, Zheng Liu, Jan Kara

Merge conditions in ext4_setattr() handling inode size changes, also
move ext4_begin_ordered_truncate() call somewhat earlier because it
simplifies error recovery in case of failure. Also add error handling in
case i_disksize update fails.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 109 +++++++++++++++++++++++++-------------------------------
 1 file changed, 49 insertions(+), 60 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 4263878..e7d98d2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4560,7 +4560,9 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 		ext4_journal_stop(handle);
 	}
 
-	if (attr->ia_valid & ATTR_SIZE) {
+	if (attr->ia_valid & ATTR_SIZE && attr->ia_size != inode->i_size) {
+		handle_t *handle;
+		loff_t oldsize = inode->i_size;
 
 		if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
 			struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
@@ -4568,73 +4570,60 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 			if (attr->ia_size > sbi->s_bitmap_maxbytes)
 				return -EFBIG;
 		}
-	}
-
-	if (S_ISREG(inode->i_mode) &&
-	    attr->ia_valid & ATTR_SIZE &&
-	    (attr->ia_size < inode->i_size)) {
-		handle_t *handle;
-
-		handle = ext4_journal_start(inode, EXT4_HT_INODE, 3);
-		if (IS_ERR(handle)) {
-			error = PTR_ERR(handle);
-			goto err_out;
-		}
-		if (ext4_handle_valid(handle)) {
-			error = ext4_orphan_add(handle, inode);
-			orphan = 1;
-		}
-		EXT4_I(inode)->i_disksize = attr->ia_size;
-		rc = ext4_mark_inode_dirty(handle, inode);
-		if (!error)
-			error = rc;
-		ext4_journal_stop(handle);
-
-		if (ext4_should_order_data(inode)) {
-			error = ext4_begin_ordered_truncate(inode,
+		if (S_ISREG(inode->i_mode) &&
+		    (attr->ia_size < inode->i_size)) {
+			if (ext4_should_order_data(inode)) {
+				error = ext4_begin_ordered_truncate(inode,
 							    attr->ia_size);
-			if (error) {
-				/* Do as much error cleanup as possible */
-				handle = ext4_journal_start(inode,
-							    EXT4_HT_INODE, 3);
-				if (IS_ERR(handle)) {
-					ext4_orphan_del(NULL, inode);
+				if (error)
 					goto err_out;
-				}
-				ext4_orphan_del(handle, inode);
-				orphan = 0;
-				ext4_journal_stop(handle);
+			}
+			handle = ext4_journal_start(inode, EXT4_HT_INODE, 3);
+			if (IS_ERR(handle)) {
+				error = PTR_ERR(handle);
+				goto err_out;
+			}
+			if (ext4_handle_valid(handle)) {
+				error = ext4_orphan_add(handle, inode);
+				orphan = 1;
+			}
+			EXT4_I(inode)->i_disksize = attr->ia_size;
+			rc = ext4_mark_inode_dirty(handle, inode);
+			if (!error)
+				error = rc;
+			ext4_journal_stop(handle);
+			if (error) {
+				ext4_orphan_del(NULL, inode);
 				goto err_out;
 			}
 		}
-	}
-
-	if (attr->ia_valid & ATTR_SIZE) {
-		if (attr->ia_size != inode->i_size) {
-			loff_t oldsize = inode->i_size;
 
-			i_size_write(inode, attr->ia_size);
-			/*
-			 * Blocks are going to be removed from the inode. Wait
-			 * for dio in flight.  Temporarily disable
-			 * dioread_nolock to prevent livelock.
-			 */
-			if (orphan) {
-				if (!ext4_should_journal_data(inode)) {
-					ext4_inode_block_unlocked_dio(inode);
-					inode_dio_wait(inode);
-					ext4_inode_resume_unlocked_dio(inode);
-				} else
-					ext4_wait_for_tail_page_commit(inode);
-			}
-			/*
-			 * Truncate pagecache after we've waited for commit
-			 * in data=journal mode to make pages freeable.
-			 */
-			truncate_pagecache(inode, oldsize, inode->i_size);
+		i_size_write(inode, attr->ia_size);
+		/*
+		 * Blocks are going to be removed from the inode. Wait
+		 * for dio in flight.  Temporarily disable
+		 * dioread_nolock to prevent livelock.
+		 */
+		if (orphan) {
+			if (!ext4_should_journal_data(inode)) {
+				ext4_inode_block_unlocked_dio(inode);
+				inode_dio_wait(inode);
+				ext4_inode_resume_unlocked_dio(inode);
+			} else
+				ext4_wait_for_tail_page_commit(inode);
 		}
-		ext4_truncate(inode);
+		/*
+		 * Truncate pagecache after we've waited for commit
+		 * in data=journal mode to make pages freeable.
+		 */
+		truncate_pagecache(inode, oldsize, inode->i_size);
 	}
+	/*
+	 * We want to call ext4_truncate() even if attr->ia_size ==
+ 	 * inode->i_size for cases like truncation of fallocated space
+	 */
+	if (attr->ia_valid & ATTR_SIZE)
+		ext4_truncate(inode);
 
 	if (!rc) {
 		setattr_copy(inode, attr);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/4] ext4: Simplify truncation code in ext4_setattr()
  2013-08-05 13:52 ` [PATCH 3/4] ext4: Simplify truncation code in ext4_setattr() Jan Kara
@ 2013-08-17 14:08   ` Theodore Ts'o
  0 siblings, 0 replies; 12+ messages in thread
From: Theodore Ts'o @ 2013-08-17 14:08 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, Dave Jones, Zheng Liu

On Mon, Aug 05, 2013 at 03:52:23PM +0200, Jan Kara wrote:
> Merge conditions in ext4_setattr() handling inode size changes, also
> move ext4_begin_ordered_truncate() call somewhat earlier because it
> simplifies error recovery in case of failure. Also add error handling in
> case i_disksize update fails.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Thanks, applied with one minor whitespace fixup.

							- Ted

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 4/4] ext4: Fix lost truncate due to race with writeback
  2013-08-05 13:52 [PATCH 0/4 v2] ext4: Fix races between writeback and truncate Jan Kara
                   ` (2 preceding siblings ...)
  2013-08-05 13:52 ` [PATCH 3/4] ext4: Simplify truncation code in ext4_setattr() Jan Kara
@ 2013-08-05 13:52 ` Jan Kara
  2013-08-17 14:12   ` Theodore Ts'o
  3 siblings, 1 reply; 12+ messages in thread
From: Jan Kara @ 2013-08-05 13:52 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Dave Jones, Zheng Liu, Jan Kara

The following race can lead to a loss of i_disksize update from truncate
thus resulting in a wrong inode size if the inode size isn't updated
again before inode is reclaimed:

ext4_setattr()				mpage_map_and_submit_extent()
  EXT4_I(inode)->i_disksize = attr->ia_size;
  ...					  ...
					  disksize = ((loff_t)mpd->first_page) << PAGE_CACHE_SHIFT
					  /* False because i_size isn't
					   * updated yet */
					  if (disksize > i_size_read(inode))
					  /* True, because i_disksize is
					   * already truncated */
					  if (disksize > EXT4_I(inode)->i_disksize)
					    /* Overwrite i_disksize
					     * update from truncate */
					    ext4_update_i_disksize()
  i_size_write(inode, attr->ia_size);

For other places updating i_disksize such race cannot happen because
i_mutex prevents these races. Writeback is the only place where we do
not hold i_mutex and we cannot grab it there because of lock ordering.

We fix the race by doing both i_disksize and i_size update in truncate
atomically under i_data_sem and in mpage_map_and_submit_extent() we move
the check against i_size under i_data_sem as well.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h  | 24 ++++++++++++++++++++----
 fs/ext4/inode.c | 17 ++++++++++++-----
 2 files changed, 32 insertions(+), 9 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b577e45..648c5e6 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2416,16 +2416,32 @@ do {								\
 #define EXT4_FREECLUSTERS_WATERMARK 0
 #endif
 
+/* Update i_disksize. Requires i_mutex to avoid races with truncate */
 static inline void ext4_update_i_disksize(struct inode *inode, loff_t newsize)
 {
-	/*
-	 * XXX: replace with spinlock if seen contended -bzzz
-	 */
+	WARN_ON_ONCE(S_ISREG(inode->i_mode) &&
+		     !mutex_is_locked(&inode->i_mutex));
+	down_write(&EXT4_I(inode)->i_data_sem);
+	if (newsize > EXT4_I(inode)->i_disksize)
+		EXT4_I(inode)->i_disksize = newsize;
+	up_write(&EXT4_I(inode)->i_data_sem);
+}
+
+/*
+ * Update i_disksize after writeback has been started. Races with truncate
+ * are avoided by checking i_size under i_data_sem.
+ */
+static inline void ext4_wb_update_i_disksize(struct inode *inode, loff_t newsize)
+{
+	loff_t i_size;
+
 	down_write(&EXT4_I(inode)->i_data_sem);
+	i_size = i_size_read(inode);
+	if (newsize > i_size)
+		newsize = i_size;
 	if (newsize > EXT4_I(inode)->i_disksize)
 		EXT4_I(inode)->i_disksize = newsize;
 	up_write(&EXT4_I(inode)->i_data_sem);
-	return ;
 }
 
 struct ext4_group_info {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e7d98d2..5d3706e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2240,12 +2240,10 @@ static int mpage_map_and_submit_extent(handle_t *handle,
 
 	/* Update on-disk size after IO is submitted */
 	disksize = ((loff_t)mpd->first_page) << PAGE_CACHE_SHIFT;
-	if (disksize > i_size_read(inode))
-		disksize = i_size_read(inode);
 	if (disksize > EXT4_I(inode)->i_disksize) {
 		int err2;
 
-		ext4_update_i_disksize(inode, disksize);
+		ext4_wb_update_i_disksize(inode, disksize);
 		err2 = ext4_mark_inode_dirty(handle, inode);
 		if (err2)
 			ext4_error(inode->i_sb,
@@ -4587,18 +4585,27 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 				error = ext4_orphan_add(handle, inode);
 				orphan = 1;
 			}
+			down_write(&EXT4_I(inode)->i_data_sem);
 			EXT4_I(inode)->i_disksize = attr->ia_size;
 			rc = ext4_mark_inode_dirty(handle, inode);
 			if (!error)
 				error = rc;
+			/*
+			 * We have to update i_size under i_data_sem together
+			 * with i_disksize to avoid races with writeback code
+			 * running ext4_wb_update_i_disksize().
+			 */
+			if (!error)
+				i_size_write(inode, attr->ia_size);
+			up_write(&EXT4_I(inode)->i_data_sem);
 			ext4_journal_stop(handle);
 			if (error) {
 				ext4_orphan_del(NULL, inode);
 				goto err_out;
 			}
-		}
+		} else
+			i_size_write(inode, attr->ia_size);
 
-		i_size_write(inode, attr->ia_size);
 		/*
 		 * Blocks are going to be removed from the inode. Wait
 		 * for dio in flight.  Temporarily disable
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 4/4] ext4: Fix lost truncate due to race with writeback
  2013-08-05 13:52 ` [PATCH 4/4] ext4: Fix lost truncate due to race with writeback Jan Kara
@ 2013-08-17 14:12   ` Theodore Ts'o
  2013-08-26 19:01     ` Dave Jones
  0 siblings, 1 reply; 12+ messages in thread
From: Theodore Ts'o @ 2013-08-17 14:12 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, Dave Jones, Zheng Liu

On Mon, Aug 05, 2013 at 03:52:24PM +0200, Jan Kara wrote:
> The following race can lead to a loss of i_disksize update from truncate
> thus resulting in a wrong inode size if the inode size isn't updated
> again before inode is reclaimed:
> 
> ext4_setattr()				mpage_map_and_submit_extent()
>   EXT4_I(inode)->i_disksize = attr->ia_size;
>   ...					  ...
> 					  disksize = ((loff_t)mpd->first_page) << PAGE_CACHE_SHIFT
> 					  /* False because i_size isn't
> 					   * updated yet */
> 					  if (disksize > i_size_read(inode))
> 					  /* True, because i_disksize is
> 					   * already truncated */
> 					  if (disksize > EXT4_I(inode)->i_disksize)
> 					    /* Overwrite i_disksize
> 					     * update from truncate */
> 					    ext4_update_i_disksize()
>   i_size_write(inode, attr->ia_size);
> 
> For other places updating i_disksize such race cannot happen because
> i_mutex prevents these races. Writeback is the only place where we do
> not hold i_mutex and we cannot grab it there because of lock ordering.
> 
> We fix the race by doing both i_disksize and i_size update in truncate
> atomically under i_data_sem and in mpage_map_and_submit_extent() we move
> the check against i_size under i_data_sem as well.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Applied, thanks.

						- Ted

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 4/4] ext4: Fix lost truncate due to race with writeback
  2013-08-17 14:12   ` Theodore Ts'o
@ 2013-08-26 19:01     ` Dave Jones
  2013-08-28 22:55       ` Theodore Ts'o
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Jones @ 2013-08-26 19:01 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Jan Kara, linux-ext4, Zheng Liu, Linus Torvalds

On Sat, Aug 17, 2013 at 10:12:27AM -0400, Theodore Ts'o wrote:
 > On Mon, Aug 05, 2013 at 03:52:24PM +0200, Jan Kara wrote:
 > > The following race can lead to a loss of i_disksize update from truncate
 > > thus resulting in a wrong inode size if the inode size isn't updated
 > > again before inode is reclaimed:
 > > 
 > > ext4_setattr()				mpage_map_and_submit_extent()
 > >   EXT4_I(inode)->i_disksize = attr->ia_size;
 > >   ...					  ...
 > > 					  disksize = ((loff_t)mpd->first_page) << PAGE_CACHE_SHIFT
 > > 					  /* False because i_size isn't
 > > 					   * updated yet */
 > > 					  if (disksize > i_size_read(inode))
 > > 					  /* True, because i_disksize is
 > > 					   * already truncated */
 > > 					  if (disksize > EXT4_I(inode)->i_disksize)
 > > 					    /* Overwrite i_disksize
 > > 					     * update from truncate */
 > > 					    ext4_update_i_disksize()
 > >   i_size_write(inode, attr->ia_size);
 > > 
 > > For other places updating i_disksize such race cannot happen because
 > > i_mutex prevents these races. Writeback is the only place where we do
 > > not hold i_mutex and we cannot grab it there because of lock ordering.
 > > 
 > > We fix the race by doing both i_disksize and i_size update in truncate
 > > atomically under i_data_sem and in mpage_map_and_submit_extent() we move
 > > the check against i_size under i_data_sem as well.
 > > 
 > > Signed-off-by: Jan Kara <jack@suse.cz>
 > 
 > Applied, thanks.

Is this queued for 3.11 ? 1k blocksize fs's are still broken in rc7.

	Dave


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 4/4] ext4: Fix lost truncate due to race with writeback
  2013-08-26 19:01     ` Dave Jones
@ 2013-08-28 22:55       ` Theodore Ts'o
  2013-08-28 23:01         ` Dave Jones
  0 siblings, 1 reply; 12+ messages in thread
From: Theodore Ts'o @ 2013-08-28 22:55 UTC (permalink / raw)
  To: Dave Jones; +Cc: Jan Kara, linux-ext4, Zheng Liu, Linus Torvalds

On Mon, Aug 26, 2013 at 03:01:48PM -0400, Dave Jones wrote:
> 
> Is this queued for 3.11 ? 1k blocksize fs's are still broken in rc7.

These patches fixed races that have been around for a while; it's not
a regression.  Given that they are fairly involved, I was nervous
sending them to Linus for 3.11, given the late date.

They are queued for the next merge window, and I'll mark them cc:
stable@vger.kernel.org.

						- Ted

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 4/4] ext4: Fix lost truncate due to race with writeback
  2013-08-28 22:55       ` Theodore Ts'o
@ 2013-08-28 23:01         ` Dave Jones
  0 siblings, 0 replies; 12+ messages in thread
From: Dave Jones @ 2013-08-28 23:01 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Jan Kara, linux-ext4, Zheng Liu, Linus Torvalds

On Wed, Aug 28, 2013 at 06:55:04PM -0400, Theodore Ts'o wrote:
 > On Mon, Aug 26, 2013 at 03:01:48PM -0400, Dave Jones wrote:
 > > 
 > > Is this queued for 3.11 ? 1k blocksize fs's are still broken in rc7.
 > 
 > These patches fixed races that have been around for a while; it's not
 > a regression.  Given that they are fairly involved, I was nervous
 > sending them to Linus for 3.11, given the late date.

That's odd, because I can't reproduce the problem I'm seeing on 3.10

 > They are queued for the next merge window, and I'll mark them cc:
 > stable@vger.kernel.org.
 
Fair enough.

	Dave


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2013-08-28 23:02 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-08-05 13:52 [PATCH 0/4 v2] ext4: Fix races between writeback and truncate Jan Kara
2013-08-05 13:52 ` [PATCH 1/4] ext4: Move test whether extent to map can be extended to one place Jan Kara
2013-08-17 13:59   ` Theodore Ts'o
2013-08-05 13:52 ` [PATCH 2/4] ext4: Fix ext4_writepages() in presence of truncate Jan Kara
2013-08-17 14:08   ` Theodore Ts'o
2013-08-05 13:52 ` [PATCH 3/4] ext4: Simplify truncation code in ext4_setattr() Jan Kara
2013-08-17 14:08   ` Theodore Ts'o
2013-08-05 13:52 ` [PATCH 4/4] ext4: Fix lost truncate due to race with writeback Jan Kara
2013-08-17 14:12   ` Theodore Ts'o
2013-08-26 19:01     ` Dave Jones
2013-08-28 22:55       ` Theodore Ts'o
2013-08-28 23:01         ` Dave Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).