[PATCH 0/2] fix gfs2 truncate race

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] fix gfs2 truncate race
@ 2016-06-14 22:02 Benjamin Marzinski
  2016-06-14 22:02 ` [PATCH 1/2] fs: export __block_write_full_page Benjamin Marzinski
  2016-06-14 22:02 ` [PATCH 2/2] gfs2: writeout truncated pages Benjamin Marzinski
  0 siblings, 2 replies; 7+ messages in thread
From: Benjamin Marzinski @ 2016-06-14 22:02 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: cluster-devel

If gfs2 tries to write out a page of a file with data journaling
enabled, while that file is being truncated, it can cause a kernel
panic. The problem is that the page it is writing out may point to past
the end of the file. If this happens, gfs2 will try to invalidate the
page.  However, it can't invalidate a journaled page with buffers in the
in-memory log, without revoking those buffers, so that there is no
chance of corruption if the machine crashes and later has its journal
replayed.  If the page is being written out by the log flush code, there
is no way that it can add a revoke entry onto the log during the flush.

To solve this, gfs2 simply writes out journalled data pages that point
past the end of the file, since the truncate is still in progress, and
everything will be cleaned up before the file is unlocked, or the blocks
are marked free. Doing this involves both a gfs2 change and exporting an
additional symbol from the vfs.

Benjamin Marzinski (2):
  fs: export __block_write_full_page
  gfs2: writeout truncated pages

 fs/buffer.c                 |  3 ++-
 fs/gfs2/aops.c              | 37 +++++++++++++++++++++++++++----------
 include/linux/buffer_head.h |  3 +++
 3 files changed, 32 insertions(+), 11 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] fs: export __block_write_full_page
  2016-06-14 22:02 [PATCH 0/2] fix gfs2 truncate race Benjamin Marzinski
@ 2016-06-14 22:02 ` Benjamin Marzinski
  2016-06-14 22:02 ` [PATCH 2/2] gfs2: writeout truncated pages Benjamin Marzinski
  1 sibling, 0 replies; 7+ messages in thread
From: Benjamin Marzinski @ 2016-06-14 22:02 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: cluster-devel

gfs2 needs to be able to skip the check to see if a page is outside of
the file size when writing it out. gfs2 can get into a situation where
it needs to flush its in-memory log to disk while a truncate is in
progress. If the file being trucated has data journaling enabled, it is
possible that there are data blocks in the log that are past the end of
the file. gfs can't finish the log flush without either writing these
blocks out or revoking them. Otherwise, if the node crashed, it could
overwrite subsequent changes made by other nodes in the cluster when
it's journal was replayed.

Unfortunately, there is no way to add log entries to the log during a
flush. So gfs2 simply writes out the page instead. This situation can
only occur when the truncate code still has the file locked exclusively,
and hasn't marked this block as free in the metadata (which happens
later in truc_dealloc).  After gfs2 writes this page out, the truncation
code will shortly invalidate it and write out any revokes if necessary.

In order to make this work, gfs2 needs to be able to skip the check for
writes outside the file size. Since the check exists in
block_write_full_page, this patch exports __block_write_full_page, which
doesn't have the check.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
---
 fs/buffer.c                 | 3 ++-
 include/linux/buffer_head.h | 3 +++
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 754813a..6c15012 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1687,7 +1687,7 @@ static struct buffer_head *create_page_buffers(struct page *page, struct inode *
  * WB_SYNC_ALL, the writes are posted using WRITE_SYNC; this
  * causes the writes to be flagged as synchronous writes.
  */
-static int __block_write_full_page(struct inode *inode, struct page *page,
+int __block_write_full_page(struct inode *inode, struct page *page,
 			get_block_t *get_block, struct writeback_control *wbc,
 			bh_end_io_t *handler)
 {
@@ -1848,6 +1848,7 @@ recover:
 	unlock_page(page);
 	goto done;
 }
+EXPORT_SYMBOL(__block_write_full_page);

 /*
  * If a page has any new buffers, zero them out here, and mark them uptodate
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index d48daa3..7e14e54 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -208,6 +208,9 @@ void block_invalidatepage(struct page *page, unsigned int offset,
 			  unsigned int length);
 int block_write_full_page(struct page *page, get_block_t *get_block,
 				struct writeback_control *wbc);
+int __block_write_full_page(struct inode *inode, struct page *page,
+			get_block_t *get_block, struct writeback_control *wbc,
+			bh_end_io_t *handler);
 int block_read_full_page(struct page*, get_block_t*);
 int block_is_partially_uptodate(struct page *page, unsigned long from,
 				unsigned long count);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/2] gfs2: writeout truncated pages
  2016-06-14 22:02 [PATCH 0/2] fix gfs2 truncate race Benjamin Marzinski
  2016-06-14 22:02 ` [PATCH 1/2] fs: export __block_write_full_page Benjamin Marzinski
@ 2016-06-14 22:02 ` Benjamin Marzinski
  2016-06-15 17:08   ` [Cluster-devel] " Bob Peterson
  2016-06-16 15:56   ` Steven Whitehouse
  1 sibling, 2 replies; 7+ messages in thread
From: Benjamin Marzinski @ 2016-06-14 22:02 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: cluster-devel

When gfs2 attempts to write a page to a file that is being truncated,
and notices that the page is completely outside of the file size, it
tries to invalidate it.  However, this may require a transaction for
journaled data files to revoke any buffers from the page on the active
items list. Unfortunately, this can happen inside a log flush, where a
transaction cannot be started. Also, gfs2 may need to be able to remove
the buffer from the ail1 list before it can finish the log flush.

To deal with this, gfs2 now skips the check to see if the write is
outside the file size, and simply writes it anyway. This situation can
only occur when the truncate code still has the file locked exclusively,
and hasn't marked this block as free in the metadata (which happens
later in truc_dealloc).  After gfs2 writes this page out, the truncation
code will shortly invalidate it and write out any revokes if necessary.

To do this, gfs2 now implements its own version of block_write_full_page
without the check, and calls the newly exported __block_write_full_page.
The check still exists in nobh_writepage, which is gfs2 calls in the
non-journaled case. But in this case the buffers will never be in the
journal. Thus, no revokes will need to be issued and the write can
safely be skipped without causing any possible journal replay issues.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
---
 fs/gfs2/aops.c | 37 +++++++++++++++++++++++++++----------
 1 file changed, 27 insertions(+), 10 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 37b7bc1..d3a7301 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -100,20 +100,11 @@ static int gfs2_writepage_common(struct page *page,
 	struct inode *inode = page->mapping->host;
 	struct gfs2_inode *ip = GFS2_I(inode);
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
-	loff_t i_size = i_size_read(inode);
-	pgoff_t end_index = i_size >> PAGE_SHIFT;
-	unsigned offset;
 
 	if (gfs2_assert_withdraw(sdp, gfs2_glock_is_held_excl(ip->i_gl)))
 		goto out;
 	if (current->journal_info)
 		goto redirty;
-	/* Is the page fully outside i_size? (truncate in progress) */
-	offset = i_size & (PAGE_SIZE-1);
-	if (page->index > end_index || (page->index == end_index && !offset)) {
-		page->mapping->a_ops->invalidatepage(page, 0, PAGE_SIZE);
-		goto out;
-	}
 	return 1;
 redirty:
 	redirty_page_for_writepage(wbc, page);
@@ -140,6 +131,32 @@ static int gfs2_writepage(struct page *page, struct writeback_control *wbc)
 	return nobh_writepage(page, gfs2_get_block_noalloc, wbc);
 }
 
+/* This is the same as calling block_write_full_page, but it also
+ * writes pages outside of i_size
+ */
+int gfs2_write_full_page(struct page *page, get_block_t *get_block,
+			 struct writeback_control *wbc)
+{
+	struct inode * const inode = page->mapping->host;
+	loff_t i_size = i_size_read(inode);
+	const pgoff_t end_index = i_size >> PAGE_SHIFT;
+	unsigned offset;
+
+	/*
+	 * The page straddles i_size.  It must be zeroed out on each and every
+	 * writepage invocation because it may be mmapped.  "A file is mapped
+	 * in multiples of the page size.  For a file that is not a multiple of
+	 * the  page size, the remaining memory is zeroed when mapped, and
+	 * writes to that region are not written out to the file."
+	 */
+	offset = i_size & (PAGE_SIZE-1);
+	if (page->index == end_index && offset)
+		zero_user_segment(page, offset, PAGE_SIZE);
+
+	return __block_write_full_page(inode, page, get_block, wbc,
+				       end_buffer_async_write);
+}
+
 /**
  * __gfs2_jdata_writepage - The core of jdata writepage
  * @page: The page to write
@@ -165,7 +182,7 @@ static int __gfs2_jdata_writepage(struct page *page, struct writeback_control *w
 		}
 		gfs2_page_add_databufs(ip, page, 0, sdp->sd_vfs->s_blocksize-1);
 	}
-	return block_write_full_page(page, gfs2_get_block_noalloc, wbc);
+	return gfs2_write_full_page(page, gfs2_get_block_noalloc, wbc);
 }
 
 /**
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [Cluster-devel] [PATCH 2/2] gfs2: writeout truncated pages
  2016-06-14 22:02 ` [PATCH 2/2] gfs2: writeout truncated pages Benjamin Marzinski
@ 2016-06-15 17:08   ` Bob Peterson
  2016-06-15 18:45     ` Benjamin Marzinski
  2016-06-16 15:56   ` Steven Whitehouse
  1 sibling, 1 reply; 7+ messages in thread
From: Bob Peterson @ 2016-06-15 17:08 UTC (permalink / raw)
  To: Benjamin Marzinski; +Cc: linux-fsdevel, cluster-devel

Hi Ben,

Comments below.

----- Original Message -----
| When gfs2 attempts to write a page to a file that is being truncated,
| and notices that the page is completely outside of the file size, it
| tries to invalidate it.  However, this may require a transaction for
| journaled data files to revoke any buffers from the page on the active
| items list. Unfortunately, this can happen inside a log flush, where a
| transaction cannot be started. Also, gfs2 may need to be able to remove
| the buffer from the ail1 list before it can finish the log flush.
| 
| To deal with this, gfs2 now skips the check to see if the write is
| outside the file size, and simply writes it anyway. This situation can
| only occur when the truncate code still has the file locked exclusively,
| and hasn't marked this block as free in the metadata (which happens
| later in truc_dealloc).  After gfs2 writes this page out, the truncation
| code will shortly invalidate it and write out any revokes if necessary.
| 
| To do this, gfs2 now implements its own version of block_write_full_page
| without the check, and calls the newly exported __block_write_full_page.
| The check still exists in nobh_writepage, which is gfs2 calls in the
| non-journaled case. But in this case the buffers will never be in the
| journal. Thus, no revokes will need to be issued and the write can
| safely be skipped without causing any possible journal replay issues.
| 
| Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
| ---
|  fs/gfs2/aops.c | 37 +++++++++++++++++++++++++++----------
|  1 file changed, 27 insertions(+), 10 deletions(-)
| 
| diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
| index 37b7bc1..d3a7301 100644
| --- a/fs/gfs2/aops.c
| +++ b/fs/gfs2/aops.c
| @@ -100,20 +100,11 @@ static int gfs2_writepage_common(struct page *page,
|  	struct inode *inode = page->mapping->host;
|  	struct gfs2_inode *ip = GFS2_I(inode);
|  	struct gfs2_sbd *sdp = GFS2_SB(inode);
| -	loff_t i_size = i_size_read(inode);
| -	pgoff_t end_index = i_size >> PAGE_SHIFT;
| -	unsigned offset;
|  
|  	if (gfs2_assert_withdraw(sdp, gfs2_glock_is_held_excl(ip->i_gl)))
|  		goto out;
|  	if (current->journal_info)
|  		goto redirty;
| -	/* Is the page fully outside i_size? (truncate in progress) */
| -	offset = i_size & (PAGE_SIZE-1);
| -	if (page->index > end_index || (page->index == end_index && !offset)) {
| -		page->mapping->a_ops->invalidatepage(page, 0, PAGE_SIZE);
| -		goto out;
| -	}

The concept in general looks good. Two things:

(1) Since this is writepage_common, does it still need to do the
check for the normal non-jdata writeback? Does it makes sense to
split this back up into two writepage functions again rather than
the common one? After all, with your patch, the function does almost
nothing anymore.

(2) Just a nit, but can you eliminate the ugly goto around the return?
In other words, rather than:

	if (current->journal_info)
		goto redirty;
	return 1;
redirty:

Just simply do:

	if (!current->journal_info)
		return 1;

Regards,

Bob Peterson
Red Hat File Systems

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Cluster-devel] [PATCH 2/2] gfs2: writeout truncated pages
  2016-06-15 17:08   ` [Cluster-devel] " Bob Peterson
@ 2016-06-15 18:45     ` Benjamin Marzinski
  0 siblings, 0 replies; 7+ messages in thread
From: Benjamin Marzinski @ 2016-06-15 18:45 UTC (permalink / raw)
  To: Bob Peterson; +Cc: linux-fsdevel, cluster-devel

On Wed, Jun 15, 2016 at 01:08:02PM -0400, Bob Peterson wrote:
> Hi Ben,
> 
> Comments below.
> 
> ----- Original Message -----
> | When gfs2 attempts to write a page to a file that is being truncated,
> | and notices that the page is completely outside of the file size, it
> | tries to invalidate it.  However, this may require a transaction for
> | journaled data files to revoke any buffers from the page on the active
> | items list. Unfortunately, this can happen inside a log flush, where a
> | transaction cannot be started. Also, gfs2 may need to be able to remove
> | the buffer from the ail1 list before it can finish the log flush.
> | 
> | To deal with this, gfs2 now skips the check to see if the write is
> | outside the file size, and simply writes it anyway. This situation can
> | only occur when the truncate code still has the file locked exclusively,
> | and hasn't marked this block as free in the metadata (which happens
> | later in truc_dealloc).  After gfs2 writes this page out, the truncation
> | code will shortly invalidate it and write out any revokes if necessary.
> | 
> | To do this, gfs2 now implements its own version of block_write_full_page
> | without the check, and calls the newly exported __block_write_full_page.
> | The check still exists in nobh_writepage, which is gfs2 calls in the
> | non-journaled case. But in this case the buffers will never be in the
> | journal. Thus, no revokes will need to be issued and the write can
> | safely be skipped without causing any possible journal replay issues.
> | 
> | Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
> | ---
> |  fs/gfs2/aops.c | 37 +++++++++++++++++++++++++++----------
> |  1 file changed, 27 insertions(+), 10 deletions(-)
> | 
> | diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
> | index 37b7bc1..d3a7301 100644
> | --- a/fs/gfs2/aops.c
> | +++ b/fs/gfs2/aops.c
> | @@ -100,20 +100,11 @@ static int gfs2_writepage_common(struct page *page,
> |  	struct inode *inode = page->mapping->host;
> |  	struct gfs2_inode *ip = GFS2_I(inode);
> |  	struct gfs2_sbd *sdp = GFS2_SB(inode);
> | -	loff_t i_size = i_size_read(inode);
> | -	pgoff_t end_index = i_size >> PAGE_SHIFT;
> | -	unsigned offset;
> |  
> |  	if (gfs2_assert_withdraw(sdp, gfs2_glock_is_held_excl(ip->i_gl)))
> |  		goto out;
> |  	if (current->journal_info)
> |  		goto redirty;
> | -	/* Is the page fully outside i_size? (truncate in progress) */
> | -	offset = i_size & (PAGE_SIZE-1);
> | -	if (page->index > end_index || (page->index == end_index && !offset)) {
> | -		page->mapping->a_ops->invalidatepage(page, 0, PAGE_SIZE);
> | -		goto out;
> | -	}
> 
> The concept in general looks good. Two things:
> 
> (1) Since this is writepage_common, does it still need to do the
> check for the normal non-jdata writeback? Does it makes sense to
> split this back up into two writepage functions again rather than
> the common one? After all, with your patch, the function does almost
> nothing anymore.

We do not need to worry about the non-jdata case.  nobh_writepage(),
which is called after gfs2_writepage_common in the non-jdata case, has
the identical check. It doesn't invalidate the page, but that's fine,
because the truncate code is about to do that anyway.

This isn't even as big a change as it looks.  In cases where we weren't
doing this writepage during a log flush, there was nothing to guarantee
that the truncate code wouldn't shrink the file size after the check in
gfs2_writepage_common, and cause us to hit the check in nobh_writepage
instead. Nate hit that scenario while testing the jdata case. In fact,
there's nothing to guarantee that the i_size doesn't change after the
check in nobh_writepage either, in which case we simply write out the
page, like in the jdata case.

> (2) Just a nit, but can you eliminate the ugly goto around the return?
> In other words, rather than:
> 
> 	if (current->journal_info)
> 		goto redirty;
> 	return 1;
> redirty:
> 
> Just simply do:
> 
> 	if (!current->journal_info)
> 		return 1;
> 

Sure. I'll clean that up and respin the patches.

-Ben

> Regards,
> 
> Bob Peterson
> Red Hat File Systems

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Cluster-devel] [PATCH 2/2] gfs2: writeout truncated pages
  2016-06-14 22:02 ` [PATCH 2/2] gfs2: writeout truncated pages Benjamin Marzinski
  2016-06-15 17:08   ` [Cluster-devel] " Bob Peterson
@ 2016-06-16 15:56   ` Steven Whitehouse
  2016-06-16 17:49     ` Benjamin Marzinski
  1 sibling, 1 reply; 7+ messages in thread
From: Steven Whitehouse @ 2016-06-16 15:56 UTC (permalink / raw)
  To: Benjamin Marzinski, linux-fsdevel; +Cc: cluster-devel

Hi,

On 14/06/16 23:02, Benjamin Marzinski wrote:
> When gfs2 attempts to write a page to a file that is being truncated,
> and notices that the page is completely outside of the file size, it
> tries to invalidate it.  However, this may require a transaction for
> journaled data files to revoke any buffers from the page on the active
> items list. Unfortunately, this can happen inside a log flush, where a
> transaction cannot be started. Also, gfs2 may need to be able to remove
> the buffer from the ail1 list before it can finish the log flush.
>
> To deal with this, gfs2 now skips the check to see if the write is
> outside the file size, and simply writes it anyway. This situation can
> only occur when the truncate code still has the file locked exclusively,
> and hasn't marked this block as free in the metadata (which happens
> later in truc_dealloc).  After gfs2 writes this page out, the truncation
> code will shortly invalidate it and write out any revokes if necessary.
>
> To do this, gfs2 now implements its own version of block_write_full_page
> without the check, and calls the newly exported __block_write_full_page.
> The check still exists in nobh_writepage, which is gfs2 calls in the
> non-journaled case. But in this case the buffers will never be in the
> journal. Thus, no revokes will need to be issued and the write can
> safely be skipped without causing any possible journal replay issues.
>
> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
> ---
>   fs/gfs2/aops.c | 37 +++++++++++++++++++++++++++----------
>   1 file changed, 27 insertions(+), 10 deletions(-)
>
> diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
> index 37b7bc1..d3a7301 100644
> --- a/fs/gfs2/aops.c
> +++ b/fs/gfs2/aops.c
> @@ -100,20 +100,11 @@ static int gfs2_writepage_common(struct page *page,
>   	struct inode *inode = page->mapping->host;
>   	struct gfs2_inode *ip = GFS2_I(inode);
>   	struct gfs2_sbd *sdp = GFS2_SB(inode);
> -	loff_t i_size = i_size_read(inode);
> -	pgoff_t end_index = i_size >> PAGE_SHIFT;
> -	unsigned offset;
>   
>   	if (gfs2_assert_withdraw(sdp, gfs2_glock_is_held_excl(ip->i_gl)))
>   		goto out;
>   	if (current->journal_info)
>   		goto redirty;
> -	/* Is the page fully outside i_size? (truncate in progress) */
> -	offset = i_size & (PAGE_SIZE-1);
> -	if (page->index > end_index || (page->index == end_index && !offset)) {
> -		page->mapping->a_ops->invalidatepage(page, 0, PAGE_SIZE);
> -		goto out;
> -	}
>   	return 1;
I wonder whether it would make more sense to simply stop the jdata 
writepage calling the writepage_common function altogether and move the 
remaining checks to the jdata writepage function. That way the existing 
writepage would continue to do the invalidation, or does this check not 
buy us anything in the ordered/writeback cases either?

Otherwise I think it all looks good,

Steve.

>   redirty:
>   	redirty_page_for_writepage(wbc, page);
> @@ -140,6 +131,32 @@ static int gfs2_writepage(struct page *page, struct writeback_control *wbc)
>   	return nobh_writepage(page, gfs2_get_block_noalloc, wbc);
>   }
>   
> +/* This is the same as calling block_write_full_page, but it also
> + * writes pages outside of i_size
> + */
> +int gfs2_write_full_page(struct page *page, get_block_t *get_block,
> +			 struct writeback_control *wbc)
> +{
> +	struct inode * const inode = page->mapping->host;
> +	loff_t i_size = i_size_read(inode);
> +	const pgoff_t end_index = i_size >> PAGE_SHIFT;
> +	unsigned offset;
> +
> +	/*
> +	 * The page straddles i_size.  It must be zeroed out on each and every
> +	 * writepage invocation because it may be mmapped.  "A file is mapped
> +	 * in multiples of the page size.  For a file that is not a multiple of
> +	 * the  page size, the remaining memory is zeroed when mapped, and
> +	 * writes to that region are not written out to the file."
> +	 */
> +	offset = i_size & (PAGE_SIZE-1);
> +	if (page->index == end_index && offset)
> +		zero_user_segment(page, offset, PAGE_SIZE);
> +
> +	return __block_write_full_page(inode, page, get_block, wbc,
> +				       end_buffer_async_write);
> +}
> +
>   /**
>    * __gfs2_jdata_writepage - The core of jdata writepage
>    * @page: The page to write
> @@ -165,7 +182,7 @@ static int __gfs2_jdata_writepage(struct page *page, struct writeback_control *w
>   		}
>   		gfs2_page_add_databufs(ip, page, 0, sdp->sd_vfs->s_blocksize-1);
>   	}
> -	return block_write_full_page(page, gfs2_get_block_noalloc, wbc);
> +	return gfs2_write_full_page(page, gfs2_get_block_noalloc, wbc);
>   }
>   
>   /**


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Cluster-devel] [PATCH 2/2] gfs2: writeout truncated pages
  2016-06-16 15:56   ` Steven Whitehouse
@ 2016-06-16 17:49     ` Benjamin Marzinski
  0 siblings, 0 replies; 7+ messages in thread
From: Benjamin Marzinski @ 2016-06-16 17:49 UTC (permalink / raw)
  To: Steven Whitehouse; +Cc: linux-fsdevel, cluster-devel

On Thu, Jun 16, 2016 at 04:56:49PM +0100, Steven Whitehouse wrote:
> Hi,
> 
> On 14/06/16 23:02, Benjamin Marzinski wrote:
> >When gfs2 attempts to write a page to a file that is being truncated,
> >and notices that the page is completely outside of the file size, it
> >tries to invalidate it.  However, this may require a transaction for
> >journaled data files to revoke any buffers from the page on the active
> >items list. Unfortunately, this can happen inside a log flush, where a
> >transaction cannot be started. Also, gfs2 may need to be able to remove
> >the buffer from the ail1 list before it can finish the log flush.
> >
> >To deal with this, gfs2 now skips the check to see if the write is
> >outside the file size, and simply writes it anyway. This situation can
> >only occur when the truncate code still has the file locked exclusively,
> >and hasn't marked this block as free in the metadata (which happens
> >later in truc_dealloc).  After gfs2 writes this page out, the truncation
> >code will shortly invalidate it and write out any revokes if necessary.
> >
> >To do this, gfs2 now implements its own version of block_write_full_page
> >without the check, and calls the newly exported __block_write_full_page.
> >The check still exists in nobh_writepage, which is gfs2 calls in the
> >non-journaled case. But in this case the buffers will never be in the
> >journal. Thus, no revokes will need to be issued and the write can
> >safely be skipped without causing any possible journal replay issues.
> >
> >Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
> >---
> >  fs/gfs2/aops.c | 37 +++++++++++++++++++++++++++----------
> >  1 file changed, 27 insertions(+), 10 deletions(-)
> >
> >diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
> >index 37b7bc1..d3a7301 100644
> >--- a/fs/gfs2/aops.c
> >+++ b/fs/gfs2/aops.c
> >@@ -100,20 +100,11 @@ static int gfs2_writepage_common(struct page *page,
> >  	struct inode *inode = page->mapping->host;
> >  	struct gfs2_inode *ip = GFS2_I(inode);
> >  	struct gfs2_sbd *sdp = GFS2_SB(inode);
> >-	loff_t i_size = i_size_read(inode);
> >-	pgoff_t end_index = i_size >> PAGE_SHIFT;
> >-	unsigned offset;
> >  	if (gfs2_assert_withdraw(sdp, gfs2_glock_is_held_excl(ip->i_gl)))
> >  		goto out;
> >  	if (current->journal_info)
> >  		goto redirty;
> >-	/* Is the page fully outside i_size? (truncate in progress) */
> >-	offset = i_size & (PAGE_SIZE-1);
> >-	if (page->index > end_index || (page->index == end_index && !offset)) {
> >-		page->mapping->a_ops->invalidatepage(page, 0, PAGE_SIZE);
> >-		goto out;
> >-	}
> >  	return 1;
> I wonder whether it would make more sense to simply stop the jdata writepage
> calling the writepage_common function altogether and move the remaining
> checks to the jdata writepage function. That way the existing writepage
> would continue to do the invalidation, or does this check not buy us
> anything in the ordered/writeback cases either?
> 
> Otherwise I think it all looks good,

Like I told Bob, we already run into cases where the file i_size is
shrunk after we call gfs2_writepage_common, and we rely on
nobh_writepage to do the check.  So, from a correctness point of view, I
don't think this changes much.  We are invalidating the page sooner, so
it's possible that there is some difference in performance, but I don't
know that it would be measureable.

If people would prefer, I'm fine following this suggestion, since it
means that we aren't changing code that has been working fine all along,
and doing it my way certainly won't improve our performance.

-Ben

> 
> Steve.
> 
> >  redirty:
> >  	redirty_page_for_writepage(wbc, page);
> >@@ -140,6 +131,32 @@ static int gfs2_writepage(struct page *page, struct writeback_control *wbc)
> >  	return nobh_writepage(page, gfs2_get_block_noalloc, wbc);
> >  }
> >+/* This is the same as calling block_write_full_page, but it also
> >+ * writes pages outside of i_size
> >+ */
> >+int gfs2_write_full_page(struct page *page, get_block_t *get_block,
> >+			 struct writeback_control *wbc)
> >+{
> >+	struct inode * const inode = page->mapping->host;
> >+	loff_t i_size = i_size_read(inode);
> >+	const pgoff_t end_index = i_size >> PAGE_SHIFT;
> >+	unsigned offset;
> >+
> >+	/*
> >+	 * The page straddles i_size.  It must be zeroed out on each and every
> >+	 * writepage invocation because it may be mmapped.  "A file is mapped
> >+	 * in multiples of the page size.  For a file that is not a multiple of
> >+	 * the  page size, the remaining memory is zeroed when mapped, and
> >+	 * writes to that region are not written out to the file."
> >+	 */
> >+	offset = i_size & (PAGE_SIZE-1);
> >+	if (page->index == end_index && offset)
> >+		zero_user_segment(page, offset, PAGE_SIZE);
> >+
> >+	return __block_write_full_page(inode, page, get_block, wbc,
> >+				       end_buffer_async_write);
> >+}
> >+
> >  /**
> >   * __gfs2_jdata_writepage - The core of jdata writepage
> >   * @page: The page to write
> >@@ -165,7 +182,7 @@ static int __gfs2_jdata_writepage(struct page *page, struct writeback_control *w
> >  		}
> >  		gfs2_page_add_databufs(ip, page, 0, sdp->sd_vfs->s_blocksize-1);
> >  	}
> >-	return block_write_full_page(page, gfs2_get_block_noalloc, wbc);
> >+	return gfs2_write_full_page(page, gfs2_get_block_noalloc, wbc);
> >  }
> >  /**

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-06-16 17:49 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-06-14 22:02 [PATCH 0/2] fix gfs2 truncate race Benjamin Marzinski
2016-06-14 22:02 ` [PATCH 1/2] fs: export __block_write_full_page Benjamin Marzinski
2016-06-14 22:02 ` [PATCH 2/2] gfs2: writeout truncated pages Benjamin Marzinski
2016-06-15 17:08   ` [Cluster-devel] " Bob Peterson
2016-06-15 18:45     ` Benjamin Marzinski
2016-06-16 15:56   ` Steven Whitehouse
2016-06-16 17:49     ` Benjamin Marzinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).