public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] btrfs: refactor how we prepare pages for btrfs_buffered_write()
@ 2020-08-07  7:27 Qu Wenruo
  2020-08-13 11:16 ` Qu Wenruo
  0 siblings, 1 reply; 3+ messages in thread
From: Qu Wenruo @ 2020-08-07  7:27 UTC (permalink / raw)
  To: linux-btrfs

When we prepare pages for buffered write, we need to ensure that if
we're partially dirting a page, then it must be uptodate to make sure it
has the old content read from disk.

While for pages that is completely inside the write range, we don't need
to make it uptodate at all, thus we can skip one page read.

The old method uses @force_page_uptodate flag, but in fact it doesn't
make any sense at all.
In parepare_uptodate_page() it has the check to ensure we always force
a paritially written page to be uptodate.

So this patch will refactor the mess, to make it more easier to read by:
- Remove the @force_page_uptodate bit
  It makes no difference at all

- Remove the @pos and @force_uptodate for prepare_uptodate_page()
  Now the caller, parepare_pages() will determine if it's necessary.

- Add more comment for parepare_pages()
  This will explain why we don't need some pages to be uptodate.

- Use page_offset() to be more user-friendly
  After find_or_create_page(), page_offset() will return the offset
  inside the address space, thus can be used directly to be more
  reader-friendly.

The new code uses the following method to do such check:

	full_dirty_page_start = round_up(pos, fs_info->sectorsize);
	full_dirty_page_end = round_down(pos + write_bytes,
					 fs_info->sectorsize) - 1;
	for (i = 0; i < num_pages; i++) {
		...
		if (!err && !(page_offset >= full_dirty_page_start &&
			      page_offset <= full_dirty_page_end))
			err = prepare_uptodate_page(inode, pages[i]);
	}

Which can handle all the possible cases like:
- Dirty range in side a page
  || |///| ||
  Then @full_dirty_page_start > @full_dirty_page_end, and it result will
  always be (!err && !(false))

- Dirty range across one page boundary
  ||   |///||//|     ||
  Then @full_dirty_page_start == @full_dirty_page_end + 1, the same
  as above case.

- Dirty range covers a full page
  ||   |///||////////||//|     ||
  Then only for the 2nd page meet the condition, and skip the
  prepare_uptodate_page() call.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/file.c | 40 +++++++++++++++++++++++-----------------
 1 file changed, 23 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 841c516079a9..705ebe709e8d 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1378,13 +1378,11 @@ int btrfs_mark_extent_written(struct btrfs_trans_handle *trans,
  * on success we return a locked page and 0
  */
 static int prepare_uptodate_page(struct inode *inode,
-				 struct page *page, u64 pos,
-				 bool force_uptodate)
+				 struct page *page)
 {
 	int ret = 0;
 
-	if (((pos & (PAGE_SIZE - 1)) || force_uptodate) &&
-	    !PageUptodate(page)) {
+	if (!PageUptodate(page)) {
 		ret = btrfs_readpage(NULL, page);
 		if (ret)
 			return ret;
@@ -1406,14 +1404,29 @@ static int prepare_uptodate_page(struct inode *inode,
  */
 static noinline int prepare_pages(struct inode *inode, struct page **pages,
 				  size_t num_pages, loff_t pos,
-				  size_t write_bytes, bool force_uptodate)
+				  size_t write_bytes)
 {
-	int i;
+	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
 	unsigned long index = pos >> PAGE_SHIFT;
 	gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping);
+	u64 full_dirty_page_start;
+	u64 full_dirty_page_end;
 	int err = 0;
 	int faili;
+	int i;
 
+	/*
+	 * || = page boundary
+	 *
+	 *     @pos                   @pos + write_bytes
+	 * ||  |///||//////||/////||//|   ||
+	 *	    \		  /
+	 *           In this range, we don't need to the page to
+	 *           be uptodate at all.
+	 */
+	full_dirty_page_start = round_up(pos, fs_info->sectorsize);
+	full_dirty_page_end = round_down(pos + write_bytes,
+					 fs_info->sectorsize) - 1;
 	for (i = 0; i < num_pages; i++) {
 again:
 		pages[i] = find_or_create_page(inode->i_mapping, index + i,
@@ -1424,12 +1437,9 @@ static noinline int prepare_pages(struct inode *inode, struct page **pages,
 			goto fail;
 		}
 
-		if (i == 0)
-			err = prepare_uptodate_page(inode, pages[i], pos,
-						    force_uptodate);
-		if (!err && i == num_pages - 1)
-			err = prepare_uptodate_page(inode, pages[i],
-						    pos + write_bytes, false);
+		if (!err && !(page_offset(pages[i]) >= full_dirty_page_start &&
+			      page_offset(pages[i]) <= full_dirty_page_end))
+			err = prepare_uptodate_page(inode, pages[i]);
 		if (err) {
 			put_page(pages[i]);
 			if (err == -EAGAIN) {
@@ -1638,7 +1648,6 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
 	int nrptrs;
 	int ret = 0;
 	bool only_release_metadata = false;
-	bool force_page_uptodate = false;
 
 	nrptrs = min(DIV_ROUND_UP(iov_iter_count(i), PAGE_SIZE),
 			PAGE_SIZE / (sizeof(struct page *)));
@@ -1727,8 +1736,7 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
 		 * contents of pages from loop to loop
 		 */
 		ret = prepare_pages(inode, pages, num_pages,
-				    pos, write_bytes,
-				    force_page_uptodate);
+				    pos, write_bytes);
 		if (ret) {
 			btrfs_delalloc_release_extents(BTRFS_I(inode),
 						       reserve_bytes);
@@ -1763,11 +1771,9 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
 			nrptrs = 1;
 
 		if (copied == 0) {
-			force_page_uptodate = true;
 			dirty_sectors = 0;
 			dirty_pages = 0;
 		} else {
-			force_page_uptodate = false;
 			dirty_pages = DIV_ROUND_UP(copied + offset,
 						   PAGE_SIZE);
 		}
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] btrfs: refactor how we prepare pages for btrfs_buffered_write()
  2020-08-07  7:27 [PATCH] btrfs: refactor how we prepare pages for btrfs_buffered_write() Qu Wenruo
@ 2020-08-13 11:16 ` Qu Wenruo
  2020-08-13 11:22   ` David Sterba
  0 siblings, 1 reply; 3+ messages in thread
From: Qu Wenruo @ 2020-08-13 11:16 UTC (permalink / raw)
  To: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 6820 bytes --]



On 2020/8/7 下午3:27, Qu Wenruo wrote:
> When we prepare pages for buffered write, we need to ensure that if
> we're partially dirting a page, then it must be uptodate to make sure it
> has the old content read from disk.
> 
> While for pages that is completely inside the write range, we don't need
> to make it uptodate at all, thus we can skip one page read.
> 
> The old method uses @force_page_uptodate flag, but in fact it doesn't
> make any sense at all.
> In parepare_uptodate_page() it has the check to ensure we always force
> a paritially written page to be uptodate.
> 
> So this patch will refactor the mess, to make it more easier to read by:
> - Remove the @force_page_uptodate bit
>   It makes no difference at all
> 
> - Remove the @pos and @force_uptodate for prepare_uptodate_page()
>   Now the caller, parepare_pages() will determine if it's necessary.
> 
> - Add more comment for parepare_pages()
>   This will explain why we don't need some pages to be uptodate.
> 
> - Use page_offset() to be more user-friendly
>   After find_or_create_page(), page_offset() will return the offset
>   inside the address space, thus can be used directly to be more
>   reader-friendly.
> 
> The new code uses the following method to do such check:
> 
> 	full_dirty_page_start = round_up(pos, fs_info->sectorsize);
> 	full_dirty_page_end = round_down(pos + write_bytes,
> 					 fs_info->sectorsize) - 1;
> 	for (i = 0; i < num_pages; i++) {
> 		...
> 		if (!err && !(page_offset >= full_dirty_page_start &&
> 			      page_offset <= full_dirty_page_end))
> 			err = prepare_uptodate_page(inode, pages[i]);
> 	}
> 
> Which can handle all the possible cases like:
> - Dirty range in side a page
>   || |///| ||
>   Then @full_dirty_page_start > @full_dirty_page_end, and it result will
>   always be (!err && !(false))
> 
> - Dirty range across one page boundary
>   ||   |///||//|     ||
>   Then @full_dirty_page_start == @full_dirty_page_end + 1, the same
>   as above case.
> 
> - Dirty range covers a full page
>   ||   |///||////////||//|     ||
>   Then only for the 2nd page meet the condition, and skip the
>   prepare_uptodate_page() call.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Please discard this patch too.

It turns out that, some forced uptodate operations are needed. Reason
inlined below.

There is still something useful, like the partial written range check,
but it breaks the original behavior with some very rare iov combinations.

> ---
>  fs/btrfs/file.c | 40 +++++++++++++++++++++++-----------------
>  1 file changed, 23 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 841c516079a9..705ebe709e8d 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1378,13 +1378,11 @@ int btrfs_mark_extent_written(struct btrfs_trans_handle *trans,
>   * on success we return a locked page and 0
>   */
>  static int prepare_uptodate_page(struct inode *inode,
> -				 struct page *page, u64 pos,
> -				 bool force_uptodate)
> +				 struct page *page)
>  {
>  	int ret = 0;
>  
> -	if (((pos & (PAGE_SIZE - 1)) || force_uptodate) &&
> -	    !PageUptodate(page)) {
> +	if (!PageUptodate(page)) {
>  		ret = btrfs_readpage(NULL, page);
>  		if (ret)
>  			return ret;
> @@ -1406,14 +1404,29 @@ static int prepare_uptodate_page(struct inode *inode,
>   */
>  static noinline int prepare_pages(struct inode *inode, struct page **pages,
>  				  size_t num_pages, loff_t pos,
> -				  size_t write_bytes, bool force_uptodate)
> +				  size_t write_bytes)
>  {
> -	int i;
> +	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
>  	unsigned long index = pos >> PAGE_SHIFT;
>  	gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping);
> +	u64 full_dirty_page_start;
> +	u64 full_dirty_page_end;
>  	int err = 0;
>  	int faili;
> +	int i;
>  
> +	/*
> +	 * || = page boundary
> +	 *
> +	 *     @pos                   @pos + write_bytes
> +	 * ||  |///||//////||/////||//|   ||
> +	 *	    \		  /
> +	 *           In this range, we don't need to the page to
> +	 *           be uptodate at all.
> +	 */
> +	full_dirty_page_start = round_up(pos, fs_info->sectorsize);
> +	full_dirty_page_end = round_down(pos + write_bytes,
> +					 fs_info->sectorsize) - 1;
>  	for (i = 0; i < num_pages; i++) {
>  again:
>  		pages[i] = find_or_create_page(inode->i_mapping, index + i,
> @@ -1424,12 +1437,9 @@ static noinline int prepare_pages(struct inode *inode, struct page **pages,
>  			goto fail;
>  		}
>  
> -		if (i == 0)
> -			err = prepare_uptodate_page(inode, pages[i], pos,
> -						    force_uptodate);
> -		if (!err && i == num_pages - 1)
> -			err = prepare_uptodate_page(inode, pages[i],
> -						    pos + write_bytes, false);
> +		if (!err && !(page_offset(pages[i]) >= full_dirty_page_start &&
> +			      page_offset(pages[i]) <= full_dirty_page_end))
> +			err = prepare_uptodate_page(inode, pages[i]);
>  		if (err) {
>  			put_page(pages[i]);
>  			if (err == -EAGAIN) {
> @@ -1638,7 +1648,6 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
>  	int nrptrs;
>  	int ret = 0;
>  	bool only_release_metadata = false;
> -	bool force_page_uptodate = false;
>  
>  	nrptrs = min(DIV_ROUND_UP(iov_iter_count(i), PAGE_SIZE),
>  			PAGE_SIZE / (sizeof(struct page *)));
> @@ -1727,8 +1736,7 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
>  		 * contents of pages from loop to loop
>  		 */
>  		ret = prepare_pages(inode, pages, num_pages,
> -				    pos, write_bytes,
> -				    force_page_uptodate);
> +				    pos, write_bytes);
>  		if (ret) {
>  			btrfs_delalloc_release_extents(BTRFS_I(inode),
>  						       reserve_bytes);
> @@ -1763,11 +1771,9 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
>  			nrptrs = 1;
>  
>  		if (copied == 0) {
> -			force_page_uptodate = true;

This is here to prevent the following problem:

One page is completely covered by write range, but the write range
inside the page is split into two iov.

In that case, if the later part failed to be copied, we will hit a short
write.
Originally we will set force_page_uptodate, then falls back to nrptrs =
1; with force_page_uptodate = true;

Then in the next loop, we will force to read that full page to avoid
garbage in the partial written page.

This is indeed super rare, thus all my fstests run haven't triggered it.
But in theory, as long as we have differnt iovs supported, it should
still be possible.

Thanks,
Qu

>  			dirty_sectors = 0;
>  			dirty_pages = 0;
>  		} else {
> -			force_page_uptodate = false;
>  			dirty_pages = DIV_ROUND_UP(copied + offset,
>  						   PAGE_SIZE);
>  		}
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] btrfs: refactor how we prepare pages for btrfs_buffered_write()
  2020-08-13 11:16 ` Qu Wenruo
@ 2020-08-13 11:22   ` David Sterba
  0 siblings, 0 replies; 3+ messages in thread
From: David Sterba @ 2020-08-13 11:22 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Aug 13, 2020 at 07:16:48PM +0800, Qu Wenruo wrote:
> > @@ -1763,11 +1771,9 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
> >  			nrptrs = 1;
> >  
> >  		if (copied == 0) {
> > -			force_page_uptodate = true;
> 
> This is here to prevent the following problem:
> 
> One page is completely covered by write range, but the write range
> inside the page is split into two iov.
> 
> In that case, if the later part failed to be copied, we will hit a short
> write.
> Originally we will set force_page_uptodate, then falls back to nrptrs =
> 1; with force_page_uptodate = true;
> 
> Then in the next loop, we will force to read that full page to avoid
> garbage in the partial written page.
> 
> This is indeed super rare, thus all my fstests run haven't triggered it.
> But in theory, as long as we have differnt iovs supported, it should
> still be possible.

I wouldn't count on fstests to trigger corner cases, for that we need
long running stress tests. Good that you can identify the rare cases
while the patch is still in development, catching such bugs later is
much harder.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-08-13 11:23 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-08-07  7:27 [PATCH] btrfs: refactor how we prepare pages for btrfs_buffered_write() Qu Wenruo
2020-08-13 11:16 ` Qu Wenruo
2020-08-13 11:22   ` David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox