Re: [RFCv3 3/3] iomap: Support subpage size dirty tracking to improve write performance

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Brian Foster <bfoster@redhat.com>
To: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Cc: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	Aravinda Herle <araherle@in.ibm.com>
Subject: Re: [RFCv3 3/3] iomap: Support subpage size dirty tracking to improve write performance
Date: Thu, 2 Mar 2023 09:02:24 -0500	[thread overview]
Message-ID: <ZACscHnzmywRaXvu@bfoster> (raw)
In-Reply-To: <9650ef88e09c6227b99bb5793eef2b8e47994c7d.1677428795.git.ritesh.list@gmail.com>

On Mon, Feb 27, 2023 at 01:13:32AM +0530, Ritesh Harjani (IBM) wrote:
> On a 64k pagesize platforms (specially Power and/or aarch64) with 4k
> filesystem blocksize, this patch should improve the performance by doing
> only the subpage dirty data write.
> 
> This should also reduce the write amplification since we can now track
> subpage dirty status within state bitmaps. Earlier we had to
> write the entire 64k page even if only a part of it (e.g. 4k) was
> updated.
> 
> Performance testing of below fio workload reveals ~16x performance
> improvement on nvme with XFS (4k blocksize) on Power (64K pagesize)
> FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.
> 
> 1. <test_randwrite.fio>
> [global]
> 	ioengine=psync
> 	rw=randwrite
> 	overwrite=1
> 	pre_read=1
> 	direct=0
> 	bs=4k
> 	size=1G
> 	dir=./
> 	numjobs=8
> 	fdatasync=1
> 	runtime=60
> 	iodepth=64
> 	group_reporting=1
> 
> [fio-run]
> 
> 2. Also our internal performance team reported that this patch improves there
>    database workload performance by around ~83% (with XFS on Power)
> 
> Reported-by: Aravinda Herle <araherle@in.ibm.com>
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---
>  fs/gfs2/aops.c         |   2 +-
>  fs/iomap/buffered-io.c | 104 +++++++++++++++++++++++++++++++++++++----
>  fs/xfs/xfs_aops.c      |   2 +-
>  fs/zonefs/super.c      |   2 +-
>  include/linux/iomap.h  |   1 +
>  5 files changed, 99 insertions(+), 12 deletions(-)
> 
...
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index e0b0be16278e..fb55183c547f 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
...
> @@ -1630,7 +1715,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  		struct writeback_control *wbc, struct inode *inode,
>  		struct folio *folio, u64 end_pos)
>  {
> -	struct iomap_page *iop = iomap_page_create(inode, folio, 0);
> +	struct iomap_page *iop = iomap_page_create(inode, folio, 0, true);
>  	struct iomap_ioend *ioend, *next;
>  	unsigned len = i_blocksize(inode);
>  	unsigned nblocks = i_blocks_per_folio(inode, folio);
> @@ -1646,7 +1731,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  	 * invalid, grab a new one.
>  	 */
>  	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
> -		if (iop && !iop_test_uptodate(iop, i, nblocks))
> +		if (iop && !iop_test_dirty(iop, i, nblocks))
>  			continue;
> 
>  		error = wpc->ops->map_blocks(wpc, inode, pos);

Hi Ritesh,

I'm not sure if you followed any of the discussion on the imap
revalidation series that landed in the last cycle or so, but the
associated delalloc punch error handling code has a subtle dependency on
current writeback behavior and thus left a bit of a landmine for this
work. For reference, more detailed discussion starts around here [1].
The context is basically that the use of mapping seek hole/data relies
on uptodate status, which means in certain error cases the filesystem
might allocate a delalloc block for a write, but not punch it out of the
associated write happens to fail and the underlying portion of the folio
was uptodate.

This doesn't cause a problem in current mainline because writeback maps
every uptodate block in a dirty folio, and so the delalloc block will
convert at writeback time even though it wasn't written. This no longer
occurs with the change above, which means there's a vector for a stale
delalloc block to be left around in the inode. This is a free space
accounting corruption issue on XFS. Here's a quick example [2] on a 1k
FSB XFS filesystem to show exactly what I mean:

# xfs_io -fc "truncate 4k" -c "mmap 0 4k" -c "mread 0 4k" -c "pwrite 0 1" -c "pwrite -f 2k 1" -c fsync /mnt/file
# xfs_io -c "fiemap -v" /mnt/file 
/mnt/file:
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
...
   2: [4..5]:          0..1                 2   0x7
...
(the above shows delalloc after an fsync)
# umount /mnt
  kernel:XFS: Assertion failed: xfs_is_shutdown(mp) || percpu_counter_sum(&mp->m_delalloc_blks) == 0, file: fs/xfs/xfs_super.c, line: 1068
# xfs_repair -n /dev/vdb2 
Phase 1 - find and verify superblock...
Phase 2 - using internal log
...
sb_fdblocks 20960187, counted 20960195
...
#

I suspect this means either the error handling code needs to be updated
to consider dirty state (i.e. punch delalloc if the block is !dirty), or
otherwise this needs to depend on a broader change in XFS to reclaim
delalloc blocks before inode eviction (along the lines of Dave's recent
proposal to do something like that for corrupted inodes). Of course the
caveat with the latter is that doesn't help for any other filesystems
(?) that might have similar expectations for delayed allocation and want
to use iomap.

Brian

[1] https://lore.kernel.org/linux-fsdevel/Y3TsPzd0XzXXIzQv@bfoster/

[2] This test case depends on a local xfs_io hack to co-opt the -f flag
into inducing a write failure. A POC patch for that is available here,
if you wanted to replicate:

https://lore.kernel.org/linux-xfs/20221123181322.3710820-1-bfoster@redhat.com/

next prev parent reply	other threads:[~2023-03-02 14:01 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-26 19:43 [RFCv3 0/3] iomap: Add support for subpage dirty state tracking to improve write performance Ritesh Harjani (IBM)
2023-02-26 19:43 ` [RFCv3 1/3] iomap: Allocate iop in ->write_begin() early Ritesh Harjani (IBM)
2023-02-26 22:41   ` Dave Chinner
2023-02-28 17:55     ` Ritesh Harjani
2023-02-26 23:12   ` Matthew Wilcox
2023-02-28 18:33     ` Ritesh Harjani
2023-02-28 18:36       ` Matthew Wilcox
2023-03-02 18:59         ` Ritesh Harjani
2023-02-26 19:43 ` [RFCv3 2/3] iomap: Change uptodate variable name to state Ritesh Harjani (IBM)
2023-02-26 23:12   ` Dave Chinner
2023-02-26 23:23   ` Matthew Wilcox
2023-02-28 18:38     ` Ritesh Harjani
2023-02-26 19:43 ` [RFCv3 3/3] iomap: Support subpage size dirty tracking to improve write performance Ritesh Harjani (IBM)
2023-02-26 23:48   ` Dave Chinner
2023-02-26 23:56     ` Matthew Wilcox
2023-02-27  0:10       ` Dave Chinner
2023-03-02 14:02   ` Brian Foster [this message]
2023-04-26  9:57     ` Ritesh Harjani

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZACscHnzmywRaXvu@bfoster \
    --to=bfoster@redhat.com \
    --cc=araherle@in.ibm.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=ritesh.list@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).