linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <djwong@kernel.org>
To: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Cc: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	Christoph Hellwig <hch@infradead.org>,
	Matthew Wilcox <willy@infradead.org>,
	Aravinda Herle <araherle@in.ibm.com>
Subject: Re: [RFC 2/2] iomap: Support subpage size dirty tracking to improve write performance
Date: Fri, 28 Oct 2022 10:01:19 -0700	[thread overview]
Message-ID: <Y1wK3x7IketHl+DQ@magnolia> (raw)
In-Reply-To: <886076cfa6f547d22765c522177d33cf621013d2.1666928993.git.ritesh.list@gmail.com>

On Fri, Oct 28, 2022 at 10:00:33AM +0530, Ritesh Harjani (IBM) wrote:
> On a 64k pagesize platforms (specially Power and/or aarch64) with 4k
> filesystem blocksize, this patch should improve the performance by doing
> only the subpage dirty data write.
> 
> This should also reduce the write amplification since we can now track
> subpage dirty status within state bitmaps. Earlier we had to
> write the entire 64k page even if only a part of it (e.g. 4k) was
> updated.
> 
> Performance testing of below fio workload reveals ~16x performance
> improvement on nvme with XFS (4k blocksize) on Power (64K pagesize)
> FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.
> 
> <test_randwrite.fio>
> [global]
> 	ioengine=psync
> 	rw=randwrite
> 	overwrite=1
> 	pre_read=1
> 	direct=0
> 	bs=4k
> 	size=1G
> 	dir=./
> 	numjobs=8
> 	fdatasync=1
> 	runtime=60
> 	iodepth=64
> 	group_reporting=1
> 
> [fio-run]
> 
> Reported-by: Aravinda Herle <araherle@in.ibm.com>
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---
>  fs/iomap/buffered-io.c | 53 ++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 51 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 255f9f92668c..31ee80a996b2 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -58,7 +58,7 @@ iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags)
>  	else
>  		gfp = GFP_NOFS | __GFP_NOFAIL;
>  
> -	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
> +	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(2 * nr_blocks)),
>  		      gfp);
>  	if (iop) {
>  		spin_lock_init(&iop->state_lock);
> @@ -168,6 +168,48 @@ static void iomap_set_range_uptodate(struct folio *folio,
>  		folio_mark_uptodate(folio);
>  }
>  
> +static void iomap_iop_set_range_dirty(struct folio *folio,
> +		struct iomap_page *iop, size_t off, size_t len)
> +{
> +	struct inode *inode = folio->mapping->host;
> +	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
> +	unsigned first = (off >> inode->i_blkbits) + nr_blocks;
> +	unsigned last = ((off + len - 1) >> inode->i_blkbits) + nr_blocks;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&iop->state_lock, flags);
> +	bitmap_set(iop->state, first, last - first + 1);
> +	spin_unlock_irqrestore(&iop->state_lock, flags);
> +}
> +
> +static void iomap_set_range_dirty(struct folio *folio,
> +		struct iomap_page *iop, size_t off, size_t len)
> +{
> +	if (iop)
> +		iomap_iop_set_range_dirty(folio, iop, off, len);
> +}
> +
> +static void iomap_iop_clear_range_dirty(struct folio *folio,
> +		struct iomap_page *iop, size_t off, size_t len)
> +{
> +	struct inode *inode = folio->mapping->host;
> +	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
> +	unsigned first = (off >> inode->i_blkbits) + nr_blocks;
> +	unsigned last = ((off + len - 1) >> inode->i_blkbits) + nr_blocks;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&iop->state_lock, flags);
> +	bitmap_clear(iop->state, first, last - first + 1);
> +	spin_unlock_irqrestore(&iop->state_lock, flags);
> +}
> +
> +static void iomap_clear_range_dirty(struct folio *folio,
> +		struct iomap_page *iop, size_t off, size_t len)
> +{
> +	if (iop)
> +		iomap_iop_clear_range_dirty(folio, iop, off, len);
> +}
> +
>  static void iomap_finish_folio_read(struct folio *folio, size_t offset,
>  		size_t len, int error)
>  {
> @@ -665,6 +707,7 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
>  	if (unlikely(copied < len && !folio_test_uptodate(folio)))
>  		return 0;
>  	iomap_set_range_uptodate(folio, iop, offset_in_folio(folio, pos), len);
> +	iomap_set_range_dirty(folio, iop, offset_in_folio(folio, pos), len);
>  	filemap_dirty_folio(inode->i_mapping, folio);
>  	return copied;
>  }
> @@ -979,6 +1022,8 @@ static loff_t iomap_folio_mkwrite_iter(struct iomap_iter *iter,
>  		block_commit_write(&folio->page, 0, length);
>  	} else {
>  		WARN_ON_ONCE(!folio_test_uptodate(folio));
> +		iomap_set_range_dirty(folio, to_iomap_page(folio),
> +				offset_in_folio(folio, iter->pos), length);
>  		folio_mark_dirty(folio);
>  	}
>  
> @@ -1354,7 +1399,8 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  	 * invalid, grab a new one.
>  	 */
>  	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
> -		if (iop && !test_bit(i, iop->state))
> +		if (iop && (!test_bit(i, iop->state) ||
> +			    !test_bit(i + nblocks, iop->state)))

Hmm.  So I /think/ these two test_bit()s mean that we skip any folio
sub-block if it's either notuptodate or not dirty?

I /think/ we only need to check the dirty status, right?  Like willy
said? :)

That said... somewhere we probably ought to check the consistency of the
two bits to ensure that they're not (dirty && !uptodate), given our
horrible history of getting things wrong with page and bufferhead state
bits.

Admittedly I'm not thrilled at the reintroduction of page and iop dirty
state that are updated in separate places, but OTOH the write
amplification here is demonstrably horrifying as you point out so it's
clearly necessary.

Maybe we need a debugging function that will check the page and iop
state, and call it every time we go in and out of critical iomap
functions (write, writeback, dropping pages, etc)

--D

>  			continue;
>  
>  		error = wpc->ops->map_blocks(wpc, inode, pos);
> @@ -1397,6 +1443,9 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  		}
>  	}
>  
> +	iomap_clear_range_dirty(folio, iop,
> +				offset_in_folio(folio, folio_pos(folio)),
> +				end_pos - folio_pos(folio));
>  	folio_start_writeback(folio);
>  	folio_unlock(folio);
>  
> -- 
> 2.37.3
> 

  parent reply	other threads:[~2022-10-28 17:03 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-28  4:30 [RFC 0/2] iomap: Add support for subpage dirty state tracking to improve write performance Ritesh Harjani (IBM)
2022-10-28  4:30 ` [RFC 1/2] iomap: Change uptodate variable name to state Ritesh Harjani (IBM)
2022-10-28 16:31   ` Darrick J. Wong
2022-10-29  3:09     ` Ritesh Harjani (IBM)
2022-10-28  4:30 ` [RFC 2/2] iomap: Support subpage size dirty tracking to improve write performance Ritesh Harjani (IBM)
2022-10-28 12:42   ` Matthew Wilcox
2022-10-29  3:05     ` Ritesh Harjani (IBM)
2022-10-28 17:01   ` Darrick J. Wong [this message]
2022-10-28 18:15     ` Matthew Wilcox
2022-10-29  3:25     ` Ritesh Harjani (IBM)
2022-10-28 21:04   ` Dave Chinner
2022-10-30  3:27     ` Ritesh Harjani (IBM)
2022-10-30 22:31       ` Dave Chinner
2022-10-31  3:43     ` Matthew Wilcox
2022-10-31  7:08       ` Dave Chinner
2022-10-31 10:27         ` Matthew Wilcox
2022-11-02  8:57           ` Christoph Hellwig
2022-11-03  0:38             ` Dave Chinner
2022-11-02  9:03       ` Christoph Hellwig
2022-11-02 17:35         ` Darrick J. Wong
2022-11-04  7:27           ` Christoph Hellwig
2022-11-04 14:15             ` Ritesh Harjani (IBM)
2022-11-03 14:51       ` David Howells
2022-11-04  7:30         ` Christoph Hellwig
2022-11-07 13:03         ` David Howells
2022-11-04 11:28       ` Ritesh Harjani (IBM)
2022-11-03 14:12     ` David Howells

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y1wK3x7IketHl+DQ@magnolia \
    --to=djwong@kernel.org \
    --cc=araherle@in.ibm.com \
    --cc=hch@infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=ritesh.list@gmail.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).