From: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
Matthew Wilcox <willy@infradead.org>,
Dave Chinner <david@fromorbit.com>,
Brian Foster <bfoster@redhat.com>,
Christoph Hellwig <hch@infradead.org>,
Andreas Gruenbacher <agruenba@redhat.com>,
Ojaswin Mujoo <ojaswin@linux.ibm.com>,
Disha Goel <disgoel@linux.ibm.com>,
Aravinda Herle <araherle@in.ibm.com>
Subject: Re: [PATCHv8 5/5] iomap: Add per-block dirty state tracking to improve performance
Date: Wed, 07 Jun 2023 15:57:13 +0530 [thread overview]
Message-ID: <873533k1oe.fsf@doe.com> (raw)
In-Reply-To: <20230606150636.GL1325469@frogsfrogsfrogs>
"Darrick J. Wong" <djwong@kernel.org> writes:
> On Tue, Jun 06, 2023 at 05:13:52PM +0530, Ritesh Harjani (IBM) wrote:
>> When filesystem blocksize is less than folio size (either with
>> mapping_large_folio_support() or with blocksize < pagesize) and when the
>> folio is uptodate in pagecache, then even a byte write can cause
>> an entire folio to be written to disk during writeback. This happens
>> because we currently don't have a mechanism to track per-block dirty
>> state within struct iomap_folio. We currently only track uptodate state.
>>
>> This patch implements support for tracking per-block dirty state in
>> iomap_folio->state bitmap. This should help improve the filesystem write
>> performance and help reduce write amplification.
>>
>> Performance testing of below fio workload reveals ~16x performance
>> improvement using nvme with XFS (4k blocksize) on Power (64K pagesize)
>> FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.
>>
>> 1. <test_randwrite.fio>
>> [global]
>> ioengine=psync
>> rw=randwrite
>> overwrite=1
>> pre_read=1
>> direct=0
>> bs=4k
>> size=1G
>> dir=./
>> numjobs=8
>> fdatasync=1
>> runtime=60
>> iodepth=64
>> group_reporting=1
>>
>> [fio-run]
>>
>> 2. Also our internal performance team reported that this patch improves
>> their database workload performance by around ~83% (with XFS on Power)
>>
>> Reported-by: Aravinda Herle <araherle@in.ibm.com>
>> Reported-by: Brian Foster <bfoster@redhat.com>
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> ---
>> fs/gfs2/aops.c | 2 +-
>> fs/iomap/buffered-io.c | 126 +++++++++++++++++++++++++++++++++++++++--
>> fs/xfs/xfs_aops.c | 2 +-
>> fs/zonefs/file.c | 2 +-
>> include/linux/iomap.h | 1 +
>> 5 files changed, 126 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
>> index a5f4be6b9213..75efec3c3b71 100644
>> --- a/fs/gfs2/aops.c
>> +++ b/fs/gfs2/aops.c
>> @@ -746,7 +746,7 @@ static const struct address_space_operations gfs2_aops = {
>> .writepages = gfs2_writepages,
>> .read_folio = gfs2_read_folio,
>> .readahead = gfs2_readahead,
>> - .dirty_folio = filemap_dirty_folio,
>> + .dirty_folio = iomap_dirty_folio,
>> .release_folio = iomap_release_folio,
>> .invalidate_folio = iomap_invalidate_folio,
>> .bmap = gfs2_bmap,
>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>> index 2b72ca3ba37a..b99d611f1cd5 100644
>> --- a/fs/iomap/buffered-io.c
>> +++ b/fs/iomap/buffered-io.c
>> @@ -84,6 +84,70 @@ static void iomap_set_range_uptodate(struct folio *folio, size_t off,
>> folio_mark_uptodate(folio);
>> }
>>
>> +static inline bool iomap_iof_is_block_dirty(struct folio *folio,
>> + struct iomap_folio *iof, int block)
>> +{
>> + struct inode *inode = folio->mapping->host;
>> + unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
>> +
>> + return test_bit(block + blks_per_folio, iof->state);
>> +}
>> +
>> +static void iomap_iof_clear_range_dirty(struct folio *folio,
>> + struct iomap_folio *iof, size_t off, size_t len)
>> +{
>> + struct inode *inode = folio->mapping->host;
>> + unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
>> + unsigned int first_blk = off >> inode->i_blkbits;
>> + unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
>> + unsigned int nr_blks = last_blk - first_blk + 1;
>> + unsigned long flags;
>> +
>> + spin_lock_irqsave(&iof->state_lock, flags);
>> + bitmap_clear(iof->state, first_blk + blks_per_folio, nr_blks);
>> + spin_unlock_irqrestore(&iof->state_lock, flags);
>> +}
>> +
>> +static void iomap_clear_range_dirty(struct folio *folio, size_t off, size_t len)
>> +{
>> + struct iomap_folio *iof = iomap_get_iof(folio);
>> +
>> + if (!iof)
>> + return;
>> + iomap_iof_clear_range_dirty(folio, iof, off, len);
>> +}
>> +
>> +static void iomap_iof_set_range_dirty(struct inode *inode, struct folio *folio,
>> + struct iomap_folio *iof, size_t off, size_t len)
>> +{
>> + unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
>> + unsigned int first_blk = off >> inode->i_blkbits;
>> + unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
>> + unsigned int nr_blks = last_blk - first_blk + 1;
>> + unsigned long flags;
>> +
>> + spin_lock_irqsave(&iof->state_lock, flags);
>> + bitmap_set(iof->state, first_blk + blks_per_folio, nr_blks);
>> + spin_unlock_irqrestore(&iof->state_lock, flags);
>> +}
>> +
>> +/*
>> + * The reason we pass inode explicitly here is to avoid any race with truncate
>> + * when iomap_set_range_dirty() gets called from iomap_dirty_folio().
>> + * Check filemap_dirty_folio() & __folio_mark_dirty() for more details.
>> + * Hence we explicitly pass mapping->host to iomap_set_range_dirty() from
>> + * iomap_dirty_folio().
>> + */
>> +static void iomap_set_range_dirty(struct inode *inode, struct folio *folio,
>> + size_t off, size_t len)
>> +{
>> + struct iomap_folio *iof = iomap_get_iof(folio);
>> +
>> + if (!iof)
>> + return;
>> + iomap_iof_set_range_dirty(inode, folio, iof, off, len);
>> +}
>> +
>> static struct iomap_folio *iomap_iof_alloc(struct inode *inode,
>> struct folio *folio, unsigned int flags)
>> {
>> @@ -99,12 +163,20 @@ static struct iomap_folio *iomap_iof_alloc(struct inode *inode,
>> else
>> gfp = GFP_NOFS | __GFP_NOFAIL;
>>
>> - iof = kzalloc(struct_size(iof, state, BITS_TO_LONGS(nr_blocks)),
>> + /*
>> + * iof->state tracks two sets of state flags when the
>> + * filesystem block size is smaller than the folio size.
>> + * The first state tracks per-block uptodate and the
>> + * second tracks per-block dirty state.
>> + */
>> + iof = kzalloc(struct_size(iof, state, BITS_TO_LONGS(2 * nr_blocks)),
>> gfp);
>> if (iof) {
>> spin_lock_init(&iof->state_lock);
>> if (folio_test_uptodate(folio))
>> - bitmap_fill(iof->state, nr_blocks);
>> + bitmap_set(iof->state, 0, nr_blocks);
>> + if (folio_test_dirty(folio))
>> + bitmap_set(iof->state, nr_blocks, nr_blocks);
>> folio_attach_private(folio, iof);
>> }
>> return iof;
>> @@ -529,6 +601,17 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
>> }
>> EXPORT_SYMBOL_GPL(iomap_invalidate_folio);
>>
>> +bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio)
>> +{
>> + struct inode *inode = mapping->host;
>> + size_t len = folio_size(folio);
>> +
>> + iomap_iof_alloc(inode, folio, 0);
>> + iomap_set_range_dirty(inode, folio, 0, len);
>> + return filemap_dirty_folio(mapping, folio);
>> +}
>> +EXPORT_SYMBOL_GPL(iomap_dirty_folio);
>> +
>> static void
>> iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
>> {
>> @@ -733,6 +816,8 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
>> if (unlikely(copied < len && !folio_test_uptodate(folio)))
>> return 0;
>> iomap_set_range_uptodate(folio, offset_in_folio(folio, pos), len);
>> + iomap_set_range_dirty(inode, folio, offset_in_folio(folio, pos),
>> + copied);
>> filemap_dirty_folio(inode->i_mapping, folio);
>> return copied;
>> }
>> @@ -902,6 +987,10 @@ static int iomap_write_delalloc_punch(struct inode *inode, struct folio *folio,
>> int (*punch)(struct inode *inode, loff_t offset, loff_t length))
>> {
>> int ret = 0;
>> + struct iomap_folio *iof;
>> + unsigned int first_blk, last_blk, i;
>> + loff_t last_byte;
>> + u8 blkbits = inode->i_blkbits;
>>
>> if (!folio_test_dirty(folio))
>> return ret;
>> @@ -913,6 +1002,29 @@ static int iomap_write_delalloc_punch(struct inode *inode, struct folio *folio,
>> if (ret)
>> goto out;
>> }
>> + /*
>> + * When we have per-block dirty tracking, there can be
>> + * blocks within a folio which are marked uptodate
>> + * but not dirty. In that case it is necessary to punch
>> + * out such blocks to avoid leaking any delalloc blocks.
>> + */
>> + iof = iomap_get_iof(folio);
>> + if (!iof)
>> + goto skip_iof_punch;
>> +
>> + last_byte = min_t(loff_t, end_byte - 1,
>> + (folio_next_index(folio) << PAGE_SHIFT) - 1);
>> + first_blk = offset_in_folio(folio, start_byte) >> blkbits;
>> + last_blk = offset_in_folio(folio, last_byte) >> blkbits;
>> + for (i = first_blk; i <= last_blk; i++) {
>> + if (!iomap_iof_is_block_dirty(folio, iof, i)) {
>> + ret = punch(inode, i << blkbits, 1 << blkbits);
>
> Isn't the second argument to punch() supposed to be the offset within
> the file, not the offset within the folio?
>
Thanks for spotting this. Somehow got missed.
Yes, it should be byte position within file. Will fix in the next rev.
punch(inode, folio_pos(folio) + (i << blkbits), 1 << blkbits);
> I /almost/ wonder if this chunk should be its own static inline
> iomap_iof_delalloc_punch function, but ... eh. My enthusiasm for
> slicing and dicing is weak today.
>
umm, I felt having all punch logic in one function would be better.
Hence iomap_write_delalloc_punch().
-ritesh
next prev parent reply other threads:[~2023-06-07 10:27 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-06-06 11:43 [PATCHv8 0/5] iomap: Add support for per-block dirty state to improve write performance Ritesh Harjani (IBM)
2023-06-06 11:43 ` [PATCHv8 1/5] iomap: Rename iomap_page to iomap_folio and others Ritesh Harjani (IBM)
2023-06-07 6:49 ` Christoph Hellwig
2023-06-07 10:28 ` Ritesh Harjani
2023-06-06 11:43 ` [PATCHv8 2/5] iomap: Renames and refactor iomap_folio state bitmap handling Ritesh Harjani (IBM)
2023-06-06 15:13 ` Darrick J. Wong
2023-06-07 6:50 ` Christoph Hellwig
2023-06-07 10:44 ` Ritesh Harjani
2023-06-07 13:29 ` Christoph Hellwig
2023-06-06 11:43 ` [PATCHv8 3/5] iomap: Refactor iomap_write_delalloc_punch() function out Ritesh Harjani (IBM)
2023-06-07 6:52 ` Christoph Hellwig
2023-06-07 10:55 ` Ritesh Harjani
2023-06-06 11:43 ` [PATCHv8 4/5] iomap: Allocate iof in ->write_begin() early Ritesh Harjani (IBM)
2023-06-07 6:53 ` Christoph Hellwig
2023-06-06 11:43 ` [PATCHv8 5/5] iomap: Add per-block dirty state tracking to improve performance Ritesh Harjani (IBM)
2023-06-06 15:06 ` Darrick J. Wong
2023-06-07 10:27 ` Ritesh Harjani [this message]
2023-06-07 7:04 ` Christoph Hellwig
2023-06-07 15:37 ` Ritesh Harjani
2023-06-08 5:33 ` Christoph Hellwig
2023-06-08 14:45 ` Darrick J. Wong
2023-06-06 12:37 ` [PATCHv8 0/5] iomap: Add support for per-block dirty state to improve write performance Matthew Wilcox
2023-06-06 13:00 ` Ritesh Harjani
2023-06-06 15:16 ` Darrick J. Wong
2023-06-07 6:54 ` Christoph Hellwig
2023-06-07 6:48 ` Christoph Hellwig
2023-06-07 10:22 ` Ritesh Harjani
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=873533k1oe.fsf@doe.com \
--to=ritesh.list@gmail.com \
--cc=agruenba@redhat.com \
--cc=araherle@in.ibm.com \
--cc=bfoster@redhat.com \
--cc=david@fromorbit.com \
--cc=disgoel@linux.ibm.com \
--cc=djwong@kernel.org \
--cc=hch@infradead.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=ojaswin@linux.ibm.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.