From: Jonathan Derrick <jonathan.derrick@linux.dev>
To: Song Liu <song@kernel.org>
Cc: linux-raid@vger.kernel.org, Sushma Kalakota <sushma.kalakota@intel.com>
Subject: Re: [PATCH] md: Use optimal I/O size for last bitmap page
Date: Thu, 16 Feb 2023 16:52:17 -0700 [thread overview]
Message-ID: <850848e9-77db-8c93-d921-ba0be3ba7c38@linux.dev> (raw)
In-Reply-To: <CAPhsuW65Qai4Dq-wzrcMbCR-s7J9astse1K8U8TwoAuxD4FyzQ@mail.gmail.com>
On 2/10/2023 10:32 AM, Song Liu wrote:
> Hi Jonathan,
>
> On Thu, Feb 9, 2023 at 12:38 PM Jonathan Derrick
> <jonathan.derrick@linux.dev> wrote:
>>
>> Hi Song,
>>
>> Any thoughts on this?
>
> I am really sorry that I missed this patch.
>
>>
>> On 1/17/2023 5:53 PM, Jonathan Derrick wrote:
>>> From: Jon Derrick <jonathan.derrick@linux.dev>
>>>
>>> If the bitmap space has enough room, size the I/O for the last bitmap
>>> page write to the optimal I/O size for the storage device. The expanded
>>> write is checked that it won't overrun the data or metadata.
>>>
>>> This change helps increase performance by preventing unnecessary
>>> device-side read-mod-writes due to non-atomic write unit sizes.
>>>
>>> Ex biosnoop log. Device lba size 512, optimal size 4k:
>>> Before:
>>> Time Process PID Device LBA Size Lat
>>> 0.843734 md0_raid10 5267 nvme0n1 W 24 3584 1.17
>>> 0.843933 md0_raid10 5267 nvme1n1 W 24 3584 1.36
>>> 0.843968 md0_raid10 5267 nvme1n1 W 14207939968 4096 0.01
>>> 0.843979 md0_raid10 5267 nvme0n1 W 14207939968 4096 0.02
>>>
>>> After:
>>> Time Process PID Device LBA Size Lat
>>> 18.374244 md0_raid10 6559 nvme0n1 W 24 4096 0.01
>>> 18.374253 md0_raid10 6559 nvme1n1 W 24 4096 0.01
>>> 18.374300 md0_raid10 6559 nvme0n1 W 11020272296 4096 0.01
>>> 18.374306 md0_raid10 6559 nvme1n1 W 11020272296 4096 0.02
>
> Do we see significant improvements from io benchmarks?
Yes. With lbaf=512, optimal i/o size=4k:
Without patch:
write: IOPS=1570, BW=6283KiB/s (6434kB/s)(368MiB/60001msec); 0 zone resets
With patch:
write: IOPS=59.7k, BW=233MiB/s (245MB/s)(13.7GiB/60001msec); 0 zone resets
It's going to be different for different drives, but this was a drive where a
drive-side read-mod-write has huge penalty.
>
> IIUC, fewer future HDDs will use 512B LBA size.We probably don't need
> such optimizations in the future.
Maybe. But many drives still ship with 512B formatting as default.
>
> Thanks,
> Song
>
>>>
>>> Signed-off-by: Jon Derrick <jonathan.derrick@linux.dev>
>>> ---
>>> drivers/md/md-bitmap.c | 27 ++++++++++++++++++---------
>>> 1 file changed, 18 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
>>> index e7cc6ba1b657..569297ea9b99 100644
>>> --- a/drivers/md/md-bitmap.c
>>> +++ b/drivers/md/md-bitmap.c
>>> @@ -220,6 +220,7 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
>>> rdev = NULL;
>>> while ((rdev = next_active_rdev(rdev, mddev)) != NULL) {
>>> int size = PAGE_SIZE;
>>> + int optimal_size = PAGE_SIZE;
>>> loff_t offset = mddev->bitmap_info.offset;
>>>
>>> bdev = (rdev->meta_bdev) ? rdev->meta_bdev : rdev->bdev;
>>> @@ -228,9 +229,14 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
>>> int last_page_size = store->bytes & (PAGE_SIZE-1);
>>> if (last_page_size == 0)
>>> last_page_size = PAGE_SIZE;
>>> - size = roundup(last_page_size,
>>> - bdev_logical_block_size(bdev));
>>> + size = roundup(last_page_size, bdev_logical_block_size(bdev));
>>> + if (bdev_io_opt(bdev) > bdev_logical_block_size(bdev))
>>> + optimal_size = roundup(last_page_size, bdev_io_opt(bdev));
>>> + else
>>> + optimal_size = size;
>>> }
>>> +
>>> +
>>> /* Just make sure we aren't corrupting data or
>>> * metadata
>>> */
>>> @@ -246,9 +252,11 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
>>> goto bad_alignment;
>>> } else if (offset < 0) {
>>> /* DATA BITMAP METADATA */
>>> - if (offset
>>> - + (long)(page->index * (PAGE_SIZE/512))
>>> - + size/512 > 0)
>>> + loff_t off = offset + (long)(page->index * (PAGE_SIZE/512));
>>> + if (size != optimal_size &&
>>> + off + optimal_size/512 <= 0)
>>> + size = optimal_size;
>>> + else if (off + size/512 > 0)
>>> /* bitmap runs in to metadata */
>>> goto bad_alignment;
>>> if (rdev->data_offset + mddev->dev_sectors
>>> @@ -257,10 +265,11 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
>>> goto bad_alignment;
>>> } else if (rdev->sb_start < rdev->data_offset) {
>>> /* METADATA BITMAP DATA */
>>> - if (rdev->sb_start
>>> - + offset
>>> - + page->index*(PAGE_SIZE/512) + size/512
>>> - > rdev->data_offset)
>>> + loff_t off = rdev->sb_start + offset + page->index*(PAGE_SIZE/512);
>>> + if (size != optimal_size &&
>>> + off + optimal_size/512 <= rdev->data_offset)
>>> + size = optimal_size;
>>> + else if (off + size/512 > rdev->data_offset)
>>> /* bitmap runs in to data */
>>> goto bad_alignment;
>>> } else {
next prev parent reply other threads:[~2023-02-16 23:52 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-01-18 0:53 [PATCH] md: Use optimal I/O size for last bitmap page Jonathan Derrick
2023-02-09 20:37 ` Jonathan Derrick
2023-02-10 17:32 ` Song Liu
2023-02-16 23:52 ` Jonathan Derrick [this message]
2023-02-17 13:21 ` Paul Menzel
2023-02-17 18:22 ` Jonathan Derrick
2023-02-21 5:27 ` Song Liu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=850848e9-77db-8c93-d921-ba0be3ba7c38@linux.dev \
--to=jonathan.derrick@linux.dev \
--cc=linux-raid@vger.kernel.org \
--cc=song@kernel.org \
--cc=sushma.kalakota@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).