* [PATCH] md: Use optimal I/O size for last bitmap page
@ 2023-01-18 0:53 Jonathan Derrick
2023-02-09 20:37 ` Jonathan Derrick
0 siblings, 1 reply; 7+ messages in thread
From: Jonathan Derrick @ 2023-01-18 0:53 UTC (permalink / raw)
To: Song Liu, linux-raid; +Cc: Sushma Kalakota, Jon Derrick
From: Jon Derrick <jonathan.derrick@linux.dev>
If the bitmap space has enough room, size the I/O for the last bitmap
page write to the optimal I/O size for the storage device. The expanded
write is checked that it won't overrun the data or metadata.
This change helps increase performance by preventing unnecessary
device-side read-mod-writes due to non-atomic write unit sizes.
Ex biosnoop log. Device lba size 512, optimal size 4k:
Before:
Time Process PID Device LBA Size Lat
0.843734 md0_raid10 5267 nvme0n1 W 24 3584 1.17
0.843933 md0_raid10 5267 nvme1n1 W 24 3584 1.36
0.843968 md0_raid10 5267 nvme1n1 W 14207939968 4096 0.01
0.843979 md0_raid10 5267 nvme0n1 W 14207939968 4096 0.02
After:
Time Process PID Device LBA Size Lat
18.374244 md0_raid10 6559 nvme0n1 W 24 4096 0.01
18.374253 md0_raid10 6559 nvme1n1 W 24 4096 0.01
18.374300 md0_raid10 6559 nvme0n1 W 11020272296 4096 0.01
18.374306 md0_raid10 6559 nvme1n1 W 11020272296 4096 0.02
Signed-off-by: Jon Derrick <jonathan.derrick@linux.dev>
---
drivers/md/md-bitmap.c | 27 ++++++++++++++++++---------
1 file changed, 18 insertions(+), 9 deletions(-)
diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index e7cc6ba1b657..569297ea9b99 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -220,6 +220,7 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
rdev = NULL;
while ((rdev = next_active_rdev(rdev, mddev)) != NULL) {
int size = PAGE_SIZE;
+ int optimal_size = PAGE_SIZE;
loff_t offset = mddev->bitmap_info.offset;
bdev = (rdev->meta_bdev) ? rdev->meta_bdev : rdev->bdev;
@@ -228,9 +229,14 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
int last_page_size = store->bytes & (PAGE_SIZE-1);
if (last_page_size == 0)
last_page_size = PAGE_SIZE;
- size = roundup(last_page_size,
- bdev_logical_block_size(bdev));
+ size = roundup(last_page_size, bdev_logical_block_size(bdev));
+ if (bdev_io_opt(bdev) > bdev_logical_block_size(bdev))
+ optimal_size = roundup(last_page_size, bdev_io_opt(bdev));
+ else
+ optimal_size = size;
}
+
+
/* Just make sure we aren't corrupting data or
* metadata
*/
@@ -246,9 +252,11 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
goto bad_alignment;
} else if (offset < 0) {
/* DATA BITMAP METADATA */
- if (offset
- + (long)(page->index * (PAGE_SIZE/512))
- + size/512 > 0)
+ loff_t off = offset + (long)(page->index * (PAGE_SIZE/512));
+ if (size != optimal_size &&
+ off + optimal_size/512 <= 0)
+ size = optimal_size;
+ else if (off + size/512 > 0)
/* bitmap runs in to metadata */
goto bad_alignment;
if (rdev->data_offset + mddev->dev_sectors
@@ -257,10 +265,11 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
goto bad_alignment;
} else if (rdev->sb_start < rdev->data_offset) {
/* METADATA BITMAP DATA */
- if (rdev->sb_start
- + offset
- + page->index*(PAGE_SIZE/512) + size/512
- > rdev->data_offset)
+ loff_t off = rdev->sb_start + offset + page->index*(PAGE_SIZE/512);
+ if (size != optimal_size &&
+ off + optimal_size/512 <= rdev->data_offset)
+ size = optimal_size;
+ else if (off + size/512 > rdev->data_offset)
/* bitmap runs in to data */
goto bad_alignment;
} else {
--
2.27.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH] md: Use optimal I/O size for last bitmap page
2023-01-18 0:53 [PATCH] md: Use optimal I/O size for last bitmap page Jonathan Derrick
@ 2023-02-09 20:37 ` Jonathan Derrick
2023-02-10 17:32 ` Song Liu
0 siblings, 1 reply; 7+ messages in thread
From: Jonathan Derrick @ 2023-02-09 20:37 UTC (permalink / raw)
To: Song Liu, linux-raid; +Cc: Sushma Kalakota
Hi Song,
Any thoughts on this?
On 1/17/2023 5:53 PM, Jonathan Derrick wrote:
> From: Jon Derrick <jonathan.derrick@linux.dev>
>
> If the bitmap space has enough room, size the I/O for the last bitmap
> page write to the optimal I/O size for the storage device. The expanded
> write is checked that it won't overrun the data or metadata.
>
> This change helps increase performance by preventing unnecessary
> device-side read-mod-writes due to non-atomic write unit sizes.
>
> Ex biosnoop log. Device lba size 512, optimal size 4k:
> Before:
> Time Process PID Device LBA Size Lat
> 0.843734 md0_raid10 5267 nvme0n1 W 24 3584 1.17
> 0.843933 md0_raid10 5267 nvme1n1 W 24 3584 1.36
> 0.843968 md0_raid10 5267 nvme1n1 W 14207939968 4096 0.01
> 0.843979 md0_raid10 5267 nvme0n1 W 14207939968 4096 0.02
>
> After:
> Time Process PID Device LBA Size Lat
> 18.374244 md0_raid10 6559 nvme0n1 W 24 4096 0.01
> 18.374253 md0_raid10 6559 nvme1n1 W 24 4096 0.01
> 18.374300 md0_raid10 6559 nvme0n1 W 11020272296 4096 0.01
> 18.374306 md0_raid10 6559 nvme1n1 W 11020272296 4096 0.02
>
> Signed-off-by: Jon Derrick <jonathan.derrick@linux.dev>
> ---
> drivers/md/md-bitmap.c | 27 ++++++++++++++++++---------
> 1 file changed, 18 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
> index e7cc6ba1b657..569297ea9b99 100644
> --- a/drivers/md/md-bitmap.c
> +++ b/drivers/md/md-bitmap.c
> @@ -220,6 +220,7 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
> rdev = NULL;
> while ((rdev = next_active_rdev(rdev, mddev)) != NULL) {
> int size = PAGE_SIZE;
> + int optimal_size = PAGE_SIZE;
> loff_t offset = mddev->bitmap_info.offset;
>
> bdev = (rdev->meta_bdev) ? rdev->meta_bdev : rdev->bdev;
> @@ -228,9 +229,14 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
> int last_page_size = store->bytes & (PAGE_SIZE-1);
> if (last_page_size == 0)
> last_page_size = PAGE_SIZE;
> - size = roundup(last_page_size,
> - bdev_logical_block_size(bdev));
> + size = roundup(last_page_size, bdev_logical_block_size(bdev));
> + if (bdev_io_opt(bdev) > bdev_logical_block_size(bdev))
> + optimal_size = roundup(last_page_size, bdev_io_opt(bdev));
> + else
> + optimal_size = size;
> }
> +
> +
> /* Just make sure we aren't corrupting data or
> * metadata
> */
> @@ -246,9 +252,11 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
> goto bad_alignment;
> } else if (offset < 0) {
> /* DATA BITMAP METADATA */
> - if (offset
> - + (long)(page->index * (PAGE_SIZE/512))
> - + size/512 > 0)
> + loff_t off = offset + (long)(page->index * (PAGE_SIZE/512));
> + if (size != optimal_size &&
> + off + optimal_size/512 <= 0)
> + size = optimal_size;
> + else if (off + size/512 > 0)
> /* bitmap runs in to metadata */
> goto bad_alignment;
> if (rdev->data_offset + mddev->dev_sectors
> @@ -257,10 +265,11 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
> goto bad_alignment;
> } else if (rdev->sb_start < rdev->data_offset) {
> /* METADATA BITMAP DATA */
> - if (rdev->sb_start
> - + offset
> - + page->index*(PAGE_SIZE/512) + size/512
> - > rdev->data_offset)
> + loff_t off = rdev->sb_start + offset + page->index*(PAGE_SIZE/512);
> + if (size != optimal_size &&
> + off + optimal_size/512 <= rdev->data_offset)
> + size = optimal_size;
> + else if (off + size/512 > rdev->data_offset)
> /* bitmap runs in to data */
> goto bad_alignment;
> } else {
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] md: Use optimal I/O size for last bitmap page
2023-02-09 20:37 ` Jonathan Derrick
@ 2023-02-10 17:32 ` Song Liu
2023-02-16 23:52 ` Jonathan Derrick
0 siblings, 1 reply; 7+ messages in thread
From: Song Liu @ 2023-02-10 17:32 UTC (permalink / raw)
To: Jonathan Derrick; +Cc: linux-raid, Sushma Kalakota
Hi Jonathan,
On Thu, Feb 9, 2023 at 12:38 PM Jonathan Derrick
<jonathan.derrick@linux.dev> wrote:
>
> Hi Song,
>
> Any thoughts on this?
I am really sorry that I missed this patch.
>
> On 1/17/2023 5:53 PM, Jonathan Derrick wrote:
> > From: Jon Derrick <jonathan.derrick@linux.dev>
> >
> > If the bitmap space has enough room, size the I/O for the last bitmap
> > page write to the optimal I/O size for the storage device. The expanded
> > write is checked that it won't overrun the data or metadata.
> >
> > This change helps increase performance by preventing unnecessary
> > device-side read-mod-writes due to non-atomic write unit sizes.
> >
> > Ex biosnoop log. Device lba size 512, optimal size 4k:
> > Before:
> > Time Process PID Device LBA Size Lat
> > 0.843734 md0_raid10 5267 nvme0n1 W 24 3584 1.17
> > 0.843933 md0_raid10 5267 nvme1n1 W 24 3584 1.36
> > 0.843968 md0_raid10 5267 nvme1n1 W 14207939968 4096 0.01
> > 0.843979 md0_raid10 5267 nvme0n1 W 14207939968 4096 0.02
> >
> > After:
> > Time Process PID Device LBA Size Lat
> > 18.374244 md0_raid10 6559 nvme0n1 W 24 4096 0.01
> > 18.374253 md0_raid10 6559 nvme1n1 W 24 4096 0.01
> > 18.374300 md0_raid10 6559 nvme0n1 W 11020272296 4096 0.01
> > 18.374306 md0_raid10 6559 nvme1n1 W 11020272296 4096 0.02
Do we see significant improvements from io benchmarks?
IIUC, fewer future HDDs will use 512B LBA size.We probably don't need
such optimizations in the future.
Thanks,
Song
> >
> > Signed-off-by: Jon Derrick <jonathan.derrick@linux.dev>
> > ---
> > drivers/md/md-bitmap.c | 27 ++++++++++++++++++---------
> > 1 file changed, 18 insertions(+), 9 deletions(-)
> >
> > diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
> > index e7cc6ba1b657..569297ea9b99 100644
> > --- a/drivers/md/md-bitmap.c
> > +++ b/drivers/md/md-bitmap.c
> > @@ -220,6 +220,7 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
> > rdev = NULL;
> > while ((rdev = next_active_rdev(rdev, mddev)) != NULL) {
> > int size = PAGE_SIZE;
> > + int optimal_size = PAGE_SIZE;
> > loff_t offset = mddev->bitmap_info.offset;
> >
> > bdev = (rdev->meta_bdev) ? rdev->meta_bdev : rdev->bdev;
> > @@ -228,9 +229,14 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
> > int last_page_size = store->bytes & (PAGE_SIZE-1);
> > if (last_page_size == 0)
> > last_page_size = PAGE_SIZE;
> > - size = roundup(last_page_size,
> > - bdev_logical_block_size(bdev));
> > + size = roundup(last_page_size, bdev_logical_block_size(bdev));
> > + if (bdev_io_opt(bdev) > bdev_logical_block_size(bdev))
> > + optimal_size = roundup(last_page_size, bdev_io_opt(bdev));
> > + else
> > + optimal_size = size;
> > }
> > +
> > +
> > /* Just make sure we aren't corrupting data or
> > * metadata
> > */
> > @@ -246,9 +252,11 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
> > goto bad_alignment;
> > } else if (offset < 0) {
> > /* DATA BITMAP METADATA */
> > - if (offset
> > - + (long)(page->index * (PAGE_SIZE/512))
> > - + size/512 > 0)
> > + loff_t off = offset + (long)(page->index * (PAGE_SIZE/512));
> > + if (size != optimal_size &&
> > + off + optimal_size/512 <= 0)
> > + size = optimal_size;
> > + else if (off + size/512 > 0)
> > /* bitmap runs in to metadata */
> > goto bad_alignment;
> > if (rdev->data_offset + mddev->dev_sectors
> > @@ -257,10 +265,11 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
> > goto bad_alignment;
> > } else if (rdev->sb_start < rdev->data_offset) {
> > /* METADATA BITMAP DATA */
> > - if (rdev->sb_start
> > - + offset
> > - + page->index*(PAGE_SIZE/512) + size/512
> > - > rdev->data_offset)
> > + loff_t off = rdev->sb_start + offset + page->index*(PAGE_SIZE/512);
> > + if (size != optimal_size &&
> > + off + optimal_size/512 <= rdev->data_offset)
> > + size = optimal_size;
> > + else if (off + size/512 > rdev->data_offset)
> > /* bitmap runs in to data */
> > goto bad_alignment;
> > } else {
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] md: Use optimal I/O size for last bitmap page
2023-02-10 17:32 ` Song Liu
@ 2023-02-16 23:52 ` Jonathan Derrick
2023-02-17 13:21 ` Paul Menzel
2023-02-21 5:27 ` Song Liu
0 siblings, 2 replies; 7+ messages in thread
From: Jonathan Derrick @ 2023-02-16 23:52 UTC (permalink / raw)
To: Song Liu; +Cc: linux-raid, Sushma Kalakota
On 2/10/2023 10:32 AM, Song Liu wrote:
> Hi Jonathan,
>
> On Thu, Feb 9, 2023 at 12:38 PM Jonathan Derrick
> <jonathan.derrick@linux.dev> wrote:
>>
>> Hi Song,
>>
>> Any thoughts on this?
>
> I am really sorry that I missed this patch.
>
>>
>> On 1/17/2023 5:53 PM, Jonathan Derrick wrote:
>>> From: Jon Derrick <jonathan.derrick@linux.dev>
>>>
>>> If the bitmap space has enough room, size the I/O for the last bitmap
>>> page write to the optimal I/O size for the storage device. The expanded
>>> write is checked that it won't overrun the data or metadata.
>>>
>>> This change helps increase performance by preventing unnecessary
>>> device-side read-mod-writes due to non-atomic write unit sizes.
>>>
>>> Ex biosnoop log. Device lba size 512, optimal size 4k:
>>> Before:
>>> Time Process PID Device LBA Size Lat
>>> 0.843734 md0_raid10 5267 nvme0n1 W 24 3584 1.17
>>> 0.843933 md0_raid10 5267 nvme1n1 W 24 3584 1.36
>>> 0.843968 md0_raid10 5267 nvme1n1 W 14207939968 4096 0.01
>>> 0.843979 md0_raid10 5267 nvme0n1 W 14207939968 4096 0.02
>>>
>>> After:
>>> Time Process PID Device LBA Size Lat
>>> 18.374244 md0_raid10 6559 nvme0n1 W 24 4096 0.01
>>> 18.374253 md0_raid10 6559 nvme1n1 W 24 4096 0.01
>>> 18.374300 md0_raid10 6559 nvme0n1 W 11020272296 4096 0.01
>>> 18.374306 md0_raid10 6559 nvme1n1 W 11020272296 4096 0.02
>
> Do we see significant improvements from io benchmarks?
Yes. With lbaf=512, optimal i/o size=4k:
Without patch:
write: IOPS=1570, BW=6283KiB/s (6434kB/s)(368MiB/60001msec); 0 zone resets
With patch:
write: IOPS=59.7k, BW=233MiB/s (245MB/s)(13.7GiB/60001msec); 0 zone resets
It's going to be different for different drives, but this was a drive where a
drive-side read-mod-write has huge penalty.
>
> IIUC, fewer future HDDs will use 512B LBA size.We probably don't need
> such optimizations in the future.
Maybe. But many drives still ship with 512B formatting as default.
>
> Thanks,
> Song
>
>>>
>>> Signed-off-by: Jon Derrick <jonathan.derrick@linux.dev>
>>> ---
>>> drivers/md/md-bitmap.c | 27 ++++++++++++++++++---------
>>> 1 file changed, 18 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
>>> index e7cc6ba1b657..569297ea9b99 100644
>>> --- a/drivers/md/md-bitmap.c
>>> +++ b/drivers/md/md-bitmap.c
>>> @@ -220,6 +220,7 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
>>> rdev = NULL;
>>> while ((rdev = next_active_rdev(rdev, mddev)) != NULL) {
>>> int size = PAGE_SIZE;
>>> + int optimal_size = PAGE_SIZE;
>>> loff_t offset = mddev->bitmap_info.offset;
>>>
>>> bdev = (rdev->meta_bdev) ? rdev->meta_bdev : rdev->bdev;
>>> @@ -228,9 +229,14 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
>>> int last_page_size = store->bytes & (PAGE_SIZE-1);
>>> if (last_page_size == 0)
>>> last_page_size = PAGE_SIZE;
>>> - size = roundup(last_page_size,
>>> - bdev_logical_block_size(bdev));
>>> + size = roundup(last_page_size, bdev_logical_block_size(bdev));
>>> + if (bdev_io_opt(bdev) > bdev_logical_block_size(bdev))
>>> + optimal_size = roundup(last_page_size, bdev_io_opt(bdev));
>>> + else
>>> + optimal_size = size;
>>> }
>>> +
>>> +
>>> /* Just make sure we aren't corrupting data or
>>> * metadata
>>> */
>>> @@ -246,9 +252,11 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
>>> goto bad_alignment;
>>> } else if (offset < 0) {
>>> /* DATA BITMAP METADATA */
>>> - if (offset
>>> - + (long)(page->index * (PAGE_SIZE/512))
>>> - + size/512 > 0)
>>> + loff_t off = offset + (long)(page->index * (PAGE_SIZE/512));
>>> + if (size != optimal_size &&
>>> + off + optimal_size/512 <= 0)
>>> + size = optimal_size;
>>> + else if (off + size/512 > 0)
>>> /* bitmap runs in to metadata */
>>> goto bad_alignment;
>>> if (rdev->data_offset + mddev->dev_sectors
>>> @@ -257,10 +265,11 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
>>> goto bad_alignment;
>>> } else if (rdev->sb_start < rdev->data_offset) {
>>> /* METADATA BITMAP DATA */
>>> - if (rdev->sb_start
>>> - + offset
>>> - + page->index*(PAGE_SIZE/512) + size/512
>>> - > rdev->data_offset)
>>> + loff_t off = rdev->sb_start + offset + page->index*(PAGE_SIZE/512);
>>> + if (size != optimal_size &&
>>> + off + optimal_size/512 <= rdev->data_offset)
>>> + size = optimal_size;
>>> + else if (off + size/512 > rdev->data_offset)
>>> /* bitmap runs in to data */
>>> goto bad_alignment;
>>> } else {
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] md: Use optimal I/O size for last bitmap page
2023-02-16 23:52 ` Jonathan Derrick
@ 2023-02-17 13:21 ` Paul Menzel
2023-02-17 18:22 ` Jonathan Derrick
2023-02-21 5:27 ` Song Liu
1 sibling, 1 reply; 7+ messages in thread
From: Paul Menzel @ 2023-02-17 13:21 UTC (permalink / raw)
To: Jonathan Derrick; +Cc: Song Liu, linux-raid, Sushma Kalakota
Dear Jonathan,
Thank you for your patch.
Am 17.02.23 um 00:52 schrieb Jonathan Derrick:
> On 2/10/2023 10:32 AM, Song Liu wrote:
>> On Thu, Feb 9, 2023 at 12:38 PM Jonathan Derrick wrote:
>>> Any thoughts on this?
>>
>> I am really sorry that I missed this patch.
>>
>>> On 1/17/2023 5:53 PM, Jonathan Derrick wrote:
>>>> From: Jon Derrick <jonathan.derrick@linux.dev>
>>>>
>>>> If the bitmap space has enough room, size the I/O for the last bitmap
>>>> page write to the optimal I/O size for the storage device. The expanded
>>>> write is checked that it won't overrun the data or metadata.
>>>>
>>>> This change helps increase performance by preventing unnecessary
>>>> device-side read-mod-writes due to non-atomic write unit sizes.
>>>>
>>>> Ex biosnoop log. Device lba size 512, optimal size 4k:
>>>> Before:
>>>> Time Process PID Device LBA Size Lat
>>>> 0.843734 md0_raid10 5267 nvme0n1 W 24 3584 1.17
>>>> 0.843933 md0_raid10 5267 nvme1n1 W 24 3584 1.36
>>>> 0.843968 md0_raid10 5267 nvme1n1 W 14207939968 4096 0.01
>>>> 0.843979 md0_raid10 5267 nvme0n1 W 14207939968 4096 0.02
>>>>
>>>> After:
>>>> Time Process PID Device LBA Size Lat
>>>> 18.374244 md0_raid10 6559 nvme0n1 W 24 4096 0.01
>>>> 18.374253 md0_raid10 6559 nvme1n1 W 24 4096 0.01
>>>> 18.374300 md0_raid10 6559 nvme0n1 W 11020272296 4096 0.01
>>>> 18.374306 md0_raid10 6559 nvme1n1 W 11020272296 4096 0.02
>>
>> Do we see significant improvements from io benchmarks?
>
> Yes. With lbaf=512, optimal i/o size=4k:
>
> Without patch:
> write: IOPS=1570, BW=6283KiB/s (6434kB/s)(368MiB/60001msec); 0 zone resets
> With patch:
> write: IOPS=59.7k, BW=233MiB/s (245MB/s)(13.7GiB/60001msec); 0 zone resets
>
> It's going to be different for different drives, but this was a drive where a
> drive-side read-mod-write has huge penalty.
Nice. Can you share the drive model used for benchmarking. Also, maybe
also add that to the commit message?
>> IIUC, fewer future HDDs will use 512B LBA size.We probably don't need
>> such optimizations in the future.
>
> Maybe. But many drives still ship with 512B formatting as default.
Is that the problem on “desktop” HDDs, where `PHY-SEC` and `LOG-SEC` are
different?
$ lsblk -o
NAME,MODEL,ALIGNMENT,MIN-IO,OPT-IO,PHY-SEC,LOG-SEC,ROTA,SCHED,RQ-SIZE,RA,WSAME
-d /dev/sda
NAME MODEL ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC
ROTA SCHED RQ-SIZE RA WSAME
sda TOSHIBA DT01ACA050 0 4096 0 4096 512
1 bfq 64 128 0B
(Though these are not used in a RAID here.)
Kind regards,
Paul
[1]:
https://toshiba.semicon-storage.com/content/dam/toshiba-ss-v3/master/en/storage/product/internal-specialty/cHDD-DT01ACAxxx-Product-Overview.pdf
>>>> Signed-off-by: Jon Derrick <jonathan.derrick@linux.dev>
>>>> ---
>>>> drivers/md/md-bitmap.c | 27 ++++++++++++++++++---------
>>>> 1 file changed, 18 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
>>>> index e7cc6ba1b657..569297ea9b99 100644
>>>> --- a/drivers/md/md-bitmap.c
>>>> +++ b/drivers/md/md-bitmap.c
>>>> @@ -220,6 +220,7 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
>>>> rdev = NULL;
>>>> while ((rdev = next_active_rdev(rdev, mddev)) != NULL) {
>>>> int size = PAGE_SIZE;
>>>> + int optimal_size = PAGE_SIZE;
>>>> loff_t offset = mddev->bitmap_info.offset;
>>>>
>>>> bdev = (rdev->meta_bdev) ? rdev->meta_bdev : rdev->bdev;
>>>> @@ -228,9 +229,14 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
>>>> int last_page_size = store->bytes & (PAGE_SIZE-1);
>>>> if (last_page_size == 0)
>>>> last_page_size = PAGE_SIZE;
>>>> - size = roundup(last_page_size,
>>>> - bdev_logical_block_size(bdev));
>>>> + size = roundup(last_page_size, bdev_logical_block_size(bdev));
>>>> + if (bdev_io_opt(bdev) > bdev_logical_block_size(bdev))
>>>> + optimal_size = roundup(last_page_size, bdev_io_opt(bdev));
>>>> + else
>>>> + optimal_size = size;
>>>> }
>>>> +
>>>> +
>>>> /* Just make sure we aren't corrupting data or
>>>> * metadata
>>>> */
>>>> @@ -246,9 +252,11 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
>>>> goto bad_alignment;
>>>> } else if (offset < 0) {
>>>> /* DATA BITMAP METADATA */
>>>> - if (offset
>>>> - + (long)(page->index * (PAGE_SIZE/512))
>>>> - + size/512 > 0)
>>>> + loff_t off = offset + (long)(page->index * (PAGE_SIZE/512));
>>>> + if (size != optimal_size &&
>>>> + off + optimal_size/512 <= 0)
>>>> + size = optimal_size;
>>>> + else if (off + size/512 > 0)
>>>> /* bitmap runs in to metadata */
>>>> goto bad_alignment;
>>>> if (rdev->data_offset + mddev->dev_sectors
>>>> @@ -257,10 +265,11 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
>>>> goto bad_alignment;
>>>> } else if (rdev->sb_start < rdev->data_offset) {
>>>> /* METADATA BITMAP DATA */
>>>> - if (rdev->sb_start
>>>> - + offset
>>>> - + page->index*(PAGE_SIZE/512) + size/512
>>>> - > rdev->data_offset)
>>>> + loff_t off = rdev->sb_start + offset + page->index*(PAGE_SIZE/512);
>>>> + if (size != optimal_size &&
>>>> + off + optimal_size/512 <= rdev->data_offset)
>>>> + size = optimal_size;
>>>> + else if (off + size/512 > rdev->data_offset)
>>>> /* bitmap runs in to data */
>>>> goto bad_alignment;
>>>> } else {
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] md: Use optimal I/O size for last bitmap page
2023-02-17 13:21 ` Paul Menzel
@ 2023-02-17 18:22 ` Jonathan Derrick
0 siblings, 0 replies; 7+ messages in thread
From: Jonathan Derrick @ 2023-02-17 18:22 UTC (permalink / raw)
To: Paul Menzel; +Cc: Song Liu, linux-raid, Sushma Kalakota
On 2/17/2023 6:21 AM, Paul Menzel wrote:
> Dear Jonathan,
>
>
> Thank you for your patch.
>
> Am 17.02.23 um 00:52 schrieb Jonathan Derrick:
>
>> On 2/10/2023 10:32 AM, Song Liu wrote:
>
>>> On Thu, Feb 9, 2023 at 12:38 PM Jonathan Derrick wrote:
>
>>>> Any thoughts on this?
>>>
>>> I am really sorry that I missed this patch.
>>>
>>>> On 1/17/2023 5:53 PM, Jonathan Derrick wrote:
>>>>> From: Jon Derrick <jonathan.derrick@linux.dev>
>>>>>
>>>>> If the bitmap space has enough room, size the I/O for the last bitmap
>>>>> page write to the optimal I/O size for the storage device. The expanded
>>>>> write is checked that it won't overrun the data or metadata.
>>>>>
>>>>> This change helps increase performance by preventing unnecessary
>>>>> device-side read-mod-writes due to non-atomic write unit sizes.
>>>>>
>>>>> Ex biosnoop log. Device lba size 512, optimal size 4k:
>>>>> Before:
>>>>> Time Process PID Device LBA Size Lat
>>>>> 0.843734 md0_raid10 5267 nvme0n1 W 24 3584 1.17
>>>>> 0.843933 md0_raid10 5267 nvme1n1 W 24 3584 1.36
>>>>> 0.843968 md0_raid10 5267 nvme1n1 W 14207939968 4096 0.01
>>>>> 0.843979 md0_raid10 5267 nvme0n1 W 14207939968 4096 0.02
>>>>>
>>>>> After:
>>>>> Time Process PID Device LBA Size Lat
>>>>> 18.374244 md0_raid10 6559 nvme0n1 W 24 4096 0.01
>>>>> 18.374253 md0_raid10 6559 nvme1n1 W 24 4096 0.01
>>>>> 18.374300 md0_raid10 6559 nvme0n1 W 11020272296 4096 0.01
>>>>> 18.374306 md0_raid10 6559 nvme1n1 W 11020272296 4096 0.02
>>>
>>> Do we see significant improvements from io benchmarks?
>>
>> Yes. With lbaf=512, optimal i/o size=4k:
>>
>> Without patch:
>> write: IOPS=1570, BW=6283KiB/s (6434kB/s)(368MiB/60001msec); 0 zone resets
>> With patch:
>> write: IOPS=59.7k, BW=233MiB/s (245MB/s)(13.7GiB/60001msec); 0 zone resets
>>
>> It's going to be different for different drives, but this was a drive where a
>> drive-side read-mod-write has huge penalty.
>
> Nice. Can you share the drive model used for benchmarking. Also, maybe also add that to the commit message?
>
It's an Intel/Solidigm P5520 but other drives show similar results.
The mdraid configuration had it having 2 bitmap pages, which meant that the 'last' bitmap page was
being exercised in the random workload roughly 50% of the time, leading to this high latency result.
Here's a test on another system.
Raid10, Chunk-size 64M, bitmap-size 57228 bits
Without patch:
write: IOPS=1676, BW=6708KiB/s (6869kB/s)(393MiB/60001msec); 0 zone resets
With patch:
write: IOPS=15.7k, BW=61.4MiB/s (64.4MB/s)(3683MiB/60001msec); 0 zone resets
Biosnoop:
Without patch:
Time Process PID Device LBA Size Lat
1.410377 md0_raid10 6900 nvme0n1 W 16 4096 0.02
1.410387 md0_raid10 6900 nvme2n1 W 16 4096 0.02
1.410374 md0_raid10 6900 nvme3n1 W 16 4096 0.01
1.410381 md0_raid10 6900 nvme1n1 W 16 4096 0.02
1.410411 md0_raid10 6900 nvme1n1 W 115346512 4096 0.01
1.410418 md0_raid10 6900 nvme0n1 W 115346512 4096 0.02
1.410915 md0_raid10 6900 nvme2n1 W 24 3584 0.43
1.410935 md0_raid10 6900 nvme3n1 W 24 3584 0.45
1.411124 md0_raid10 6900 nvme1n1 W 24 3584 0.64
1.411147 md0_raid10 6900 nvme0n1 W 24 3584 0.66
1.411176 md0_raid10 6900 nvme3n1 W 2019022184 4096 0.01
1.411189 md0_raid10 6900 nvme2n1 W 2019022184 4096 0.02
With patch:
Time Process PID Device LBA Size Lat
5.747193 md0_raid10 727 nvme0n1 W 16 4096 0.01
5.747192 md0_raid10 727 nvme1n1 W 16 4096 0.02
5.747195 md0_raid10 727 nvme3n1 W 16 4096 0.01
5.747202 md0_raid10 727 nvme2n1 W 16 4096 0.02
5.747229 md0_raid10 727 nvme3n1 W 1196223704 4096 0.02
5.747224 md0_raid10 727 nvme0n1 W 1196223704 4096 0.01
5.747279 md0_raid10 727 nvme0n1 W 24 4096 0.01
5.747279 md0_raid10 727 nvme1n1 W 24 4096 0.02
5.747284 md0_raid10 727 nvme3n1 W 24 4096 0.02
5.747291 md0_raid10 727 nvme2n1 W 24 4096 0.02
5.747314 md0_raid10 727 nvme2n1 W 2234636712 4096 0.01
5.747317 md0_raid10 727 nvme1n1 W 2234636712 4096 0.02
>>> IIUC, fewer future HDDs will use 512B LBA size.We probably don't need
>>> such optimizations in the future.
>>
>> Maybe. But many drives still ship with 512B formatting as default.
>
> Is that the problem on “desktop” HDDs, where `PHY-SEC` and `LOG-SEC` are different?
>
> $ lsblk -o NAME,MODEL,ALIGNMENT,MIN-IO,OPT-IO,PHY-SEC,LOG-SEC,ROTA,SCHED,RQ-SIZE,RA,WSAME -d /dev/sda
> NAME MODEL ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE RA WSAME
> sda TOSHIBA DT01ACA050 0 4096 0 4096 512 1 bfq 64 128 0B
>
> (Though these are not used in a RAID here.)
In the test scenario, both PHYS-SEC and LOG-SEC are 512, however MIN-IO and OPT-IO are 4k.
>
>
> Kind regards,
>
> Paul
>
>
> [1]: https://toshiba.semicon-storage.com/content/dam/toshiba-ss-v3/master/en/storage/product/internal-specialty/cHDD-DT01ACAxxx-Product-Overview.pdf
>
>
>>>>> Signed-off-by: Jon Derrick <jonathan.derrick@linux.dev>
>>>>> ---
>>>>> drivers/md/md-bitmap.c | 27 ++++++++++++++++++---------
>>>>> 1 file changed, 18 insertions(+), 9 deletions(-)
>>>>>
>>>>> diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
>>>>> index e7cc6ba1b657..569297ea9b99 100644
>>>>> --- a/drivers/md/md-bitmap.c
>>>>> +++ b/drivers/md/md-bitmap.c
>>>>> @@ -220,6 +220,7 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
>>>>> rdev = NULL;
>>>>> while ((rdev = next_active_rdev(rdev, mddev)) != NULL) {
>>>>> int size = PAGE_SIZE;
>>>>> + int optimal_size = PAGE_SIZE;
>>>>> loff_t offset = mddev->bitmap_info.offset;
>>>>>
>>>>> bdev = (rdev->meta_bdev) ? rdev->meta_bdev : rdev->bdev;
>>>>> @@ -228,9 +229,14 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
>>>>> int last_page_size = store->bytes & (PAGE_SIZE-1);
>>>>> if (last_page_size == 0)
>>>>> last_page_size = PAGE_SIZE;
>>>>> - size = roundup(last_page_size,
>>>>> - bdev_logical_block_size(bdev));
>>>>> + size = roundup(last_page_size, bdev_logical_block_size(bdev));
>>>>> + if (bdev_io_opt(bdev) > bdev_logical_block_size(bdev))
>>>>> + optimal_size = roundup(last_page_size, bdev_io_opt(bdev));
>>>>> + else
>>>>> + optimal_size = size;
>>>>> }
>>>>> +
>>>>> +
>>>>> /* Just make sure we aren't corrupting data or
>>>>> * metadata
>>>>> */
>>>>> @@ -246,9 +252,11 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
>>>>> goto bad_alignment;
>>>>> } else if (offset < 0) {
>>>>> /* DATA BITMAP METADATA */
>>>>> - if (offset
>>>>> - + (long)(page->index * (PAGE_SIZE/512))
>>>>> - + size/512 > 0)
>>>>> + loff_t off = offset + (long)(page->index * (PAGE_SIZE/512));
>>>>> + if (size != optimal_size &&
>>>>> + off + optimal_size/512 <= 0)
>>>>> + size = optimal_size;
>>>>> + else if (off + size/512 > 0)
>>>>> /* bitmap runs in to metadata */
>>>>> goto bad_alignment;
>>>>> if (rdev->data_offset + mddev->dev_sectors
>>>>> @@ -257,10 +265,11 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
>>>>> goto bad_alignment;
>>>>> } else if (rdev->sb_start < rdev->data_offset) {
>>>>> /* METADATA BITMAP DATA */
>>>>> - if (rdev->sb_start
>>>>> - + offset
>>>>> - + page->index*(PAGE_SIZE/512) + size/512
>>>>> - > rdev->data_offset)
>>>>> + loff_t off = rdev->sb_start + offset + page->index*(PAGE_SIZE/512);
>>>>> + if (size != optimal_size &&
>>>>> + off + optimal_size/512 <= rdev->data_offset)
>>>>> + size = optimal_size;
>>>>> + else if (off + size/512 > rdev->data_offset)
>>>>> /* bitmap runs in to data */
>>>>> goto bad_alignment;
>>>>> } else {
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] md: Use optimal I/O size for last bitmap page
2023-02-16 23:52 ` Jonathan Derrick
2023-02-17 13:21 ` Paul Menzel
@ 2023-02-21 5:27 ` Song Liu
1 sibling, 0 replies; 7+ messages in thread
From: Song Liu @ 2023-02-21 5:27 UTC (permalink / raw)
To: Jonathan Derrick; +Cc: linux-raid, Sushma Kalakota
On Thu, Feb 16, 2023 at 3:52 PM Jonathan Derrick
<jonathan.derrick@linux.dev> wrote:
>
>
>
> On 2/10/2023 10:32 AM, Song Liu wrote:
> > Hi Jonathan,
> >
> > On Thu, Feb 9, 2023 at 12:38 PM Jonathan Derrick
> > <jonathan.derrick@linux.dev> wrote:
> >>
> >> Hi Song,
> >>
> >> Any thoughts on this?
> >
> > I am really sorry that I missed this patch.
> >
> >>
> >> On 1/17/2023 5:53 PM, Jonathan Derrick wrote:
> >>> From: Jon Derrick <jonathan.derrick@linux.dev>
> >>>
> >>> If the bitmap space has enough room, size the I/O for the last bitmap
> >>> page write to the optimal I/O size for the storage device. The expanded
> >>> write is checked that it won't overrun the data or metadata.
> >>>
> >>> This change helps increase performance by preventing unnecessary
> >>> device-side read-mod-writes due to non-atomic write unit sizes.
> >>>
> >>> Ex biosnoop log. Device lba size 512, optimal size 4k:
> >>> Before:
> >>> Time Process PID Device LBA Size Lat
> >>> 0.843734 md0_raid10 5267 nvme0n1 W 24 3584 1.17
> >>> 0.843933 md0_raid10 5267 nvme1n1 W 24 3584 1.36
> >>> 0.843968 md0_raid10 5267 nvme1n1 W 14207939968 4096 0.01
> >>> 0.843979 md0_raid10 5267 nvme0n1 W 14207939968 4096 0.02
> >>>
> >>> After:
> >>> Time Process PID Device LBA Size Lat
> >>> 18.374244 md0_raid10 6559 nvme0n1 W 24 4096 0.01
> >>> 18.374253 md0_raid10 6559 nvme1n1 W 24 4096 0.01
> >>> 18.374300 md0_raid10 6559 nvme0n1 W 11020272296 4096 0.01
> >>> 18.374306 md0_raid10 6559 nvme1n1 W 11020272296 4096 0.02
> >
> > Do we see significant improvements from io benchmarks?
> Yes. With lbaf=512, optimal i/o size=4k:
>
> Without patch:
> write: IOPS=1570, BW=6283KiB/s (6434kB/s)(368MiB/60001msec); 0 zone resets
> With patch:
> write: IOPS=59.7k, BW=233MiB/s (245MB/s)(13.7GiB/60001msec); 0 zone resets
The difference is much bigger than I expected. Given this big improvements, I
think we should ship this improvement. Unfortunately, we are too late for 6.3.
Let's plan to ship this in the 6.4 release. I am on vacation this
week. I will work
on this afterwards.
Thanks,
Song
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2023-02-21 5:28 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-01-18 0:53 [PATCH] md: Use optimal I/O size for last bitmap page Jonathan Derrick
2023-02-09 20:37 ` Jonathan Derrick
2023-02-10 17:32 ` Song Liu
2023-02-16 23:52 ` Jonathan Derrick
2023-02-17 13:21 ` Paul Menzel
2023-02-17 18:22 ` Jonathan Derrick
2023-02-21 5:27 ` Song Liu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).