* Reason for md raid1 max_sectors_kb limited to 127?
@ 2012-05-07 9:41 Sebastian Riemer
2012-05-07 11:18 ` NeilBrown
0 siblings, 1 reply; 9+ messages in thread
From: Sebastian Riemer @ 2012-05-07 9:41 UTC (permalink / raw)
To: linux-raid
Hi list,
I'm wondering why max_sectors_kb is set to 127 with md raid1.
This value is at 512 on normal HDDs.
If I do a file copy on ext4 on a normal HDD then I can see with
blktrace/blkparse that 128 KiB chunks are read and 512 KiB chunks are
written.
With md raid it looks like this: 124 KiB, 4 KiB, 124 KiB, 4 KiB, ... .
This looks very inefficient.
So, is there a particular reason that max_sectors_kb is set to 127?
Cheers,
Sebastian
--
Sebastian Riemer
Linux Kernel Developer
ProfitBricks GmbH
Greifswalder Str. 207
10405 Berlin, Germany
Tel.: +49 - 30 - 60 98 56 991 - 303
Fax: +49 - 30 - 51 64 09 22
Email: sebastian.riemer@profitbricks.com
Web: http://www.profitbricks.com/
Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Andreas Gauger, Achim Weiss
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Reason for md raid1 max_sectors_kb limited to 127?
2012-05-07 9:41 Reason for md raid1 max_sectors_kb limited to 127? Sebastian Riemer
@ 2012-05-07 11:18 ` NeilBrown
2012-05-07 11:34 ` Sebastian Riemer
0 siblings, 1 reply; 9+ messages in thread
From: NeilBrown @ 2012-05-07 11:18 UTC (permalink / raw)
To: Sebastian Riemer; +Cc: linux-raid
[-- Attachment #1: Type: text/plain, Size: 915 bytes --]
On Mon, 07 May 2012 11:41:43 +0200 Sebastian Riemer
<sebastian.riemer@profitbricks.com> wrote:
> Hi list,
>
> I'm wondering why max_sectors_kb is set to 127 with md raid1.
> This value is at 512 on normal HDDs.
>
> If I do a file copy on ext4 on a normal HDD then I can see with
> blktrace/blkparse that 128 KiB chunks are read and 512 KiB chunks are
> written.
>
> With md raid it looks like this: 124 KiB, 4 KiB, 124 KiB, 4 KiB, ... .
> This looks very inefficient.
>
> So, is there a particular reason that max_sectors_kb is set to 127?
>
> Cheers,
> Sebastian
>
You didn't say which kernel you are running.
However md/raid1 bases all those settings on the minimum or maximum (as
appropriate) of the setting of the underlying devices, using blk_stack_limits
(in block/blk-settings.c).
So the likely answer is that one of your HDDs has a smaller max_sectors_kb?
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Reason for md raid1 max_sectors_kb limited to 127?
2012-05-07 11:18 ` NeilBrown
@ 2012-05-07 11:34 ` Sebastian Riemer
2012-05-07 14:04 ` Bernd Schubert
0 siblings, 1 reply; 9+ messages in thread
From: Sebastian Riemer @ 2012-05-07 11:34 UTC (permalink / raw)
To: NeilBrown; +Cc: linux-raid
On 07/05/12 13:18, NeilBrown wrote:
> You didn't say which kernel you are running.
>
> However md/raid1 bases all those settings on the minimum or maximum (as
> appropriate) of the setting of the underlying devices, using blk_stack_limits
> (in block/blk-settings.c).
>
> So the likely answer is that one of your HDDs has a smaller max_sectors_kb?
>
> NeilBrown
Thanks for your answer! Kernel version is vanilla 3.2, but I've also
tested 2.6.32. There is no difference. Distribution: Debian Squeeze.
I can even reproduce this behaviour with RAM disks:
# modprobe brd rd_nr=2 rd_size=1048576
# cat /sys/block/ram0/queue/max_sectors_kb
512
# cat /sys/block/ram1/queue/max_sectors_kb
512
# mdadm -C /dev/md200 --force --assume-clean -n 2 -l raid1 -a md
/dev/ram0 /dev/ram1
# cat /sys/block/md200/queue/max_sectors_kb
127
I'll have a look at that blk_stack_limits() function.
Cheers,
Sebastian
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Reason for md raid1 max_sectors_kb limited to 127?
2012-05-07 11:34 ` Sebastian Riemer
@ 2012-05-07 14:04 ` Bernd Schubert
2012-05-07 14:14 ` Sebastian Riemer
0 siblings, 1 reply; 9+ messages in thread
From: Bernd Schubert @ 2012-05-07 14:04 UTC (permalink / raw)
To: Sebastian Riemer, Linux RAID Mailing List
On 05/07/2012 01:34 PM, Sebastian Riemer wrote:
> On 07/05/12 13:18, NeilBrown wrote:
>> You didn't say which kernel you are running.
>>
>> However md/raid1 bases all those settings on the minimum or maximum (as
>> appropriate) of the setting of the underlying devices, using blk_stack_limits
>> (in block/blk-settings.c).
>>
>> So the likely answer is that one of your HDDs has a smaller max_sectors_kb?
>>
>> NeilBrown
>
> Thanks for your answer! Kernel version is vanilla 3.2, but I've also
> tested 2.6.32. There is no difference. Distribution: Debian Squeeze.
>
> I can even reproduce this behaviour with RAM disks:
>
> # modprobe brd rd_nr=2 rd_size=1048576
> # cat /sys/block/ram0/queue/max_sectors_kb
> 512
> # cat /sys/block/ram1/queue/max_sectors_kb
> 512
> # mdadm -C /dev/md200 --force --assume-clean -n 2 -l raid1 -a md
> /dev/ram0 /dev/ram1
> # cat /sys/block/md200/queue/max_sectors_kb
> 127
>
> I'll have a look at that blk_stack_limits() function.
>
I think you need something like this. I thought there already went
something in in recent kernel, will check it later on today.
Index: 2.6.32.13/drivers/md/raid0.c
===================================================================
--- 2.6.32.13.orig/drivers/md/raid0.c
+++ 2.6.32.13/drivers/md/raid0.c
@@ -96,6 +96,7 @@ static int create_strip_zones(mddev_t *m
int cnt;
char b[BDEVNAME_SIZE];
raid0_conf_t *conf = kzalloc(sizeof(*conf), GFP_KERNEL);
+ unsigned int opt_io_size;
if (!conf)
return -ENOMEM;
@@ -256,9 +257,16 @@ static int create_strip_zones(mddev_t *m
goto abort;
}
+ /*
+ * To send large IOs to the drives we need sufficient segments
+ * for our own queue first.
+ */
+ opt_io_size = (mddev->chunk_sectors << 9) * mddev->raid_disks;
+ blk_queue_max_phys_segments(mddev->queue, opt_io_size >> PAGE_SHIFT);
+ blk_queue_max_hw_segments(mddev->queue, opt_io_size >> PAGE_SHIFT);
+
blk_queue_io_min(mddev->queue, mddev->chunk_sectors << 9);
- blk_queue_io_opt(mddev->queue,
- (mddev->chunk_sectors << 9) * mddev->raid_disks);
+ blk_queue_io_opt(mddev->queue, opt_io_size);
printk(KERN_INFO "raid0: done.\n");
mddev->private = conf;
Cheers,
Bernd
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Reason for md raid1 max_sectors_kb limited to 127?
2012-05-07 14:04 ` Bernd Schubert
@ 2012-05-07 14:14 ` Sebastian Riemer
[not found] ` <CABYL=ToyVwRGOkVs63iBgfWeTsjDQ=ofsYC4hgAVpABrWbHUjA@mail.gmail.com>
0 siblings, 1 reply; 9+ messages in thread
From: Sebastian Riemer @ 2012-05-07 14:14 UTC (permalink / raw)
To: Bernd Schubert; +Cc: Linux RAID Mailing List
On 07/05/12 16:04, Bernd Schubert wrote:
> I think you need something like this. I thought there already went
> something in in recent kernel, will check it later on today.
>
>
>
> Index: 2.6.32.13/drivers/md/raid0.c
> ===================================================================
> --- 2.6.32.13.orig/drivers/md/raid0.c
> +++ 2.6.32.13/drivers/md/raid0.c
> @@ -96,6 +96,7 @@ static int create_strip_zones(mddev_t *m
> int cnt;
> char b[BDEVNAME_SIZE];
> raid0_conf_t *conf = kzalloc(sizeof(*conf), GFP_KERNEL);
> + unsigned int opt_io_size;
>
> if (!conf)
> return -ENOMEM;
> @@ -256,9 +257,16 @@ static int create_strip_zones(mddev_t *m
> goto abort;
> }
>
> + /*
> + * To send large IOs to the drives we need sufficient segments
> + * for our own queue first.
> + */
> + opt_io_size = (mddev->chunk_sectors << 9) * mddev->raid_disks;
> + blk_queue_max_phys_segments(mddev->queue, opt_io_size >> PAGE_SHIFT);
> + blk_queue_max_hw_segments(mddev->queue, opt_io_size >> PAGE_SHIFT);
> +
> blk_queue_io_min(mddev->queue, mddev->chunk_sectors << 9);
> - blk_queue_io_opt(mddev->queue,
> - (mddev->chunk_sectors << 9) * mddev->raid_disks);
> + blk_queue_io_opt(mddev->queue, opt_io_size);
>
> printk(KERN_INFO "raid0: done.\n");
> mddev->private = conf;
>
>
> Cheers,
> Bernd
>
Thanks for your answer, Bernd!
I've tested 3.3 and 3.4-rc1 and there it works. Now, I'll test latest
linux-3.2.y from linux-stable. Perhaps I'll start a "git bisect" later
on if the fix commit isn't that obvious.
Cheers,
Sebastian
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Reason for md raid1 max_sectors_kb limited to 127?
[not found] ` <CABYL=ToyVwRGOkVs63iBgfWeTsjDQ=ofsYC4hgAVpABrWbHUjA@mail.gmail.com>
@ 2012-05-07 14:47 ` Bernd Schubert
2012-05-07 15:33 ` Sebastian Riemer
0 siblings, 1 reply; 9+ messages in thread
From: Bernd Schubert @ 2012-05-07 14:47 UTC (permalink / raw)
To: Roberto Spadim; +Cc: Sebastian Riemer, Linux RAID Mailing List
On 05/07/2012 04:44 PM, Roberto Spadim wrote:
> could check if it have any performace increase? maybe i consider
> upgrading my kernel to get more performace inn raid1
>
Depends on your hardware. Hardware that can handle small IO sizes, such
as common hard disks usually don't have problems with 512KB IOs. But if
you should use software on top of hardware raid large IOs might be very
important (again depends on the hw-raid vendor then).
Cheers,
Bernd
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Reason for md raid1 max_sectors_kb limited to 127?
2012-05-07 14:47 ` Bernd Schubert
@ 2012-05-07 15:33 ` Sebastian Riemer
2012-05-07 16:13 ` Bernd Schubert
0 siblings, 1 reply; 9+ messages in thread
From: Sebastian Riemer @ 2012-05-07 15:33 UTC (permalink / raw)
To: Bernd Schubert; +Cc: Roberto Spadim, Linux RAID Mailing List
On 07/05/12 16:47, Bernd Schubert wrote:
> On 05/07/2012 04:44 PM, Roberto Spadim wrote:
>> could check if it have any performace increase? maybe i consider
>> upgrading my kernel to get more performace inn raid1
>>
>
> Depends on your hardware. Hardware that can handle small IO sizes, such
> as common hard disks usually don't have problems with 512KB IOs. But if
> you should use software on top of hardware raid large IOs might be very
> important (again depends on the hw-raid vendor then).
>
>
> Cheers,
> Bernd
O.K., I've also tested 3.2.16 and there the problem still exists.
Bernd pinpointed me to commit b1bd055d397e09f99dcef9b138ed104ff1812fcb
(block: Introduce blk_set_stacking_limits function).
After cherry-picking it on 3.2.16 it worked. Tomorrow I'll test the
performance impact and verify it by block tracing.
Cheers,
Sebastian
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Reason for md raid1 max_sectors_kb limited to 127?
2012-05-07 15:33 ` Sebastian Riemer
@ 2012-05-07 16:13 ` Bernd Schubert
0 siblings, 0 replies; 9+ messages in thread
From: Bernd Schubert @ 2012-05-07 16:13 UTC (permalink / raw)
To: Sebastian Riemer; +Cc: Roberto Spadim, Linux RAID Mailing List
On 05/07/2012 05:33 PM, Sebastian Riemer wrote:
> On 07/05/12 16:47, Bernd Schubert wrote:
>> On 05/07/2012 04:44 PM, Roberto Spadim wrote:
>>> could check if it have any performace increase? maybe i consider
>>> upgrading my kernel to get more performace inn raid1
>>>
>>
>> Depends on your hardware. Hardware that can handle small IO sizes, such
>> as common hard disks usually don't have problems with 512KB IOs. But if
>> you should use software on top of hardware raid large IOs might be very
>> important (again depends on the hw-raid vendor then).
>>
>>
>> Cheers,
>> Bernd
>
> O.K., I've also tested 3.2.16 and there the problem still exists.
> Bernd pinpointed me to commit b1bd055d397e09f99dcef9b138ed104ff1812fcb
> (block: Introduce blk_set_stacking_limits function).
>
> After cherry-picking it on 3.2.16 it worked. Tomorrow I'll test the
> performance impact and verify it by block tracing.
You might want to check IO sizes with my a patched blkiomon (blktrace).
I added a mode to make the IO table more verbose about actual io sizes.
I always wanted to further improve it and to send patches upstream, but
so far didn't find time for that.
http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/blktrace/
Cheers,
Bernd
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Reason for md raid1 max_sectors_kb limited to 127?
@ 2012-05-08 16:07 Sebastian Riemer
0 siblings, 0 replies; 9+ messages in thread
From: Sebastian Riemer @ 2012-05-08 16:07 UTC (permalink / raw)
To: Roberto Spadim; +Cc: Linux RAID Mailing List
On 07/05/12 17:33, Sebastian Riemer wrote:
> O.K., I've also tested 3.2.16 and there the problem still exists.
> Bernd pinpointed me to commit b1bd055d397e09f99dcef9b138ed104ff1812fcb
> (block: Introduce blk_set_stacking_limits function).
>
> After cherry-picking it on 3.2.16 it worked. Tomorrow I'll test the
> performance impact and verify it by block tracing.
>
> Cheers,
> Sebastian
>
I've measured the performance impact today.
This is my test system:
Supermicro H8DGi mainboard
2 x 8-core Opteron 6128, 2 GHz
32 GB RAM
LSI MegaRAID 9260-4i
16 x SEAGATE ST31000424SS nearline SAS
NFS root
Each HDD is exported by the HW RAID controller write through, direct, no
read-ahead.
These virtual drives have max_sectors_kb 320 only, but should be O.K.
for a first test.
I've got 8 x md raid1 and md raid0 on top, because we consider the md
raid10 driver to be worse in performance especially with >= 24 HDDs with
kernel 3.2. I've got also LVM on top with a 50 GiB LV. The LV has ext4
on it.
First I've tested the file copy on unpatched 3.2.16 kernel:
*312 MB/s* in average
Now, patched with the fix:
*379 MB/s* in average
That's a clear improvement, because of the big chunks! With a SAS HBA
and max_sectors_kb 512 this could be even better.
My copy test creates the file to be copied with fio and direct IO in
order to get random data into the file and to bypass all caching. Here
is the simple code:
#!/bin/bash
SIZES="1G"
FILE="test"
FILE2="test2"
MOUNTPOINT="/mnt/bench1"
for size in $SIZES; do
rm -f $MOUNTPOINT/$FILE $MOUNTPOINT/$FILE2
fio -name iops -rw=write -size="$size" -iodepth 1 -filename
$MOUNTPOINT/$FILE -ioengine libaio -direct=1 -bs=1M
echo -e "\n*** Starting Copy Test ***"
# blktrace /dev/md119 -b 4096 &
# pid=$!
time $(cp $MOUNTPOINT/$FILE $MOUNTPOINT/$FILE2; sync)
# kill -2 $pid
done
Cheers,
Sebastian
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2012-05-08 16:07 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-05-07 9:41 Reason for md raid1 max_sectors_kb limited to 127? Sebastian Riemer
2012-05-07 11:18 ` NeilBrown
2012-05-07 11:34 ` Sebastian Riemer
2012-05-07 14:04 ` Bernd Schubert
2012-05-07 14:14 ` Sebastian Riemer
[not found] ` <CABYL=ToyVwRGOkVs63iBgfWeTsjDQ=ofsYC4hgAVpABrWbHUjA@mail.gmail.com>
2012-05-07 14:47 ` Bernd Schubert
2012-05-07 15:33 ` Sebastian Riemer
2012-05-07 16:13 ` Bernd Schubert
-- strict thread matches above, loose matches on Subject: below --
2012-05-08 16:07 Sebastian Riemer
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).