From: NeilBrown <neilb@suse.de>
To: Mandar Joshi <mandar.joshi@calsoftinc.com>
Cc: linux-raid@vger.kernel.org
Subject: Re: Full stripe write in RAID6
Date: Wed, 6 Aug 2014 16:47:20 +1000 [thread overview]
Message-ID: <20140806164720.2aac2c5a@notabene.brown> (raw)
In-Reply-To: <01b101cfb0c9$e8fcf240$baf6d6c0$@calsoftinc.com>
[-- Attachment #1: Type: text/plain, Size: 5779 bytes --]
On Tue, 5 Aug 2014 21:55:46 +0530 "Mandar Joshi"
<mandar.joshi@calsoftinc.com> wrote:
> Hi,
> If I am writing entire stripe then whether RAID6 md driver
> need to read any of the blocks from underlying device?
>
> I have created RAID6 device with default (512K) chunk size
> with total 6 RAID devices. cat /sys/block/md127/queue/optimal_io_size =
> 2097152 I believe this is full stripe (512K * 4 data disks).
> If I write 2MB data, I am expected to dirty entire stripe hence what I
> believe I need not require to read either any of the data block or parity
> blocks. Thus avoiding RAID6 penalties. Whether md/raid driver supports full
> stripe writes by avoiding RAID 6 penalties?
>
> I also expected 6 disks will receive 512K writes each. (4 data disk + 2
> parity disks).
Your expectation is correct in theory, but it doesn't always quite work like
that in practice.
The write request will arrive at the raid6 driver in smaller chunks and it
doesn't always decide correctly whether it should wait for more writes to
arrive, or if it should start reading now.
It would certainly be good to "fix" the scheduling in raid5/raid6, but no one
have worked out how yet.
NeilBrown
>
> If I do IO directly on block device /dev/md127, I do observe reads happening
> on md device and underlying raid devices as well.
>
> #mdstat o/p:
> md127 : active raid6 sdah1[5] sdai1[4] sdaj1[3] sdcg1[2] sdch1[1] sdci1[0]
> 41926656 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/6]
> [UUUUUU]
>
>
>
> # time (dd if=/dev/zero of=/dev/md127 bs=2M count=1 && sync)
>
> # iostat::
> Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
> sdaj1 19.80 1.60 205.20 8 1026
> sdai1 18.20 0.00 205.20 0 1026
> sdah1 33.60 11.20 344.40 56 1722
> sdcg1 20.20 0.00 205.20 0 1026
> sdci1 31.00 3.20 344.40 16 1722
> sdch1 34.00 120.00 205.20 600 1026
> md127 119.20 134.40 819.20 672 4096
>
>
> So to avoid cache effect, if any (?) I am using raw device to perform IO.
> Then for one stripe write I do observe no reads happening.
> At the same time I also see few disks getting more writes than expected. Did
> not get why?
>
> # raw -qa
> /dev/raw/raw1: bound to major 9, minor 127
>
> #time (dd if=/dev/zero of=/dev/raw/raw1 bs=2M count=1 && sync)
>
> # iostat shows:
> Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
> sdaj1 7.00 0.00 205.20 0 1026
> sdai1 6.20 0.00 205.20 0 1026
> sdah1 9.80 0.00 246.80 0 1234
> sdcg1 6.80 0.00 205.20 0 1026
> sdci1 9.60 0.00 246.80 0 1234
> sdch1 6.80 0.00 205.20 0 1026
> md127 0.80 0.00 819.20 0 4096
>
> I assume if I perform writes in multiples of “optimal_io_size” I would be
> doing full stripe writes thus avoiding reads. But unfortunately with two 2M
> writes, I do see reads happening for some these drives. Same case for
> count=4 or 6 (equal to data disks or total disks).
> # time (dd if=/dev/zero of=/dev/raw/raw1 bs=2M count=2 && sync)
>
>
> Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
> sdaj1 13.40 204.80 410.00 1024 2050
> sdai1 11.20 0.00 410.00 0 2050
> sdah1 15.80 0.00 464.40 0 2322
> sdcg1 13.20 204.80 410.00 1024 2050
> sdci1 16.60 0.00 464.40 0 2322
> sdch1 12.40 192.00 410.00 960 2050
> md127 1.60 0.00 1638.40 0 8192
>
>
> I read about “/sys/block/md127/md/md/preread_bypass_threshold”.
> I tried setting this to 0 as well as suggested somewhere. But no help.
>
> I believe RAID6 penalties will exist if it’s a random write, but in case of
> seq. write, whether they will still exist in some other form in Linux
> md/raid driver?
> My aim is to maximize RAID6 Write IO rate with sequential Writes without
> RAID6 penalties.
>
> Rectify me wherever my assumptions are wrong. Let me know if any other
> configuration param (for block device or md device) is required to achieve
> the same.
>
> --
> Mandar Joshi
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
next prev parent reply other threads:[~2014-08-06 6:47 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-08-05 16:25 Full stripe write in RAID6 Mandar Joshi
2014-08-06 6:47 ` NeilBrown [this message]
2014-08-18 15:55 ` Mandar Joshi
2014-08-19 6:54 ` NeilBrown
2014-08-06 7:03 ` Roman Mamedov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140806164720.2aac2c5a@notabene.brown \
--to=neilb@suse.de \
--cc=linux-raid@vger.kernel.org \
--cc=mandar.joshi@calsoftinc.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).