From mboxrd@z Thu Jan 1 00:00:00 1970 From: NeilBrown Subject: Re: Full stripe write in RAID6 Date: Tue, 19 Aug 2014 16:54:30 +1000 Message-ID: <20140819165430.1eea7936@notabene.brown> References: <01b101cfb0c9$e8fcf240$baf6d6c0$@calsoftinc.com> <20140806164720.2aac2c5a@notabene.brown> <01d701cfbafc$d3545cc0$79fd1640$@calsoftinc.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; boundary="Sig_/+yzDIUzfaLf=oPeEYG/HQWT"; protocol="application/pgp-signature" Return-path: In-Reply-To: <01d701cfbafc$d3545cc0$79fd1640$@calsoftinc.com> Sender: linux-raid-owner@vger.kernel.org To: Mandar Joshi Cc: linux-raid@vger.kernel.org List-Id: linux-raid.ids --Sig_/+yzDIUzfaLf=oPeEYG/HQWT Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On Mon, 18 Aug 2014 21:25:25 +0530 "Mandar Joshi" wrote: > Thanks Neil for the reply... > Comments inline... >=20 > -----Original Message----- > From: NeilBrown [mailto:neilb@suse.de]=20 > Sent: Wednesday, August 06, 2014 12:17 PM > To: Mandar Joshi > Cc: linux-raid@vger.kernel.org > Subject: Re: Full stripe write in RAID6 >=20 > On Tue, 5 Aug 2014 21:55:46 +0530 "Mandar Joshi" > wrote: >=20 > > Hi, > > If I am writing entire stripe then whether RAID6 md=20 > > driver need to read any of the blocks from underlying device? > > =20 > > I have created RAID6 device with default (512K) chunk=20 > > size with total 6 RAID devices. cat=20 > > /sys/block/md127/queue/optimal_io_size =3D > > 2097152 I believe this is full stripe (512K * 4 data disks).=20 > > If I write 2MB data, I am expected to dirty entire stripe hence what I= =20 > > believe I need not require to read either any of the data block or=20 > > parity blocks. Thus avoiding RAID6 penalties. Whether md/raid driver=20 > > supports full stripe writes by avoiding RAID 6 penalties? > >=20 > > I also expected 6 disks will receive 512K writes each. (4 data disk +=20 > > 2 parity disks). >=20 > Your expectation is correct in theory, but it doesn't always quite work l= ike that in practice. > The write request will arrive at the raid6 driver in smaller chunks and i= t doesn't always decide correctly whether it should wait for more writes to= arrive, or if it should start reading now. >=20 > It would certainly be good to "fix" the scheduling in raid5/raid6, but no= one have worked out how yet. >=20 > NeilBrown >=20 > [Mandar] Tuning sysfs/.../md/stripe_cache_size=3D32768 significantly lowe= red pre-reads as discussed above. As it does not force queue for completion= , stripe handling gets time to dirty next entire full stripes, thus avoidin= g pre-reads. Still some of the stripe were not lucky to experience the same. > Further tuning sysfs/.../md/preread_bypass_threshold=3Dstripe_cache_size = i.e. 32768, almost eliminated pre-reads in my case. > Neil mentioned that "raid6 driver gets write request in smaller chunks." > Also correct if my understanding below is wrong. > Is it because md/raid driver does not have its own io scheduler which can= merge requests? Can we not have io scheduler for md? md/raid5 does have a scheduler and does merge requests. It is quite unlike the IO scheduler for a SCSI (or similar) device because its goal is quite different. The scheduler merges requests into a stripe rather than into a sequence because that is what benefits raid5. raid5 sends single-page requests down to the underlying driver and expects = it to merge them into multi-page requests if it would benefit from that. The problem is that the raid5 scheduler isn't very clever and gets it wrong sometimes. > When I do any buffered write request on md/raid6, I always get multiple 4= K requests. I think in absence of io scheduler, its because of Buffered IO = writes (from the page-cache) will always be in one-page units? Yes. > Due to this reason, whether md/raid6 driver was designed in such way that= its internal stripe handling considers stripe =3D 4K * noOfDisks?=20 Because that is easier. > Why design does not consider internal stripe =3D chunk_size * noOfDisks? That would either be very complex, or would require all IO to be in full chunks which is no ideal for small random IO. > I think it will help file systems which can do submit_bio with larger siz= e(?) > Is there any config-setting or patch to improve on in this case? No - apart from the config settings you have already found. NeilBrown > In case of direct IO, pages will be accumulated and then given to md/raid= 6, hence md/raid6 can receive more than 4K requests. > But again here, with direct io I could not get a write request more than = chunk size of it. Any specific reason? >=20 >=20 >=20 > >=20 > > If I do IO directly on block device /dev/md127, I do observe reads=20 > > happening on md device and underlying raid devices as well. > >=20 > > #mdstat o/p: > > md127 : active raid6 sdah1[5] sdai1[4] sdaj1[3] sdcg1[2] sdch1[1]=20 > > sdci1[0] > > 41926656 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/6]= =20 > > [UUUUUU] > >=20 > >=20 > >=20 > > # time (dd if=3D/dev/zero of=3D/dev/md127 bs=3D2M count=3D1 && sync) > >=20 > > # iostat:: > > Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn > > sdaj1 19.80 1.60 205.20 8 1026 > > sdai1 18.20 0.00 205.20 0 1026 > > sdah1 33.60 11.20 344.40 56 1722 > > sdcg1 20.20 0.00 205.20 0 1026 > > sdci1 31.00 3.20 344.40 16 1722 > > sdch1 34.00 120.00 205.20 600 1026 > > md127 119.20 134.40 819.20 672 4096 > >=20 > >=20 > > So to avoid cache effect, if any (?) I am using raw device to perform I= O. > > Then for one stripe write I do observe no reads happening.=20 > > At the same time I also see few disks getting more writes than=20 > > expected. Did not get why? > >=20 > > # raw -qa > > /dev/raw/raw1: bound to major 9, minor 127 > >=20 > > #time (dd if=3D/dev/zero of=3D/dev/raw/raw1 bs=3D2M count=3D1 && sync) > >=20 > > # iostat shows: > > Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn > > sdaj1 7.00 0.00 205.20 0 1026 > > sdai1 6.20 0.00 205.20 0 1026 > > sdah1 9.80 0.00 246.80 0 1234 > > sdcg1 6.80 0.00 205.20 0 1026 > > sdci1 9.60 0.00 246.80 0 1234 > > sdch1 6.80 0.00 205.20 0 1026 > > md127 0.80 0.00 819.20 0 4096 > >=20 > > I assume if I perform writes in multiples of =E2=80=9Coptimal_io_size= =E2=80=9D I would=20 > > be doing full stripe writes thus avoiding reads. But unfortunately=20 > > with two 2M writes, I do see reads happening for some these drives.=20 > > Same case for > > count=3D4 or 6 (equal to data disks or total disks). > > # time (dd if=3D/dev/zero of=3D/dev/raw/raw1 bs=3D2M count=3D2 && sync) > >=20 > >=20 > > Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn > > sdaj1 13.40 204.80 410.00 1024 2050 > > sdai1 11.20 0.00 410.00 0 2050 > > sdah1 15.80 0.00 464.40 0 2322 > > sdcg1 13.20 204.80 410.00 1024 2050 > > sdci1 16.60 0.00 464.40 0 2322 > > sdch1 12.40 192.00 410.00 960 2050 > > md127 1.60 0.00 1638.40 0 8192 > >=20 > >=20 > > I read about =E2=80=9C/sys/block/md127/md/md/preread_bypass_threshold= =E2=80=9D.=20 > > I tried setting this to 0 as well as suggested somewhere. But no help. > >=20 > > I believe RAID6 penalties will exist if it=E2=80=99s a random write, bu= t in=20 > > case of seq. write, whether they will still exist in some other form=20 > > in Linux md/raid driver? > > My aim is to maximize RAID6 Write IO rate with sequential Writes=20 > > without > > RAID6 penalties. > >=20 > > Rectify me wherever my assumptions are wrong. Let me know if any other= =20 > > configuration param (for block device or md device) is required to=20 > > achieve the same. > >=20 > > -- > > Mandar Joshi >=20 --Sig_/+yzDIUzfaLf=oPeEYG/HQWT Content-Type: application/pgp-signature; name=signature.asc Content-Disposition: attachment; filename=signature.asc -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQIVAwUBU/L0pjnsnt1WYoG5AQJ1KRAAs4qzXg1YvcmK4J/odTVaGCHfUF5CY9Ho Vdfjf//dztoO0lwvQgkZ/TuH16Tbp/wXXj3Nhaqlb3NIb4j1Z3mpxkxyeqhzoUT+ rKSsQARb2CKx66gyi7x5Rm8faKYNNq2Wt6HC9XTeNghtfk8ehev/cztxWD1uDQGZ jf4d743CTYRSypHrYFhA5KBhlCegNcPBbymFCIqcN4SHykLdBs+kbVOC0i0E0hj4 iLCw7PEAMFL9jbijfnVStv5k1RUOGpXgd3f86xFfqgFWgJ4oHIHx9mEvD+NhgjHj QtXSZpYj3aBNzGZUTgU4PgMNDar0f4K3mJm2p8rA6I/pDAmIWbEtZil1BVY8LWMc jb31vVteksJt9IMBaRHhjUDUV4eYs9oxoaPUbQOxjRuPiB1Kv8hqW8YfrZnY5sHM Dzv0b5kolT/EJuKoSTGEgz2YoJqkFPh5SQQxn8IP/TPdgXS3irWvFCrzA2+8/1dw 6P0IKFJaUVAka/X7XNSMwYtZn2Y4DnKqVaJULpvQ5IVxk9KqrOnfsctwrkTkWiCB U+m3MiL83vHumaQod442V2Ik/fpjq8Ey+1xTB58xF23PZU8LfOlKBwgdQO8jIRBr OwioIl9TPoFxT92nEBgbagrYW42zzh4aegR4ISwL5cP292vtC8n/bjY8qmkBAy28 nNw+eeNTLYM= =EnvM -----END PGP SIGNATURE----- --Sig_/+yzDIUzfaLf=oPeEYG/HQWT--