From mboxrd@z Thu Jan 1 00:00:00 1970 From: Doug Ledford Subject: Re: kernel checksumming performance vs actual raid device performance Date: Tue, 23 Aug 2016 15:10:59 -0400 Message-ID: <5416db5c-2d2b-8cc5-b477-604e8ccf0707@redhat.com> References: <51443e5b-3eef-c35f-8ee7-ad3e85e4e76c@redhat.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="mGWc1K5QJp2Mo0u7qQVJ8OFFf0oXorOQt" Return-path: In-Reply-To: Sender: linux-raid-owner@vger.kernel.org To: doug@easyco.com Cc: Matt Garman , Mdadm List-Id: linux-raid.ids This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --mGWc1K5QJp2Mo0u7qQVJ8OFFf0oXorOQt Content-Type: multipart/mixed; boundary="WpkqmW4DB3eMGI7mStbEVbsUMGuJPm3fa" From: Doug Ledford To: doug@easyco.com Cc: Matt Garman , Mdadm Message-ID: <5416db5c-2d2b-8cc5-b477-604e8ccf0707@redhat.com> Subject: Re: kernel checksumming performance vs actual raid device performance References: <51443e5b-3eef-c35f-8ee7-ad3e85e4e76c@redhat.com> In-Reply-To: --WpkqmW4DB3eMGI7mStbEVbsUMGuJPm3fa Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable On 8/23/2016 2:27 PM, Doug Dumitru wrote: > Mr. Ledford, >=20 > I think your explanation of RAID "dirty" read performance is a bit off.= >=20 > If you have 64KB chunks, this describes the layout. I don't think > this also requires 64K reads. I know that this is true with RAID-5, > and I am pretty sure it applies to raid-6 as well. So if you do 4K > reads, you should see 4K reads to all the member drives. Of course. I didn't mean to imply otherwise. The read size is the read size. But, since the OPs test case was to "read random files" and not "read random blocks of random files" I took it to mean it would be sequential IO across a multitude of random files. That assumption might have been wrong, but I wrote my explanation with that in mind. > You can verify this pretty easily with iostat. >=20 > Mr. Garman, >=20 > Your results are a lot worse than expected. I always assume that a > raid "dirty" read will try to hit the disk hard. This implies issuing > the 22 reads requests in parallel. This is how "SSD" folks think. It > is possible that this code is old enough to be in an HDD "mindset" and > that the requests are issued sequentially. If so, then this is > something to "fix" in the raid code (I use the term fix here loosely > as this is not really a bug). >=20 > Can you run an iostat during your degraded test, and also a top run > over 20+ seconds with kernel threads showing up. Even better would be > a perf capture, but you might not have all the tools installed. You > can always try: >=20 > perf record -a sleep 20 >=20 > then >=20 > perf report >=20 > should show you the top functions globally over the 20 second sample. > If you don't have perf loaded, you might (or might not) be able to > load it from the distro. >=20 > Doug >=20 >=20 > On Tue, Aug 23, 2016 at 11:00 AM, Doug Ledford wr= ote: >> On 8/23/2016 10:54 AM, Matt Garman wrote: >>> On Tue, Aug 16, 2016 at 11:36 AM, Doug Dumitru wrot= e: >>>> The RAID rebuild for a single bad drive "should" be an XOR and shoul= d run at >>>> 200,000 kb/sec (the default speed_limit_max). I might be wrong on t= his and >>>> this might still need a full RAID-6 syndrome compute, but I dont thi= nk so. >>>> >>>> The rebuild might not hit 200MB/sec if the drive you replaced is >>>> "conditioned". Be sure to secure erase any non-new drive before you= replace >>>> it. >>>> >>>> Your read IOPS will compete with now busy drives which may increase = the IO >>>> latency a lot, and slow you down a lot. >>>> >>>> One out of 22 read OPS will be to the bad drive, so this will now ta= ke 22 >>>> reads to re-construct the IO. The reconstruction is XOR, so pretty = cheap >>>> from a CPU point of view. Regardless, your IOPS total will double. >>>> >>>> You can probably mitigate the amount of degradation by lowering the = rebuild >>>> speed, but this will make the rebuild take longer, so you are messed= up >>>> either way. If the server has "down time" at night, you might lower= the >>>> rebuild to a really small value during the day, and up it at night. >>> >>> OK, right now I'm looking purely at performance in a degraded state, >>> no rebuild taking place. >>> >>> We have designed a simple read load test to simulate the actual >>> production workload. (It's not perfect of course, but a reasonable >>> approximation. I can share with the list if there's interest.) But >>> basically it just runs multiple threads of reading random files >>> continuously. >>> >>> When the array is in a pristine state, we can achieve read throughput= >>> of 8000 MB/sec (at the array level, per iostat with 5 second samples)= =2E >>> >>> Now I failed a single drive. Running the same test, read performance= >>> drops all the way down to 200 MB/sec. >>> >>> I understand that IOPS should double, which to me says we should >>> expect a roughly 50% read performance drop (napkin math). But this i= s >>> a drop of over 95%. >>> >>> Again, this is with no rebuild taking place... >>> >>> Thoughts? >> >> This depends a lot on how you structured your raid array. I didn't se= e >> your earlier emails, so I'm inferring from the "one out of 22 reads wi= ll >> be to the bad drive" that you have a 24 disk raid6 array? If so, then= >> that's 22 data disks and 2 parity disks per stripe. I'm gonna use tha= t >> as the basis for my next statement even if it's slightly wrong. >> >> Doug was right in that you will have to read 21 data disks and 1 parit= y >> disk to reconstruct reads from the missing block of any given stripe. >> And while he is also correct that this doubles IO ops needed to get yo= ur >> read data, it doesn't address the XOR load to get your data. With 19 >> data disks and 1 parity disk, and say a 64k chunk size, you have to XO= R >> 20 64k data blocks for 1 result. If you are getting 200MB/s, you are >> actually achieving more like 390MB/s of data read, with 190MB/s of it >> being direct reads, and then you are using XOR on 200MB/s in order to >> generate the other 10MB/s of results. >> >> The question of why that performance is so bad is probably (and I say >> probably because without actually testing it this is just some hand-wa= vy >> explanation based upon what I've tested and found in the past, but may= >> not be true today) due to a couple factors: >> >> 1) 200MB/s of XOR is not insignificant. Due to our single thread XOR >> routines, you can actually keep a CPU pretty busy with this. Also, ev= en >> though the XOR routines try to time their assembly 'just so' so that >> they can use the cache avoiding instructions, this fails more often th= an >> not so you end up blowing CPU caches while doing this work, which of >> course effects the overall system. Possible fixes for this might incl= ude: >> a) Multi-threaded XOR becoming the default (last I knew it was= n't, >> correct me if I'm wrong) >> b) Improved XOR routines that deal with cache more intelligent= ly >> c) Creating a consolidated page cache/stripe cache (if we can = read more >> of the blocks needed to get our data from cache instead of disk it hel= ps >> reduce that IO ops issue) >> d) Rearchitecting your arrays into raid50 instead of big raid6= array >> >> 2) Even though we theoretically doubled IO ops, we haven't addressed >> whether or not that doubling is done efficiently. Testing would be >> warranted here to make sure that our reads for reconstruction aren't >> negatively impacting overall disk IO op capability. We might be doing= >> something that we can fix, such as interfering with merges or with >> ordering or with latency sensitive commands. A person would need to d= o >> some deep inspection of how commands are being created and sent to eac= h >> device in order to see if we are keeping them busy or our own latencie= s >> at the kernel level are leaving the disks idle and killing our overall= >> throughput (or conversely has the random head seeks just gone so >> radically through the roof that the problem here really is the time it= >> takes the heads to travel everywhere we are sending them). >> >> >> -- >> Doug Ledford >> GPG Key ID: 0E572FDD >> >=20 >=20 >=20 --=20 Doug Ledford GPG Key ID: 0E572FDD --WpkqmW4DB3eMGI7mStbEVbsUMGuJPm3fa-- --mGWc1K5QJp2Mo0u7qQVJ8OFFf0oXorOQt Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBCAAGBQJXvJ/DAAoJELgmozMOVy/dyHsP/2LHdQWTWAdJ85AKQpC8e0w6 OFiztRZCtcDDvx4Y+cjcjIMlVLphF5NHHvZ94DfYoKIgT9cpcjWDaP5ouPji/z6L avGkJkc1S7wp6lHz3+mWmO8JchAHSxx3V/aw0XWfBi5pNfY2RycDje5M7+3VY3Pt 2SLpVEN4YFf3Kr3yr/1pLr8scXLuOlsgkeEHpeO6Nfz5nKf+YImwZlb/tNTjEMNZ JrHvoh4NEDweuBneFO1MygXXey4ysMog/niexSg7d01RJopuX5oSZabF+oIKEBI2 /h9BIT5Egby+hV5jjAtRwckw2o4Z5cq0jaPA1OhwjBy+j9B3ZPQ/ygdE0gZXohR8 PRyMMVgjDtG1BNPZa89NXBEV9D0Fp3h6Eo4frAkOWaR47p4J7jt19CxwqfCVRsHX EjGW68vm3UFn2In79wdCFv/m17clYS7pTz5rB826ud8AyNxPR7P+Fkp/9IzsSJ3t rEROeU4UsiP7eySRH3jTuwwQ5mbDUmQyJXXfXqt8f4TTtugxvppEzLipmMFAWBHi ucQBjnT5DHmxebKpcxClpYYpZT1/Gf7FfLhlqCWACzyYa8XXMdbFnYGs+XtLvDIY PBGhvWfmZrOkCV4BNx05+9ry0D3qe1zWznJrDpLGDtlQFzWs4I6qGADQRM9NeOxz xYnDrWBrMyuPRMkDJuB2 =7RaM -----END PGP SIGNATURE----- --mGWc1K5QJp2Mo0u7qQVJ8OFFf0oXorOQt--