From mboxrd@z Thu Jan 1 00:00:00 1970 From: NeilBrown Subject: Re: [RFC] raid5: add a log device to fix raid5/6 write hole issue Date: Thu, 2 Apr 2015 11:19:41 +1100 Message-ID: <20150402111941.104d0633@notabene.brown> References: <20150330222459.GA575371@devbig257.prn2.facebook.com> <20150402085312.5ea3d518@notabene.brown> <20150401234055.GA3375744@devbig257.prn2.facebook.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; boundary="Sig_/gZr4cF4bdWeyhIBDCfUBISX"; protocol="application/pgp-signature" Return-path: In-Reply-To: <20150401234055.GA3375744@devbig257.prn2.facebook.com> Sender: linux-raid-owner@vger.kernel.org To: Shaohua Li Cc: dan.j.williams@intel.com, linux-raid@vger.kernel.org, songliubraving@fb.com, Kernel-team@fb.com List-Id: linux-raid.ids --Sig_/gZr4cF4bdWeyhIBDCfUBISX Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable On Wed, 1 Apr 2015 16:40:57 -0700 Shaohua Li wrote: > > Your code does avoid write-hole-protection for fill-stripe-writes, and= this > > would greatly reduce the number of block that were written multiple t= imes. > > However I'm not convinced that is correct. > > A reasonable goal is that if the system crashes while writing to a st= orage > > device, then reads should return the old data or not new data, not any= thing > > else. A crash in the middle of a full-stripe-write to a degraded array > > could result in some block in the stripe appearing to contain data tha= t is > > different to both the old and the new. If you are going to close the = whole, > > I think it should be done properly. >=20 > I can do it simpley. But don't think this assumption is true. If you > write to a disk range and there is failure, there is nothing guarantee > you can either read old data or new data. If you write a range of blocks to a normal disk and crash during the write, each block will contain either the old data or the new data. If you write a range to a degraded RAID5 and crash during the write, you cannot make that same guarantee. I don't know how important this is, but then I don't really know how important any of this is. >=20 > >=20 > > A combined log would "simply" involve writing every data block and ev= ery > > compute parity block (with index information) to the log device. > > Replaying the log would collect data blocks and flush out those in a s= tripe > > once the parity block(s) for that stripe became available. > >=20 > > I think this would actually turn into a fairly simple logging mechanis= m. >=20 > It's not simple at all. It's unlikely we write data and parity > continuously in disk and in the same time. This will make log checkpoint > fairly complex. I don't see any cause for complexity. Let me be more explicit. I imagine that all data remains in the stripe cache, in memory, until it is finally written to the RAID5. So the stripe cache will need to be quite a bit bigger. Every time we get a block that we want to write, either a new data block or= a a computed parity block, we queue it to the log. The log works like this: - take the first (e.g.) 256 blocks in the queue, create a header to descri= be them, write the header with FUA, then write all the data blocks. If the= re are fewer than 256, just write what we have. - when the header write completes, all blocks written *previously* are now safe and we can call bio_end_io on data or unlock the stripe for parity. - loop back and write some more blocks. If there are no blocks to write, write a header which describes an empty set of blocks, and wait for more blocks to appear. Each stripe_head needs to track (roughly) where the relevant blocks were written so it can release them when the stripe is written. I would conceptually divide the log into 32 regions and keep a 32bit number with each stripe. When a block is assigned to a region in the log, the relevant bit is set for the stripe, and a per-region counter is incremented. When a stripe completes its write, the region counters for all the bits are= =20 cleared. The log cannot progress into a region which has a non-zero counte= r. We choose the size of transactions so that the first block of each region is a header block. These contain a magic number, a sequence number, and a checksum together with the addresses of the data/parity blocks. On restart we read all 32 of these to find out where the log starts and ends. Then we replay all the blocks into the stripe cache - discarding any that don't come with the required parity blocks. So it is a very simple log which is never read exact on crash recovery. It commits everything ASAP so that the writeout to the array can be lazy and c= an gather related blocks and sort address etc with not impact on filesystem latency. Does that make sense? NeilBrown --Sig_/gZr4cF4bdWeyhIBDCfUBISX Content-Type: application/pgp-signature Content-Description: OpenPGP digital signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIVAwUBVRyLHjnsnt1WYoG5AQKxCg/+PoyxxKTjTCVmNiXn14v1kUriRdrxHj9R 6Sn4QqxYPdFgV8bb6orPKPQQUULamlylhTQAHu0xmxzK9KJrrDwB5o4TP8Ik82/M 2ZW1prA7goKJ/sEfjtf/g5BsDT6mMBMYGUSCVKuvYFyqg4G4+EQJ66ZTB8UGeoS7 p3QliW4mHvQZF1uQEJWK0PRu5zF3Y6CWAnVnhKuzWejLPdO0ChCPaSWKrTfL/tnn cruvCSbnAuCpXCKwrBZwFpX2Ibm8A/N6MF/ZPOAnsabN8xwKaJg9mq/Das/OSjB1 V0W0Vlq1LhUdQAjaz51iKXOkpWAvtEbyWTDtlc2wMDm57tPy21GomeM3U7VVglxT VF7qht/EF9BDzTm6tzwctkrijkiAduwhkRAAiyZ1DVcVad6Qx/d5z4LrC60LlT1E 94BXYPUQQRhlvsWgzGQ9D4aIuVqX1SOxhKQ1uYYW27kJKKUKQT76DG0hXAnLJUVF CieZaI14I1R5twPT4BKOnRIKdy6yPO0rBuD7R09EhLXGHCnOnxoFojQulpY4CU19 TkGmx4xW8lh8CDDKZoE6UZULMLR9+18vagtIR9SJhLBn6QG7PYv6MJE75nD8poia xvRpLYcESOlz/csfwiOBKPzm15i6p0zj2ZnOYFtI0ZpJ+XlSXWabSM2oGI3bcc3+ R43UWJ5nGN4= =1FAk -----END PGP SIGNATURE----- --Sig_/gZr4cF4bdWeyhIBDCfUBISX--