* Proper way to test RAID456? @ 2022-01-07 2:30 Qu Wenruo 2022-01-08 19:52 ` [dm-devel] " Lukas Straub 0 siblings, 1 reply; 8+ messages in thread From: Qu Wenruo @ 2022-01-07 2:30 UTC (permalink / raw) To: Linux FS Devel, linux-block@vger.kernel.org, dm-devel@redhat.com Hi, Recently I'm working on refactor btrfs raid56 (with long term objective to add proper journal to solve write-hole), and the coverage of current fstests for btrfs RAID56 is not that ideal. Is there any project testing dm/md RAID456 for things like re-silvering/write-hole problems? And how you dm guys do the tests for stacked RAID456? I really hope to learn some tricks from the existing, tried-and-true RAID456 implementations, and hopefully to solve the known write-hole bugs in btrfs. Thanks, Qu ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [dm-devel] Proper way to test RAID456? 2022-01-07 2:30 Proper way to test RAID456? Qu Wenruo @ 2022-01-08 19:52 ` Lukas Straub 2022-01-08 20:29 ` Lukas Straub 0 siblings, 1 reply; 8+ messages in thread From: Lukas Straub @ 2022-01-08 19:52 UTC (permalink / raw) To: Qu Wenruo Cc: Linux FS Devel, linux-block@vger.kernel.org, dm-devel@redhat.com, linux-raid CC'ing linux-raid mailing list, where md raid development happens. dm-raid is just a different interface to md raid. On Fri, 7 Jan 2022 10:30:56 +0800 Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > Hi, > > Recently I'm working on refactor btrfs raid56 (with long term objective > to add proper journal to solve write-hole), and the coverage of current > fstests for btrfs RAID56 is not that ideal. > > Is there any project testing dm/md RAID456 for things like > re-silvering/write-hole problems? > > And how you dm guys do the tests for stacked RAID456? > > I really hope to learn some tricks from the existing, tried-and-true > RAID456 implementations, and hopefully to solve the known write-hole > bugs in btrfs. > > Thanks, > Qu > > > -- > dm-devel mailing list > dm-devel@redhat.com > https://listman.redhat.com/mailman/listinfo/dm-devel > -- ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [dm-devel] Proper way to test RAID456? 2022-01-08 19:52 ` [dm-devel] " Lukas Straub @ 2022-01-08 20:29 ` Lukas Straub 2022-01-08 23:55 ` Qu Wenruo 0 siblings, 1 reply; 8+ messages in thread From: Lukas Straub @ 2022-01-08 20:29 UTC (permalink / raw) To: Qu Wenruo Cc: Linux FS Devel, linux-block@vger.kernel.org, dm-devel@redhat.com, linux-raid [-- Attachment #1: Type: text/plain, Size: 1822 bytes --] On Sat, 8 Jan 2022 19:52:59 +0000 Lukas Straub <lukasstraub2@web.de> wrote: > CC'ing linux-raid mailing list, where md raid development happens. > dm-raid is just a different interface to md raid. > > On Fri, 7 Jan 2022 10:30:56 +0800 > Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > > Hi, > > > > Recently I'm working on refactor btrfs raid56 (with long term objective > > to add proper journal to solve write-hole), and the coverage of current > > fstests for btrfs RAID56 is not that ideal. > > > > Is there any project testing dm/md RAID456 for things like > > re-silvering/write-hole problems? > > > > And how you dm guys do the tests for stacked RAID456? > > > > I really hope to learn some tricks from the existing, tried-and-true > > RAID456 implementations, and hopefully to solve the known write-hole > > bugs in btrfs. Just some thoughts: Besides the journal to mitigate the write-hole, md raid has another trick: The Partial Parity Log https://www.kernel.org/doc/html/latest/driver-api/md/raid5-ppl.html When a stripe is partially updated with new data, PPL ensures that the old data in the stripe will not be corrupted by the write-hole. The new data on the other hand is still affected by the write hole, but for btrfs that is no problem. But there is a even simpler solution for btrfs: It could just not touch stripes that already contain data. The big problem will be NOCOW files, since a write to an already allocated extent will necessarily touch a stripe with old data in it and the new data also needs to be protected from the write-hole. Regards, Lukas Straub > > Thanks, > > Qu > > > > > > -- > > dm-devel mailing list > > dm-devel@redhat.com > > https://listman.redhat.com/mailman/listinfo/dm-devel > > > > > -- [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [dm-devel] Proper way to test RAID456? 2022-01-08 20:29 ` Lukas Straub @ 2022-01-08 23:55 ` Qu Wenruo 2022-01-09 10:04 ` David Woodhouse 0 siblings, 1 reply; 8+ messages in thread From: Qu Wenruo @ 2022-01-08 23:55 UTC (permalink / raw) To: Lukas Straub Cc: Linux FS Devel, linux-block@vger.kernel.org, dm-devel@redhat.com, linux-raid On 2022/1/9 04:29, Lukas Straub wrote: > On Sat, 8 Jan 2022 19:52:59 +0000 > Lukas Straub <lukasstraub2@web.de> wrote: > >> CC'ing linux-raid mailing list, where md raid development happens. >> dm-raid is just a different interface to md raid. >> >> On Fri, 7 Jan 2022 10:30:56 +0800 >> Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >> >>> Hi, >>> >>> Recently I'm working on refactor btrfs raid56 (with long term objective >>> to add proper journal to solve write-hole), and the coverage of current >>> fstests for btrfs RAID56 is not that ideal. >>> >>> Is there any project testing dm/md RAID456 for things like >>> re-silvering/write-hole problems? >>> >>> And how you dm guys do the tests for stacked RAID456? >>> >>> I really hope to learn some tricks from the existing, tried-and-true >>> RAID456 implementations, and hopefully to solve the known write-hole >>> bugs in btrfs. > > Just some thoughts: > Besides the journal to mitigate the write-hole, md raid has another > trick: > The Partial Parity Log > https://www.kernel.org/doc/html/latest/driver-api/md/raid5-ppl.html > > When a stripe is partially updated with new data, PPL ensures that the > old data in the stripe will not be corrupted by the write-hole. The new > data on the other hand is still affected by the write hole, but for > btrfs that is no problem. > > But there is a even simpler solution for btrfs: It could just not touch > stripes that already contain data. That would waste a lot of space, if the fs is fragemented. Or we have to write into data stripes when free space is low. That's why I'm trying to implement a PPL-like journal for btrfs RAID56. Thanks, Qu > > The big problem will be NOCOW files, since a write to an already > allocated extent will necessarily touch a stripe with old data in it > and the new data also needs to be protected from the write-hole. > > Regards, > Lukas Straub > >>> Thanks, >>> Qu >>> >>> >>> -- >>> dm-devel mailing list >>> dm-devel@redhat.com >>> https://listman.redhat.com/mailman/listinfo/dm-devel >>> >> >> >> > > > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [dm-devel] Proper way to test RAID456? 2022-01-08 23:55 ` Qu Wenruo @ 2022-01-09 10:04 ` David Woodhouse 2022-01-09 12:13 ` Qu Wenruo 0 siblings, 1 reply; 8+ messages in thread From: David Woodhouse @ 2022-01-09 10:04 UTC (permalink / raw) To: Qu Wenruo, Lukas Straub Cc: Linux FS Devel, linux-block@vger.kernel.org, dm-devel@redhat.com, linux-raid [-- Attachment #1: Type: text/plain, Size: 1069 bytes --] On Sun, 2022-01-09 at 07:55 +0800, Qu Wenruo wrote: > On 2022/1/9 04:29, Lukas Straub wrote: > > But there is a even simpler solution for btrfs: It could just not touch > > stripes that already contain data. > > That would waste a lot of space, if the fs is fragemented. > > Or we have to write into data stripes when free space is low. > > That's why I'm trying to implement a PPL-like journal for btrfs RAID56. PPL writes the P/Q of the unmodified chunks from the stripe, doesn't it? An alternative in a true file system which can do its own block allocation is to just calculate the P/Q of the final stripe after it's been modified, and write those (and) the updated data out to newly- allocated blocks instead of overwriting the original. Then the final step is to free the original data blocks and P/Q. This means that your RAID stripes no longer have a fixed topology; you need metadata to be able to *find* the component data and P/Q chunks... it ends up being non-trivial, but it has attractive properties if we can work it out. [-- Attachment #2: smime.p7s --] [-- Type: application/pkcs7-signature, Size: 5965 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [dm-devel] Proper way to test RAID456? 2022-01-09 10:04 ` David Woodhouse @ 2022-01-09 12:13 ` Qu Wenruo 2022-01-12 16:56 ` Lukas Straub 0 siblings, 1 reply; 8+ messages in thread From: Qu Wenruo @ 2022-01-09 12:13 UTC (permalink / raw) To: David Woodhouse, Lukas Straub Cc: Linux FS Devel, linux-block@vger.kernel.org, dm-devel@redhat.com, linux-raid On 2022/1/9 18:04, David Woodhouse wrote: > On Sun, 2022-01-09 at 07:55 +0800, Qu Wenruo wrote: >> On 2022/1/9 04:29, Lukas Straub wrote: >>> But there is a even simpler solution for btrfs: It could just not touch >>> stripes that already contain data. >> >> That would waste a lot of space, if the fs is fragemented. >> >> Or we have to write into data stripes when free space is low. >> >> That's why I'm trying to implement a PPL-like journal for btrfs RAID56. > > PPL writes the P/Q of the unmodified chunks from the stripe, doesn't > it? Did I miss something or the PPL isn't what I thought? I thought PPL either: a) Just write a metadata entry into the journal to indicate a full stripe (along with its location) is going to be written. b) Write a metadata entry into the journal about a non-full stripe write, then write the new data and new P/Q into the journal And this is before we start any data/P/Q write. And after related data/P/Q write is finished, remove corresponding metadata and data entry from the journal. Or PPL have even better solution? > > An alternative in a true file system which can do its own block > allocation is to just calculate the P/Q of the final stripe after it's > been modified, and write those (and) the updated data out to newly- > allocated blocks instead of overwriting the original. This is what Johannes is considering, but for a different purpose. Johannes' idea is to support zoned device. As the physical location a zoned append write will only be known after it's written. So his idea is to maintain another mapping tree for zoned write, so that full stripe update will also happen in that tree. But that idea is still in the future, on the other hand I still prefer some tried-and-true method, as I'm 100% sure there will be new difficulties waiting us for the new mapping tree method. Thanks, Qu > > Then the final step is to free the original data blocks and P/Q. > > This means that your RAID stripes no longer have a fixed topology; you > need metadata to be able to *find* the component data and P/Q chunks... > it ends up being non-trivial, but it has attractive properties if we > can work it out. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [dm-devel] Proper way to test RAID456? 2022-01-09 12:13 ` Qu Wenruo @ 2022-01-12 16:56 ` Lukas Straub 2022-01-13 1:30 ` Qu Wenruo 0 siblings, 1 reply; 8+ messages in thread From: Lukas Straub @ 2022-01-12 16:56 UTC (permalink / raw) To: Qu Wenruo Cc: David Woodhouse, Linux FS Devel, linux-block@vger.kernel.org, dm-devel@redhat.com, linux-raid [-- Attachment #1: Type: text/plain, Size: 4616 bytes --] On Sun, 9 Jan 2022 20:13:36 +0800 Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > On 2022/1/9 18:04, David Woodhouse wrote: > > On Sun, 2022-01-09 at 07:55 +0800, Qu Wenruo wrote: > >> On 2022/1/9 04:29, Lukas Straub wrote: > >>> But there is a even simpler solution for btrfs: It could just not touch > >>> stripes that already contain data. > >> > >> That would waste a lot of space, if the fs is fragemented. > >> > >> Or we have to write into data stripes when free space is low. > >> > >> That's why I'm trying to implement a PPL-like journal for btrfs RAID56. > > > > PPL writes the P/Q of the unmodified chunks from the stripe, doesn't > > it? > > Did I miss something or the PPL isn't what I thought? > > I thought PPL either: > > a) Just write a metadata entry into the journal to indicate a full > stripe (along with its location) is going to be written. > > b) Write a metadata entry into the journal about a non-full stripe > write, then write the new data and new P/Q into the journal > > And this is before we start any data/P/Q write. > > And after related data/P/Q write is finished, remove corresponding > metadata and data entry from the journal. > > Or PPL have even better solution? Yes, PPL is a bit better than a journal as you described it (md supports both). Because a journal would need to be replicated to multiple devices (raid1) in the array while the PPL is only written to the drive containing the parity for the particular stripe. And since the parity is distributed across all drives, the PPL overhead is also distributed across all drives. However, PPL only works for raid5 as you'll see. PPL works like this: Before any data/parity write either: a) Just write a metadata entry into the PPL on the parity drive to indicate a full stripe (along with its location) is going to be written. b) Write a metadata entry into the PPL on the parity drive about a non-full stripe write, including which data chunks are going to be modified, then write the XOR of chunks not modified by this write in to the PPL. To recover a inconsistent array with a lost drive: In case a), the stripe consists only of newly written data, so it will be affected by the write-hole (this is the trade-off that PPL makes) so just standard parity recovery. In case b), XOR what we wrote to the PPL (the XOR of chunks not modified) with the modified data chunks to get our new (consistent) parity. Then do standard parity recovery. This just works if we lost a unmodified data chunk. If we lost a modified data chunk this is not possible and just do standard parity recovery from the beginning. Again, the newly written data is affected by the write-hole but existing data is not. If we lost the parity drive (containing the PPL) there is no need to recover since all the data chunks are present. Of course, this was a simplified explanation, see drivers/md/raid5-ppl.c for details (it has good comments with examples). This also covers the case where a data chunk is only partially modified and the unmodified part of the chunk also needs to be protected (by working on a per-block basis instead of per-chunk). The PPL is not possible for raid6 AFAIK, because there it could happen that you loose both a modified data chunk and a unmodified data chunk. Regards, Lukas Straub > > > > An alternative in a true file system which can do its own block > > allocation is to just calculate the P/Q of the final stripe after it's > > been modified, and write those (and) the updated data out to newly- > > allocated blocks instead of overwriting the original. > > This is what Johannes is considering, but for a different purpose. > Johannes' idea is to support zoned device. As the physical location a > zoned append write will only be known after it's written. > > So his idea is to maintain another mapping tree for zoned write, so that > full stripe update will also happen in that tree. > > But that idea is still in the future, on the other hand I still prefer > some tried-and-true method, as I'm 100% sure there will be new > difficulties waiting us for the new mapping tree method. > > Thanks, > Qu > > > > > Then the final step is to free the original data blocks and P/Q. > > > > This means that your RAID stripes no longer have a fixed topology; you > > need metadata to be able to *find* the component data and P/Q chunks... > > it ends up being non-trivial, but it has attractive properties if we > > can work it out. -- [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [dm-devel] Proper way to test RAID456? 2022-01-12 16:56 ` Lukas Straub @ 2022-01-13 1:30 ` Qu Wenruo 0 siblings, 0 replies; 8+ messages in thread From: Qu Wenruo @ 2022-01-13 1:30 UTC (permalink / raw) To: Lukas Straub Cc: Linux FS Devel, linux-block@vger.kernel.org, dm-devel@redhat.com, David Woodhouse, linux-raid On 2022/1/13 00:56, Lukas Straub wrote: > On Sun, 9 Jan 2022 20:13:36 +0800 > Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > >> On 2022/1/9 18:04, David Woodhouse wrote: >>> On Sun, 2022-01-09 at 07:55 +0800, Qu Wenruo wrote: >>>> On 2022/1/9 04:29, Lukas Straub wrote: >>>>> But there is a even simpler solution for btrfs: It could just not touch >>>>> stripes that already contain data. >>>> >>>> That would waste a lot of space, if the fs is fragemented. >>>> >>>> Or we have to write into data stripes when free space is low. >>>> >>>> That's why I'm trying to implement a PPL-like journal for btrfs RAID56. >>> >>> PPL writes the P/Q of the unmodified chunks from the stripe, doesn't >>> it? >> >> Did I miss something or the PPL isn't what I thought? >> >> I thought PPL either: >> >> a) Just write a metadata entry into the journal to indicate a full >> stripe (along with its location) is going to be written. >> >> b) Write a metadata entry into the journal about a non-full stripe >> write, then write the new data and new P/Q into the journal >> >> And this is before we start any data/P/Q write. >> >> And after related data/P/Q write is finished, remove corresponding >> metadata and data entry from the journal. >> >> Or PPL have even better solution? > > Yes, PPL is a bit better than a journal as you described it (md > supports both). Because a journal would need to be replicated to > multiple devices (raid1) in the array while the PPL is only written to > the drive containing the parity for the particular stripe. And since the > parity is distributed across all drives, the PPL overhead is also > distributed across all drives. However, PPL only works for raid5 as > you'll see. > > PPL works like this: > > Before any data/parity write either: > > a) Just write a metadata entry into the PPL on the parity drive to > indicate a full stripe (along with its location) is going to be > written. > > b) Write a metadata entry into the PPL on the parity drive about a > non-full stripe write, including which data chunks are going to be > modified, then write the XOR of chunks not modified by this write in > to the PPL. This is a little different than I thought, and I guess that's why RAID6 is not supported. My original assumption would be something like this for one RMW (X = modified data, | | = unmodified data) Data 1: |XXXXXXXXX| | | Data 2: | | |XXXXXXXX| P(1+2): |XXXXXXXXX| |XXXXXXXX| In that case, modified Data 1 and 2 will go logged into PPL for the corresponding disks. Then for P(1+2), only the modified two parts will be logged into the device. I'm wondering if we go this solution, wouldn't it be able to handle RAID6 too? Even we lost two disks, the remaining part in the PPL should still be enough to recover whatever is lost, as long as the unmodified sectors are really unmodified on-disk. Although this would greatly make the PPL management much harder, as different devices will have different PPL data usage. > > To recover a inconsistent array with a lost drive: > > In case a), the stripe consists only of newly written data, so it will > be affected by the write-hole (this is the trade-off that PPL makes) so > just standard parity recovery. > > In case b), XOR what we wrote to the PPL (the XOR of chunks not > modified) with the modified data chunks to get our new (consistent) > parity. Then do standard parity recovery. This just works if we lost a > unmodified data chunk. > If we lost a modified data chunk this is not possible and just do > standard parity recovery from the beginning. Again, the newly written > data is affected by the write-hole but existing data is not. > If we lost the parity drive (containing the PPL) there is no need to > recover since all the data chunks are present. > > Of course, this was a simplified explanation, see drivers/md/raid5-ppl.c > for details (it has good comments with examples). This also covers the > case where a data chunk is only partially modified and the unmodified > part of the chunk also needs to be protected (by working on a per-block > basis instead of per-chunk). Thanks for the detailed explanation. Qu > > The PPL is not possible for raid6 AFAIK, because there it could happen > that you loose both a modified data chunk and a unmodified data chunk. > > Regards, > Lukas Straub > >>> >>> An alternative in a true file system which can do its own block >>> allocation is to just calculate the P/Q of the final stripe after it's >>> been modified, and write those (and) the updated data out to newly- >>> allocated blocks instead of overwriting the original. >> >> This is what Johannes is considering, but for a different purpose. >> Johannes' idea is to support zoned device. As the physical location a >> zoned append write will only be known after it's written. >> >> So his idea is to maintain another mapping tree for zoned write, so that >> full stripe update will also happen in that tree. >> >> But that idea is still in the future, on the other hand I still prefer >> some tried-and-true method, as I'm 100% sure there will be new >> difficulties waiting us for the new mapping tree method. >> >> Thanks, >> Qu >> >>> >>> Then the final step is to free the original data blocks and P/Q. >>> >>> This means that your RAID stripes no longer have a fixed topology; you >>> need metadata to be able to *find* the component data and P/Q chunks... >>> it ends up being non-trivial, but it has attractive properties if we >>> can work it out. > > > > > -- > dm-devel mailing list > dm-devel@redhat.com > https://listman.redhat.com/mailman/listinfo/dm-devel ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2022-01-13 1:30 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-01-07 2:30 Proper way to test RAID456? Qu Wenruo 2022-01-08 19:52 ` [dm-devel] " Lukas Straub 2022-01-08 20:29 ` Lukas Straub 2022-01-08 23:55 ` Qu Wenruo 2022-01-09 10:04 ` David Woodhouse 2022-01-09 12:13 ` Qu Wenruo 2022-01-12 16:56 ` Lukas Straub 2022-01-13 1:30 ` Qu Wenruo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).