* Device replace issues and disabling it until they are solved @ 2016-05-11 9:09 Filipe Manana 2016-05-11 9:29 ` Qu Wenruo 2016-05-11 15:13 ` Filipe Manana 0 siblings, 2 replies; 9+ messages in thread From: Filipe Manana @ 2016-05-11 9:09 UTC (permalink / raw) To: linux-btrfs@vger.kernel.org; +Cc: Chris Mason, Josef Bacik, David Sterba Hi, I've noticed some time ago that our device replace implementation is unreliable. Basically under several situations it ends up not copying extents (both data and metadata) from the old device to the new device (I briefly talked about some of the problems with Josef at LSF). Essentially it operates by iterating all the extents of the old device using the device tree's commit root, and for each device extent found, does a direct copy of it into the new device (it does not use the regular write paths). While it's doing this no extents can be allocated from the new device, this is only possible once it finishes copying all device extents and replaces the old device with the new device in the list of devices available for allocations. There are 4 main flaws with the implementation: 1) It processes each device extent only once. Clearly after it processes (copies) a device extent it's possible for future data and metadata writes to allocate extents from the corresponding block group, so any new extents are not copied into the new device. 2) Before it copies a device extent, it sets the corresponding block group to RO mode (as of commit [1], which sadly does not really fix an issue nor does it explain how the problem it claims to fix happens) to prevent new allocations from allocating extents from the block group while we are processing it, and once we finish processing it, it sets the block group back to RW mode. This is flawed too, as when it sets the block group to RO mode, some concurrent task might have just allocated an extent from the block group and ends up writing to it while or after we process the device extent. 3) Since it iterates over the device tree in an ascending fashion, it never considers the possibility for new device extents with an offset lower then the current search offset from being allocated for new block groups (just take a look at the while loop at scrub_enumerate_chunks() to see how flawed it is). So it totally ignores that device holes can be used while we are doing the replace and never looks back to check for any and copy the respective device extent into the new device. This is much more likely to happen nowadays since empty block groups are automatically removed by the cleaner kthread, while in the past this could only happen through balance operations triggered from userspace. 4) When we finish iterating "all" device extents, we call btrfs_dev_replace_finishing(), which flushes all delalloc, commits the current transaction, replaces the old device in all extent mappings with the new device, removes old device from the list of available devices for allocation and adds the new device to that list. So when flushing dellaloc more extents are allocated from existing block groups and new block groups might be allocated, with all the corresponding bios writing to the old device and not the new one, and after this flushing we surely don't copy all new or changed extents to the new device. These are the 4 major issues I see with device replace. Issue number 2 is trivial to solve but it's pointless to do so without addressing the other issues. Solving this isn't easy, not without some trade offs at least. We can keep in memory structures to track which new block groups are allocated while we are doing the replace and which new extents are allocated from existing block groups. Besides the needed synchronization between the allocator and the device replace code (which would bring yet more complexity), this can imply using a lot of memory if there's a significant number of writes happening all the time and ending up in an endless copy loop. Maybe some sort of write throttling would work out. There's also a bit of a chicken and the egg problem, as while the replace operation is ongoing we update a progress item in the devices tree (btrfs_run_dev_replace()) which in turn results in allocating new metadata extents. Anyway, I'm just hoping someone tells me that I'm completely wrong and misunderstood the current implementation we have. If not, and given how serious this is (affecting both data and metadata, and defeating the main purpose of raid), shouldn't we disable this feature, like we did for snapshot aware defrag a couple years ago, until we have these issues solved? thanks [1] - https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=55e3a601c81cdca4497bf855fa4d331f8e830744 ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Device replace issues and disabling it until they are solved 2016-05-11 9:09 Device replace issues and disabling it until they are solved Filipe Manana @ 2016-05-11 9:29 ` Qu Wenruo 2016-05-11 9:39 ` Filipe Manana 2016-05-11 15:13 ` Filipe Manana 1 sibling, 1 reply; 9+ messages in thread From: Qu Wenruo @ 2016-05-11 9:29 UTC (permalink / raw) To: Filipe Manana, linux-btrfs@vger.kernel.org Cc: Chris Mason, Josef Bacik, David Sterba Filipe Manana wrote on 2016/05/11 10:09 +0100: > Hi, > > I've noticed some time ago that our device replace implementation is > unreliable. Basically under several situations it ends up not copying > extents (both data and metadata) from the old device to the new device > (I briefly talked about some of the problems with Josef at LSF). > > Essentially it operates by iterating all the extents of the old device > using the device tree's commit root, and for each device extent found, > does a direct copy of it into the new device (it does not use the > regular write paths). While it's doing this no extents can be > allocated from the new device, this is only possible once it finishes > copying all device extents and replaces the old device with the new > device in the list of devices available for allocations. > > There are 4 main flaws with the implementation: > > 1) It processes each device extent only once. Clearly after it > processes (copies) a device extent it's possible for future data and > metadata writes to allocate extents from the corresponding block > group, so any new extents are not copied into the new device. Not quite familiar with dev replace though, but IIRC, current dev replace will commit current transaction and mark that block group readonly to prevent any incoming write. So that's to say, the readonly setup and commit_transaction() has a race window to allow new write into the ro block group. Or I missed something? > > 2) Before it copies a device extent, it sets the corresponding block > group to RO mode (as of commit [1], which sadly does not really fix an > issue nor does it explain how the problem it claims to fix happens) to > prevent new allocations from allocating extents from the block group > while we are processing it, and once we finish processing it, it sets > the block group back to RW mode. This is flawed too, as when it sets > the block group to RO mode, some concurrent task might have just > allocated an extent from the block group and ends up writing to it > while or after we process the device extent. So the problem is the timing of commit_transaction() and set block group RO. Or is there any possibility to allocate extent while can still escape the control of commit_transaction()? > > 3) Since it iterates over the device tree in an ascending fashion, it > never considers the possibility for new device extents with an offset > lower then the current search offset from being allocated for new > block groups (just take a look at the while loop at > scrub_enumerate_chunks() to see how flawed it is). So it totally > ignores that device holes can be used while we are doing the replace > and never looks back to check for any and copy the respective device > extent into the new device. This is much more likely to happen > nowadays since empty block groups are automatically removed by the > cleaner kthread, while in the past this could only happen through > balance operations triggered from userspace. It seems that the problem is still a result of problem 1 and 2. Still commit_trans() and set block group RO problem. Just as stated, I'm still not quite familiar with dev replace, it would be quite nice if you could point out any misunderstand from me. Thanks, Qu > > 4) When we finish iterating "all" device extents, we call > btrfs_dev_replace_finishing(), which flushes all delalloc, commits the > current transaction, replaces the old device in all extent mappings > with the new device, removes old device from the list of available > devices for allocation and adds the new device to that list. So when > flushing dellaloc more extents are allocated from existing block > groups and new block groups might be allocated, with all the > corresponding bios writing to the old device and not the new one, and > after this flushing we surely don't copy all new or changed extents to > the new device. > > These are the 4 major issues I see with device replace. Issue number 2 > is trivial to solve but it's pointless to do so without addressing the > other issues. > > Solving this isn't easy, not without some trade offs at least. We can > keep in memory structures to track which new block groups are > allocated while we are doing the replace and which new extents are > allocated from existing block groups. Besides the needed > synchronization between the allocator and the device replace code > (which would bring yet more complexity), this can imply using a lot of > memory if there's a significant number of writes happening all the > time and ending up in an endless copy loop. Maybe some sort of write > throttling would work out. There's also a bit of a chicken and the egg > problem, as while the replace operation is ongoing we update a > progress item in the devices tree (btrfs_run_dev_replace()) which in > turn results in allocating new metadata extents. > > Anyway, I'm just hoping someone tells me that I'm completely wrong and > misunderstood the current implementation we have. If not, and given > how serious this is (affecting both data and metadata, and defeating > the main purpose of raid), shouldn't we disable this feature, like we > did for snapshot aware defrag a couple years ago, until we have these > issues solved? > > thanks > > > [1] - https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=55e3a601c81cdca4497bf855fa4d331f8e830744 > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Device replace issues and disabling it until they are solved 2016-05-11 9:29 ` Qu Wenruo @ 2016-05-11 9:39 ` Filipe Manana 0 siblings, 0 replies; 9+ messages in thread From: Filipe Manana @ 2016-05-11 9:39 UTC (permalink / raw) To: Qu Wenruo Cc: linux-btrfs@vger.kernel.org, Chris Mason, Josef Bacik, David Sterba On Wed, May 11, 2016 at 10:29 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote: > > > Filipe Manana wrote on 2016/05/11 10:09 +0100: >> >> Hi, >> >> I've noticed some time ago that our device replace implementation is >> unreliable. Basically under several situations it ends up not copying >> extents (both data and metadata) from the old device to the new device >> (I briefly talked about some of the problems with Josef at LSF). >> >> Essentially it operates by iterating all the extents of the old device >> using the device tree's commit root, and for each device extent found, >> does a direct copy of it into the new device (it does not use the >> regular write paths). While it's doing this no extents can be >> allocated from the new device, this is only possible once it finishes >> copying all device extents and replaces the old device with the new >> device in the list of devices available for allocations. >> >> There are 4 main flaws with the implementation: >> >> 1) It processes each device extent only once. Clearly after it >> processes (copies) a device extent it's possible for future data and >> metadata writes to allocate extents from the corresponding block >> group, so any new extents are not copied into the new device. > > > Not quite familiar with dev replace though, but IIRC, current dev replace > will commit current transaction and mark that block group readonly to > prevent any incoming write. > > So that's to say, the readonly setup and commit_transaction() has a race > window to allow new write into the ro block group. > > Or I missed something? You misunderstood my statement. What I say is that before we process a device extent, we set the block group to RO. But once we finish processing it, we turn it back to RW. And after that we don't process that device extent/block group anymore, which is a problem because new extents might have been allocated from it. >> >> >> 2) Before it copies a device extent, it sets the corresponding block >> group to RO mode (as of commit [1], which sadly does not really fix an >> issue nor does it explain how the problem it claims to fix happens) to >> prevent new allocations from allocating extents from the block group >> while we are processing it, and once we finish processing it, it sets >> the block group back to RW mode. This is flawed too, as when it sets >> the block group to RO mode, some concurrent task might have just >> allocated an extent from the block group and ends up writing to it >> while or after we process the device extent. > > > So the problem is the timing of commit_transaction() and set block group RO. > > Or is there any possibility to allocate extent while can still escape the > control of commit_transaction()? The transaction commit is irrelevant here. That's not the problem. See the answer above. > >> >> 3) Since it iterates over the device tree in an ascending fashion, it >> never considers the possibility for new device extents with an offset >> lower then the current search offset from being allocated for new >> block groups (just take a look at the while loop at >> scrub_enumerate_chunks() to see how flawed it is). So it totally >> ignores that device holes can be used while we are doing the replace >> and never looks back to check for any and copy the respective device >> extent into the new device. This is much more likely to happen >> nowadays since empty block groups are automatically removed by the >> cleaner kthread, while in the past this could only happen through >> balance operations triggered from userspace. > > > It seems that the problem is still a result of problem 1 and 2. > Still commit_trans() and set block group RO problem. Not, problems 1 and 2 are very different. This about space that was previously not allocated becoming allocated to a new block group and the loop at scrub_enumerate_chunks() totally ignoring that possibility. The loop keeps searching for extents using a key that has its offset incremented by "offset of current extent + length of current extent" and never goes back. > > Just as stated, I'm still not quite familiar with dev replace, > it would be quite nice if you could point out any misunderstand from me. Well I gave a high level overview of the problems in the algorithm. Now you'll to read it by yourself and confirm. thanks > > Thanks, > Qu > >> >> 4) When we finish iterating "all" device extents, we call >> btrfs_dev_replace_finishing(), which flushes all delalloc, commits the >> current transaction, replaces the old device in all extent mappings >> with the new device, removes old device from the list of available >> devices for allocation and adds the new device to that list. So when >> flushing dellaloc more extents are allocated from existing block >> groups and new block groups might be allocated, with all the >> corresponding bios writing to the old device and not the new one, and >> after this flushing we surely don't copy all new or changed extents to >> the new device. >> >> These are the 4 major issues I see with device replace. Issue number 2 >> is trivial to solve but it's pointless to do so without addressing the >> other issues. >> >> Solving this isn't easy, not without some trade offs at least. We can >> keep in memory structures to track which new block groups are >> allocated while we are doing the replace and which new extents are >> allocated from existing block groups. Besides the needed >> synchronization between the allocator and the device replace code >> (which would bring yet more complexity), this can imply using a lot of >> memory if there's a significant number of writes happening all the >> time and ending up in an endless copy loop. Maybe some sort of write >> throttling would work out. There's also a bit of a chicken and the egg >> problem, as while the replace operation is ongoing we update a >> progress item in the devices tree (btrfs_run_dev_replace()) which in >> turn results in allocating new metadata extents. >> >> Anyway, I'm just hoping someone tells me that I'm completely wrong and >> misunderstood the current implementation we have. If not, and given >> how serious this is (affecting both data and metadata, and defeating >> the main purpose of raid), shouldn't we disable this feature, like we >> did for snapshot aware defrag a couple years ago, until we have these >> issues solved? >> >> thanks >> >> >> [1] - >> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=55e3a601c81cdca4497bf855fa4d331f8e830744 >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> > > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Device replace issues and disabling it until they are solved 2016-05-11 9:09 Device replace issues and disabling it until they are solved Filipe Manana 2016-05-11 9:29 ` Qu Wenruo @ 2016-05-11 15:13 ` Filipe Manana 2016-05-26 19:09 ` Scott Talbert 1 sibling, 1 reply; 9+ messages in thread From: Filipe Manana @ 2016-05-11 15:13 UTC (permalink / raw) To: linux-btrfs@vger.kernel.org; +Cc: Chris Mason, Josef Bacik, David Sterba On Wed, May 11, 2016 at 10:09 AM, Filipe Manana <fdmanana@kernel.org> wrote: > Hi, > > I've noticed some time ago that our device replace implementation is > unreliable. Basically under several situations it ends up not copying > extents (both data and metadata) from the old device to the new device > (I briefly talked about some of the problems with Josef at LSF). > > Essentially it operates by iterating all the extents of the old device > using the device tree's commit root, and for each device extent found, > does a direct copy of it into the new device (it does not use the > regular write paths). While it's doing this no extents can be > allocated from the new device, this is only possible once it finishes > copying all device extents and replaces the old device with the new > device in the list of devices available for allocations. > > There are 4 main flaws with the implementation: > > 1) It processes each device extent only once. Clearly after it > processes (copies) a device extent it's possible for future data and > metadata writes to allocate extents from the corresponding block > group, so any new extents are not copied into the new device. > > 2) Before it copies a device extent, it sets the corresponding block > group to RO mode (as of commit [1], which sadly does not really fix an > issue nor does it explain how the problem it claims to fix happens) to > prevent new allocations from allocating extents from the block group > while we are processing it, and once we finish processing it, it sets > the block group back to RW mode. This is flawed too, as when it sets > the block group to RO mode, some concurrent task might have just > allocated an extent from the block group and ends up writing to it > while or after we process the device extent. > > 3) Since it iterates over the device tree in an ascending fashion, it > never considers the possibility for new device extents with an offset > lower then the current search offset from being allocated for new > block groups (just take a look at the while loop at > scrub_enumerate_chunks() to see how flawed it is). So it totally > ignores that device holes can be used while we are doing the replace > and never looks back to check for any and copy the respective device > extent into the new device. This is much more likely to happen > nowadays since empty block groups are automatically removed by the > cleaner kthread, while in the past this could only happen through > balance operations triggered from userspace. > > 4) When we finish iterating "all" device extents, we call > btrfs_dev_replace_finishing(), which flushes all delalloc, commits the > current transaction, replaces the old device in all extent mappings > with the new device, removes old device from the list of available > devices for allocation and adds the new device to that list. So when > flushing dellaloc more extents are allocated from existing block > groups and new block groups might be allocated, with all the > corresponding bios writing to the old device and not the new one, and > after this flushing we surely don't copy all new or changed extents to > the new device. > > These are the 4 major issues I see with device replace. Issue number 2 > is trivial to solve but it's pointless to do so without addressing the > other issues. > > Solving this isn't easy, not without some trade offs at least. We can > keep in memory structures to track which new block groups are > allocated while we are doing the replace and which new extents are > allocated from existing block groups. Besides the needed > synchronization between the allocator and the device replace code > (which would bring yet more complexity), this can imply using a lot of > memory if there's a significant number of writes happening all the > time and ending up in an endless copy loop. Maybe some sort of write > throttling would work out. There's also a bit of a chicken and the egg > problem, as while the replace operation is ongoing we update a > progress item in the devices tree (btrfs_run_dev_replace()) which in > turn results in allocating new metadata extents. > > Anyway, I'm just hoping someone tells me that I'm completely wrong and > misunderstood the current implementation we have. If not, and given > how serious this is (affecting both data and metadata, and defeating > the main purpose of raid), shouldn't we disable this feature, like we > did for snapshot aware defrag a couple years ago, until we have these > issues solved? On IRC, Stefan Behrens just pointed me out that for locations of the old device behind the current left cursor, we make all writes go both into the old and new devices (volumes.c:__btrfs_map_block()). This solves problems 1 and 3, which are really the hard problems. The remaining ones are easy to fix and I'll address them soon. Thanks. > > thanks > > > [1] - https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=55e3a601c81cdca4497bf855fa4d331f8e830744 ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Device replace issues and disabling it until they are solved 2016-05-11 15:13 ` Filipe Manana @ 2016-05-26 19:09 ` Scott Talbert 2016-05-27 9:43 ` Filipe Manana 0 siblings, 1 reply; 9+ messages in thread From: Scott Talbert @ 2016-05-26 19:09 UTC (permalink / raw) Cc: linux-btrfs@vger.kernel.org On Wed, 11 May 2016, Filipe Manana wrote: >> I've noticed some time ago that our device replace implementation is >> unreliable. Basically under several situations it ends up not copying >> extents (both data and metadata) from the old device to the new device >> (I briefly talked about some of the problems with Josef at LSF). >> >> Essentially it operates by iterating all the extents of the old device >> using the device tree's commit root, and for each device extent found, >> does a direct copy of it into the new device (it does not use the >> regular write paths). While it's doing this no extents can be >> allocated from the new device, this is only possible once it finishes >> copying all device extents and replaces the old device with the new >> device in the list of devices available for allocations. >> >> There are 4 main flaws with the implementation: >> >> 1) It processes each device extent only once. Clearly after it >> processes (copies) a device extent it's possible for future data and >> metadata writes to allocate extents from the corresponding block >> group, so any new extents are not copied into the new device. >> >> 2) Before it copies a device extent, it sets the corresponding block >> group to RO mode (as of commit [1], which sadly does not really fix an >> issue nor does it explain how the problem it claims to fix happens) to >> prevent new allocations from allocating extents from the block group >> while we are processing it, and once we finish processing it, it sets >> the block group back to RW mode. This is flawed too, as when it sets >> the block group to RO mode, some concurrent task might have just >> allocated an extent from the block group and ends up writing to it >> while or after we process the device extent. >> >> 3) Since it iterates over the device tree in an ascending fashion, it >> never considers the possibility for new device extents with an offset >> lower then the current search offset from being allocated for new >> block groups (just take a look at the while loop at >> scrub_enumerate_chunks() to see how flawed it is). So it totally >> ignores that device holes can be used while we are doing the replace >> and never looks back to check for any and copy the respective device >> extent into the new device. This is much more likely to happen >> nowadays since empty block groups are automatically removed by the >> cleaner kthread, while in the past this could only happen through >> balance operations triggered from userspace. >> >> 4) When we finish iterating "all" device extents, we call >> btrfs_dev_replace_finishing(), which flushes all delalloc, commits the >> current transaction, replaces the old device in all extent mappings >> with the new device, removes old device from the list of available >> devices for allocation and adds the new device to that list. So when >> flushing dellaloc more extents are allocated from existing block >> groups and new block groups might be allocated, with all the >> corresponding bios writing to the old device and not the new one, and >> after this flushing we surely don't copy all new or changed extents to >> the new device. >> >> These are the 4 major issues I see with device replace. Issue number 2 >> is trivial to solve but it's pointless to do so without addressing the >> other issues. >> >> Solving this isn't easy, not without some trade offs at least. We can >> keep in memory structures to track which new block groups are >> allocated while we are doing the replace and which new extents are >> allocated from existing block groups. Besides the needed >> synchronization between the allocator and the device replace code >> (which would bring yet more complexity), this can imply using a lot of >> memory if there's a significant number of writes happening all the >> time and ending up in an endless copy loop. Maybe some sort of write >> throttling would work out. There's also a bit of a chicken and the egg >> problem, as while the replace operation is ongoing we update a >> progress item in the devices tree (btrfs_run_dev_replace()) which in >> turn results in allocating new metadata extents. >> >> Anyway, I'm just hoping someone tells me that I'm completely wrong and >> misunderstood the current implementation we have. If not, and given >> how serious this is (affecting both data and metadata, and defeating >> the main purpose of raid), shouldn't we disable this feature, like we >> did for snapshot aware defrag a couple years ago, until we have these >> issues solved? > > On IRC, Stefan Behrens just pointed me out that for locations of the > old device behind the current left cursor, we make all writes go both > into the old and new devices (volumes.c:__btrfs_map_block()). This > solves problems 1 and 3, which are really the hard problems. The > remaining ones are easy to fix and I'll address them soon. Hi Filipe, Does your recent patch set (from May 20) address all of these issues? Scott ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Device replace issues and disabling it until they are solved 2016-05-26 19:09 ` Scott Talbert @ 2016-05-27 9:43 ` Filipe Manana 2016-06-02 10:03 ` Yauhen Kharuzhy 0 siblings, 1 reply; 9+ messages in thread From: Filipe Manana @ 2016-05-27 9:43 UTC (permalink / raw) To: Scott Talbert; +Cc: linux-btrfs@vger.kernel.org On Thu, May 26, 2016 at 8:09 PM, Scott Talbert <scott.talbert@hgst.com> wrote: > On Wed, 11 May 2016, Filipe Manana wrote: > >>> I've noticed some time ago that our device replace implementation is >>> unreliable. Basically under several situations it ends up not copying >>> extents (both data and metadata) from the old device to the new device >>> (I briefly talked about some of the problems with Josef at LSF). >>> >>> Essentially it operates by iterating all the extents of the old device >>> using the device tree's commit root, and for each device extent found, >>> does a direct copy of it into the new device (it does not use the >>> regular write paths). While it's doing this no extents can be >>> allocated from the new device, this is only possible once it finishes >>> copying all device extents and replaces the old device with the new >>> device in the list of devices available for allocations. >>> >>> There are 4 main flaws with the implementation: >>> >>> 1) It processes each device extent only once. Clearly after it >>> processes (copies) a device extent it's possible for future data and >>> metadata writes to allocate extents from the corresponding block >>> group, so any new extents are not copied into the new device. >>> >>> 2) Before it copies a device extent, it sets the corresponding block >>> group to RO mode (as of commit [1], which sadly does not really fix an >>> issue nor does it explain how the problem it claims to fix happens) to >>> prevent new allocations from allocating extents from the block group >>> while we are processing it, and once we finish processing it, it sets >>> the block group back to RW mode. This is flawed too, as when it sets >>> the block group to RO mode, some concurrent task might have just >>> allocated an extent from the block group and ends up writing to it >>> while or after we process the device extent. >>> >>> 3) Since it iterates over the device tree in an ascending fashion, it >>> never considers the possibility for new device extents with an offset >>> lower then the current search offset from being allocated for new >>> block groups (just take a look at the while loop at >>> scrub_enumerate_chunks() to see how flawed it is). So it totally >>> ignores that device holes can be used while we are doing the replace >>> and never looks back to check for any and copy the respective device >>> extent into the new device. This is much more likely to happen >>> nowadays since empty block groups are automatically removed by the >>> cleaner kthread, while in the past this could only happen through >>> balance operations triggered from userspace. >>> >>> 4) When we finish iterating "all" device extents, we call >>> btrfs_dev_replace_finishing(), which flushes all delalloc, commits the >>> current transaction, replaces the old device in all extent mappings >>> with the new device, removes old device from the list of available >>> devices for allocation and adds the new device to that list. So when >>> flushing dellaloc more extents are allocated from existing block >>> groups and new block groups might be allocated, with all the >>> corresponding bios writing to the old device and not the new one, and >>> after this flushing we surely don't copy all new or changed extents to >>> the new device. >>> >>> These are the 4 major issues I see with device replace. Issue number 2 >>> is trivial to solve but it's pointless to do so without addressing the >>> other issues. >>> >>> Solving this isn't easy, not without some trade offs at least. We can >>> keep in memory structures to track which new block groups are >>> allocated while we are doing the replace and which new extents are >>> allocated from existing block groups. Besides the needed >>> synchronization between the allocator and the device replace code >>> (which would bring yet more complexity), this can imply using a lot of >>> memory if there's a significant number of writes happening all the >>> time and ending up in an endless copy loop. Maybe some sort of write >>> throttling would work out. There's also a bit of a chicken and the egg >>> problem, as while the replace operation is ongoing we update a >>> progress item in the devices tree (btrfs_run_dev_replace()) which in >>> turn results in allocating new metadata extents. >>> >>> Anyway, I'm just hoping someone tells me that I'm completely wrong and >>> misunderstood the current implementation we have. If not, and given >>> how serious this is (affecting both data and metadata, and defeating >>> the main purpose of raid), shouldn't we disable this feature, like we >>> did for snapshot aware defrag a couple years ago, until we have these >>> issues solved? >> >> >> On IRC, Stefan Behrens just pointed me out that for locations of the >> old device behind the current left cursor, we make all writes go both >> into the old and new devices (volumes.c:__btrfs_map_block()). This >> solves problems 1 and 3, which are really the hard problems. The >> remaining ones are easy to fix and I'll address them soon. > > > Hi Filipe, Hi Scott, > > Does your recent patch set (from May 20) address all of these issues? Yes. > > Scott > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, "Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men." ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Device replace issues and disabling it until they are solved 2016-05-27 9:43 ` Filipe Manana @ 2016-06-02 10:03 ` Yauhen Kharuzhy 2016-06-02 10:05 ` Filipe Manana 0 siblings, 1 reply; 9+ messages in thread From: Yauhen Kharuzhy @ 2016-06-02 10:03 UTC (permalink / raw) To: Filipe Manana; +Cc: Scott Talbert, linux-btrfs@vger.kernel.org On Fri, May 27, 2016 at 10:43:47AM +0100, Filipe Manana wrote: > > Hi Filipe, > > Hi Scott, > > > > > Does your recent patch set (from May 20) address all of these issues? > > Yes. Tested, RAID5/6 still produces a plenty of 'failed to rebuild valid logical NNNNNN" messages after two consecutive device replaces. So, replace is still not usable for RAID5/6. And it is very slow in comparison with 'device add && balance device remove missing' sequence (4x slower). -- Yauhen Kharuzhy ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Device replace issues and disabling it until they are solved 2016-06-02 10:03 ` Yauhen Kharuzhy @ 2016-06-02 10:05 ` Filipe Manana 2016-06-02 13:58 ` Scott Talbert 0 siblings, 1 reply; 9+ messages in thread From: Filipe Manana @ 2016-06-02 10:05 UTC (permalink / raw) To: Yauhen Kharuzhy; +Cc: Scott Talbert, linux-btrfs@vger.kernel.org On Thu, Jun 2, 2016 at 11:03 AM, Yauhen Kharuzhy <yauhen.kharuzhy@zavadatar.com> wrote: > On Fri, May 27, 2016 at 10:43:47AM +0100, Filipe Manana wrote: >> > Hi Filipe, >> >> Hi Scott, >> >> > >> > Does your recent patch set (from May 20) address all of these issues? >> >> Yes. > > Tested, RAID5/6 still produces a plenty of 'failed to rebuild valid > logical NNNNNN" messages after two consecutive device replaces. So, > replace is still not usable for RAID5/6. And it is very slow in > comparison with 'device add && balance device remove missing' sequence > (4x slower). Right. There's missing code for raid5/6 I believe. I didn't care about that, nor will in the near future at least. The set of problems I tried to solve were generic and unrelated to any specific raid mode. > > -- > Yauhen Kharuzhy -- Filipe David Manana, "Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men." ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Device replace issues and disabling it until they are solved 2016-06-02 10:05 ` Filipe Manana @ 2016-06-02 13:58 ` Scott Talbert 0 siblings, 0 replies; 9+ messages in thread From: Scott Talbert @ 2016-06-02 13:58 UTC (permalink / raw) To: Filipe Manana; +Cc: Yauhen Kharuzhy, Scott Talbert, linux-btrfs@vger.kernel.org On Thu, 2 Jun 2016, Filipe Manana wrote: > On Thu, Jun 2, 2016 at 11:03 AM, Yauhen Kharuzhy > <yauhen.kharuzhy@zavadatar.com> wrote: >> On Fri, May 27, 2016 at 10:43:47AM +0100, Filipe Manana wrote: >>>> Hi Filipe, >>> >>> Hi Scott, >>> >>>> >>>> Does your recent patch set (from May 20) address all of these issues? >>> >>> Yes. >> >> Tested, RAID5/6 still produces a plenty of 'failed to rebuild valid >> logical NNNNNN" messages after two consecutive device replaces. So, >> replace is still not usable for RAID5/6. And it is very slow in >> comparison with 'device add && balance device remove missing' sequence >> (4x slower). > > Right. There's missing code for raid5/6 I believe. I didn't care about > that, nor will in the near future at least. > The set of problems I tried to solve were generic and unrelated to any > specific raid mode. Back to your original question, then...should replace be disabled for raid5/6 until it is fixed? Scott ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2016-06-02 13:58 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-05-11 9:09 Device replace issues and disabling it until they are solved Filipe Manana 2016-05-11 9:29 ` Qu Wenruo 2016-05-11 9:39 ` Filipe Manana 2016-05-11 15:13 ` Filipe Manana 2016-05-26 19:09 ` Scott Talbert 2016-05-27 9:43 ` Filipe Manana 2016-06-02 10:03 ` Yauhen Kharuzhy 2016-06-02 10:05 ` Filipe Manana 2016-06-02 13:58 ` Scott Talbert
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).