Device replace issues and disabling it until they are solved

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Device replace issues and disabling it until they are solved
@ 2016-05-11  9:09 Filipe Manana
  2016-05-11  9:29 ` Qu Wenruo
  2016-05-11 15:13 ` Filipe Manana
  0 siblings, 2 replies; 9+ messages in thread
From: Filipe Manana @ 2016-05-11  9:09 UTC (permalink / raw)
  To: linux-btrfs@vger.kernel.org; +Cc: Chris Mason, Josef Bacik, David Sterba

Hi,

I've noticed some time ago that our device replace implementation is
unreliable. Basically under several situations it ends up not copying
extents (both data and metadata) from the old device to the new device
(I briefly talked about some of the problems with Josef at LSF).

Essentially it operates by iterating all the extents of the old device
using the device tree's commit root, and for each device extent found,
does a direct copy of it into the new device (it does not use the
regular write paths). While it's doing this no extents can be
allocated from the new device, this is only possible once it finishes
copying all device extents and replaces the old device with the new
device in the list of devices available for allocations.

There are 4 main flaws with the implementation:

1) It processes each device extent only once. Clearly after it
processes (copies) a device extent it's possible for future data and
metadata writes to allocate extents from the corresponding block
group, so any new extents are not copied into the new device.

2) Before it copies a device extent, it sets the corresponding block
group to RO mode (as of commit [1], which sadly does not really fix an
issue nor does it explain how the problem it claims to fix happens) to
prevent new allocations from allocating extents from the block group
while we are processing it, and once we finish processing it, it sets
the block group back to RW mode. This is flawed too, as when it sets
the block group to RO mode, some concurrent task might have just
allocated an extent from the block group and ends up writing to it
while or after we process the device extent.

3) Since it iterates over the device tree in an ascending fashion, it
never considers the possibility for new device extents with an offset
lower then the current search offset from being allocated for new
block groups (just take a look at the while loop at
scrub_enumerate_chunks() to see how flawed it is). So it totally
ignores that device holes can be used while we are doing the replace
and never looks back to check for any and copy the respective device
extent into the new device. This is much more likely to happen
nowadays since empty block groups are automatically removed by the
cleaner kthread, while in the past this could only happen through
balance operations triggered from userspace.

4) When we finish iterating "all" device extents, we call
btrfs_dev_replace_finishing(), which flushes all delalloc, commits the
current transaction, replaces the old device in all extent mappings
with the new device, removes old device from the list of available
devices for allocation and adds the new device to that list. So when
flushing dellaloc more extents are allocated from existing block
groups and new block groups might be allocated, with all the
corresponding bios writing to the old device and not the new one, and
after this flushing we surely don't copy all new or changed extents to
the new device.

These are the 4 major issues I see with device replace. Issue number 2
is trivial to solve but it's pointless to do so without addressing the
other issues.

Solving this isn't easy, not without some trade offs at least. We can
keep in memory structures to track which new block groups are
allocated while we are doing the replace and which new extents are
allocated from existing block groups. Besides the needed
synchronization between the allocator and the device replace code
(which would bring yet more complexity), this can imply using a lot of
memory if there's a significant number of writes happening all the
time and ending up in an endless copy loop. Maybe some sort of write
throttling would work out. There's also a bit of a chicken and the egg
problem, as while the replace operation is ongoing we update a
progress item in the devices tree (btrfs_run_dev_replace()) which in
turn results in allocating new metadata extents.

Anyway, I'm just hoping someone tells me that I'm completely wrong and
misunderstood the current implementation we have. If not, and given
how serious this is (affecting both data and metadata, and defeating
the main purpose of raid), shouldn't we disable this feature, like we
did for snapshot aware defrag a couple years ago, until we have these
issues solved?

thanks

[1] - https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=55e3a601c81cdca4497bf855fa4d331f8e830744

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Device replace issues and disabling it until they are solved
  2016-05-11  9:09 Device replace issues and disabling it until they are solved Filipe Manana
@ 2016-05-11  9:29 ` Qu Wenruo
  2016-05-11  9:39   ` Filipe Manana
  2016-05-11 15:13 ` Filipe Manana
  1 sibling, 1 reply; 9+ messages in thread
From: Qu Wenruo @ 2016-05-11  9:29 UTC (permalink / raw)
  To: Filipe Manana, linux-btrfs@vger.kernel.org
  Cc: Chris Mason, Josef Bacik, David Sterba



Filipe Manana wrote on 2016/05/11 10:09 +0100:
> Hi,
>
> I've noticed some time ago that our device replace implementation is
> unreliable. Basically under several situations it ends up not copying
> extents (both data and metadata) from the old device to the new device
> (I briefly talked about some of the problems with Josef at LSF).
>
> Essentially it operates by iterating all the extents of the old device
> using the device tree's commit root, and for each device extent found,
> does a direct copy of it into the new device (it does not use the
> regular write paths). While it's doing this no extents can be
> allocated from the new device, this is only possible once it finishes
> copying all device extents and replaces the old device with the new
> device in the list of devices available for allocations.
>
> There are 4 main flaws with the implementation:
>
> 1) It processes each device extent only once. Clearly after it
> processes (copies) a device extent it's possible for future data and
> metadata writes to allocate extents from the corresponding block
> group, so any new extents are not copied into the new device.

Not quite familiar with dev replace though, but IIRC, current dev 
replace will commit current transaction and mark that block group 
readonly to prevent any incoming write.

So that's to say, the readonly setup and commit_transaction() has a race 
window to allow new write into the ro block group.

Or I missed something?
>
> 2) Before it copies a device extent, it sets the corresponding block
> group to RO mode (as of commit [1], which sadly does not really fix an
> issue nor does it explain how the problem it claims to fix happens) to
> prevent new allocations from allocating extents from the block group
> while we are processing it, and once we finish processing it, it sets
> the block group back to RW mode. This is flawed too, as when it sets
> the block group to RO mode, some concurrent task might have just
> allocated an extent from the block group and ends up writing to it
> while or after we process the device extent.

So the problem is the timing of commit_transaction() and set block group RO.

Or is there any possibility to allocate extent while can still escape 
the control of commit_transaction()?

>
> 3) Since it iterates over the device tree in an ascending fashion, it
> never considers the possibility for new device extents with an offset
> lower then the current search offset from being allocated for new
> block groups (just take a look at the while loop at
> scrub_enumerate_chunks() to see how flawed it is). So it totally
> ignores that device holes can be used while we are doing the replace
> and never looks back to check for any and copy the respective device
> extent into the new device. This is much more likely to happen
> nowadays since empty block groups are automatically removed by the
> cleaner kthread, while in the past this could only happen through
> balance operations triggered from userspace.

It seems that the problem is still a result of problem 1 and 2.
Still commit_trans() and set block group RO problem.

Just as stated, I'm still not quite familiar with dev replace,
it would be quite nice if you could point out any misunderstand from me.

Thanks,
Qu

>
> 4) When we finish iterating "all" device extents, we call
> btrfs_dev_replace_finishing(), which flushes all delalloc, commits the
> current transaction, replaces the old device in all extent mappings
> with the new device, removes old device from the list of available
> devices for allocation and adds the new device to that list. So when
> flushing dellaloc more extents are allocated from existing block
> groups and new block groups might be allocated, with all the
> corresponding bios writing to the old device and not the new one, and
> after this flushing we surely don't copy all new or changed extents to
> the new device.
>
> These are the 4 major issues I see with device replace. Issue number 2
> is trivial to solve but it's pointless to do so without addressing the
> other issues.
>
> Solving this isn't easy, not without some trade offs at least. We can
> keep in memory structures to track which new block groups are
> allocated while we are doing the replace and which new extents are
> allocated from existing block groups. Besides the needed
> synchronization between the allocator and the device replace code
> (which would bring yet more complexity), this can imply using a lot of
> memory if there's a significant number of writes happening all the
> time and ending up in an endless copy loop. Maybe some sort of write
> throttling would work out. There's also a bit of a chicken and the egg
> problem, as while the replace operation is ongoing we update a
> progress item in the devices tree (btrfs_run_dev_replace()) which in
> turn results in allocating new metadata extents.
>
> Anyway, I'm just hoping someone tells me that I'm completely wrong and
> misunderstood the current implementation we have. If not, and given
> how serious this is (affecting both data and metadata, and defeating
> the main purpose of raid), shouldn't we disable this feature, like we
> did for snapshot aware defrag a couple years ago, until we have these
> issues solved?
>
> thanks
>
>
> [1] - https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=55e3a601c81cdca4497bf855fa4d331f8e830744
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Device replace issues and disabling it until they are solved
  2016-05-11  9:29 ` Qu Wenruo
@ 2016-05-11  9:39   ` Filipe Manana
  0 siblings, 0 replies; 9+ messages in thread
From: Filipe Manana @ 2016-05-11  9:39 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: linux-btrfs@vger.kernel.org, Chris Mason, Josef Bacik,
	David Sterba

On Wed, May 11, 2016 at 10:29 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>
>
> Filipe Manana wrote on 2016/05/11 10:09 +0100:
>>
>> Hi,
>>
>> I've noticed some time ago that our device replace implementation is
>> unreliable. Basically under several situations it ends up not copying
>> extents (both data and metadata) from the old device to the new device
>> (I briefly talked about some of the problems with Josef at LSF).
>>
>> Essentially it operates by iterating all the extents of the old device
>> using the device tree's commit root, and for each device extent found,
>> does a direct copy of it into the new device (it does not use the
>> regular write paths). While it's doing this no extents can be
>> allocated from the new device, this is only possible once it finishes
>> copying all device extents and replaces the old device with the new
>> device in the list of devices available for allocations.
>>
>> There are 4 main flaws with the implementation:
>>
>> 1) It processes each device extent only once. Clearly after it
>> processes (copies) a device extent it's possible for future data and
>> metadata writes to allocate extents from the corresponding block
>> group, so any new extents are not copied into the new device.
>
>
> Not quite familiar with dev replace though, but IIRC, current dev replace
> will commit current transaction and mark that block group readonly to
> prevent any incoming write.
>
> So that's to say, the readonly setup and commit_transaction() has a race
> window to allow new write into the ro block group.
>
> Or I missed something?

You misunderstood my statement. What I say is that before we process a
device extent, we set the block group to RO. But once we finish
processing it, we turn it back to RW. And after that we don't process
that device extent/block group anymore, which is a problem because new
extents might have been allocated from it.

>>
>>
>> 2) Before it copies a device extent, it sets the corresponding block
>> group to RO mode (as of commit [1], which sadly does not really fix an
>> issue nor does it explain how the problem it claims to fix happens) to
>> prevent new allocations from allocating extents from the block group
>> while we are processing it, and once we finish processing it, it sets
>> the block group back to RW mode. This is flawed too, as when it sets
>> the block group to RO mode, some concurrent task might have just
>> allocated an extent from the block group and ends up writing to it
>> while or after we process the device extent.
>
>
> So the problem is the timing of commit_transaction() and set block group RO.
>
> Or is there any possibility to allocate extent while can still escape the
> control of commit_transaction()?

The transaction commit is irrelevant here. That's not the problem. See
the answer above.

>
>>
>> 3) Since it iterates over the device tree in an ascending fashion, it
>> never considers the possibility for new device extents with an offset
>> lower then the current search offset from being allocated for new
>> block groups (just take a look at the while loop at
>> scrub_enumerate_chunks() to see how flawed it is). So it totally
>> ignores that device holes can be used while we are doing the replace
>> and never looks back to check for any and copy the respective device
>> extent into the new device. This is much more likely to happen
>> nowadays since empty block groups are automatically removed by the
>> cleaner kthread, while in the past this could only happen through
>> balance operations triggered from userspace.
>
>
> It seems that the problem is still a result of problem 1 and 2.
> Still commit_trans() and set block group RO problem.

Not, problems 1 and 2 are very different.

This about space that was previously not allocated becoming allocated
to a new block group and the loop at scrub_enumerate_chunks() totally
ignoring that possibility.
The loop keeps searching for extents using a key that has its offset
incremented by "offset of current extent + length of current extent"
and never goes back.

>
> Just as stated, I'm still not quite familiar with dev replace,
> it would be quite nice if you could point out any misunderstand from me.

Well I gave a high level overview of the problems in the algorithm.
Now you'll to read it by yourself and confirm.

thanks

>
> Thanks,
> Qu
>
>>
>> 4) When we finish iterating "all" device extents, we call
>> btrfs_dev_replace_finishing(), which flushes all delalloc, commits the
>> current transaction, replaces the old device in all extent mappings
>> with the new device, removes old device from the list of available
>> devices for allocation and adds the new device to that list. So when
>> flushing dellaloc more extents are allocated from existing block
>> groups and new block groups might be allocated, with all the
>> corresponding bios writing to the old device and not the new one, and
>> after this flushing we surely don't copy all new or changed extents to
>> the new device.
>>
>> These are the 4 major issues I see with device replace. Issue number 2
>> is trivial to solve but it's pointless to do so without addressing the
>> other issues.
>>
>> Solving this isn't easy, not without some trade offs at least. We can
>> keep in memory structures to track which new block groups are
>> allocated while we are doing the replace and which new extents are
>> allocated from existing block groups. Besides the needed
>> synchronization between the allocator and the device replace code
>> (which would bring yet more complexity), this can imply using a lot of
>> memory if there's a significant number of writes happening all the
>> time and ending up in an endless copy loop. Maybe some sort of write
>> throttling would work out. There's also a bit of a chicken and the egg
>> problem, as while the replace operation is ongoing we update a
>> progress item in the devices tree (btrfs_run_dev_replace()) which in
>> turn results in allocating new metadata extents.
>>
>> Anyway, I'm just hoping someone tells me that I'm completely wrong and
>> misunderstood the current implementation we have. If not, and given
>> how serious this is (affecting both data and metadata, and defeating
>> the main purpose of raid), shouldn't we disable this feature, like we
>> did for snapshot aware defrag a couple years ago, until we have these
>> issues solved?
>>
>> thanks
>>
>>
>> [1] -
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=55e3a601c81cdca4497bf855fa4d331f8e830744
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Device replace issues and disabling it until they are solved
  2016-05-11  9:09 Device replace issues and disabling it until they are solved Filipe Manana
  2016-05-11  9:29 ` Qu Wenruo
@ 2016-05-11 15:13 ` Filipe Manana
  2016-05-26 19:09   ` Scott Talbert
  1 sibling, 1 reply; 9+ messages in thread
From: Filipe Manana @ 2016-05-11 15:13 UTC (permalink / raw)
  To: linux-btrfs@vger.kernel.org; +Cc: Chris Mason, Josef Bacik, David Sterba

On Wed, May 11, 2016 at 10:09 AM, Filipe Manana <fdmanana@kernel.org> wrote:
> Hi,
>
> I've noticed some time ago that our device replace implementation is
> unreliable. Basically under several situations it ends up not copying
> extents (both data and metadata) from the old device to the new device
> (I briefly talked about some of the problems with Josef at LSF).
>
> Essentially it operates by iterating all the extents of the old device
> using the device tree's commit root, and for each device extent found,
> does a direct copy of it into the new device (it does not use the
> regular write paths). While it's doing this no extents can be
> allocated from the new device, this is only possible once it finishes
> copying all device extents and replaces the old device with the new
> device in the list of devices available for allocations.
>
> There are 4 main flaws with the implementation:
>
> 1) It processes each device extent only once. Clearly after it
> processes (copies) a device extent it's possible for future data and
> metadata writes to allocate extents from the corresponding block
> group, so any new extents are not copied into the new device.
>
> 2) Before it copies a device extent, it sets the corresponding block
> group to RO mode (as of commit [1], which sadly does not really fix an
> issue nor does it explain how the problem it claims to fix happens) to
> prevent new allocations from allocating extents from the block group
> while we are processing it, and once we finish processing it, it sets
> the block group back to RW mode. This is flawed too, as when it sets
> the block group to RO mode, some concurrent task might have just
> allocated an extent from the block group and ends up writing to it
> while or after we process the device extent.
>
> 3) Since it iterates over the device tree in an ascending fashion, it
> never considers the possibility for new device extents with an offset
> lower then the current search offset from being allocated for new
> block groups (just take a look at the while loop at
> scrub_enumerate_chunks() to see how flawed it is). So it totally
> ignores that device holes can be used while we are doing the replace
> and never looks back to check for any and copy the respective device
> extent into the new device. This is much more likely to happen
> nowadays since empty block groups are automatically removed by the
> cleaner kthread, while in the past this could only happen through
> balance operations triggered from userspace.
>
> 4) When we finish iterating "all" device extents, we call
> btrfs_dev_replace_finishing(), which flushes all delalloc, commits the
> current transaction, replaces the old device in all extent mappings
> with the new device, removes old device from the list of available
> devices for allocation and adds the new device to that list. So when
> flushing dellaloc more extents are allocated from existing block
> groups and new block groups might be allocated, with all the
> corresponding bios writing to the old device and not the new one, and
> after this flushing we surely don't copy all new or changed extents to
> the new device.
>
> These are the 4 major issues I see with device replace. Issue number 2
> is trivial to solve but it's pointless to do so without addressing the
> other issues.
>
> Solving this isn't easy, not without some trade offs at least. We can
> keep in memory structures to track which new block groups are
> allocated while we are doing the replace and which new extents are
> allocated from existing block groups. Besides the needed
> synchronization between the allocator and the device replace code
> (which would bring yet more complexity), this can imply using a lot of
> memory if there's a significant number of writes happening all the
> time and ending up in an endless copy loop. Maybe some sort of write
> throttling would work out. There's also a bit of a chicken and the egg
> problem, as while the replace operation is ongoing we update a
> progress item in the devices tree (btrfs_run_dev_replace()) which in
> turn results in allocating new metadata extents.
>
> Anyway, I'm just hoping someone tells me that I'm completely wrong and
> misunderstood the current implementation we have. If not, and given
> how serious this is (affecting both data and metadata, and defeating
> the main purpose of raid), shouldn't we disable this feature, like we
> did for snapshot aware defrag a couple years ago, until we have these
> issues solved?

On IRC, Stefan Behrens just pointed me out that for locations of the
old device behind the current left cursor, we make all writes go both
into the old and new devices (volumes.c:__btrfs_map_block()). This
solves problems 1 and 3, which are really the hard problems. The
remaining ones are easy to fix and I'll address them soon.

Thanks.

>
> thanks
>
>
> [1] - https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=55e3a601c81cdca4497bf855fa4d331f8e830744

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Device replace issues and disabling it until they are solved
  2016-05-11 15:13 ` Filipe Manana
@ 2016-05-26 19:09   ` Scott Talbert
  2016-05-27  9:43     ` Filipe Manana
  0 siblings, 1 reply; 9+ messages in thread
From: Scott Talbert @ 2016-05-26 19:09 UTC (permalink / raw)
  Cc: linux-btrfs@vger.kernel.org

On Wed, 11 May 2016, Filipe Manana wrote:

>> I've noticed some time ago that our device replace implementation is
>> unreliable. Basically under several situations it ends up not copying
>> extents (both data and metadata) from the old device to the new device
>> (I briefly talked about some of the problems with Josef at LSF).
>>
>> Essentially it operates by iterating all the extents of the old device
>> using the device tree's commit root, and for each device extent found,
>> does a direct copy of it into the new device (it does not use the
>> regular write paths). While it's doing this no extents can be
>> allocated from the new device, this is only possible once it finishes
>> copying all device extents and replaces the old device with the new
>> device in the list of devices available for allocations.
>>
>> There are 4 main flaws with the implementation:
>>
>> 1) It processes each device extent only once. Clearly after it
>> processes (copies) a device extent it's possible for future data and
>> metadata writes to allocate extents from the corresponding block
>> group, so any new extents are not copied into the new device.
>>
>> 2) Before it copies a device extent, it sets the corresponding block
>> group to RO mode (as of commit [1], which sadly does not really fix an
>> issue nor does it explain how the problem it claims to fix happens) to
>> prevent new allocations from allocating extents from the block group
>> while we are processing it, and once we finish processing it, it sets
>> the block group back to RW mode. This is flawed too, as when it sets
>> the block group to RO mode, some concurrent task might have just
>> allocated an extent from the block group and ends up writing to it
>> while or after we process the device extent.
>>
>> 3) Since it iterates over the device tree in an ascending fashion, it
>> never considers the possibility for new device extents with an offset
>> lower then the current search offset from being allocated for new
>> block groups (just take a look at the while loop at
>> scrub_enumerate_chunks() to see how flawed it is). So it totally
>> ignores that device holes can be used while we are doing the replace
>> and never looks back to check for any and copy the respective device
>> extent into the new device. This is much more likely to happen
>> nowadays since empty block groups are automatically removed by the
>> cleaner kthread, while in the past this could only happen through
>> balance operations triggered from userspace.
>>
>> 4) When we finish iterating "all" device extents, we call
>> btrfs_dev_replace_finishing(), which flushes all delalloc, commits the
>> current transaction, replaces the old device in all extent mappings
>> with the new device, removes old device from the list of available
>> devices for allocation and adds the new device to that list. So when
>> flushing dellaloc more extents are allocated from existing block
>> groups and new block groups might be allocated, with all the
>> corresponding bios writing to the old device and not the new one, and
>> after this flushing we surely don't copy all new or changed extents to
>> the new device.
>>
>> These are the 4 major issues I see with device replace. Issue number 2
>> is trivial to solve but it's pointless to do so without addressing the
>> other issues.
>>
>> Solving this isn't easy, not without some trade offs at least. We can
>> keep in memory structures to track which new block groups are
>> allocated while we are doing the replace and which new extents are
>> allocated from existing block groups. Besides the needed
>> synchronization between the allocator and the device replace code
>> (which would bring yet more complexity), this can imply using a lot of
>> memory if there's a significant number of writes happening all the
>> time and ending up in an endless copy loop. Maybe some sort of write
>> throttling would work out. There's also a bit of a chicken and the egg
>> problem, as while the replace operation is ongoing we update a
>> progress item in the devices tree (btrfs_run_dev_replace()) which in
>> turn results in allocating new metadata extents.
>>
>> Anyway, I'm just hoping someone tells me that I'm completely wrong and
>> misunderstood the current implementation we have. If not, and given
>> how serious this is (affecting both data and metadata, and defeating
>> the main purpose of raid), shouldn't we disable this feature, like we
>> did for snapshot aware defrag a couple years ago, until we have these
>> issues solved?
>
> On IRC, Stefan Behrens just pointed me out that for locations of the
> old device behind the current left cursor, we make all writes go both
> into the old and new devices (volumes.c:__btrfs_map_block()). This
> solves problems 1 and 3, which are really the hard problems. The
> remaining ones are easy to fix and I'll address them soon.

Hi Filipe,

Does your recent patch set (from May 20) address all of these issues?

Scott

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Device replace issues and disabling it until they are solved
  2016-05-26 19:09   ` Scott Talbert
@ 2016-05-27  9:43     ` Filipe Manana
  2016-06-02 10:03       ` Yauhen Kharuzhy
  0 siblings, 1 reply; 9+ messages in thread
From: Filipe Manana @ 2016-05-27  9:43 UTC (permalink / raw)
  To: Scott Talbert; +Cc: linux-btrfs@vger.kernel.org

On Thu, May 26, 2016 at 8:09 PM, Scott Talbert <scott.talbert@hgst.com> wrote:
> On Wed, 11 May 2016, Filipe Manana wrote:
>
>>> I've noticed some time ago that our device replace implementation is
>>> unreliable. Basically under several situations it ends up not copying
>>> extents (both data and metadata) from the old device to the new device
>>> (I briefly talked about some of the problems with Josef at LSF).
>>>
>>> Essentially it operates by iterating all the extents of the old device
>>> using the device tree's commit root, and for each device extent found,
>>> does a direct copy of it into the new device (it does not use the
>>> regular write paths). While it's doing this no extents can be
>>> allocated from the new device, this is only possible once it finishes
>>> copying all device extents and replaces the old device with the new
>>> device in the list of devices available for allocations.
>>>
>>> There are 4 main flaws with the implementation:
>>>
>>> 1) It processes each device extent only once. Clearly after it
>>> processes (copies) a device extent it's possible for future data and
>>> metadata writes to allocate extents from the corresponding block
>>> group, so any new extents are not copied into the new device.
>>>
>>> 2) Before it copies a device extent, it sets the corresponding block
>>> group to RO mode (as of commit [1], which sadly does not really fix an
>>> issue nor does it explain how the problem it claims to fix happens) to
>>> prevent new allocations from allocating extents from the block group
>>> while we are processing it, and once we finish processing it, it sets
>>> the block group back to RW mode. This is flawed too, as when it sets
>>> the block group to RO mode, some concurrent task might have just
>>> allocated an extent from the block group and ends up writing to it
>>> while or after we process the device extent.
>>>
>>> 3) Since it iterates over the device tree in an ascending fashion, it
>>> never considers the possibility for new device extents with an offset
>>> lower then the current search offset from being allocated for new
>>> block groups (just take a look at the while loop at
>>> scrub_enumerate_chunks() to see how flawed it is). So it totally
>>> ignores that device holes can be used while we are doing the replace
>>> and never looks back to check for any and copy the respective device
>>> extent into the new device. This is much more likely to happen
>>> nowadays since empty block groups are automatically removed by the
>>> cleaner kthread, while in the past this could only happen through
>>> balance operations triggered from userspace.
>>>
>>> 4) When we finish iterating "all" device extents, we call
>>> btrfs_dev_replace_finishing(), which flushes all delalloc, commits the
>>> current transaction, replaces the old device in all extent mappings
>>> with the new device, removes old device from the list of available
>>> devices for allocation and adds the new device to that list. So when
>>> flushing dellaloc more extents are allocated from existing block
>>> groups and new block groups might be allocated, with all the
>>> corresponding bios writing to the old device and not the new one, and
>>> after this flushing we surely don't copy all new or changed extents to
>>> the new device.
>>>
>>> These are the 4 major issues I see with device replace. Issue number 2
>>> is trivial to solve but it's pointless to do so without addressing the
>>> other issues.
>>>
>>> Solving this isn't easy, not without some trade offs at least. We can
>>> keep in memory structures to track which new block groups are
>>> allocated while we are doing the replace and which new extents are
>>> allocated from existing block groups. Besides the needed
>>> synchronization between the allocator and the device replace code
>>> (which would bring yet more complexity), this can imply using a lot of
>>> memory if there's a significant number of writes happening all the
>>> time and ending up in an endless copy loop. Maybe some sort of write
>>> throttling would work out. There's also a bit of a chicken and the egg
>>> problem, as while the replace operation is ongoing we update a
>>> progress item in the devices tree (btrfs_run_dev_replace()) which in
>>> turn results in allocating new metadata extents.
>>>
>>> Anyway, I'm just hoping someone tells me that I'm completely wrong and
>>> misunderstood the current implementation we have. If not, and given
>>> how serious this is (affecting both data and metadata, and defeating
>>> the main purpose of raid), shouldn't we disable this feature, like we
>>> did for snapshot aware defrag a couple years ago, until we have these
>>> issues solved?
>>
>>
>> On IRC, Stefan Behrens just pointed me out that for locations of the
>> old device behind the current left cursor, we make all writes go both
>> into the old and new devices (volumes.c:__btrfs_map_block()). This
>> solves problems 1 and 3, which are really the hard problems. The
>> remaining ones are easy to fix and I'll address them soon.
>
>
> Hi Filipe,

Hi Scott,

>
> Does your recent patch set (from May 20) address all of these issues?

Yes.

>
> Scott
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Device replace issues and disabling it until they are solved
  2016-05-27  9:43     ` Filipe Manana
@ 2016-06-02 10:03       ` Yauhen Kharuzhy
  2016-06-02 10:05         ` Filipe Manana
  0 siblings, 1 reply; 9+ messages in thread
From: Yauhen Kharuzhy @ 2016-06-02 10:03 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Scott Talbert, linux-btrfs@vger.kernel.org

On Fri, May 27, 2016 at 10:43:47AM +0100, Filipe Manana wrote:
> > Hi Filipe,
> 
> Hi Scott,
> 
> >
> > Does your recent patch set (from May 20) address all of these issues?
> 
> Yes.

Tested, RAID5/6 still produces a plenty of 'failed to rebuild valid
logical NNNNNN" messages after two consecutive device replaces. So,
replace is still not usable for RAID5/6. And it is very slow in
comparison with 'device add && balance device remove missing' sequence
(4x slower).

-- 
Yauhen Kharuzhy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Device replace issues and disabling it until they are solved
  2016-06-02 10:03       ` Yauhen Kharuzhy
@ 2016-06-02 10:05         ` Filipe Manana
  2016-06-02 13:58           ` Scott Talbert
  0 siblings, 1 reply; 9+ messages in thread
From: Filipe Manana @ 2016-06-02 10:05 UTC (permalink / raw)
  To: Yauhen Kharuzhy; +Cc: Scott Talbert, linux-btrfs@vger.kernel.org

On Thu, Jun 2, 2016 at 11:03 AM, Yauhen Kharuzhy
<yauhen.kharuzhy@zavadatar.com> wrote:
> On Fri, May 27, 2016 at 10:43:47AM +0100, Filipe Manana wrote:
>> > Hi Filipe,
>>
>> Hi Scott,
>>
>> >
>> > Does your recent patch set (from May 20) address all of these issues?
>>
>> Yes.
>
> Tested, RAID5/6 still produces a plenty of 'failed to rebuild valid
> logical NNNNNN" messages after two consecutive device replaces. So,
> replace is still not usable for RAID5/6. And it is very slow in
> comparison with 'device add && balance device remove missing' sequence
> (4x slower).

Right. There's missing code for raid5/6 I believe. I didn't care about
that, nor will in the near future at least.
The set of problems I tried to solve were generic and unrelated to any
specific raid mode.

>
> --
> Yauhen Kharuzhy



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Device replace issues and disabling it until they are solved
  2016-06-02 10:05         ` Filipe Manana
@ 2016-06-02 13:58           ` Scott Talbert
  0 siblings, 0 replies; 9+ messages in thread
From: Scott Talbert @ 2016-06-02 13:58 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Yauhen Kharuzhy, Scott Talbert, linux-btrfs@vger.kernel.org



On Thu, 2 Jun 2016, Filipe Manana wrote:

> On Thu, Jun 2, 2016 at 11:03 AM, Yauhen Kharuzhy
> <yauhen.kharuzhy@zavadatar.com> wrote:
>> On Fri, May 27, 2016 at 10:43:47AM +0100, Filipe Manana wrote:
>>>> Hi Filipe,
>>>
>>> Hi Scott,
>>>
>>>>
>>>> Does your recent patch set (from May 20) address all of these issues?
>>>
>>> Yes.
>>
>> Tested, RAID5/6 still produces a plenty of 'failed to rebuild valid
>> logical NNNNNN" messages after two consecutive device replaces. So,
>> replace is still not usable for RAID5/6. And it is very slow in
>> comparison with 'device add && balance device remove missing' sequence
>> (4x slower).
>
> Right. There's missing code for raid5/6 I believe. I didn't care about
> that, nor will in the near future at least.
> The set of problems I tried to solve were generic and unrelated to any
> specific raid mode.

Back to your original question, then...should replace be disabled for 
raid5/6 until it is fixed?

Scott

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-06-02 13:58 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-05-11  9:09 Device replace issues and disabling it until they are solved Filipe Manana
2016-05-11  9:29 ` Qu Wenruo
2016-05-11  9:39   ` Filipe Manana
2016-05-11 15:13 ` Filipe Manana
2016-05-26 19:09   ` Scott Talbert
2016-05-27  9:43     ` Filipe Manana
2016-06-02 10:03       ` Yauhen Kharuzhy
2016-06-02 10:05         ` Filipe Manana
2016-06-02 13:58           ` Scott Talbert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).