Enable skip_copy can cause data integrity issue in some storage stack

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Enable skip_copy can cause data integrity issue in some storage stack
@ 2017-09-01  7:26 alexwu
  2017-09-06 15:57 ` Shaohua Li
  0 siblings, 1 reply; 4+ messages in thread
From: alexwu @ 2017-09-01  7:26 UTC (permalink / raw)
  To: linux-raid; +Cc: linux-block

Hi,

Recently a data integrity issue about skip_copy was found. We are able
to reproduce it and found the root cause. This data integrity issue
might happen if there are other layers between file system and raid5.

[How to Reproduce]

1. Create a raid5 named md0 first (with skip_copy enable), and wait md0
    resync done which ensures that all data and parity are synchronized
2. Use lvm tools to create a logical volume named lv-md0 over md0
3. Format an ext4 file system on lv-md0 and mount on /mnt
4. Do some db operations (e.g. sqlite insert) to write data through /mnt
5. When those db operations finished, we do the following command
    "echo check > /sys/block/md0/md/sync_action" to check mismatch_cnt,
    it is very likely that we got mismatch_cnt != 0 when check finished

[Root Cause]

After tracing code and more experiments, it is more proper to say that
it's a problem about backing_dev_info (bdi) instead of a bug about 
skip_copy.

We notice that:
    1. skip_copy counts on BDI_CAP_STABLE_WRITES to ensure that bio's 
page
       will not be modified before raid5 completes I/O. Thus we can skip 
copy
       page from bio to stripe cache
    2. The ext4 file system will call wait_for_stable_page() to ask 
whether
       the mapped bdi requires stable writes

Data integrity happens because:
    1. When raid5 enable skip_copy, it will only set it's own bdi 
required
       BDI_CAP_STABLE_WRITES, but this information will not propagate to
       other bdi between file system and md
    2. When ext4 file system check stable writes requirement by calling
       wait_for_stable_page(), it can only see the underlying bdi's 
capability
       and cannot see all related bdi

Thus, skip_copy works fine if we format file system directly on md.
But data integrity issue happens if there are some other block layers 
(e.g. dm)
between file system and md.

[Result]

We do more tests on different storage stacks, here are the results.

The following settings can pass the test thousand times:
    1. raid5 with skip_copy enable + ext4
    2. raid5 with skip_copy disable + ext4
    3. raid5 with skip_copy disable + lvm + ext4

The following setting will fail the test in 10 rounds:
    1. raid5 with skip_copy enable + lvm + ext4

I think the solution might be let all bdi can communicate through 
different block layers,
then we can pass BDI_CAP_STABLE_WRITES information if we enable 
skip_copy.
But the current bdi structure is not allowed us to do that.

What would you suggest to do if we want to make skip_copy more reliable 
?

Best Regards,
Alex

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Enable skip_copy can cause data integrity issue in some storage stack
  2017-09-01  7:26 Enable skip_copy can cause data integrity issue in some storage stack alexwu
@ 2017-09-06 15:57 ` Shaohua Li
  2017-09-07  1:11   ` NeilBrown
  0 siblings, 1 reply; 4+ messages in thread
From: Shaohua Li @ 2017-09-06 15:57 UTC (permalink / raw)
  To: alexwu; +Cc: linux-raid, linux-block, axboe, neilb

On Fri, Sep 01, 2017 at 03:26:41PM +0800, alexwu wrote:
> Hi,
> 
> Recently a data integrity issue about skip_copy was found. We are able
> to reproduce it and found the root cause. This data integrity issue
> might happen if there are other layers between file system and raid5.
> 
> [How to Reproduce]
> 
> 1. Create a raid5 named md0 first (with skip_copy enable), and wait md0
>    resync done which ensures that all data and parity are synchronized
> 2. Use lvm tools to create a logical volume named lv-md0 over md0
> 3. Format an ext4 file system on lv-md0 and mount on /mnt
> 4. Do some db operations (e.g. sqlite insert) to write data through /mnt
> 5. When those db operations finished, we do the following command
>    "echo check > /sys/block/md0/md/sync_action" to check mismatch_cnt,
>    it is very likely that we got mismatch_cnt != 0 when check finished
> 
> [Root Cause]
> 
> After tracing code and more experiments, it is more proper to say that
> it's a problem about backing_dev_info (bdi) instead of a bug about
> skip_copy.
> 
> We notice that:
>    1. skip_copy counts on BDI_CAP_STABLE_WRITES to ensure that bio's page
>       will not be modified before raid5 completes I/O. Thus we can skip copy
>       page from bio to stripe cache
>    2. The ext4 file system will call wait_for_stable_page() to ask whether
>       the mapped bdi requires stable writes
> 
> Data integrity happens because:
>    1. When raid5 enable skip_copy, it will only set it's own bdi required
>       BDI_CAP_STABLE_WRITES, but this information will not propagate to
>       other bdi between file system and md
>    2. When ext4 file system check stable writes requirement by calling
>       wait_for_stable_page(), it can only see the underlying bdi's
> capability
>       and cannot see all related bdi
> 
> Thus, skip_copy works fine if we format file system directly on md.
> But data integrity issue happens if there are some other block layers (e.g.
> dm)
> between file system and md.
> 
> [Result]
> 
> We do more tests on different storage stacks, here are the results.
> 
> The following settings can pass the test thousand times:
>    1. raid5 with skip_copy enable + ext4
>    2. raid5 with skip_copy disable + ext4
>    3. raid5 with skip_copy disable + lvm + ext4
> 
> The following setting will fail the test in 10 rounds:
>    1. raid5 with skip_copy enable + lvm + ext4
> 
> I think the solution might be let all bdi can communicate through different
> block layers,
> then we can pass BDI_CAP_STABLE_WRITES information if we enable skip_copy.
> But the current bdi structure is not allowed us to do that.
> 
> What would you suggest to do if we want to make skip_copy more reliable ?

Very interesting problem. Looks this isn't raid5 specific. stacked device with
block integrity enabled could expose the same issue.

I'm cc Jens and Neil to check if they have good idea. Basically the problem is
for stacked disk, the under layer disk has BDI_CAP_STABLE_WRITES enabled, but
upper layer doesn't enable it. The under layer disk could be a raid5 with
skip_copy enabled or could be any disk with block integrity enabled (which will
enable BDI_CAP_STABLE_WRITES). Currently we don't propagate the flag to upper
layer disk.

A solution is blk_set_stacking_limits propagate the flag at queue setup, we
then don't allow the flag changing in runtime later. The raid5 skip_copy
interface changes the flag in runtime which probably isn't safe.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Enable skip_copy can cause data integrity issue in some storage stack
  2017-09-06 15:57 ` Shaohua Li
@ 2017-09-07  1:11   ` NeilBrown
  2017-09-07 22:16     ` BDI_CAP_STABLE_WRITES for stacked device (Re: Enable skip_copy can cause data integrity issue in some storage) stack Shaohua Li
  0 siblings, 1 reply; 4+ messages in thread
From: NeilBrown @ 2017-09-07  1:11 UTC (permalink / raw)
  To: Shaohua Li, alexwu; +Cc: linux-raid, linux-block, axboe

[-- Attachment #1: Type: text/plain, Size: 4771 bytes --]

On Wed, Sep 06 2017, Shaohua Li wrote:

> On Fri, Sep 01, 2017 at 03:26:41PM +0800, alexwu wrote:
>> Hi,
>> 
>> Recently a data integrity issue about skip_copy was found. We are able
>> to reproduce it and found the root cause. This data integrity issue
>> might happen if there are other layers between file system and raid5.
>> 
>> [How to Reproduce]
>> 
>> 1. Create a raid5 named md0 first (with skip_copy enable), and wait md0
>>    resync done which ensures that all data and parity are synchronized
>> 2. Use lvm tools to create a logical volume named lv-md0 over md0
>> 3. Format an ext4 file system on lv-md0 and mount on /mnt
>> 4. Do some db operations (e.g. sqlite insert) to write data through /mnt
>> 5. When those db operations finished, we do the following command
>>    "echo check > /sys/block/md0/md/sync_action" to check mismatch_cnt,
>>    it is very likely that we got mismatch_cnt != 0 when check finished
>> 
>> [Root Cause]
>> 
>> After tracing code and more experiments, it is more proper to say that
>> it's a problem about backing_dev_info (bdi) instead of a bug about
>> skip_copy.
>> 
>> We notice that:
>>    1. skip_copy counts on BDI_CAP_STABLE_WRITES to ensure that bio's page
>>       will not be modified before raid5 completes I/O. Thus we can skip copy
>>       page from bio to stripe cache
>>    2. The ext4 file system will call wait_for_stable_page() to ask whether
>>       the mapped bdi requires stable writes
>> 
>> Data integrity happens because:
>>    1. When raid5 enable skip_copy, it will only set it's own bdi required
>>       BDI_CAP_STABLE_WRITES, but this information will not propagate to
>>       other bdi between file system and md
>>    2. When ext4 file system check stable writes requirement by calling
>>       wait_for_stable_page(), it can only see the underlying bdi's
>> capability
>>       and cannot see all related bdi
>> 
>> Thus, skip_copy works fine if we format file system directly on md.
>> But data integrity issue happens if there are some other block layers (e.g.
>> dm)
>> between file system and md.
>> 
>> [Result]
>> 
>> We do more tests on different storage stacks, here are the results.
>> 
>> The following settings can pass the test thousand times:
>>    1. raid5 with skip_copy enable + ext4
>>    2. raid5 with skip_copy disable + ext4
>>    3. raid5 with skip_copy disable + lvm + ext4
>> 
>> The following setting will fail the test in 10 rounds:
>>    1. raid5 with skip_copy enable + lvm + ext4
>> 
>> I think the solution might be let all bdi can communicate through different
>> block layers,
>> then we can pass BDI_CAP_STABLE_WRITES information if we enable skip_copy.
>> But the current bdi structure is not allowed us to do that.
>> 
>> What would you suggest to do if we want to make skip_copy more reliable ?
>
> Very interesting problem. Looks this isn't raid5 specific. stacked device with
> block integrity enabled could expose the same issue.
>
> I'm cc Jens and Neil to check if they have good idea. Basically the problem is
> for stacked disk, the under layer disk has BDI_CAP_STABLE_WRITES enabled, but
> upper layer doesn't enable it. The under layer disk could be a raid5 with
> skip_copy enabled or could be any disk with block integrity enabled (which will
> enable BDI_CAP_STABLE_WRITES). Currently we don't propagate the flag to upper
> layer disk.
>
> A solution is blk_set_stacking_limits propagate the flag at queue setup, we
> then don't allow the flag changing in runtime later. The raid5 skip_copy
> interface changes the flag in runtime which probably isn't safe.
>
> Thanks,
> Shaohua

It has always been messy to handle dependencies from the bottom of the
stack affecting things closer to the filesystem.
We used to have issues with alignments and request sizes, and got rid of
those problems by requiring all devices to accept all bio sizes, and
providing support for them to split as necessary.
This is a similar sort of problem, but also quite different.

It is different because the STABLE_WRITES requirement doesn't affect the
sort of bio that is passed down, but instead affect the way it is waited
for.

I have thought a few times that it would be useful if the "bd_claim"
functionality included a callback so the claimed device could notify the
claimer that things had changed.
Then raid5 could notice that it has been claimed, and call the callback
when it changes BDI_CAP_STABLE_WRITES.  If the claimer is a stacked
device, it can re-run blk_set_stacking_limits and call its own claimer.

There are probably some interesting locking issues around this but, to
me, it seems like the most likely path to a robust solution.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* BDI_CAP_STABLE_WRITES for stacked device (Re: Enable skip_copy can cause data integrity issue in some storage) stack
  2017-09-07  1:11   ` NeilBrown
@ 2017-09-07 22:16     ` Shaohua Li
  0 siblings, 0 replies; 4+ messages in thread
From: Shaohua Li @ 2017-09-07 22:16 UTC (permalink / raw)
  To: NeilBrown; +Cc: alexwu, linux-raid, linux-block, axboe

On Thu, Sep 07, 2017 at 11:11:24AM +1000, Neil Brown wrote:
> On Wed, Sep 06 2017, Shaohua Li wrote:
> 
> > On Fri, Sep 01, 2017 at 03:26:41PM +0800, alexwu wrote:
> >> Hi,
> >> 
> >> Recently a data integrity issue about skip_copy was found. We are able
> >> to reproduce it and found the root cause. This data integrity issue
> >> might happen if there are other layers between file system and raid5.
> >> 
> >> [How to Reproduce]
> >> 
> >> 1. Create a raid5 named md0 first (with skip_copy enable), and wait md0
> >>    resync done which ensures that all data and parity are synchronized
> >> 2. Use lvm tools to create a logical volume named lv-md0 over md0
> >> 3. Format an ext4 file system on lv-md0 and mount on /mnt
> >> 4. Do some db operations (e.g. sqlite insert) to write data through /mnt
> >> 5. When those db operations finished, we do the following command
> >>    "echo check > /sys/block/md0/md/sync_action" to check mismatch_cnt,
> >>    it is very likely that we got mismatch_cnt != 0 when check finished
> >> 
> >> [Root Cause]
> >> 
> >> After tracing code and more experiments, it is more proper to say that
> >> it's a problem about backing_dev_info (bdi) instead of a bug about
> >> skip_copy.
> >> 
> >> We notice that:
> >>    1. skip_copy counts on BDI_CAP_STABLE_WRITES to ensure that bio's page
> >>       will not be modified before raid5 completes I/O. Thus we can skip copy
> >>       page from bio to stripe cache
> >>    2. The ext4 file system will call wait_for_stable_page() to ask whether
> >>       the mapped bdi requires stable writes
> >> 
> >> Data integrity happens because:
> >>    1. When raid5 enable skip_copy, it will only set it's own bdi required
> >>       BDI_CAP_STABLE_WRITES, but this information will not propagate to
> >>       other bdi between file system and md
> >>    2. When ext4 file system check stable writes requirement by calling
> >>       wait_for_stable_page(), it can only see the underlying bdi's
> >> capability
> >>       and cannot see all related bdi
> >> 
> >> Thus, skip_copy works fine if we format file system directly on md.
> >> But data integrity issue happens if there are some other block layers (e.g.
> >> dm)
> >> between file system and md.
> >> 
> >> [Result]
> >> 
> >> We do more tests on different storage stacks, here are the results.
> >> 
> >> The following settings can pass the test thousand times:
> >>    1. raid5 with skip_copy enable + ext4
> >>    2. raid5 with skip_copy disable + ext4
> >>    3. raid5 with skip_copy disable + lvm + ext4
> >> 
> >> The following setting will fail the test in 10 rounds:
> >>    1. raid5 with skip_copy enable + lvm + ext4
> >> 
> >> I think the solution might be let all bdi can communicate through different
> >> block layers,
> >> then we can pass BDI_CAP_STABLE_WRITES information if we enable skip_copy.
> >> But the current bdi structure is not allowed us to do that.
> >> 
> >> What would you suggest to do if we want to make skip_copy more reliable ?
> >
> > Very interesting problem. Looks this isn't raid5 specific. stacked device with
> > block integrity enabled could expose the same issue.
> >
> > I'm cc Jens and Neil to check if they have good idea. Basically the problem is
> > for stacked disk, the under layer disk has BDI_CAP_STABLE_WRITES enabled, but
> > upper layer doesn't enable it. The under layer disk could be a raid5 with
> > skip_copy enabled or could be any disk with block integrity enabled (which will
> > enable BDI_CAP_STABLE_WRITES). Currently we don't propagate the flag to upper
> > layer disk.
> >
> > A solution is blk_set_stacking_limits propagate the flag at queue setup, we
> > then don't allow the flag changing in runtime later. The raid5 skip_copy
> > interface changes the flag in runtime which probably isn't safe.
> >
> > Thanks,
> > Shaohua
> 
> It has always been messy to handle dependencies from the bottom of the
> stack affecting things closer to the filesystem.
> We used to have issues with alignments and request sizes, and got rid of
> those problems by requiring all devices to accept all bio sizes, and
> providing support for them to split as necessary.
> This is a similar sort of problem, but also quite different.
> 
> It is different because the STABLE_WRITES requirement doesn't affect the
> sort of bio that is passed down, but instead affect the way it is waited
> for.
> 
> I have thought a few times that it would be useful if the "bd_claim"
> functionality included a callback so the claimed device could notify the
> claimer that things had changed.
> Then raid5 could notice that it has been claimed, and call the callback
> when it changes BDI_CAP_STABLE_WRITES.  If the claimer is a stacked
> device, it can re-run blk_set_stacking_limits and call its own claimer.
> 
> There are probably some interesting locking issues around this but, to
> me, it seems like the most likely path to a robust solution.

Fixing blk_set_stacking_limits is the first step for sure. The callback
approach definitely looks the right way if we want to change
BDI_CAP_STABLE_WRITES at runtime. I'm not sure if runtime change is allowed
though because there is a race condition here. The IO for dirty page probably
doesn't hit to block layer yet for whatever reason right before
wait_for_stable_page. At that time if BDI_CAP_STABLE_WRITES isn't set, we do
nothing in wait_for_stable_page. After wait_for_stable_page, the
BDI_CAP_STABLE_WRITES is set and IO is on the way, but fs could change the
page, which breaks the syntax of stable page.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-09-07 22:16 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-09-01  7:26 Enable skip_copy can cause data integrity issue in some storage stack alexwu
2017-09-06 15:57 ` Shaohua Li
2017-09-07  1:11   ` NeilBrown
2017-09-07 22:16     ` BDI_CAP_STABLE_WRITES for stacked device (Re: Enable skip_copy can cause data integrity issue in some storage) stack Shaohua Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).