From: alexwu <alexwu@synology.com>
To: linux-raid@vger.kernel.org
Cc: linux-block@vger.kernel.org
Subject: Enable skip_copy can cause data integrity issue in some storage stack
Date: Fri, 01 Sep 2017 15:26:41 +0800 [thread overview]
Message-ID: <73ed28991837a9884824309137f1621c@synology.com> (raw)
Hi,
Recently a data integrity issue about skip_copy was found. We are able
to reproduce it and found the root cause. This data integrity issue
might happen if there are other layers between file system and raid5.
[How to Reproduce]
1. Create a raid5 named md0 first (with skip_copy enable), and wait md0
resync done which ensures that all data and parity are synchronized
2. Use lvm tools to create a logical volume named lv-md0 over md0
3. Format an ext4 file system on lv-md0 and mount on /mnt
4. Do some db operations (e.g. sqlite insert) to write data through /mnt
5. When those db operations finished, we do the following command
"echo check > /sys/block/md0/md/sync_action" to check mismatch_cnt,
it is very likely that we got mismatch_cnt != 0 when check finished
[Root Cause]
After tracing code and more experiments, it is more proper to say that
it's a problem about backing_dev_info (bdi) instead of a bug about
skip_copy.
We notice that:
1. skip_copy counts on BDI_CAP_STABLE_WRITES to ensure that bio's
page
will not be modified before raid5 completes I/O. Thus we can skip
copy
page from bio to stripe cache
2. The ext4 file system will call wait_for_stable_page() to ask
whether
the mapped bdi requires stable writes
Data integrity happens because:
1. When raid5 enable skip_copy, it will only set it's own bdi
required
BDI_CAP_STABLE_WRITES, but this information will not propagate to
other bdi between file system and md
2. When ext4 file system check stable writes requirement by calling
wait_for_stable_page(), it can only see the underlying bdi's
capability
and cannot see all related bdi
Thus, skip_copy works fine if we format file system directly on md.
But data integrity issue happens if there are some other block layers
(e.g. dm)
between file system and md.
[Result]
We do more tests on different storage stacks, here are the results.
The following settings can pass the test thousand times:
1. raid5 with skip_copy enable + ext4
2. raid5 with skip_copy disable + ext4
3. raid5 with skip_copy disable + lvm + ext4
The following setting will fail the test in 10 rounds:
1. raid5 with skip_copy enable + lvm + ext4
I think the solution might be let all bdi can communicate through
different block layers,
then we can pass BDI_CAP_STABLE_WRITES information if we enable
skip_copy.
But the current bdi structure is not allowed us to do that.
What would you suggest to do if we want to make skip_copy more reliable
?
Best Regards,
Alex
next reply other threads:[~2017-09-01 7:26 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-09-01 7:26 alexwu [this message]
2017-09-06 15:57 ` Enable skip_copy can cause data integrity issue in some storage stack Shaohua Li
2017-09-07 1:11 ` NeilBrown
2017-09-07 22:16 ` BDI_CAP_STABLE_WRITES for stacked device (Re: Enable skip_copy can cause data integrity issue in some storage) stack Shaohua Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=73ed28991837a9884824309137f1621c@synology.com \
--to=alexwu@synology.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-raid@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).