From: Qu Wenruo <quwenruo@cn.fujitsu.com>
To: <dsterba@suse.cz>, Chris Mason <clm@fb.com>,
<linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH RFC] btrfs: csum: Introduce partial csum for tree block.
Date: Tue, 16 Jun 2015 10:39:52 +0800 [thread overview]
Message-ID: <557F8C78.7080304@cn.fujitsu.com> (raw)
In-Reply-To: <557F7A5F.5010206@cn.fujitsu.com>
Qu Wenruo wrote on 2015/06/16 09:22 +0800:
>
>
> David Sterba wrote on 2015/06/15 15:15 +0200:
>> On Mon, Jun 15, 2015 at 04:02:49PM +0800, Qu Wenruo wrote:
>>> In the following case of corruption, RAID1 or DUP will fail to recover
>>> it(Use 16K as leafsize)
>>> 0 4K 8K 12K 16K
>>> Mirror 0:
>>> |<-OK---------->|<----ERROR---->|<-----------------OK------------->|
>>>
>>> Mirror 1:
>>> |<----------------------------OK--------------->|<------Error----->|
>>>
>>> Since the CRC32 stored in header is calculated for the whole leaf,
>>> so both will fail the CRC32 check.
>>>
>>> But the corruption are in different position, in fact, if we know where
>>> the corruption is (no need to be so accurate), we can recover the tree
>>> block by using the current part.
>>>
>>> In above example, we can just use the correct 0~12K from mirror 1
>>> and then 12K~16K from mirror 0.
>>
>> If the mirror 0 copy is intact, you can use it entirely. Your
>> improvement could help if each mirror is partially broken but we can
>> find good copies of all 4k blocks among all mirrors.
>>
>> The natural question is how often this happens and if it's worth adding
>> the code complexity and what's the estimated speed drop.
>>
>> I think the conditions are very rare and that we could add minimal code
>> to attempt to build the metadata block from the available copies without
>> the separate block checksums. This is an immediate idea so I could have
>> missed something:
>>
>> * if a metadata-block checksum mismatches, do a direct comparison of the
>> metadata-blocks in all available mirrors
>> * if they match and checksums match, no help, it's damaged
>> * if there's a good copy (ie the original checksum or data were
>> corrupted), use it
>> * otherwise attempt to rebuild the metadata block from what's
>> available
>>
>> * by direct comparisons of the 4k blocks, find the first where the
>> metadataA and mirror1 blocks mismatch, offset N
>> * try to compute the checksum from metadataA[0..N-1] + mirror1 block N +
>> rest of metadataA
>> * if it's ok, use it
>> * if not: the block N is corrupted in mirror1 (we've skipped it in
>> metadataA)
>> then repeat with metadataA[0..N] + mirror1[N+1..end]
> The method sounds good, but the codes will be even more complex than
> mine.
> Also, due to the nature of CRC32 and RAID1/Dup case, things will be
> quite complex like the following case using your method.
>
> 0 4K 8K 12K 16K
> Mirror 0
> | | X | | |
> Mirror 1
> | | | X | |
>
> If using your method:
> [0~4K]
> CRC32 of 0~4K are the same. No problem.
>
> [4~8K]
> CRC32 of 0~8K differs.
> Now we know there is something wrong with 4~8K, but we still don't know
> which is the good copy.
>
> We can continue, keep 2 different CRC32 seed for later checksum.
> One seed using the 4~8 from mirror 0 and one from mirror1.
>
> Note, the crc32 is for the whole tree block, so until we calculate all
> the tree block, we can know which one is correct.
Sorry here, "can" -> "can't"
Thanks,
Qu
>
> [8~12K]
> Now crc32 mismatches again.
> We still don't know which part is correct.
>
> We can still continue by recording different CRC32 seed for them.
> But don't forget the previous 2 seeds we recorded.
> So we records 4 crc32 seeds for 0~12K.(Mi0, Mi0) (Mi0, Mi1) (Mi1, Mi0)
> (Mi1, Mi1).
>
> [12~16K]
> Continue calculate the crc32 with above 4 seeds.
> Finally, we found the crc32 matches with the tree block is using the
> combination of (Mi1, Mi0). And we can restore the tree block.
>
> Yes, with this behavior, we can restore the tree block even 3 parts are
> corrupted, but in that case, we need to try 8 times.
>
> So, I don't consider this is more easy to implement than my idea.
>
> [ROOT CAUSE]
> The root cause of the complex is:
> 1) Checksum algorithm depends on previous input(seed)
> Almost all checksum algorithm depends on previous input.
> And you won't know the data is correct or not until all data is calculated.
>
> 2) Only one extra duplication for RAID1/DUP
> We don't have extra duplication to judge which block is correct as there
> are only two mirror.
>
> So my partial checksum solves the 2 root cause at the same time.
> With partial checksum, we don't depend on previous data to calculate
> checksum.
> And we have extra reference if mirror differs with each other, we use
> checksum to judge which copy is correct.
>
>>
>> That's a rough idea that I hope will cover most of the cases when it
>> happens. With some more exhaustive attempts to rebuild the metadata
>> block we can try to repair 2 damaged blocks.
>>
>> As this is completely independent, we can test it separately, and also
>> add it as a rescue feature to the userspace tools.
>>
>>> Yes, this corruption case may be minor enough, since even corruption in
>>> one mirror is rare enough.
>>> So I didn't introduce a new CRC32 checksum, but use the extra 32-4 bytes
>>> to store the partial CRC32 to keep the backward compatibility.
>>
>> The above would work with any checksums, without the need to store the
>> per-block checksums which become impossible with strongher algorithms.
>>
> [FURTHER CSUM DESIGN]
> As you mentioned in later part, yes, stronger checksum like SHA256 won't
> be able to do such partial checksum as it uses all the space.
>
> But that's already a trade-off, right?
>
> If btrfs was using partial csum from the beginning, pros and cons of
> strong checksum should be quite obvious:
>
> Stronger checksum:
> (Pros)
> 1. Less confliction
> Which means more security.
>
> (Cons)
> 1. More space usage for data.
> 4Bytes/4K vs 32Bytes/4K. Obvious.
>
> 2. No/less partial checksum for metadata.
> Harder to detect/repair partial error.
>
> 3. (generally) More CPU time
> Normally, stronger checksums are cryptographic checksum, which uses
> more CPU time.
>
> So with this patch, I also want to inspire other developers about the
> trade off between different checksums.
> Especially the fact, we can make better use of the spare metadata csum
> space if the csum length is less than 32bytes.
>
> Thanks,
> Qu
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2015-06-16 2:39 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-06-12 3:00 [PATCH RFC] btrfs: csum: Introduce partial csum for tree block Qu Wenruo
2015-06-12 14:10 ` Liu Bo
2015-06-12 16:23 ` Chris Mason
2015-06-15 8:02 ` Qu Wenruo
2015-06-15 13:15 ` David Sterba
2015-06-16 1:22 ` Qu Wenruo
2015-06-16 2:39 ` Qu Wenruo [this message]
2015-06-18 1:34 ` Qu Wenruo
2015-06-18 15:57 ` Facebook
2015-06-18 17:06 ` David Sterba
2015-06-19 1:26 ` Qu Wenruo
2015-06-25 15:31 ` David Sterba
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=557F8C78.7080304@cn.fujitsu.com \
--to=quwenruo@cn.fujitsu.com \
--cc=clm@fb.com \
--cc=dsterba@suse.cz \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox