From: Wang Yugui <wangyugui@e16-tech.com>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>
Cc: Qu Wenruo <wqu@suse.com>, linux-btrfs@vger.kernel.org
Subject: Re: [PATCH RFC 00/11] btrfs: introduce write-intent bitmaps for RAID56
Date: Thu, 07 Jul 2022 12:40:20 +0800 [thread overview]
Message-ID: <20220707124019.98D5.409509F4@e16-tech.com> (raw)
In-Reply-To: <ebdc96e3-56f8-37e3-dba1-79622c860e2a@gmx.com>
Hi,
> On 2022/7/7 12:24, Wang Yugui wrote:
> > Hi,
> >
> >> [BACKGROUND]
> >> Unlike md-raid, btrfs RAID56 has nothing to sync its devices when power
> >> loss happens.
> >>
> >> For pure mirror based profiles it's fine as btrfs can utilize its csums
> >> to find the correct mirror the repair the bad ones.
> >>
> >> But for RAID56, the repair itself needs the data from other devices,
> >> thus any out-of-sync data can degrade the tolerance.
> >>
> >> Even worse, incorrect RMW can use the stale data to generate P/Q,
> >> removing the possibility of recovery the data.
> >>
> >>
> >> For md-raid, it goes with write-intent bitmap, to do faster resilver,
> >> and goes journal (partial parity log for RAID5) to ensure it can even
> >> stand a powerloss + device lose.
> >>
> >> [OBJECTIVE]
> >>
> >> This patchset will introduce a btrfs specific write-intent bitmap.
> >>
> >> The bitmap will locate at physical offset 1MiB of each device, and the
> >> content is the same between all devices.
> >>
> >> When there is a RAID56 write (currently all RAID56 write, including full
> >> stripe write), before submitting all the real bios to disks,
> >> write-intent bitmap will be updated and flushed to all writeable
> >> devices.
> >>
> >> So even if a powerloss happened, at the next mount time we know which
> >> full stripes needs to check, and can start a scrub for those involved
> >> logical bytenr ranges.
> >>
> >> [NO RECOVERY CODE YET]
> >>
> >> Unfortunately, this patchset only implements the write-intent bitmap
> >> code, the recovery part is still a place holder, as we need some scrub
> >> refactor to make it only scrub a logical bytenr range.
> >>
> >> [ADVANTAGE OF BTRFS SPECIFIC WRITE-INTENT BITMAPS]
> >>
> >> Since btrfs can utilize csum for its metadata and CoWed data, unlike
> >> dm-bitmap which can only be used for faster re-silver, we can fully
> >> rebuild the full stripe, as long as:
> >>
> >> 1) There is no missing device
> >> For missing device case, we still need to go full journal.
> >>
> >> 2) Untouched data stays untouched
> >> This should be mostly sane for sane hardware.
> >>
> >> And since the btrfs specific write-intent bitmaps are pretty small (4KiB
> >> in size), the overhead much lower than full journal.
> >>
> >> In the future, we may allow users to choose between just bitmaps or full
> >> journal to meet their requirement.
> >>
> >> [BITMAPS DESIGN]
> >>
> >> The bitmaps on-disk format looks like this:
> >>
> >> [ super ][ entry 1 ][ entry 2 ] ... [entry N]
> >> |<--------- super::size (4K) ------------->|
> >>
> >> Super block contains how many entires are in use.
> >>
> >> Each entry is 128 bits (16 bytes) in size, containing one u64 for
> >> bytenr, and u64 for one bitmap.
> >>
> >> And all utilized entries will be sorted in their bytenr order, and no
> >> bit can overlap.
> >>
> >> The blocksize is now fixed to BTRFS_STRIPE_LEN (64KiB), so each entry
> >> can contain at most 4MiB, and the whole bitmaps can contain 224 entries.
> >
> > Can we skip the write-intent bitmap log if we already log it in last N records
> > (logrotate aware) to improve the write performance? because HDD sync
> > IOPS is very small.
>
> I'm not aware about the logrotate idea you mentioned, mind to explain it
> more?
>
> But the overall idea of journal/write-intent bitmaps are, always ensure
> there is something recording the full write or the write-intention
> before the real IO is submitted.
>
> So I'm afraid such behavior can not be changed much.
The basic idea is that we recover the data by scrub, not full log,
so log *1 is same to log *2, but with less IOPS?
log *1
write(0-64K) log
wirte(64K-128K) log
wirte(128K-192K) log
wirte(192K-256K) log
log *2
write(0-256K) log
already write(0-256K), skip
already write(0-256K), skip
already write(0-256K), skip
we can search the entry we currently used, but can not search the prev
entry, because it maybe be logrotated?
Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2022/07/07
next prev parent reply other threads:[~2022-07-07 4:40 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-07-05 7:39 [PATCH RFC 00/11] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
2022-07-05 7:39 ` [PATCH RFC 01/11] btrfs: introduce new compat RO flag, EXTRA_SUPER_RESERVED Qu Wenruo
2022-07-05 7:39 ` [PATCH RFC 02/11] btrfs: introduce a new experimental compat RO flag, WRITE_INTENT_BITMAP Qu Wenruo
2022-07-05 7:39 ` [PATCH RFC 03/11] btrfs: introduce the on-disk format of btrfs write intent bitmaps Qu Wenruo
2022-07-05 7:39 ` [PATCH RFC 04/11] btrfs: load/create write-intent bitmaps at mount time Qu Wenruo
2022-07-05 7:39 ` [PATCH RFC 05/11] btrfs: write-intent: write the newly created bitmaps to all disks Qu Wenruo
2022-07-05 7:39 ` [PATCH RFC 06/11] btrfs: write-intent: introduce an internal helper to set bits for a range Qu Wenruo
2022-07-06 6:16 ` Qu Wenruo
2022-07-06 9:00 ` Qu Wenruo
2022-07-05 7:39 ` [PATCH RFC 07/11] btrfs: write-intent: introduce an internal helper to clear " Qu Wenruo
2022-07-05 7:39 ` [PATCH RFC 08/11] btrfs: write back write intent bitmap after barrier_all_devices() Qu Wenruo
2022-07-05 7:39 ` [PATCH RFC 09/11] btrfs: update and writeback the write-intent bitmap for RAID56 write Qu Wenruo
2022-07-05 7:39 ` [PATCH RFC 10/11] btrfs: raid56: clear write-intent bimaps when a full stripe finishes Qu Wenruo
2022-07-05 7:39 ` [PATCH RFC 11/11] btrfs: warn and clear bitmaps if there is dirty bitmap at mount time Qu Wenruo
2022-07-06 23:36 ` [PATCH RFC 00/11] btrfs: introduce write-intent bitmaps for RAID56 Wang Yugui
2022-07-07 1:14 ` Qu Wenruo
2022-07-07 4:24 ` Wang Yugui
2022-07-07 4:28 ` Qu Wenruo
2022-07-07 4:40 ` Wang Yugui [this message]
2022-07-07 5:05 ` Qu Wenruo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220707124019.98D5.409509F4@e16-tech.com \
--to=wangyugui@e16-tech.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=quwenruo.btrfs@gmx.com \
--cc=wqu@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.