From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Andrei Borzenkov <arvidjaar@gmail.com>
Cc: saranyag@cdac.in, linux-btrfs@vger.kernel.org
Subject: Re: Logical to Physical Address Mapping/Translation in Btrfs
Date: Wed, 20 Dec 2023 19:21:00 +1030 [thread overview]
Message-ID: <4b4089fd-9ea5-4b09-aafb-ed3cd6d51505@gmx.com> (raw)
In-Reply-To: <CAA91j0XoJ9JTqvFpeSJ+SKbhHc=QrX49SNJ0yJo+j8TjzGTsRw@mail.gmail.com>
On 2023/12/20 17:54, Andrei Borzenkov wrote:
> On Wed, Dec 20, 2023 at 1:20 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> On 2023/12/19 21:59, saranyag@cdac.in wrote:
>>> Hi,
>>>
>>> May I know how the logical address is translated to the physical address in
>>> Btrfs?
>>
>> This is documented in btrfs-dev-docs/chunks.txt:
>>
>> https://github.com/btrfs/btrfs-dev-docs/blob/master/chunks.txt
>>
>
> I tried to read it three times and I still do not understand it.
>
> It starts with showing two chunks - metadata and data
>
> --><--
> item 3 key (FIRST_CHUNK_TREE CHUNK_ITEM 4593811456) itemoff 15863 itemsize 112
> length 268435456 owner 2 stripe_len 65536 type METADATA|RAID1
> io_align 65536 io_width 65536 sector_size 4096
> num_stripes 2 sub_stripes 1
> stripe 0 devid 2 offset 2425356288
> dev_uuid a7963b67-1277-49ff-bb1d-9d81c5605f1b
> stripe 1 devid 1 offset 2446327808
> dev_uuid 5f8b54f0-2a35-4330-a06b-9c8fd935bc36
>
> item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 2446327808) itemoff 15975 itemsize 112
> length 2147483648 owner 2 stripe_len 65536 type DATA|RAID0
> io_align 65536 io_width 65536 sector_size 4096
> num_stripes 2 sub_stripes 1
> stripe 0 devid 2 offset 1351614464
> dev_uuid a7963b67-1277-49ff-bb1d-9d81c5605f1b
> stripe 1 devid 1 offset 1372585984
> dev_uuid 5f8b54f0-2a35-4330-a06b-9c8fd935bc36
> --><--
>
> Then it apparently talks about writing into metadata chunk, judging by
> the logical address
Firstly, btrfs uses chunk and block groups interchangeably.
Block groups are more used inside extent tree, as we have
BLOCK_GROUP_ITEM, focusing on the used/free space.
Meanwhile for logical -> physical mapping, we use chunk more frequently,
as that's the name of the chunk tree, and CHUNK_ITEM.
>
> --><--
> Consider we want to write 2m at at 4596957184 - that's 3m past the start of
> the data chunk in the previous example. In order to see where in the physical
> stripe this write will go into we need to derive the following values:
>
> block group offset = [address within block group] - [start address of
> block group]
> block_group_offset = 4596957184 - 4593811456 = 3145728 => 3m
> --><--
>
> The chunk start 4593811456 is metadata. But when talking about
> physical location it suddenly takes address of the device extent of
> the data chunk
>
> --><--
> physical_address = [physical stripe start] + [logical_stripe] *
> [logical stripe_size]
> physical_address = 1351614464 + 48 * 64k = 1351614464 + 3145728 = 1354760192
> --><--
Damn it, this part is for RAID0, and is not correct since our write
should arrive in RAID1 METADATA chunk.
In that case, RAID1* is pretty simple, every mirror is the a full copy
of each other, no striping/rotation.
Thus in that block group offset of 3M writes, we should write into both
mirrors:
physical_address = [physical stripe start] + [offset inside bg]
Thus the result should be:
Mirror 1 [devid 2] physical address = 2425356288 + 3m
Mirror 2 [devid 1] physical address = 2446327808 + 3m.
>
> 1351614464 is the address of the stripe 0 of the data chunk. It is
> completely unclear whether it is intentional or not.
>
> Nor does it explain how the device extent (physical stripe) is
> selected and how it jumps from the block_group_offset to the
> (physical) stripe number.
For RAID0 the whole situation is a little complex, but still easy to
understand.
Firstly for a RAID0 chunk, they are split into 64K length stripes, and
each 64K stripe are spread into each device.
If we have a RAID0 chunk with 2 devices, just like the data chunk example:
Off inside bg 0 +64K +128K +192K
| Stripe 0 | Stripe 1 | Stripe 2 | ...
Then Stripe 0 would be at the first dev extent of that data chunk, with
offset 0 to the dev extent (devid 2 physical 1351614464 + offset 0)
Stripe 1 would be at the second dev extent, with offset 0 to the dev
extent. (devid 1 physical 1372585984 + offset 0).
And for stripe 2, it would be at the first dev extent again, but offset
64K to the dev extent.
So for the bg offset 1m write for the RAID0 data chunk, it would be at:
1) stripe_nr = bg_offset / stripe_length
1M / 64K = 16
2) Choose which dev-extent to be write into:
stripe_index = (bg_off / stripe_len) % nr_dev
( 1M / 64K ) % 2 = 0
Thus we choose the dev stripe 0 ( devid 2 offset 1351614464) of that
data chunk.
3) Final physical offset
physical_off = dev_extent_off + offset_in_stripe + skipped_physical
= dev_extnet_off + (bg_off & stripe_mask) +
stripe_nr / nr_dev * stripe_len
= 1351614464 + (1M & (64K - 1)) + 16 / 2 * 64K
= 1351614464 + 0 + 8 * 64K
Remember, all these calculation is the same for regular RAID0.
For RAID10, it's RAID1 for each RAID0 stripe.
Just make above nr_dev to be (nr_dev / 2).
In btrfs' case, we use sub_stripe to distinguish the calculation.
For RAID10, sub_stripe would always be 2, and RAID0 would always have
sub_stripe as 1.
Thus above nr_dev can be replaced to (nr_dev / sub_stripes), and then
can cover both RAID0 and RAID10.
>
> Intermixing "chunk" and "block group" does not help in understanding it either.
>
> And I suspect RAID5/6 is something entirely different ...
RAID5/6 is mostly based on RAID0, but with more rotation involved, thus
more complex.
I can explain RAID56 in more details, if you can grasp the RAID0 and
RAID1, and RAID10 part first.
And RAID0 behavior are shared between LVM striped/dm-raid0/btrfs RAID0,
following the same behavior (IIRC).
The same for RAID1, among LVM mirrored/dm-raid1/btrfs RAID1.
Thanks,
Qu
>
next prev parent reply other threads:[~2023-12-20 8:51 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-19 11:29 Logical to Physical Address Mapping/Translation in Btrfs saranyag
2023-12-19 22:20 ` Qu Wenruo
2023-12-20 7:24 ` Andrei Borzenkov
2023-12-20 8:51 ` Qu Wenruo [this message]
-- strict thread matches above, loose matches on Subject: below --
2023-12-19 18:55 saranyag
2023-12-19 8:00 ` Andrei Borzenkov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4b4089fd-9ea5-4b09-aafb-ed3cd6d51505@gmx.com \
--to=quwenruo.btrfs@gmx.com \
--cc=arvidjaar@gmail.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=saranyag@cdac.in \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox