* [PATCH v2 1/4] btrfs: simplify the @bioc argument for handle_ops_on_dev_replace()
2023-02-07 4:26 [PATCH v2 0/4] btrfs: reduce the memory usage for btrfs_io_context, and reduce its variable sized members Qu Wenruo
@ 2023-02-07 4:26 ` Qu Wenruo
2023-02-07 4:26 ` [PATCH v2 2/4] btrfs: small improvement for btrfs_io_context structure Qu Wenruo
` (3 subsequent siblings)
4 siblings, 0 replies; 15+ messages in thread
From: Qu Wenruo @ 2023-02-07 4:26 UTC (permalink / raw)
To: linux-btrfs; +Cc: Johannes Thumshirn, Anand Jain
There is no memory re-allocation for handle_ops_on_dev_replace(), thus
we don't need to pass a struct btrfs_io_context ** pointer.
Just a regular struct btrfs_io_contex * pointer is enough.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/volumes.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 5dcc6204fa4d..b5a6262092d1 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6166,12 +6166,11 @@ static bool is_block_group_to_copy(struct btrfs_fs_info *fs_info, u64 logical)
}
static void handle_ops_on_dev_replace(enum btrfs_map_op op,
- struct btrfs_io_context **bioc_ret,
+ struct btrfs_io_context *bioc,
struct btrfs_dev_replace *dev_replace,
u64 logical,
int *num_stripes_ret, int *max_errors_ret)
{
- struct btrfs_io_context *bioc = *bioc_ret;
u64 srcdev_devid = dev_replace->srcdev->devid;
int tgtdev_indexes = 0;
int num_stripes = *num_stripes_ret;
@@ -6260,7 +6259,6 @@ static void handle_ops_on_dev_replace(enum btrfs_map_op op,
*num_stripes_ret = num_stripes;
*max_errors_ret = max_errors;
bioc->num_tgtdevs = tgtdev_indexes;
- *bioc_ret = bioc;
}
static bool need_full_stripe(enum btrfs_map_op op)
@@ -6605,7 +6603,7 @@ int __btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL &&
need_full_stripe(op)) {
- handle_ops_on_dev_replace(op, &bioc, dev_replace, logical,
+ handle_ops_on_dev_replace(op, bioc, dev_replace, logical,
&num_stripes, &max_errors);
}
--
2.39.1
^ permalink raw reply related [flat|nested] 15+ messages in thread* [PATCH v2 2/4] btrfs: small improvement for btrfs_io_context structure
2023-02-07 4:26 [PATCH v2 0/4] btrfs: reduce the memory usage for btrfs_io_context, and reduce its variable sized members Qu Wenruo
2023-02-07 4:26 ` [PATCH v2 1/4] btrfs: simplify the @bioc argument for handle_ops_on_dev_replace() Qu Wenruo
@ 2023-02-07 4:26 ` Qu Wenruo
2023-02-15 20:02 ` David Sterba
2023-02-07 4:26 ` [PATCH v2 3/4] btrfs: use a more space efficient way to represent the source of duplicated stripes Qu Wenruo
` (2 subsequent siblings)
4 siblings, 1 reply; 15+ messages in thread
From: Qu Wenruo @ 2023-02-07 4:26 UTC (permalink / raw)
To: linux-btrfs
That structure is our ultimate objective for all __btrfs_map_block()
related functions.
We have some hard to understand members, like tgtdev_map, but without
any comments.
This patch will improve the situation by:
- Add extra comments for num_stripes, mirror_num, num_tgtdevs and
tgtdev_map[]
Especially for the last two members, add a dedicated (thus very long)
comments for them, with example to explain it.
- Shrink those int members to u16.
In fact our on-disk format is only using u16 for num_stripes, thus
no need to go int at all.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/volumes.c | 16 ++++++++------
fs/btrfs/volumes.h | 53 +++++++++++++++++++++++++++++++++++++++++-----
2 files changed, 57 insertions(+), 12 deletions(-)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index b5a6262092d1..ebdb4261bf12 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5884,16 +5884,18 @@ static void sort_parity_stripes(struct btrfs_io_context *bioc, int num_stripes)
}
static struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_info,
- int total_stripes,
- int real_stripes)
+ u16 total_stripes,
+ u16 real_stripes)
{
- struct btrfs_io_context *bioc = kzalloc(
+ struct btrfs_io_context *bioc;
+
+ bioc = kzalloc(
/* The size of btrfs_io_context */
sizeof(struct btrfs_io_context) +
/* Plus the variable array for the stripes */
sizeof(struct btrfs_io_stripe) * (total_stripes) +
/* Plus the variable array for the tgt dev */
- sizeof(int) * (real_stripes) +
+ sizeof(u16) * (real_stripes) +
/*
* Plus the raid_map, which includes both the tgt dev
* and the stripes.
@@ -5907,7 +5909,7 @@ static struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_
refcount_set(&bioc->refs, 1);
bioc->fs_info = fs_info;
- bioc->tgtdev_map = (int *)(bioc->stripes + total_stripes);
+ bioc->tgtdev_map = (u16 *)(bioc->stripes + total_stripes);
bioc->raid_map = (u64 *)(bioc->tgtdev_map + real_stripes);
return bioc;
@@ -6377,12 +6379,12 @@ int __btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
int mirror_num = (mirror_num_ret ? *mirror_num_ret : 0);
int num_stripes;
int max_errors = 0;
- int tgtdev_indexes = 0;
struct btrfs_io_context *bioc = NULL;
struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
int dev_replace_is_ongoing = 0;
- int num_alloc_stripes;
int patch_the_first_stripe_for_dev_replace = 0;
+ u16 num_alloc_stripes;
+ u16 tgtdev_indexes = 0;
u64 physical_to_patch_in_first_stripe = 0;
u64 raid56_full_stripe_start = (u64)-1;
struct btrfs_io_geometry geom;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 46f33b66d85e..bf7e7c08d8f7 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -422,11 +422,54 @@ struct btrfs_io_context {
u64 map_type; /* get from map_lookup->type */
struct bio *orig_bio;
atomic_t error;
- int max_errors;
- int num_stripes;
- int mirror_num;
- int num_tgtdevs;
- int *tgtdev_map;
+ u16 max_errors;
+
+ /*
+ * The total number of stripes, including the extra duplicated
+ * stripe for replace.
+ */
+ u16 num_stripes;
+
+ /*
+ * The mirror_num of this bioc.
+ *
+ * This is for reads which uses 0 as mirror_num, thus we should
+ * return a valid mirror_num (>0) for the reader.
+ */
+ u16 mirror_num;
+
+ /*
+ * The following two members are for dev-replace case only.
+ *
+ * @num_tgtdevs: Number of duplicated stripes which needs to be
+ * written to replace target.
+ * Should be <= 2 (2 for DUP, otherwise <= 1).
+ * @tgtdev_map: The array indicates where the duplicated stripes
+ * are from. The size is the number of original
+ * stripes (num_stripes - num_tgtdevs).
+ *
+ * The @tgtdev_map[] array is mostly for RAID56 cases.
+ * As non-RAID56 stripes share the same contents for the mapped range,
+ * thus no need to bother where the duplicated ones are from.
+ *
+ * But for RAID56 case, all stripes contain different contents, thus
+ * must need a way to know the mapping.
+ *
+ * There is an example for the two members, using a RAID5 write:
+ * num_stripes: 4 (3 + 1 duplicated write)
+ * stripes[0]: dev = devid 1, physical = X
+ * stripes[1]: dev = devid 2, physical = Y
+ * stripes[2]: dev = devid 3, physical = Z
+ * stripes[3]: dev = devid 0, physical = Y
+ *
+ * num_tgtdevs = 1
+ * tgtdev_map[0] = 0 <- Means stripes[0] is not involved in replace.
+ * tgtdev_map[1] = 3 <- Means stripes[1] is involved in replace,
+ * and it's duplicated to stripes[3].
+ * tgtdev_map[2] = 0 <- Means stripes[2] is not involved in replace.
+ */
+ u16 num_tgtdevs;
+ u16 *tgtdev_map;
/*
* logical block numbers for the start of each stripe
* The last one or two are p/q. These are sorted,
--
2.39.1
^ permalink raw reply related [flat|nested] 15+ messages in thread* Re: [PATCH v2 2/4] btrfs: small improvement for btrfs_io_context structure
2023-02-07 4:26 ` [PATCH v2 2/4] btrfs: small improvement for btrfs_io_context structure Qu Wenruo
@ 2023-02-15 20:02 ` David Sterba
2023-02-17 5:03 ` Qu Wenruo
0 siblings, 1 reply; 15+ messages in thread
From: David Sterba @ 2023-02-15 20:02 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs
On Tue, Feb 07, 2023 at 12:26:13PM +0800, Qu Wenruo wrote:
> That structure is our ultimate objective for all __btrfs_map_block()
> related functions.
>
> We have some hard to understand members, like tgtdev_map, but without
> any comments.
>
> This patch will improve the situation by:
>
> - Add extra comments for num_stripes, mirror_num, num_tgtdevs and
> tgtdev_map[]
> Especially for the last two members, add a dedicated (thus very long)
> comments for them, with example to explain it.
>
> - Shrink those int members to u16.
> In fact our on-disk format is only using u16 for num_stripes, thus
> no need to go int at all.
Note that u16 is maybe good for space saving in structures but otherwise
it's a type that's a bit clumsy for CPU to handle. Int for passing it
around does not require masking or other sorts of conversions when
there's arithmetic done. This can be cleaned up later with close
inspection of the effects, so far we have u16 for fs_info::csum_type or
some item lengths.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 2/4] btrfs: small improvement for btrfs_io_context structure
2023-02-15 20:02 ` David Sterba
@ 2023-02-17 5:03 ` Qu Wenruo
2023-02-23 20:55 ` David Sterba
0 siblings, 1 reply; 15+ messages in thread
From: Qu Wenruo @ 2023-02-17 5:03 UTC (permalink / raw)
To: dsterba, Qu Wenruo; +Cc: linux-btrfs
On 2023/2/16 04:02, David Sterba wrote:
> On Tue, Feb 07, 2023 at 12:26:13PM +0800, Qu Wenruo wrote:
>> That structure is our ultimate objective for all __btrfs_map_block()
>> related functions.
>>
>> We have some hard to understand members, like tgtdev_map, but without
>> any comments.
>>
>> This patch will improve the situation by:
>>
>> - Add extra comments for num_stripes, mirror_num, num_tgtdevs and
>> tgtdev_map[]
>> Especially for the last two members, add a dedicated (thus very long)
>> comments for them, with example to explain it.
>>
>> - Shrink those int members to u16.
>> In fact our on-disk format is only using u16 for num_stripes, thus
>> no need to go int at all.
>
> Note that u16 is maybe good for space saving in structures but otherwise
> it's a type that's a bit clumsy for CPU to handle. Int for passing it
> around does not require masking or other sorts of conversions when
> there's arithmetic done. This can be cleaned up later with close
> inspection of the effects, so far we have u16 for fs_info::csum_type or
> some item lengths.
I'm not sure if the extra masking for CPU is a big deal.
If the compiler choose to do extra padding and not saving much space,
I'll still stick to u16.
From my recent learning on Rust, one of the best practice is the strong
size limits on their variables.
For a lot of usages, we shouldn't really need to 32bit values, and in
that case, I strongly prefer narrower variables.
If there is a better way to let compiler do 32bit operations while still
do a u16 size limits, I'm pretty happy to follow.
Otherwise I think the extra CPU time (if any) is still worthy for a more
explicit size limit.
BTW, I'll refresh this series along with its dependency patchset "btrfs:
reduce div64 calls for __btrfs_map_block() and its variants".
As I guess you didn't apply the dependency first because quite some
conflicts related to HCH's io_geometry cleanup series.
And I'll use patches from for-next branches to keep all your fixes.
Thanks,
Qu
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 2/4] btrfs: small improvement for btrfs_io_context structure
2023-02-17 5:03 ` Qu Wenruo
@ 2023-02-23 20:55 ` David Sterba
0 siblings, 0 replies; 15+ messages in thread
From: David Sterba @ 2023-02-23 20:55 UTC (permalink / raw)
To: Qu Wenruo; +Cc: dsterba, Qu Wenruo, linux-btrfs
On Fri, Feb 17, 2023 at 01:03:28PM +0800, Qu Wenruo wrote:
>
>
> On 2023/2/16 04:02, David Sterba wrote:
> > On Tue, Feb 07, 2023 at 12:26:13PM +0800, Qu Wenruo wrote:
> >> That structure is our ultimate objective for all __btrfs_map_block()
> >> related functions.
> >>
> >> We have some hard to understand members, like tgtdev_map, but without
> >> any comments.
> >>
> >> This patch will improve the situation by:
> >>
> >> - Add extra comments for num_stripes, mirror_num, num_tgtdevs and
> >> tgtdev_map[]
> >> Especially for the last two members, add a dedicated (thus very long)
> >> comments for them, with example to explain it.
> >>
> >> - Shrink those int members to u16.
> >> In fact our on-disk format is only using u16 for num_stripes, thus
> >> no need to go int at all.
> >
> > Note that u16 is maybe good for space saving in structures but otherwise
> > it's a type that's a bit clumsy for CPU to handle. Int for passing it
> > around does not require masking or other sorts of conversions when
> > there's arithmetic done. This can be cleaned up later with close
> > inspection of the effects, so far we have u16 for fs_info::csum_type or
> > some item lengths.
>
> I'm not sure if the extra masking for CPU is a big deal.
It's a microoptimization so it's always a question if it's worth or not.
More instructions to do the same thing fill the i-cache and on hot paths
it counts. We have a lot of IO in btrfs code so we are not focused on
such things. In this case I don't want to let us use u16 freely as it's
not a native type and any operation on the data will lead to asm code
bloat here and there.
I measure size of final .ko object for many patches to see effects of
various changes and the size tendency is going upward. This over time
makes the code less effective regarding CPU.
> If the compiler choose to do extra padding and not saving much space,
> I'll still stick to u16.
>
> From my recent learning on Rust, one of the best practice is the strong
> size limits on their variables.
> For a lot of usages, we shouldn't really need to 32bit values, and in
> that case, I strongly prefer narrower variables.
I'd prefer to keep using native CPU types so that the narrow types are
in structures either memory or disk but the calculations done in natural
types and then checked for eventual overflows, instead of hoping that
silent overflows won't affect.
> If there is a better way to let compiler do 32bit operations while still
> do a u16 size limits, I'm pretty happy to follow.
read u16 -> int/u32 -> do the calculations -> check/assert -> write back
to u16
We have do that explicitly for validaing input, that rust does something
else won't help us.
> Otherwise I think the extra CPU time (if any) is still worthy for a more
> explicit size limit.
Yeah it's not directly measurable and would depend on other things like
cachelines an internal CPU state at the time of executing but my point
is to think about such changes and not let inefficiencies silently creep
in.
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v2 3/4] btrfs: use a more space efficient way to represent the source of duplicated stripes
2023-02-07 4:26 [PATCH v2 0/4] btrfs: reduce the memory usage for btrfs_io_context, and reduce its variable sized members Qu Wenruo
2023-02-07 4:26 ` [PATCH v2 1/4] btrfs: simplify the @bioc argument for handle_ops_on_dev_replace() Qu Wenruo
2023-02-07 4:26 ` [PATCH v2 2/4] btrfs: small improvement for btrfs_io_context structure Qu Wenruo
@ 2023-02-07 4:26 ` Qu Wenruo
2023-02-07 4:26 ` [PATCH v2 4/4] btrfs: replace btrfs_io_context::raid_map[] with a fixed u64 value Qu Wenruo
2023-02-15 20:07 ` [PATCH v2 0/4] btrfs: reduce the memory usage for btrfs_io_context, and reduce its variable sized members David Sterba
4 siblings, 0 replies; 15+ messages in thread
From: Qu Wenruo @ 2023-02-07 4:26 UTC (permalink / raw)
To: linux-btrfs
For btrfs dev-replace, we have to duplicate writes to the source
device into the target device.
For non-RAID56, all writes into the same mapped ranges are sharing the
same content, thus they don't really need to bother anything.
(E.g. in btrfs_submit_bio() for non-RAID56 range we just submit the
same write to all involved devices).
But for RAID56, all stripes contain different content, thus we must
have a clear mapping of which stripe is duplicated from which original
stripe.
Currently we use a complex way using tgtdev_map[] array, e.g:
num_tgtdevs = 1
tgtdev_map[0] = 0 <- Means stripes[0] is not involved in replace.
tgtdev_map[1] = 3 <- Means stripes[1] is involved in replace,
and it's duplicated to stripes[3].
tgtdev_map[2] = 0 <- Means stripes[2] is not involved in replace.
But this is wasting some space, and ignored one important thing for
dev-replace, there is at most ONE running replace.
Thus we can change it to a fixed array to represent the mapping:
replace_nr_stripes = 1
replace_stripe_src = 1 <- Means stripes[1] is involved in replace.
thus the extra stripe is a copy of
stripes[1]
By this we can save some space for bioc on RAID56 chunks with many
devices.
And we get rid of one variable sized array from bioc.
Thus the patch involves the following changes:
- Replace @num_tgtdevs and @tgtdev_map[] with @replace_nr_stripes
and @replace_stripe_src.
@num_tgtdevs is just renamed to @replace_nr_stripes.
While the mapping is completely changed.
- Add extra ASSERT()s for RAID56 code
- Only add two more extra stripes for dev-replace cases.
As we have a upper limit on how many dev-replace stripes we can have.
- Unify the behavior of handle_ops_on_dev_replace()
Previously handle_ops_on_dev_replace() go two different paths for
WRITE and GET_READ_MIRRORS.
Now unify them by always go WRITE path first (with at most 2 replace
stripes), then if we're doing GET_READ_MIRRORS and we have 2 extra
stripes, just drop one stripe.
- Remove the @real_stripes argument from alloc_btrfs_io_context()
As we don't need the old variable length array any more.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/raid56.c | 39 ++++++++---
fs/btrfs/scrub.c | 4 +-
fs/btrfs/volumes.c | 170 ++++++++++++++++++++-------------------------
fs/btrfs/volumes.h | 26 +++----
4 files changed, 124 insertions(+), 115 deletions(-)
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index d095c07a152d..4332a8e165a7 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -912,7 +912,7 @@ static struct sector_ptr *sector_in_rbio(struct btrfs_raid_bio *rbio,
static struct btrfs_raid_bio *alloc_rbio(struct btrfs_fs_info *fs_info,
struct btrfs_io_context *bioc)
{
- const unsigned int real_stripes = bioc->num_stripes - bioc->num_tgtdevs;
+ const unsigned int real_stripes = bioc->num_stripes - bioc->replace_nr_stripes;
const unsigned int stripe_npages = BTRFS_STRIPE_LEN >> PAGE_SHIFT;
const unsigned int num_pages = stripe_npages * real_stripes;
const unsigned int stripe_nsectors =
@@ -1275,10 +1275,17 @@ static int rmw_assemble_write_bios(struct btrfs_raid_bio *rbio,
goto error;
}
- if (likely(!rbio->bioc->num_tgtdevs))
+ if (likely(!rbio->bioc->replace_nr_stripes))
return 0;
- /* Make a copy for the replace target device. */
+ /*
+ * Make a copy for the replace target device.
+ *
+ * Thus the source stripe number (in replace_stripe_src)
+ * should be valid.
+ */
+ ASSERT(rbio->bioc->replace_stripe_src >= 0);
+
for (total_sector_nr = 0; total_sector_nr < rbio->nr_sectors;
total_sector_nr++) {
struct sector_ptr *sector;
@@ -1286,7 +1293,12 @@ static int rmw_assemble_write_bios(struct btrfs_raid_bio *rbio,
stripe = total_sector_nr / rbio->stripe_nsectors;
sectornr = total_sector_nr % rbio->stripe_nsectors;
- if (!rbio->bioc->tgtdev_map[stripe]) {
+ /*
+ * For RAID56, there is only one device that can be replaced,
+ * and replace_stripe_src[0] indicates the stripe number
+ * we need to copy from.
+ */
+ if (stripe != rbio->bioc->replace_stripe_src) {
/*
* We can skip the whole stripe completely, note
* total_sector_nr will be increased by one anyway.
@@ -1309,7 +1321,7 @@ static int rmw_assemble_write_bios(struct btrfs_raid_bio *rbio,
}
ret = rbio_add_io_sector(rbio, bio_list, sector,
- rbio->bioc->tgtdev_map[stripe],
+ rbio->real_stripes,
sectornr, REQ_OP_WRITE);
if (ret)
goto error;
@@ -2518,7 +2530,12 @@ static int finish_parity_scrub(struct btrfs_raid_bio *rbio, int need_check)
else
BUG();
- if (bioc->num_tgtdevs && bioc->tgtdev_map[rbio->scrubp]) {
+ /*
+ * Replace is running and our P/Q stripe is being replaced, then
+ * we need to duplicated the final write to replace target.
+ */
+ if (bioc->replace_nr_stripes &&
+ bioc->replace_stripe_src == rbio->scrubp) {
is_replace = 1;
bitmap_copy(pbitmap, &rbio->dbitmap, rbio->stripe_nsectors);
}
@@ -2620,13 +2637,19 @@ static int finish_parity_scrub(struct btrfs_raid_bio *rbio, int need_check)
if (!is_replace)
goto submit_write;
+ /*
+ * Replace is running and our parity stripe needs to be duplicated
+ * to the target device.
+ * Checks we have valid source stripe number.
+ */
+ ASSERT(rbio->bioc->replace_stripe_src >= 0);
for_each_set_bit(sectornr, pbitmap, rbio->stripe_nsectors) {
struct sector_ptr *sector;
sector = rbio_stripe_sector(rbio, rbio->scrubp, sectornr);
ret = rbio_add_io_sector(rbio, &bio_list, sector,
- bioc->tgtdev_map[rbio->scrubp],
- sectornr, REQ_OP_WRITE);
+ rbio->real_stripes,
+ sectornr, REQ_OP_WRITE);
if (ret)
goto cleanup;
}
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 3b4e621e822f..83de9fecc7b6 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -1230,7 +1230,7 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
sblock_other = sblocks_for_recheck[mirror_index];
} else {
struct scrub_recover *r = sblock_bad->sectors[0]->recover;
- int max_allowed = r->bioc->num_stripes - r->bioc->num_tgtdevs;
+ int max_allowed = r->bioc->num_stripes - r->bioc->replace_nr_stripes;
if (mirror_index >= max_allowed)
break;
@@ -1540,7 +1540,7 @@ static int scrub_setup_recheck_block(struct scrub_block *original_sblock,
bioc->map_type,
bioc->raid_map,
bioc->num_stripes -
- bioc->num_tgtdevs,
+ bioc->replace_nr_stripes,
mirror_index,
&stripe_index,
&stripe_offset);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index ebdb4261bf12..860bb2709778 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5884,8 +5884,7 @@ static void sort_parity_stripes(struct btrfs_io_context *bioc, int num_stripes)
}
static struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_info,
- u16 total_stripes,
- u16 real_stripes)
+ u16 total_stripes)
{
struct btrfs_io_context *bioc;
@@ -5894,8 +5893,6 @@ static struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_
sizeof(struct btrfs_io_context) +
/* Plus the variable array for the stripes */
sizeof(struct btrfs_io_stripe) * (total_stripes) +
- /* Plus the variable array for the tgt dev */
- sizeof(u16) * (real_stripes) +
/*
* Plus the raid_map, which includes both the tgt dev
* and the stripes.
@@ -5909,8 +5906,8 @@ static struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_
refcount_set(&bioc->refs, 1);
bioc->fs_info = fs_info;
- bioc->tgtdev_map = (u16 *)(bioc->stripes + total_stripes);
- bioc->raid_map = (u64 *)(bioc->tgtdev_map + real_stripes);
+ bioc->raid_map = (u64 *)(bioc->stripes + total_stripes);
+ bioc->replace_stripe_src = -1;
return bioc;
}
@@ -6174,93 +6171,78 @@ static void handle_ops_on_dev_replace(enum btrfs_map_op op,
int *num_stripes_ret, int *max_errors_ret)
{
u64 srcdev_devid = dev_replace->srcdev->devid;
- int tgtdev_indexes = 0;
+ /*
+ * At this stage, num_stripes is still the real number of stripes,
+ * excluding the duplicated stripes.
+ */
int num_stripes = *num_stripes_ret;
+ int nr_extra_stripes = 0;
int max_errors = *max_errors_ret;
int i;
- if (op == BTRFS_MAP_WRITE) {
- int index_where_to_add;
+ /*
+ * A block group which has "to_copy" set will eventually be
+ * copied by the dev-replace process. We can avoid cloning IO here.
+ */
+ if (is_block_group_to_copy(dev_replace->srcdev->fs_info, logical))
+ return;
+
+ /*
+ * Duplicate the write operations while the dev replace
+ * procedure is running. Since the copying of the old disk to
+ * the new disk takes place at run time while the filesystem is
+ * mounted writable, the regular write operations to the old
+ * disk have to be duplicated to go to the new disk as well.
+ *
+ * Note that device->missing is handled by the caller, and that
+ * the write to the old disk is already set up in the stripes
+ * array.
+ */
+ for (i = 0; i < num_stripes; i++) {
+ struct btrfs_io_stripe *old = &bioc->stripes[i];
+ struct btrfs_io_stripe *new = &bioc->stripes[
+ num_stripes + nr_extra_stripes];
+
+ if (old->dev->devid != srcdev_devid)
+ continue;
+
+ new->physical = old->physical;
+ new->dev = dev_replace->tgtdev;
+ if (bioc->map_type & BTRFS_BLOCK_GROUP_RAID56_MASK)
+ bioc->replace_stripe_src = i;
+ nr_extra_stripes++;
+ }
+
+ /* We can only have at most 2 extra nr_stripes (for DUP). */
+ ASSERT(nr_extra_stripes <= 2);
+ /*
+ * For GET_READ_MIRRORS, we can only return at most 1 extra stripes for
+ * replace.
+ * If we have 2 extra stripes, only choose the one with smaller physical.
+ */
+ if (op == BTRFS_MAP_GET_READ_MIRRORS && nr_extra_stripes == 2) {
+ struct btrfs_io_stripe *first = &bioc->stripes[num_stripes];
+ struct btrfs_io_stripe *second = &bioc->stripes[num_stripes + 1];
+
+ /* Only DUP can have two extra stripes. */
+ ASSERT(bioc->map_type & BTRFS_BLOCK_GROUP_DUP);
/*
- * A block group which have "to_copy" set will eventually
- * copied by dev-replace process. We can avoid cloning IO here.
+ * Swap the last stripe stripes and just reduce
+ * @nr_extra_stripes.
+ * The extra stripe would still be there, but won't be
+ * accessed.
*/
- if (is_block_group_to_copy(dev_replace->srcdev->fs_info, logical))
- return;
-
- /*
- * duplicate the write operations while the dev replace
- * procedure is running. Since the copying of the old disk to
- * the new disk takes place at run time while the filesystem is
- * mounted writable, the regular write operations to the old
- * disk have to be duplicated to go to the new disk as well.
- *
- * Note that device->missing is handled by the caller, and that
- * the write to the old disk is already set up in the stripes
- * array.
- */
- index_where_to_add = num_stripes;
- for (i = 0; i < num_stripes; i++) {
- if (bioc->stripes[i].dev->devid == srcdev_devid) {
- /* write to new disk, too */
- struct btrfs_io_stripe *new =
- bioc->stripes + index_where_to_add;
- struct btrfs_io_stripe *old =
- bioc->stripes + i;
-
- new->physical = old->physical;
- new->dev = dev_replace->tgtdev;
- bioc->tgtdev_map[i] = index_where_to_add;
- index_where_to_add++;
- max_errors++;
- tgtdev_indexes++;
- }
- }
- num_stripes = index_where_to_add;
- } else if (op == BTRFS_MAP_GET_READ_MIRRORS) {
- int index_srcdev = 0;
- int found = 0;
- u64 physical_of_found = 0;
-
- /*
- * During the dev-replace procedure, the target drive can also
- * be used to read data in case it is needed to repair a corrupt
- * block elsewhere. This is possible if the requested area is
- * left of the left cursor. In this area, the target drive is a
- * full copy of the source drive.
- */
- for (i = 0; i < num_stripes; i++) {
- if (bioc->stripes[i].dev->devid == srcdev_devid) {
- /*
- * In case of DUP, in order to keep it simple,
- * only add the mirror with the lowest physical
- * address
- */
- if (found &&
- physical_of_found <= bioc->stripes[i].physical)
- continue;
- index_srcdev = i;
- found = 1;
- physical_of_found = bioc->stripes[i].physical;
- }
- }
- if (found) {
- struct btrfs_io_stripe *tgtdev_stripe =
- bioc->stripes + num_stripes;
-
- tgtdev_stripe->physical = physical_of_found;
- tgtdev_stripe->dev = dev_replace->tgtdev;
- bioc->tgtdev_map[index_srcdev] = num_stripes;
-
- tgtdev_indexes++;
- num_stripes++;
+ if (first->physical > second->physical) {
+ swap(second->physical, first->physical);
+ swap(second->dev, first->dev);
+ nr_extra_stripes--;
}
}
- *num_stripes_ret = num_stripes;
- *max_errors_ret = max_errors;
- bioc->num_tgtdevs = tgtdev_indexes;
+ *num_stripes_ret = num_stripes + nr_extra_stripes;
+ *max_errors_ret = max_errors + nr_extra_stripes;
+ bioc->replace_nr_stripes = nr_extra_stripes;
}
static bool need_full_stripe(enum btrfs_map_op op)
@@ -6384,7 +6366,6 @@ int __btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
int dev_replace_is_ongoing = 0;
int patch_the_first_stripe_for_dev_replace = 0;
u16 num_alloc_stripes;
- u16 tgtdev_indexes = 0;
u64 physical_to_patch_in_first_stripe = 0;
u64 raid56_full_stripe_start = (u64)-1;
struct btrfs_io_geometry geom;
@@ -6534,13 +6515,16 @@ int __btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
}
num_alloc_stripes = num_stripes;
- if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL) {
- if (op == BTRFS_MAP_WRITE)
- num_alloc_stripes <<= 1;
- if (op == BTRFS_MAP_GET_READ_MIRRORS)
- num_alloc_stripes++;
- tgtdev_indexes = num_stripes;
- }
+ if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL &&
+ op != BTRFS_MAP_READ)
+ /*
+ * For replace case, we need to add extra stripes for extra
+ * duplicated stripes.
+ *
+ * For both WRITE and GET_READ_MIRRORS, we may have
+ * at most 2 more stripes (DUP types, otherwise 1).
+ */
+ num_alloc_stripes += 2;
/*
* If this I/O maps to a single device, try to return the device and
@@ -6565,11 +6549,12 @@ int __btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
goto out;
}
- bioc = alloc_btrfs_io_context(fs_info, num_alloc_stripes, tgtdev_indexes);
+ bioc = alloc_btrfs_io_context(fs_info, num_alloc_stripes);
if (!bioc) {
ret = -ENOMEM;
goto out;
}
+ bioc->map_type = map->type;
for (i = 0; i < num_stripes; i++) {
set_io_stripe(&bioc->stripes[i], map, stripe_index, stripe_offset,
@@ -6610,7 +6595,6 @@ int __btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
}
*bioc_ret = bioc;
- bioc->map_type = map->type;
bioc->num_stripes = num_stripes;
bioc->max_errors = max_errors;
bioc->mirror_num = mirror_num;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index bf7e7c08d8f7..f35db441ea9c 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -441,14 +441,13 @@ struct btrfs_io_context {
/*
* The following two members are for dev-replace case only.
*
- * @num_tgtdevs: Number of duplicated stripes which needs to be
+ * @replace_nr_stripes: Number of duplicated stripes which needs to be
* written to replace target.
* Should be <= 2 (2 for DUP, otherwise <= 1).
- * @tgtdev_map: The array indicates where the duplicated stripes
- * are from. The size is the number of original
- * stripes (num_stripes - num_tgtdevs).
+ * @replace_stripe_src: The array indicates where the duplicated stripes
+ * are from.
*
- * The @tgtdev_map[] array is mostly for RAID56 cases.
+ * The @replace_stripe_src[] array is mostly for RAID56 cases.
* As non-RAID56 stripes share the same contents for the mapped range,
* thus no need to bother where the duplicated ones are from.
*
@@ -462,14 +461,17 @@ struct btrfs_io_context {
* stripes[2]: dev = devid 3, physical = Z
* stripes[3]: dev = devid 0, physical = Y
*
- * num_tgtdevs = 1
- * tgtdev_map[0] = 0 <- Means stripes[0] is not involved in replace.
- * tgtdev_map[1] = 3 <- Means stripes[1] is involved in replace,
- * and it's duplicated to stripes[3].
- * tgtdev_map[2] = 0 <- Means stripes[2] is not involved in replace.
+ * replace_nr_stripes = 1
+ * replace_stripe_src = 1 <- Means stripes[1] is involved in replace.
+ * The duplicated stripe index would be
+ * (@num_stripes - 1).
+ *
+ * Note that, we can still have cases replace_nr_stripes = 2 for DUP.
+ * In that case, all stripes share the same content, thus we don't
+ * need to bother @replace_stripe_src value at all.
*/
- u16 num_tgtdevs;
- u16 *tgtdev_map;
+ u16 replace_nr_stripes;
+ s16 replace_stripe_src;
/*
* logical block numbers for the start of each stripe
* The last one or two are p/q. These are sorted,
--
2.39.1
^ permalink raw reply related [flat|nested] 15+ messages in thread* [PATCH v2 4/4] btrfs: replace btrfs_io_context::raid_map[] with a fixed u64 value
2023-02-07 4:26 [PATCH v2 0/4] btrfs: reduce the memory usage for btrfs_io_context, and reduce its variable sized members Qu Wenruo
` (2 preceding siblings ...)
2023-02-07 4:26 ` [PATCH v2 3/4] btrfs: use a more space efficient way to represent the source of duplicated stripes Qu Wenruo
@ 2023-02-07 4:26 ` Qu Wenruo
2023-02-15 20:07 ` David Sterba
2023-02-20 8:53 ` Geert Uytterhoeven
2023-02-15 20:07 ` [PATCH v2 0/4] btrfs: reduce the memory usage for btrfs_io_context, and reduce its variable sized members David Sterba
4 siblings, 2 replies; 15+ messages in thread
From: Qu Wenruo @ 2023-02-07 4:26 UTC (permalink / raw)
To: linux-btrfs
In btrfs_io_context structure, we have a pointer raid_map, which is to
indicate the logical bytenr for each stripe.
But considering we always call sort_parity_stripes(), the result
raid_map[] is always sorted, thus raid_map[0] is always the logical
bytenr of the full stripe.
So why we waste the space and time (for sorting) for raid_map[]?
This patch will replace btrfs_io_context::raid_map with a single u64
number, full_stripe_start, by:
- Replace btrfs_io_context::raid_map with full_stripe_start
- Replace call sites using raid_map[0] to use full_stripe_start
- Replace call sites using raid_map[i] to compare with nr_data_stripes.
The benefits are:
- Less memory wasted on raid_map
It's sizeof(u64) * num_stripes vs size(u64).
It's always a save for at least one u64, and the benefit grows larger
with num_stripes.
- No more weird alloc_btrfs_io_context() behavior
As there is only one fixed size + one variable length array.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/raid56.c | 32 ++++++-------
fs/btrfs/scrub.c | 26 ++++++-----
fs/btrfs/volumes.c | 87 +++++++++++++++---------------------
fs/btrfs/volumes.h | 18 ++++++--
include/trace/events/btrfs.h | 2 +-
5 files changed, 81 insertions(+), 84 deletions(-)
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 4332a8e165a7..febde169b68a 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -202,7 +202,7 @@ static void cache_rbio_pages(struct btrfs_raid_bio *rbio)
*/
static int rbio_bucket(struct btrfs_raid_bio *rbio)
{
- u64 num = rbio->bioc->raid_map[0];
+ u64 num = rbio->bioc->full_stripe_logical;
/*
* we shift down quite a bit. We're using byte
@@ -571,7 +571,7 @@ static int rbio_can_merge(struct btrfs_raid_bio *last,
test_bit(RBIO_CACHE_BIT, &cur->flags))
return 0;
- if (last->bioc->raid_map[0] != cur->bioc->raid_map[0])
+ if (last->bioc->full_stripe_logical != cur->bioc->full_stripe_logical)
return 0;
/* we can't merge with different operations */
@@ -666,7 +666,8 @@ static noinline int lock_stripe_add(struct btrfs_raid_bio *rbio)
spin_lock_irqsave(&h->lock, flags);
list_for_each_entry(cur, &h->hash_list, hash_list) {
- if (cur->bioc->raid_map[0] != rbio->bioc->raid_map[0])
+ if (cur->bioc->full_stripe_logical !=
+ rbio->bioc->full_stripe_logical)
continue;
spin_lock(&cur->bio_list_lock);
@@ -1119,7 +1120,7 @@ static void index_one_bio(struct btrfs_raid_bio *rbio, struct bio *bio)
struct bio_vec bvec;
struct bvec_iter iter;
u32 offset = (bio->bi_iter.bi_sector << SECTOR_SHIFT) -
- rbio->bioc->raid_map[0];
+ rbio->bioc->full_stripe_logical;
bio_for_each_segment(bvec, bio, iter) {
u32 bvec_offset;
@@ -1338,7 +1339,7 @@ static void set_rbio_range_error(struct btrfs_raid_bio *rbio, struct bio *bio)
{
struct btrfs_fs_info *fs_info = rbio->bioc->fs_info;
u32 offset = (bio->bi_iter.bi_sector << SECTOR_SHIFT) -
- rbio->bioc->raid_map[0];
+ rbio->bioc->full_stripe_logical;
int total_nr_sector = offset >> fs_info->sectorsize_bits;
ASSERT(total_nr_sector < rbio->nr_data * rbio->stripe_nsectors);
@@ -1648,7 +1649,7 @@ static void rbio_add_bio(struct btrfs_raid_bio *rbio, struct bio *orig_bio)
{
const struct btrfs_fs_info *fs_info = rbio->bioc->fs_info;
const u64 orig_logical = orig_bio->bi_iter.bi_sector << SECTOR_SHIFT;
- const u64 full_stripe_start = rbio->bioc->raid_map[0];
+ const u64 full_stripe_start = rbio->bioc->full_stripe_logical;
const u32 orig_len = orig_bio->bi_iter.bi_size;
const u32 sectorsize = fs_info->sectorsize;
u64 cur_logical;
@@ -1842,9 +1843,8 @@ static int recover_vertical(struct btrfs_raid_bio *rbio, int sector_nr,
* here due to a crc mismatch and we can't give them the
* data they want.
*/
- if (rbio->bioc->raid_map[failb] == RAID6_Q_STRIPE) {
- if (rbio->bioc->raid_map[faila] ==
- RAID5_P_STRIPE)
+ if (failb == rbio->real_stripes - 1) {
+ if (faila == rbio->real_stripes - 2)
/*
* Only P and Q are corrupted.
* We only care about data stripes recovery,
@@ -1858,7 +1858,7 @@ static int recover_vertical(struct btrfs_raid_bio *rbio, int sector_nr,
goto pstripe;
}
- if (rbio->bioc->raid_map[failb] == RAID5_P_STRIPE) {
+ if (failb == rbio->real_stripes - 2) {
raid6_datap_recov(rbio->real_stripes, sectorsize,
faila, pointers);
} else {
@@ -2156,8 +2156,8 @@ static void fill_data_csums(struct btrfs_raid_bio *rbio)
{
struct btrfs_fs_info *fs_info = rbio->bioc->fs_info;
struct btrfs_root *csum_root = btrfs_csum_root(fs_info,
- rbio->bioc->raid_map[0]);
- const u64 start = rbio->bioc->raid_map[0];
+ rbio->bioc->full_stripe_logical);
+ const u64 start = rbio->bioc->full_stripe_logical;
const u32 len = (rbio->nr_data * rbio->stripe_nsectors) <<
fs_info->sectorsize_bits;
int ret;
@@ -2205,7 +2205,7 @@ static void fill_data_csums(struct btrfs_raid_bio *rbio)
*/
btrfs_warn_rl(fs_info,
"sub-stripe write for full stripe %llu is not safe, failed to get csum: %d",
- rbio->bioc->raid_map[0], ret);
+ rbio->bioc->full_stripe_logical, ret);
no_csum:
kfree(rbio->csum_buf);
bitmap_free(rbio->csum_bitmap);
@@ -2467,10 +2467,10 @@ void raid56_add_scrub_pages(struct btrfs_raid_bio *rbio, struct page *page,
int stripe_offset;
int index;
- ASSERT(logical >= rbio->bioc->raid_map[0]);
- ASSERT(logical + sectorsize <= rbio->bioc->raid_map[0] +
+ ASSERT(logical >= rbio->bioc->full_stripe_logical);
+ ASSERT(logical + sectorsize <= rbio->bioc->full_stripe_logical +
BTRFS_STRIPE_LEN * rbio->nr_data);
- stripe_offset = (int)(logical - rbio->bioc->raid_map[0]);
+ stripe_offset = (int)(logical - rbio->bioc->full_stripe_logical);
index = stripe_offset / sectorsize;
rbio->bio_sectors[index].page = page;
rbio->bio_sectors[index].pgoff = pgoff;
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 83de9fecc7b6..aacf1dd9297c 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -1430,7 +1430,7 @@ static inline int scrub_nr_raid_mirrors(struct btrfs_io_context *bioc)
}
static inline void scrub_stripe_index_and_offset(u64 logical, u64 map_type,
- u64 *raid_map,
+ u64 full_stripe_logical,
int nstripes, int mirror,
int *stripe_index,
u64 *stripe_offset)
@@ -1438,19 +1438,21 @@ static inline void scrub_stripe_index_and_offset(u64 logical, u64 map_type,
int i;
if (map_type & BTRFS_BLOCK_GROUP_RAID56_MASK) {
- /* RAID5/6 */
- for (i = 0; i < nstripes; i++) {
- if (raid_map[i] == RAID6_Q_STRIPE ||
- raid_map[i] == RAID5_P_STRIPE)
- continue;
+ const int nr_data_stripes = (map_type & BTRFS_BLOCK_GROUP_RAID5) ?
+ nstripes - 1 : nstripes - 2;
- if (logical >= raid_map[i] &&
- logical < raid_map[i] + BTRFS_STRIPE_LEN)
+ /* RAID5/6 */
+ for (i = 0; i < nr_data_stripes; i++) {
+ const u64 data_stripe_start = full_stripe_logical +
+ (i << BTRFS_STRIPE_LEN_SHIFT);
+
+ if (logical >= data_stripe_start &&
+ logical < data_stripe_start + BTRFS_STRIPE_LEN)
break;
}
*stripe_index = i;
- *stripe_offset = logical - raid_map[i];
+ *stripe_offset = logical - full_stripe_logical;
} else {
/* The other RAID type */
*stripe_index = mirror;
@@ -1538,7 +1540,7 @@ static int scrub_setup_recheck_block(struct scrub_block *original_sblock,
scrub_stripe_index_and_offset(logical,
bioc->map_type,
- bioc->raid_map,
+ bioc->full_stripe_logical,
bioc->num_stripes -
bioc->replace_nr_stripes,
mirror_index,
@@ -2398,7 +2400,7 @@ static void scrub_missing_raid56_pages(struct scrub_block *sblock)
btrfs_bio_counter_inc_blocked(fs_info);
ret = btrfs_map_sblock(fs_info, BTRFS_MAP_GET_READ_MIRRORS, logical,
&length, &bioc);
- if (ret || !bioc || !bioc->raid_map)
+ if (ret || !bioc)
goto bioc_out;
if (WARN_ON(!sctx->is_dev_replace ||
@@ -3008,7 +3010,7 @@ static void scrub_parity_check_and_repair(struct scrub_parity *sparity)
btrfs_bio_counter_inc_blocked(fs_info);
ret = btrfs_map_sblock(fs_info, BTRFS_MAP_WRITE, sparity->logic_start,
&length, &bioc);
- if (ret || !bioc || !bioc->raid_map)
+ if (ret || !bioc)
goto bioc_out;
bio = bio_alloc(NULL, BIO_MAX_VECS, REQ_OP_READ, GFP_NOFS);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 860bb2709778..3a2a256fa9cd 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5864,25 +5864,6 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
return preferred_mirror;
}
-/* Bubble-sort the stripe set to put the parity/syndrome stripes last */
-static void sort_parity_stripes(struct btrfs_io_context *bioc, int num_stripes)
-{
- int i;
- int again = 1;
-
- while (again) {
- again = 0;
- for (i = 0; i < num_stripes - 1; i++) {
- /* Swap if parity is on a smaller index */
- if (bioc->raid_map[i] > bioc->raid_map[i + 1]) {
- swap(bioc->stripes[i], bioc->stripes[i + 1]);
- swap(bioc->raid_map[i], bioc->raid_map[i + 1]);
- again = 1;
- }
- }
- }
-}
-
static struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_info,
u16 total_stripes)
{
@@ -5892,12 +5873,7 @@ static struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_
/* The size of btrfs_io_context */
sizeof(struct btrfs_io_context) +
/* Plus the variable array for the stripes */
- sizeof(struct btrfs_io_stripe) * (total_stripes) +
- /*
- * Plus the raid_map, which includes both the tgt dev
- * and the stripes.
- */
- sizeof(u64) * (total_stripes),
+ sizeof(struct btrfs_io_stripe) * (total_stripes),
GFP_NOFS);
if (!bioc)
@@ -5906,8 +5882,8 @@ static struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_
refcount_set(&bioc->refs, 1);
bioc->fs_info = fs_info;
- bioc->raid_map = (u64 *)(bioc->stripes + total_stripes);
bioc->replace_stripe_src = -1;
+ bioc->full_stripe_logical = (u64)-1;
return bioc;
}
@@ -6556,35 +6532,44 @@ int __btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
}
bioc->map_type = map->type;
- for (i = 0; i < num_stripes; i++) {
- set_io_stripe(&bioc->stripes[i], map, stripe_index, stripe_offset,
- stripe_nr);
- stripe_index++;
- }
-
- /* Build raid_map */
+ /*
+ * For RAID56 full map, we need to make sure the stripes[] follows
+ * the rule that data stripes are all ordered, then followed with P
+ * and Q (if we have).
+ *
+ * It's still mostly the same as other profiles, just with extra
+ * rotataion.
+ */
if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK && need_raid_map &&
(need_full_stripe(op) || mirror_num > 1)) {
- u64 tmp;
- unsigned rot;
-
- /* Work out the disk rotation on this stripe-set */
- rot = stripe_nr % num_stripes;
-
- /* Fill in the logical address of each stripe */
- tmp = stripe_nr * data_stripes;
- for (i = 0; i < data_stripes; i++)
- bioc->raid_map[(i + rot) % num_stripes] =
- em->start + ((tmp + i) << BTRFS_STRIPE_LEN_SHIFT);
-
- bioc->raid_map[(i + rot) % map->num_stripes] = RAID5_P_STRIPE;
- if (map->type & BTRFS_BLOCK_GROUP_RAID6)
- bioc->raid_map[(i + rot + 1) % num_stripes] =
- RAID6_Q_STRIPE;
-
- sort_parity_stripes(bioc, num_stripes);
+ /*
+ * For RAID56 @stripe_nr is already the number of full stripes
+ * before us, which is also the rotation value (needs to modulo
+ * with num_stripes).
+ *
+ * In this case, we just add @stripe_nr with @i, then do the
+ * modulo, to reduce one modulo call.
+ */
+ bioc->full_stripe_logical = em->start +
+ ((stripe_nr * data_stripes) << BTRFS_STRIPE_LEN_SHIFT);
+ for (i = 0; i < num_stripes; i++) {
+ set_io_stripe(&bioc->stripes[i], map,
+ (i + stripe_nr) % num_stripes,
+ stripe_offset, stripe_nr);
+ }
+ } else {
+ /*
+ * For all other non-RAID56 profiles, just copy the target
+ * stripe into the bioc.
+ */
+ for (i = 0; i < num_stripes; i++) {
+ set_io_stripe(&bioc->stripes[i], map, stripe_index,
+ stripe_offset, stripe_nr);
+ stripe_index++;
+ }
}
+
if (need_full_stripe(op))
max_errors = btrfs_chunk_max_errors(map);
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index f35db441ea9c..0fe21ece1491 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -473,11 +473,21 @@ struct btrfs_io_context {
u16 replace_nr_stripes;
s16 replace_stripe_src;
/*
- * logical block numbers for the start of each stripe
- * The last one or two are p/q. These are sorted,
- * so raid_map[0] is the start of our full stripe
+ * Logical bytenr of the full stripe start, only for RAID56 cases.
+ *
+ * When this value is set to other than (u64)-1, the stripes[] should
+ * follow the following pattern:
+ * (real_stripes = num_stripes - replace_nr_stripes)
+ * (data_stripes = (is_raid6) ? (real_stripes - 2) : (real_stripes - 1))
+ *
+ * stripes[0]: The first data stripe
+ * stripes[1]: The second data stripe
+ * ...
+ * stripes[data_stripes - 1]: The last data stripe
+ * stripes[data_stripes]: The P stripe
+ * stripes[data_stripes + 1]: The Q stripe (only for RAID6).
*/
- u64 *raid_map;
+ u64 full_stripe_logical;
struct btrfs_io_stripe stripes[];
};
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 75d7d22c3a27..8ea9cea9bfeb 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -2422,7 +2422,7 @@ DECLARE_EVENT_CLASS(btrfs_raid56_bio,
),
TP_fast_assign_btrfs(rbio->bioc->fs_info,
- __entry->full_stripe = rbio->bioc->raid_map[0];
+ __entry->full_stripe = rbio->bioc->full_stripe_logical;
__entry->physical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
__entry->len = bio->bi_iter.bi_size;
__entry->opf = bio_op(bio);
--
2.39.1
^ permalink raw reply related [flat|nested] 15+ messages in thread* Re: [PATCH v2 4/4] btrfs: replace btrfs_io_context::raid_map[] with a fixed u64 value
2023-02-07 4:26 ` [PATCH v2 4/4] btrfs: replace btrfs_io_context::raid_map[] with a fixed u64 value Qu Wenruo
@ 2023-02-15 20:07 ` David Sterba
2023-02-16 0:23 ` Qu Wenruo
2023-02-20 8:53 ` Geert Uytterhoeven
1 sibling, 1 reply; 15+ messages in thread
From: David Sterba @ 2023-02-15 20:07 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs
On Tue, Feb 07, 2023 at 12:26:15PM +0800, Qu Wenruo wrote:
> + /* RAID5/6 */
> + for (i = 0; i < nr_data_stripes; i++) {
> + const u64 data_stripe_start = full_stripe_logical +
> + (i << BTRFS_STRIPE_LEN_SHIFT);
BTRFS_STRIPE_LEN_SHIFT is from another patchset and there's no metion of
that dependency in the cover letter. I've converted that back to
(i * BTRFS_STRIPE_LEN).
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH v2 4/4] btrfs: replace btrfs_io_context::raid_map[] with a fixed u64 value
2023-02-15 20:07 ` David Sterba
@ 2023-02-16 0:23 ` Qu Wenruo
0 siblings, 0 replies; 15+ messages in thread
From: Qu Wenruo @ 2023-02-16 0:23 UTC (permalink / raw)
To: dsterba; +Cc: linux-btrfs
On 2023/2/16 04:07, David Sterba wrote:
> On Tue, Feb 07, 2023 at 12:26:15PM +0800, Qu Wenruo wrote:
>> + /* RAID5/6 */
>> + for (i = 0; i < nr_data_stripes; i++) {
>> + const u64 data_stripe_start = full_stripe_logical +
>> + (i << BTRFS_STRIPE_LEN_SHIFT);
>
> BTRFS_STRIPE_LEN_SHIFT is from another patchset and there's no metion of
> that dependency in the cover letter. I've converted that back to
> (i * BTRFS_STRIPE_LEN).
Forgot to mention this series is based on the series: btrfs: reduce
div64 calls for __btrfs_map_block() and its variants
https://patchwork.kernel.org/project/linux-btrfs/list/?series=719082
I strongly recommend to get them merged in the proper order.
If needed, I can re-send with both series merged with proper patch order.
Thanks,
Qu
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 4/4] btrfs: replace btrfs_io_context::raid_map[] with a fixed u64 value
2023-02-07 4:26 ` [PATCH v2 4/4] btrfs: replace btrfs_io_context::raid_map[] with a fixed u64 value Qu Wenruo
2023-02-15 20:07 ` David Sterba
@ 2023-02-20 8:53 ` Geert Uytterhoeven
2023-02-20 11:50 ` Qu Wenruo
1 sibling, 1 reply; 15+ messages in thread
From: Geert Uytterhoeven @ 2023-02-20 8:53 UTC (permalink / raw)
To: Qu Wenruo; +Cc: David Sterba, linux-btrfs, linux-next, linux-kernel
Hi Qu,
On Tue, 7 Feb 2023, Qu Wenruo wrote:
> In btrfs_io_context structure, we have a pointer raid_map, which is to
> indicate the logical bytenr for each stripe.
>
> But considering we always call sort_parity_stripes(), the result
> raid_map[] is always sorted, thus raid_map[0] is always the logical
> bytenr of the full stripe.
>
> So why we waste the space and time (for sorting) for raid_map[]?
>
> This patch will replace btrfs_io_context::raid_map with a single u64
> number, full_stripe_start, by:
>
> - Replace btrfs_io_context::raid_map with full_stripe_start
>
> - Replace call sites using raid_map[0] to use full_stripe_start
>
> - Replace call sites using raid_map[i] to compare with nr_data_stripes.
>
> The benefits are:
>
> - Less memory wasted on raid_map
> It's sizeof(u64) * num_stripes vs size(u64).
> It's always a save for at least one u64, and the benefit grows larger
> with num_stripes.
>
> - No more weird alloc_btrfs_io_context() behavior
> As there is only one fixed size + one variable length array.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
Thanks for your patch, which is now commit 4a8c6e8a6dc8ae4c ("btrfs:
replace btrfs_io_context::raid_map with a fixed u64 value") in
next-20230220.
noreply@ellerman.id.au reported several build failures when
building for 32-bit platforms:
ERROR: modpost: "__umoddi3" [fs/btrfs/btrfs.ko] undefined!
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -6556,35 +6532,44 @@ int __btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
> }
> bioc->map_type = map->type;
>
> - for (i = 0; i < num_stripes; i++) {
> - set_io_stripe(&bioc->stripes[i], map, stripe_index, stripe_offset,
> - stripe_nr);
> - stripe_index++;
> - }
> -
> - /* Build raid_map */
> + /*
> + * For RAID56 full map, we need to make sure the stripes[] follows
> + * the rule that data stripes are all ordered, then followed with P
> + * and Q (if we have).
> + *
> + * It's still mostly the same as other profiles, just with extra
> + * rotataion.
> + */
> if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK && need_raid_map &&
> (need_full_stripe(op) || mirror_num > 1)) {
> - u64 tmp;
> - unsigned rot;
> -
> - /* Work out the disk rotation on this stripe-set */
> - rot = stripe_nr % num_stripes;
> -
> - /* Fill in the logical address of each stripe */
> - tmp = stripe_nr * data_stripes;
> - for (i = 0; i < data_stripes; i++)
> - bioc->raid_map[(i + rot) % num_stripes] =
> - em->start + ((tmp + i) << BTRFS_STRIPE_LEN_SHIFT);
> -
> - bioc->raid_map[(i + rot) % map->num_stripes] = RAID5_P_STRIPE;
> - if (map->type & BTRFS_BLOCK_GROUP_RAID6)
> - bioc->raid_map[(i + rot + 1) % num_stripes] =
> - RAID6_Q_STRIPE;
> -
> - sort_parity_stripes(bioc, num_stripes);
> + /*
> + * For RAID56 @stripe_nr is already the number of full stripes
> + * before us, which is also the rotation value (needs to modulo
> + * with num_stripes).
> + *
> + * In this case, we just add @stripe_nr with @i, then do the
> + * modulo, to reduce one modulo call.
> + */
> + bioc->full_stripe_logical = em->start +
> + ((stripe_nr * data_stripes) << BTRFS_STRIPE_LEN_SHIFT);
> + for (i = 0; i < num_stripes; i++) {
> + set_io_stripe(&bioc->stripes[i], map,
> + (i + stripe_nr) % num_stripes,
As stripe_nr is u64, this is a 64-by-32 modulo operation, which
should be implemented using a helper from include/linux/math64.h
instead.
> + stripe_offset, stripe_nr);
> + }
> + } else {
> + /*
> + * For all other non-RAID56 profiles, just copy the target
> + * stripe into the bioc.
> + */
> + for (i = 0; i < num_stripes; i++) {
> + set_io_stripe(&bioc->stripes[i], map, stripe_index,
> + stripe_offset, stripe_nr);
> + stripe_index++;
> + }
> }
>
> +
> if (need_full_stripe(op))
> max_errors = btrfs_chunk_max_errors(map);
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH v2 4/4] btrfs: replace btrfs_io_context::raid_map[] with a fixed u64 value
2023-02-20 8:53 ` Geert Uytterhoeven
@ 2023-02-20 11:50 ` Qu Wenruo
2023-02-20 12:14 ` Geert Uytterhoeven
0 siblings, 1 reply; 15+ messages in thread
From: Qu Wenruo @ 2023-02-20 11:50 UTC (permalink / raw)
To: Geert Uytterhoeven, Qu Wenruo
Cc: David Sterba, linux-btrfs, linux-next, linux-kernel
On 2023/2/20 16:53, Geert Uytterhoeven wrote:
> Hi Qu,
>
> On Tue, 7 Feb 2023, Qu Wenruo wrote:
>> In btrfs_io_context structure, we have a pointer raid_map, which is to
>> indicate the logical bytenr for each stripe.
>>
>> But considering we always call sort_parity_stripes(), the result
>> raid_map[] is always sorted, thus raid_map[0] is always the logical
>> bytenr of the full stripe.
>>
>> So why we waste the space and time (for sorting) for raid_map[]?
>>
>> This patch will replace btrfs_io_context::raid_map with a single u64
>> number, full_stripe_start, by:
>>
>> - Replace btrfs_io_context::raid_map with full_stripe_start
>>
>> - Replace call sites using raid_map[0] to use full_stripe_start
>>
>> - Replace call sites using raid_map[i] to compare with nr_data_stripes.
>>
>> The benefits are:
>>
>> - Less memory wasted on raid_map
>> It's sizeof(u64) * num_stripes vs size(u64).
>> It's always a save for at least one u64, and the benefit grows larger
>> with num_stripes.
>>
>> - No more weird alloc_btrfs_io_context() behavior
>> As there is only one fixed size + one variable length array.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>
> Thanks for your patch, which is now commit 4a8c6e8a6dc8ae4c ("btrfs:
> replace btrfs_io_context::raid_map with a fixed u64 value") in
> next-20230220.
>
> noreply@ellerman.id.au reported several build failures when
> building for 32-bit platforms:
>
> ERROR: modpost: "__umoddi3" [fs/btrfs/btrfs.ko] undefined!
>
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -6556,35 +6532,44 @@ int __btrfs_map_block(struct btrfs_fs_info
>> *fs_info, enum btrfs_map_op op,
>> }
>> bioc->map_type = map->type;
>>
>> - for (i = 0; i < num_stripes; i++) {
>> - set_io_stripe(&bioc->stripes[i], map, stripe_index,
>> stripe_offset,
>> - stripe_nr);
>> - stripe_index++;
>> - }
>> -
>> - /* Build raid_map */
>> + /*
>> + * For RAID56 full map, we need to make sure the stripes[] follows
>> + * the rule that data stripes are all ordered, then followed with P
>> + * and Q (if we have).
>> + *
>> + * It's still mostly the same as other profiles, just with extra
>> + * rotataion.
>> + */
>> if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK && need_raid_map &&
>> (need_full_stripe(op) || mirror_num > 1)) {
>> - u64 tmp;
>> - unsigned rot;
>> -
>> - /* Work out the disk rotation on this stripe-set */
>> - rot = stripe_nr % num_stripes;
>> -
>> - /* Fill in the logical address of each stripe */
>> - tmp = stripe_nr * data_stripes;
>> - for (i = 0; i < data_stripes; i++)
>> - bioc->raid_map[(i + rot) % num_stripes] =
>> - em->start + ((tmp + i) << BTRFS_STRIPE_LEN_SHIFT);
>> -
>> - bioc->raid_map[(i + rot) % map->num_stripes] = RAID5_P_STRIPE;
>> - if (map->type & BTRFS_BLOCK_GROUP_RAID6)
>> - bioc->raid_map[(i + rot + 1) % num_stripes] =
>> - RAID6_Q_STRIPE;
>> -
>> - sort_parity_stripes(bioc, num_stripes);
>> + /*
>> + * For RAID56 @stripe_nr is already the number of full stripes
>> + * before us, which is also the rotation value (needs to modulo
>> + * with num_stripes).
>> + *
>> + * In this case, we just add @stripe_nr with @i, then do the
>> + * modulo, to reduce one modulo call.
>> + */
>> + bioc->full_stripe_logical = em->start +
>> + ((stripe_nr * data_stripes) << BTRFS_STRIPE_LEN_SHIFT);
>> + for (i = 0; i < num_stripes; i++) {
>> + set_io_stripe(&bioc->stripes[i], map,
>> + (i + stripe_nr) % num_stripes,
>
> As stripe_nr is u64, this is a 64-by-32 modulo operation, which
> should be implemented using a helper from include/linux/math64.h
> instead.
This is an older version, in the latest version, the @stripe_nr variable
is also u32, and I tried compiling the latest branch with i686, it
doesn't cause any u64 division problems anymore.
You can find the latest branch in either github or from the mailling list:
https://github.com/adam900710/linux/tree/map_block_refactor
https://lore.kernel.org/linux-btrfs/cover.1676611535.git.wqu@suse.com/
Thanks,
Qu
>
>> + stripe_offset, stripe_nr);
>> + }
>> + } else {
>> + /*
>> + * For all other non-RAID56 profiles, just copy the target
>> + * stripe into the bioc.
>> + */
>> + for (i = 0; i < num_stripes; i++) {
>> + set_io_stripe(&bioc->stripes[i], map, stripe_index,
>> + stripe_offset, stripe_nr);
>> + stripe_index++;
>> + }
>> }
>>
>> +
>> if (need_full_stripe(op))
>> max_errors = btrfs_chunk_max_errors(map);
>
> Gr{oetje,eeting}s,
>
> Geert
>
> --
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 --
> geert@linux-m68k.org
>
> In personal conversations with technical people, I call myself a hacker.
> But
> when I'm talking to journalists I just say "programmer" or something
> like that.
> -- Linus Torvalds
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH v2 4/4] btrfs: replace btrfs_io_context::raid_map[] with a fixed u64 value
2023-02-20 11:50 ` Qu Wenruo
@ 2023-02-20 12:14 ` Geert Uytterhoeven
2023-02-21 0:09 ` Qu Wenruo
0 siblings, 1 reply; 15+ messages in thread
From: Geert Uytterhoeven @ 2023-02-20 12:14 UTC (permalink / raw)
To: quwenruo.btrfs
Cc: Qu Wenruo, David Sterba, linux-btrfs, linux-next, linux-kernel
Hi Qu,
On Mon, Feb 20, 2023 at 12:50 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> On 2023/2/20 16:53, Geert Uytterhoeven wrote:
> > On Tue, 7 Feb 2023, Qu Wenruo wrote:
> >> In btrfs_io_context structure, we have a pointer raid_map, which is to
> >> indicate the logical bytenr for each stripe.
> >>
> >> But considering we always call sort_parity_stripes(), the result
> >> raid_map[] is always sorted, thus raid_map[0] is always the logical
> >> bytenr of the full stripe.
> >>
> >> So why we waste the space and time (for sorting) for raid_map[]?
> >>
> >> This patch will replace btrfs_io_context::raid_map with a single u64
> >> number, full_stripe_start, by:
> >>
> >> - Replace btrfs_io_context::raid_map with full_stripe_start
> >>
> >> - Replace call sites using raid_map[0] to use full_stripe_start
> >>
> >> - Replace call sites using raid_map[i] to compare with nr_data_stripes.
> >>
> >> The benefits are:
> >>
> >> - Less memory wasted on raid_map
> >> It's sizeof(u64) * num_stripes vs size(u64).
> >> It's always a save for at least one u64, and the benefit grows larger
> >> with num_stripes.
> >>
> >> - No more weird alloc_btrfs_io_context() behavior
> >> As there is only one fixed size + one variable length array.
> >>
> >> Signed-off-by: Qu Wenruo <wqu@suse.com>
> >
> > Thanks for your patch, which is now commit 4a8c6e8a6dc8ae4c ("btrfs:
> > replace btrfs_io_context::raid_map with a fixed u64 value") in
> > next-20230220.
> >
> > noreply@ellerman.id.au reported several build failures when
> > building for 32-bit platforms:
> >
> > ERROR: modpost: "__umoddi3" [fs/btrfs/btrfs.ko] undefined!
> >
> >> --- a/fs/btrfs/volumes.c
> >> +++ b/fs/btrfs/volumes.c
> >> @@ -6556,35 +6532,44 @@ int __btrfs_map_block(struct btrfs_fs_info
> >> *fs_info, enum btrfs_map_op op,
> >> }
> >> bioc->map_type = map->type;
> >>
> >> - for (i = 0; i < num_stripes; i++) {
> >> - set_io_stripe(&bioc->stripes[i], map, stripe_index,
> >> stripe_offset,
> >> - stripe_nr);
> >> - stripe_index++;
> >> - }
> >> -
> >> - /* Build raid_map */
> >> + /*
> >> + * For RAID56 full map, we need to make sure the stripes[] follows
> >> + * the rule that data stripes are all ordered, then followed with P
> >> + * and Q (if we have).
> >> + *
> >> + * It's still mostly the same as other profiles, just with extra
> >> + * rotataion.
> >> + */
> >> if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK && need_raid_map &&
> >> (need_full_stripe(op) || mirror_num > 1)) {
> >> - u64 tmp;
> >> - unsigned rot;
> >> -
> >> - /* Work out the disk rotation on this stripe-set */
> >> - rot = stripe_nr % num_stripes;
> >> -
> >> - /* Fill in the logical address of each stripe */
> >> - tmp = stripe_nr * data_stripes;
> >> - for (i = 0; i < data_stripes; i++)
> >> - bioc->raid_map[(i + rot) % num_stripes] =
> >> - em->start + ((tmp + i) << BTRFS_STRIPE_LEN_SHIFT);
> >> -
> >> - bioc->raid_map[(i + rot) % map->num_stripes] = RAID5_P_STRIPE;
> >> - if (map->type & BTRFS_BLOCK_GROUP_RAID6)
> >> - bioc->raid_map[(i + rot + 1) % num_stripes] =
> >> - RAID6_Q_STRIPE;
> >> -
> >> - sort_parity_stripes(bioc, num_stripes);
> >> + /*
> >> + * For RAID56 @stripe_nr is already the number of full stripes
> >> + * before us, which is also the rotation value (needs to modulo
> >> + * with num_stripes).
> >> + *
> >> + * In this case, we just add @stripe_nr with @i, then do the
> >> + * modulo, to reduce one modulo call.
> >> + */
> >> + bioc->full_stripe_logical = em->start +
> >> + ((stripe_nr * data_stripes) << BTRFS_STRIPE_LEN_SHIFT);
> >> + for (i = 0; i < num_stripes; i++) {
> >> + set_io_stripe(&bioc->stripes[i], map,
> >> + (i + stripe_nr) % num_stripes,
> >
> > As stripe_nr is u64, this is a 64-by-32 modulo operation, which
> > should be implemented using a helper from include/linux/math64.h
> > instead.
>
> This is an older version, in the latest version, the @stripe_nr variable
> is also u32, and I tried compiling the latest branch with i686, it
> doesn't cause any u64 division problems anymore.
>
> You can find the latest branch in either github or from the mailling list:
>
> https://github.com/adam900710/linux/tree/map_block_refactor
>
> https://lore.kernel.org/linux-btrfs/cover.1676611535.git.wqu@suse.com/
So the older version was "v2", and the latest version had no
version indicator, nor changelog, thus assuming v1?
No surprise people end up applying the wrong version...
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH v2 4/4] btrfs: replace btrfs_io_context::raid_map[] with a fixed u64 value
2023-02-20 12:14 ` Geert Uytterhoeven
@ 2023-02-21 0:09 ` Qu Wenruo
0 siblings, 0 replies; 15+ messages in thread
From: Qu Wenruo @ 2023-02-21 0:09 UTC (permalink / raw)
To: Geert Uytterhoeven
Cc: Qu Wenruo, David Sterba, linux-btrfs, linux-next, linux-kernel
On 2023/2/20 20:14, Geert Uytterhoeven wrote:
> Hi Qu,
>
> On Mon, Feb 20, 2023 at 12:50 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>> On 2023/2/20 16:53, Geert Uytterhoeven wrote:
>>> On Tue, 7 Feb 2023, Qu Wenruo wrote:
>>>> In btrfs_io_context structure, we have a pointer raid_map, which is to
>>>> indicate the logical bytenr for each stripe.
>>>>
>>>> But considering we always call sort_parity_stripes(), the result
>>>> raid_map[] is always sorted, thus raid_map[0] is always the logical
>>>> bytenr of the full stripe.
>>>>
>>>> So why we waste the space and time (for sorting) for raid_map[]?
>>>>
>>>> This patch will replace btrfs_io_context::raid_map with a single u64
>>>> number, full_stripe_start, by:
>>>>
>>>> - Replace btrfs_io_context::raid_map with full_stripe_start
>>>>
>>>> - Replace call sites using raid_map[0] to use full_stripe_start
>>>>
>>>> - Replace call sites using raid_map[i] to compare with nr_data_stripes.
>>>>
>>>> The benefits are:
>>>>
>>>> - Less memory wasted on raid_map
>>>> It's sizeof(u64) * num_stripes vs size(u64).
>>>> It's always a save for at least one u64, and the benefit grows larger
>>>> with num_stripes.
>>>>
>>>> - No more weird alloc_btrfs_io_context() behavior
>>>> As there is only one fixed size + one variable length array.
>>>>
>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>
>>> Thanks for your patch, which is now commit 4a8c6e8a6dc8ae4c ("btrfs:
>>> replace btrfs_io_context::raid_map with a fixed u64 value") in
>>> next-20230220.
>>>
>>> noreply@ellerman.id.au reported several build failures when
>>> building for 32-bit platforms:
>>>
>>> ERROR: modpost: "__umoddi3" [fs/btrfs/btrfs.ko] undefined!
>>>
>>>> --- a/fs/btrfs/volumes.c
>>>> +++ b/fs/btrfs/volumes.c
>>>> @@ -6556,35 +6532,44 @@ int __btrfs_map_block(struct btrfs_fs_info
>>>> *fs_info, enum btrfs_map_op op,
>>>> }
>>>> bioc->map_type = map->type;
>>>>
>>>> - for (i = 0; i < num_stripes; i++) {
>>>> - set_io_stripe(&bioc->stripes[i], map, stripe_index,
>>>> stripe_offset,
>>>> - stripe_nr);
>>>> - stripe_index++;
>>>> - }
>>>> -
>>>> - /* Build raid_map */
>>>> + /*
>>>> + * For RAID56 full map, we need to make sure the stripes[] follows
>>>> + * the rule that data stripes are all ordered, then followed with P
>>>> + * and Q (if we have).
>>>> + *
>>>> + * It's still mostly the same as other profiles, just with extra
>>>> + * rotataion.
>>>> + */
>>>> if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK && need_raid_map &&
>>>> (need_full_stripe(op) || mirror_num > 1)) {
>>>> - u64 tmp;
>>>> - unsigned rot;
>>>> -
>>>> - /* Work out the disk rotation on this stripe-set */
>>>> - rot = stripe_nr % num_stripes;
>>>> -
>>>> - /* Fill in the logical address of each stripe */
>>>> - tmp = stripe_nr * data_stripes;
>>>> - for (i = 0; i < data_stripes; i++)
>>>> - bioc->raid_map[(i + rot) % num_stripes] =
>>>> - em->start + ((tmp + i) << BTRFS_STRIPE_LEN_SHIFT);
>>>> -
>>>> - bioc->raid_map[(i + rot) % map->num_stripes] = RAID5_P_STRIPE;
>>>> - if (map->type & BTRFS_BLOCK_GROUP_RAID6)
>>>> - bioc->raid_map[(i + rot + 1) % num_stripes] =
>>>> - RAID6_Q_STRIPE;
>>>> -
>>>> - sort_parity_stripes(bioc, num_stripes);
>>>> + /*
>>>> + * For RAID56 @stripe_nr is already the number of full stripes
>>>> + * before us, which is also the rotation value (needs to modulo
>>>> + * with num_stripes).
>>>> + *
>>>> + * In this case, we just add @stripe_nr with @i, then do the
>>>> + * modulo, to reduce one modulo call.
>>>> + */
>>>> + bioc->full_stripe_logical = em->start +
>>>> + ((stripe_nr * data_stripes) << BTRFS_STRIPE_LEN_SHIFT);
>>>> + for (i = 0; i < num_stripes; i++) {
>>>> + set_io_stripe(&bioc->stripes[i], map,
>>>> + (i + stripe_nr) % num_stripes,
>>>
>>> As stripe_nr is u64, this is a 64-by-32 modulo operation, which
>>> should be implemented using a helper from include/linux/math64.h
>>> instead.
>>
>> This is an older version, in the latest version, the @stripe_nr variable
>> is also u32, and I tried compiling the latest branch with i686, it
>> doesn't cause any u64 division problems anymore.
>>
>> You can find the latest branch in either github or from the mailling list:
>>
>> https://github.com/adam900710/linux/tree/map_block_refactor
>>
>> https://lore.kernel.org/linux-btrfs/cover.1676611535.git.wqu@suse.com/
>
> So the older version was "v2", and the latest version had no
> version indicator, nor changelog, thus assuming v1?
> No surprise people end up applying the wrong version...
The previous version is two separate patchsets, the new one is the
merged one.
And I sent the merged version because the dependency problem and
conflicts, and since it's the merged version, no changelog based on
previous version.
Thanks,
Qu
>
> Gr{oetje,eeting}s,
>
> Geert
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 0/4] btrfs: reduce the memory usage for btrfs_io_context, and reduce its variable sized members
2023-02-07 4:26 [PATCH v2 0/4] btrfs: reduce the memory usage for btrfs_io_context, and reduce its variable sized members Qu Wenruo
` (3 preceding siblings ...)
2023-02-07 4:26 ` [PATCH v2 4/4] btrfs: replace btrfs_io_context::raid_map[] with a fixed u64 value Qu Wenruo
@ 2023-02-15 20:07 ` David Sterba
4 siblings, 0 replies; 15+ messages in thread
From: David Sterba @ 2023-02-15 20:07 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs
On Tue, Feb 07, 2023 at 12:26:11PM +0800, Qu Wenruo wrote:
> Changelog:
> v2:
> - Address the comments on various grammar errors
>
> - Further reduce the memory used for replace
> Now instead of two s16, it's just one s16 now.
>
> - Replace raid_map with a single u64
> This not only reduce the memory usage, but also makes btrfs_io_context
> to only have one variable sized member (stripes), simplify the
> allocation function
>
> This also removes the need to bubble sort the stripes.
>
>
> In btrfs_io_context, we have two members dedicated to replace, and one
> extra array for raid56
>
> - num_tgtdevs
> This is straight-forward, just the number of extra stripes for replace
> usage.
>
> - tgtdev_map[]
> This is a little complex, it represents the mapping between the
> original stripes and dev-replace stripes.
>
> This is mostly for RAID56, as only in RAID56 the stripes contain
> different contents, thus it's important to know the mapping.
>
> It goes like this:
>
> num_stripes = 4 (3 + 1 for replace)
> stripes[0]: dev = devid 1, physical = X
> stripes[1]: dev = devid 2, physical = Y
> stripes[2]: dev = devid 3, physical = Z
> stripes[3]: dev = devid 0, physical = Y
>
> num_tgtdevs = 1
> tgtdev_map[0] = 0 <- Means stripes[0] is not involved in replace.
> tgtdev_map[1] = 3 <- Means stripes[1] is involved in replace,
> and it's duplicated to stripes[3].
> tgtdev_map[2] = 0 <- Means stripes[2] is not involved in replace.
>
> Thus most space is wasted, and the more devices in the array, the more
> space wasted.
>
> - raid_map[]
> A sorted array where the first one is always the logical bytenr of
> the full stripe
>
> But the truth is, since it's always sorted, we don't need it at all.
> We can use a single u64 to indicate the full stripe start.
>
> Currently we're reusing the array mostly to re-order our stripes for
> RAID56, which is not ideal, because we can get it down right just in
> one go.
>
> All these tgdev_map[] and raid_map[] designs are wasting quite some
> space, and making alloc_btrfs_io_context() to do very complex and
> dangerous pointer juggling.
>
> This patch will replace those members by:
>
> - num_tgtdevs -> replace_nr_stripes
> Just a rename
>
> - tgtdev_map[] -> replace_stripe_src
> It's changed to a single s16 to indicate where the source stripe is.
> This single s16 is enough for RAID56. For DUP, they don't care the
> source as all stripes share the same content.
>
> - raid_map[] -> full_stripe_logical
> We remove the sort_parity_stripes(), and get the stripes selection
> done correctly in RAID56 routines.
>
> So we only need to record the logical bytenr of the full stripe start.
> Existing call sites checking the type of stripe can compare with
> their data stripes number to do the same work.
>
> This not only saved some space for btrfs_io_context structure, but also
> allows the following cleanups:
>
> - Streamline handle_ops_on_dev_replace()
> We go a common path for both WRITE and GET_READ_MIRRORS, and only
> for DUP and GET_READ_MIRRORS, we shrink the bioc to keep the same
> old behavior.
>
> - Remove some unnecessary variables
>
> - Remove variable sized members
> Now there is only one variable sized member, stripes.
>
> Although the series still increases the number of lines, the net
> increase mostly comes from comments, in fact around 100 lines of comments
> are added around the replace related members.
>
> Qu Wenruo (4):
> btrfs: simplify the @bioc argument for handle_ops_on_dev_replace()
> btrfs: small improvement for btrfs_io_context structure
> btrfs: use a more space efficient way to represent the source of
> duplicated stripes
> btrfs: replace btrfs_io_context::raid_map[] with a fixed u64 value
With some fixups added to for-next as topic branch, thanks.
^ permalink raw reply [flat|nested] 15+ messages in thread