* [PATCH 2/4] md/raid5-cache: use ring add to prevent overflow
From: JackieLiu @ 2016-11-28 8:19 UTC (permalink / raw)
To: songliubraving; +Cc: liuzhengyuan, linux-raid, JackieLiu
In-Reply-To: <20161128081921.5641-1-liuyun01@kylinos.cn>
'write_pos' must be protected with 'r5l_ring_add', or it may overflow
Signed-off-by: JackieLiu <liuyun01@kylinos.cn>
---
drivers/md/raid5-cache.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 576c88d..6d2b1da 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -2086,7 +2086,7 @@ r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
ctx->pos, ctx->seq);
mb = page_address(page);
offset = le32_to_cpu(mb->meta_size);
- write_pos = ctx->pos + BLOCK_SECTORS;
+ write_pos = r5l_ring_add(log, ctx->pos, BLOCK_SECTORS);
for (i = sh->disks; i--; ) {
struct r5dev *dev = &sh->dev[i];
--
2.7.4
^ permalink raw reply related
* [PATCH 1/4] md/raid5-cache: remove unnecessary function parameters
From: JackieLiu @ 2016-11-28 8:19 UTC (permalink / raw)
To: songliubraving; +Cc: liuzhengyuan, linux-raid, JackieLiu
The function parameter 'recovery_list' is not used in
body, we can delete it
Signed-off-by: JackieLiu <liuyun01@kylinos.cn>
---
drivers/md/raid5-cache.c | 12 ++++--------
1 file changed, 4 insertions(+), 8 deletions(-)
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 5f817bd..576c88d 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -67,7 +67,7 @@ static char *r5c_journal_mode_str[] = {"write-through",
/*
* raid5 cache state machine
*
- * With rhe RAID cache, each stripe works in two phases:
+ * With the RAID cache, each stripe works in two phases:
* - caching phase
* - writing-out phase
*
@@ -1674,7 +1674,6 @@ r5l_recovery_replay_one_stripe(struct r5conf *conf,
static struct stripe_head *
r5c_recovery_alloc_stripe(struct r5conf *conf,
- struct list_head *recovery_list,
sector_t stripe_sect,
sector_t log_start)
{
@@ -1855,8 +1854,7 @@ r5c_recovery_analyze_meta_block(struct r5l_log *log,
stripe_sect);
if (!sh) {
- sh = r5c_recovery_alloc_stripe(conf, cached_stripe_list,
- stripe_sect, ctx->pos);
+ sh = r5c_recovery_alloc_stripe(conf, stripe_sect, ctx->pos);
/*
* cannot get stripe from raid5_get_active_stripe
* try replay some stripes
@@ -1865,8 +1863,7 @@ r5c_recovery_analyze_meta_block(struct r5l_log *log,
r5c_recovery_replay_stripes(
cached_stripe_list, ctx);
sh = r5c_recovery_alloc_stripe(
- conf, cached_stripe_list,
- stripe_sect, ctx->pos);
+ conf, stripe_sect, ctx->pos);
}
if (!sh) {
pr_debug("md/raid:%s: Increasing stripe cache size to %d to recovery data on journal.\n",
@@ -1875,8 +1872,7 @@ r5c_recovery_analyze_meta_block(struct r5l_log *log,
raid5_set_cache_size(mddev,
conf->min_nr_stripes * 2);
sh = r5c_recovery_alloc_stripe(
- conf, cached_stripe_list, stripe_sect,
- ctx->pos);
+ conf, stripe_sect, ctx->pos);
}
if (!sh) {
pr_err("md/raid:%s: Cannot get enough stripes due to memory pressure. Recovery failed.\n",
--
2.7.4
^ permalink raw reply related
* Re: raid0 vs. mkfs
From: Avi Kivity @ 2016-11-28 7:38 UTC (permalink / raw)
To: NeilBrown, linux-raid
In-Reply-To: <87oa108a1x.fsf@notabene.neil.brown.name>
On 11/28/2016 07:09 AM, NeilBrown wrote:
> On Mon, Nov 28 2016, Avi Kivity wrote:
>
>> mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large
>> disk that supports TRIM/DISCARD (erase whichever is inappropriate).
>> That is because mkfs issues a TRIM/DISCARD (erase whichever is
>> inappropriate) for the entire partition. As far as I can tell, md
>> converts the large TRIM/DISCARD (erase whichever is inappropriate) into
>> a large number of TRIM/DISCARD (erase whichever is inappropriate)
>> requests, one per chunk-size worth of disk, and issues them to the RAID
>> components individually.
>>
>>
>> It seems to me that md can convert the large TRIM/DISCARD (erase
>> whichever is inappropriate) request it gets into one TRIM/DISCARD (erase
>> whichever is inappropriate) per RAID component, converting an O(disk
>> size / chunk size) operation into an O(number of RAID components)
>> operation, which is much faster.
>>
>>
>> I observed this with mkfs.xfs on a RAID0 of four 3TB NVMe devices, with
>> the operation taking about a quarter of an hour, continuously pushing
>> half-megabyte TRIM/DISCARD (erase whichever is inappropriate) requests
>> to the disk. Linux 4.1.12.
> Surely it is the task of the underlying driver, or the queuing
> infrastructure, to merge small requests into large requests.
Here's a blkparse of that run. As can be seen, there is no concurrency,
so nobody down the stack has any chance of merging anything.
259,1 10 1090 0.379688898 4801 Q D 3238067200 + 1024
[mkfs.xfs]
259,1 10 1091 0.379689222 4801 G D 3238067200 + 1024
[mkfs.xfs]
259,1 10 1092 0.379690304 4801 I D 3238067200 + 1024
[mkfs.xfs]
259,1 10 1093 0.379703110 2307 D D 3238067200 + 1024
[kworker/10:1H]
259,1 1 589 0.379718918 0 C D 3231849472 + 1024 [0]
259,1 10 1094 0.379735215 4801 Q D 3238068224 + 1024
[mkfs.xfs]
259,1 10 1095 0.379735548 4801 G D 3238068224 + 1024
[mkfs.xfs]
259,1 10 1096 0.379736598 4801 I D 3238068224 + 1024
[mkfs.xfs]
259,1 10 1097 0.379753077 2307 D D 3238068224 + 1024
[kworker/10:1H]
259,1 1 590 0.379782139 0 C D 3231850496 + 1024 [0]
259,1 10 1098 0.379785399 4801 Q D 3238069248 + 1024
[mkfs.xfs]
259,1 10 1099 0.379785657 4801 G D 3238069248 + 1024
[mkfs.xfs]
259,1 10 1100 0.379786562 4801 I D 3238069248 + 1024
[mkfs.xfs]
259,1 10 1101 0.379800116 2307 D D 3238069248 + 1024
[kworker/10:1H]
259,1 10 1102 0.379829822 4801 Q D 3238070272 + 1024
[mkfs.xfs]
259,1 10 1103 0.379830156 4801 G D 3238070272 + 1024
[mkfs.xfs]
259,1 10 1104 0.379831015 4801 I D 3238070272 + 1024
[mkfs.xfs]
259,1 10 1105 0.379844120 2307 D D 3238070272 + 1024
[kworker/10:1H]
259,1 10 1106 0.379877825 4801 Q D 3238071296 + 1024
[mkfs.xfs]
259,1 10 1107 0.379878173 4801 G D 3238071296 + 1024
[mkfs.xfs]
259,1 10 1108 0.379879028 4801 I D 3238071296 + 1024
[mkfs.xfs]
259,1 1 591 0.379886451 0 C D 3231851520 + 1024 [0]
259,1 10 1109 0.379898178 2307 D D 3238071296 + 1024
[kworker/10:1H]
259,1 10 1110 0.379923982 4801 Q D 3238072320 + 1024
[mkfs.xfs]
259,1 10 1111 0.379924229 4801 G D 3238072320 + 1024
[mkfs.xfs]
259,1 10 1112 0.379925054 4801 I D 3238072320 + 1024
[mkfs.xfs]
259,1 10 1113 0.379937716 2307 D D 3238072320 + 1024
[kworker/10:1H]
259,1 1 592 0.379954380 0 C D 3231852544 + 1024 [0]
259,1 10 1114 0.379970091 4801 Q D 3238073344 + 1024
[mkfs.xfs]
259,1 10 1115 0.379970341 4801 G D 3238073344 + 1024
[mkfs.xfs]
No merging was happening. This is an NVMe drive, so running with the
noop scheduler (which should still merge). Does the queuing layer
merge trims?
I don't think it's the queuing layer's job, though. At the I/O
scheduler you can merge to clean up sloppy patterns from the upper
layer, but each layer should try to generate the best pattern it can.
Large merges mean increased latency for the first request in the chain,
forcing the I/O scheduler to make a decision which can harm the
workload. By generating merged requests in the first place, the upper
layer removes the need to make that tradeoff (splitting the requests
removes information: "we are interested only in when all of the range is
trimmed, not any particular request").
^ permalink raw reply
* Re: [RFC PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: Coly Li @ 2016-11-28 7:33 UTC (permalink / raw)
To: Guoqing Jiang, linux-raid; +Cc: Shaohua Li, Neil Brown, Johannes Thumshirn
In-Reply-To: <58369819.1070203@suse.com>
On 2016/11/24 下午3:34, Guoqing Jiang wrote:
> Hi Coly,
>
> Please see below comments, just FYI.
>
> On 11/22/2016 05:54 AM, Coly Li wrote:
>> - In raid1_make_request(), wait_barrier() is replaced by,
>> a) wait_read_barrier(): wait barrier in regular read I/O code path
>> b) wait_barrier(): wait barrier in regular write I/O code path
>> The differnece is wait_read_barrier() only waits if array is
>> frozen, I
>> am not able to combile them into one function, because they must
>> receive
>> differnet data types in their arguments list.
>
> Maybe it is possible to add a parameter to distinguish read and write, then
> the two functions can be unified.
Hi Guoqing,
read_balance() will make sure a regular read won't access a non-synced
disk, therefore regular read only needs to wait when the whole raid1 is
frozen. That's why wait_read_barrier() only waits for
!conf->array_frozen, but _wait_barrier() will waits for both
!conf->array_frozen and !conf->barrier[idx].
Combine them is possible, but the code won't be simpler, because,
- inside the function I still need to wait two wait_event_lock_irq() for
different wait condition, and if I use sector_t as the function
parameter, in wait_all_barrier() I have to send bogus sector number
(calculated by bucket index) into _wait_barrier().
that's why I choose current implementation.
>
>> - align_to_barrier_unit_end() is called to make sure both regular and
>> resync I/O won't go across the border of a barrier unit size.
>> Open question:
>> - Need review from md clustring developer, I don't touch related
>> code now.
>
> I don't find problems with some tests so far.
In raid1_sync_request(), I see,
conf->cluster_sync_low = mddev->curr_resync_completed;
conf->cluster_sync_high = conf->cluster_sync_low +
CLUSTER_RESYNC_WINDOW_SECTORS;
Is it possible that LBA range [conf->cluster_sync_low,
conf->cluster_sync_high] goes across the border of a barrier unit size ?
>
>> static void reschedule_retry(struct r1bio *r1_bio)
>> @@ -215,10 +214,15 @@ static void reschedule_retry(struct r1bi
>> unsigned long flags;
>> struct mddev *mddev = r1_bio->mddev;
>> struct r1conf *conf = mddev->private;
>> + sector_t sector_nr;
>> + long idx;
>> +
>> + sector_nr = r1_bio->sector;
>> + idx = get_barrier_bucket_idx(sector_nr);
>
> Isn't "idx = get_barrier_bucket_idx(r1_bio->sector)" enough here?
Nice catch! I will modify this in next version.
>
>> @@ -255,19 +257,14 @@ static void call_bio_endio(struct r1bio
>> if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
>> bio->bi_error = -EIO;
>> - if (done) {
>> + if (done)
>> bio_endio(bio);
>> - /*
>> - * Wake up any possible resync thread that waits for the device
>> - * to go idle.
>> - */
>> - allow_barrier(conf, start_next_window, bi_sector);
>> - }
>> }
>> static void raid_end_bio_io(struct r1bio *r1_bio)
>> {
>> struct bio *bio = r1_bio->master_bio;
>> + struct r1conf *conf = r1_bio->mddev->private;
>> /* if nobody has done the final endio yet, do it now */
>> if (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) {
>> @@ -278,6 +275,12 @@ static void raid_end_bio_io(struct r1bio
>> call_bio_endio(r1_bio);
>> }
>> +
>> + /*
>> + * Wake up any possible resync thread that waits for the device
>> + * to go idle.
>> + */
>> + allow_barrier(conf, r1_bio->sector);
>> free_r1bio(r1_bio);
>> }
>
> I am not sure it is safe for above changes since call_bio_endio is not only
> called within raid_end_bio_io.
>
This is safe. I move it here because location of wait_barrier() is
changed and replaced by wait_read_barrier() for READ I/O, and
wait_barrier() for WRITE I/O. In this patch, conf->nr_pending[idx] is
raised for each ri_bio, so allow_barrier() should be called for each
r1_bio destroy to decrease corresponding conf->nr_pending[idx].
>> @@ -311,6 +314,7 @@ static int find_bio_disk(struct r1bio *r
>> return mirror;
>> }
>> +/* bi_end_io callback for regular READ bio */
>
> Not related to the patch itself, it would be better to make the similar
> changes in other patches.
>
OK, I will move it into another separated patch.
>> -static bool need_to_wait_for_sync(struct r1conf *conf, struct bio *bio)
>> +/* A regular I/O should wait when,
>> + * - The whole array is frozen (both READ and WRITE)
>> + * - bio is WRITE and in same barrier bucket conf->barrier[idx] raised
>> + */
>> +static void _wait_barrier(struct r1conf *conf, long idx)
>> {
>> - bool wait = false;
>> -
>> - if (conf->array_frozen || !bio)
>> - wait = true;
>> - else if (conf->barrier && bio_data_dir(bio) == WRITE) {
>> - if ((conf->mddev->curr_resync_completed
>> - >= bio_end_sector(bio)) ||
>> - (conf->next_resync + NEXT_NORMALIO_DISTANCE
>> - <= bio->bi_iter.bi_sector))
>> - wait = false;
>> - else
>> - wait = true;
>> + spin_lock_irq(&conf->resync_lock);
>> + if (conf->array_frozen || conf->barrier[idx]) {
>> + conf->nr_waiting[idx]++;
>> + /* Wait for the barrier to drop. */
>> + wait_event_lock_irq(
>> + conf->wait_barrier,
>> + !conf->array_frozen && !conf->barrier[idx],
>> + conf->resync_lock);
>> + conf->nr_waiting[idx]--;
>> }
>> - return wait;
>> + conf->nr_pending[idx]++;
>> + spin_unlock_irq(&conf->resync_lock);
>> }
>> -static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
>> +static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr)
>> {
>> - sector_t sector = 0;
>> + long idx = get_barrier_bucket_idx(sector_nr);
>> spin_lock_irq(&conf->resync_lock);
>> - if (need_to_wait_for_sync(conf, bio)) {
>> - conf->nr_waiting++;
>> - /* Wait for the barrier to drop.
>> - * However if there are already pending
>> - * requests (preventing the barrier from
>> - * rising completely), and the
>> - * per-process bio queue isn't empty,
>> - * then don't wait, as we need to empty
>> - * that queue to allow conf->start_next_window
>> - * to increase.
>> - */
>> - wait_event_lock_irq(conf->wait_barrier,
>> - !conf->array_frozen &&
>> - (!conf->barrier ||
>> - ((conf->start_next_window <
>> - conf->next_resync + RESYNC_SECTORS) &&
>> - current->bio_list &&
>> - !bio_list_empty(current->bio_list))),
>> - conf->resync_lock);
>> - conf->nr_waiting--;
>> - }
>> -
>> - if (bio && bio_data_dir(bio) == WRITE) {
>> - if (bio->bi_iter.bi_sector >= conf->next_resync) {
>> - if (conf->start_next_window == MaxSector)
>> - conf->start_next_window =
>> - conf->next_resync +
>> - NEXT_NORMALIO_DISTANCE;
>> -
>> - if ((conf->start_next_window + NEXT_NORMALIO_DISTANCE)
>> - <= bio->bi_iter.bi_sector)
>> - conf->next_window_requests++;
>> - else
>> - conf->current_window_requests++;
>> - sector = conf->start_next_window;
>> - }
>> + if (conf->array_frozen) {
>> + conf->nr_waiting[idx]++;
>> + /* Wait for array to unfreeze */
>> + wait_event_lock_irq(
>> + conf->wait_barrier,
>> + !conf->array_frozen,
>> + conf->resync_lock);
>> + conf->nr_waiting[idx]--;
>> }
>> -
>> - conf->nr_pending++;
>> + conf->nr_pending[idx]++;
>> spin_unlock_irq(&conf->resync_lock);
>> - return sector;
>> }
>> -static void allow_barrier(struct r1conf *conf, sector_t
>> start_next_window,
>> - sector_t bi_sector)
>> +static void wait_barrier(struct r1conf *conf, sector_t sector_nr)
>> +{
>> + long idx = get_barrier_bucket_idx(sector_nr);
>> +
>> + _wait_barrier(conf, idx);
>> +}
>> +
>
> I personally prefer to use one wait_barrier to cover both read and
> write, something like:
>
> wait_barrier(struct r1conf *conf, long idx, int direction)
>
As I explain previous, combine them together won't make the code
simpler, maybe more complexed. Let me try to see, if there is any better
solution to make current patch more simpler.
> BTW: there are some unnecessary changes inside the patch, maybe you can
> remove it
> or introduce other patches for them.
Yes, great suggestion, I will do this in next version.
Thanks for your review !
Coly
^ permalink raw reply
* Re: raid0 vs. mkfs
From: Avi Kivity @ 2016-11-28 7:33 UTC (permalink / raw)
To: Chris Murphy; +Cc: Linux-RAID
In-Reply-To: <9ce71838-0a1d-e4d8-5786-9ab0422688af@scylladb.com>
On 11/28/2016 09:28 AM, Avi Kivity wrote:
> 259,1 10 1140 0.380238333 4801 I D 3238079488 + 1024
> [mkfs.xfs]
> 259,1 10 1141 0.380252580 2307 D D 3238079488 + 1024
> [kworker/10:1H]
> 259,1 2 173 0.380260605 0 C D 3218709504 + 1024 [0]
> 259,1 10 1142 0.380283800 4801 Q D 3238080512 + 1024
> [mkfs.xfs]
> 259,1 10 1143 0.380284158 4801 G D 3238080512 + 1024
> [mkfs.xfs]
> 259,1 10 1144 0.380285150 4801 I D 3238080512 + 1024
> [mkfs.xfs]
> 259,1 10 1145 0.380297127 2307 D D 3238080512 + 1024
> [kworker/10:1H]
> 259,1 10 1146 0.380324340 4801 Q D 3238081536 + 1024
> [mkfs.xfs]
> 259,1 10 1147 0.380324648 4801 G D 3238081536 + 1024
> [mkfs.xfs]
> 259,1 10 1148 0.380325663 4801 I D 3238081536 + 1024
> [mkfs.xfs]
> 259,1 2 174 0.380328083 0 C D 3218710528 + 1024 [0]
>
>
> So we see these one-megabyte requests; moreover, they are issued
> sequentially.
>
Correction, they are issued in parallel, but not merged. The "C"
entries correspond to requests issued far in the past.
^ permalink raw reply
* Re: raid0 vs. mkfs
From: Avi Kivity @ 2016-11-28 7:28 UTC (permalink / raw)
To: Chris Murphy; +Cc: Linux-RAID
In-Reply-To: <CAJCQCtR+PnJXq9k37V1pXVO4UmW0dbhvAZYo=Pius2eCz9X_kg@mail.gmail.com>
On 11/28/2016 06:11 AM, Chris Murphy wrote:
> On Sun, Nov 27, 2016 at 8:24 AM, Avi Kivity <avi@scylladb.com> wrote:
>> mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large disk
>> that supports TRIM/DISCARD (erase whichever is inappropriate)
> Trim is the appropriate term. Term discard refers to a specific mount
> time implementation of FITRIM ioctl, and fstrim refers to a user space
> tool that does the same and can be scheduled or issued manually.
That's good to know.
>
>
> That is
>> because mkfs issues a TRIM/DISCARD (erase whichever is inappropriate) for
>> the entire partition. As far as I can tell, md converts the large
>> TRIM/DISCARD (erase whichever is inappropriate) into a large number of
>> TRIM/DISCARD (erase whichever is inappropriate) requests, one per chunk-size
>> worth of disk, and issues them to the RAID components individually.
> You could strace the mkfs command.
I did, and saw that it was running a single syscall for the entire run.
I verified in the sources that mkfs.xfs issues a single BLKDISCARD (?!)
ioctl spanning the entire device.
> Each filesystem is doing it a
> little differently the last time I compared mkfs.xfs and mkfs.btrfs;
> but I can't qualify the differences relative to how the device is
> going to react to those commands.
>
> It's also possible to enable block device tracing and see the actual
> SCSI or ATA commands sent to a drive.
I did, and saw a ton of half-megabyte TRIMs. It's an NVMe device so not
SCSI or SATA.
Here's a sample (I only blktraced one of the members):
259,1 10 1090 0.379688898 4801 Q D 3238067200 + 1024
[mkfs.xfs]
259,1 10 1091 0.379689222 4801 G D 3238067200 + 1024
[mkfs.xfs]
259,1 10 1092 0.379690304 4801 I D 3238067200 + 1024
[mkfs.xfs]
259,1 10 1093 0.379703110 2307 D D 3238067200 + 1024
[kworker/10:1H]
259,1 1 589 0.379718918 0 C D 3231849472 + 1024 [0]
259,1 10 1094 0.379735215 4801 Q D 3238068224 + 1024
[mkfs.xfs]
259,1 10 1095 0.379735548 4801 G D 3238068224 + 1024
[mkfs.xfs]
259,1 10 1096 0.379736598 4801 I D 3238068224 + 1024
[mkfs.xfs]
259,1 10 1097 0.379753077 2307 D D 3238068224 + 1024
[kworker/10:1H]
259,1 1 590 0.379782139 0 C D 3231850496 + 1024 [0]
259,1 10 1098 0.379785399 4801 Q D 3238069248 + 1024
[mkfs.xfs]
259,1 10 1099 0.379785657 4801 G D 3238069248 + 1024
[mkfs.xfs]
259,1 10 1100 0.379786562 4801 I D 3238069248 + 1024
[mkfs.xfs]
259,1 10 1101 0.379800116 2307 D D 3238069248 + 1024
[kworker/10:1H]
259,1 10 1102 0.379829822 4801 Q D 3238070272 + 1024
[mkfs.xfs]
259,1 10 1103 0.379830156 4801 G D 3238070272 + 1024
[mkfs.xfs]
259,1 10 1104 0.379831015 4801 I D 3238070272 + 1024
[mkfs.xfs]
259,1 10 1105 0.379844120 2307 D D 3238070272 + 1024
[kworker/10:1H]
259,1 10 1106 0.379877825 4801 Q D 3238071296 + 1024
[mkfs.xfs]
259,1 10 1107 0.379878173 4801 G D 3238071296 + 1024
[mkfs.xfs]
259,1 10 1108 0.379879028 4801 I D 3238071296 + 1024
[mkfs.xfs]
259,1 1 591 0.379886451 0 C D 3231851520 + 1024 [0]
259,1 10 1109 0.379898178 2307 D D 3238071296 + 1024
[kworker/10:1H]
259,1 10 1110 0.379923982 4801 Q D 3238072320 + 1024
[mkfs.xfs]
259,1 10 1111 0.379924229 4801 G D 3238072320 + 1024
[mkfs.xfs]
259,1 10 1112 0.379925054 4801 I D 3238072320 + 1024
[mkfs.xfs]
259,1 10 1113 0.379937716 2307 D D 3238072320 + 1024
[kworker/10:1H]
259,1 1 592 0.379954380 0 C D 3231852544 + 1024 [0]
259,1 10 1114 0.379970091 4801 Q D 3238073344 + 1024
[mkfs.xfs]
259,1 10 1115 0.379970341 4801 G D 3238073344 + 1024
[mkfs.xfs]
259,1 10 1116 0.379971260 4801 I D 3238073344 + 1024
[mkfs.xfs]
259,1 10 1117 0.379984303 2307 D D 3238073344 + 1024
[kworker/10:1H]
259,1 10 1118 0.380014754 4801 Q D 3238074368 + 1024
[mkfs.xfs]
259,1 10 1119 0.380015075 4801 G D 3238074368 + 1024
[mkfs.xfs]
259,1 10 1120 0.380015903 4801 I D 3238074368 + 1024
[mkfs.xfs]
259,1 10 1121 0.380028655 2307 D D 3238074368 + 1024
[kworker/10:1H]
259,1 2 170 0.380054279 0 C D 3218706432 + 1024 [0]
259,1 10 1122 0.380060773 4801 Q D 3238075392 + 1024
[mkfs.xfs]
259,1 10 1123 0.380061024 4801 G D 3238075392 + 1024
[mkfs.xfs]
259,1 10 1124 0.380062093 4801 I D 3238075392 + 1024
[mkfs.xfs]
259,1 10 1125 0.380072940 2307 D D 3238075392 + 1024
[kworker/10:1H]
259,1 10 1126 0.380107437 4801 Q D 3238076416 + 1024
[mkfs.xfs]
259,1 10 1127 0.380107882 4801 G D 3238076416 + 1024
[mkfs.xfs]
259,1 10 1128 0.380109258 4801 I D 3238076416 + 1024
[mkfs.xfs]
259,1 10 1129 0.380123914 2307 D D 3238076416 + 1024
[kworker/10:1H]
259,1 2 171 0.380130823 0 C D 3218707456 + 1024 [0]
259,1 10 1130 0.380156971 4801 Q D 3238077440 + 1024
[mkfs.xfs]
259,1 10 1131 0.380157308 4801 G D 3238077440 + 1024
[mkfs.xfs]
259,1 10 1132 0.380158354 4801 I D 3238077440 + 1024
[mkfs.xfs]
259,1 10 1133 0.380168948 2307 D D 3238077440 + 1024
[kworker/10:1H]
259,1 2 172 0.380186647 0 C D 3218708480 + 1024 [0]
259,1 10 1134 0.380197495 4801 Q D 3238078464 + 1024
[mkfs.xfs]
259,1 10 1135 0.380197848 4801 G D 3238078464 + 1024
[mkfs.xfs]
259,1 10 1136 0.380198724 4801 I D 3238078464 + 1024
[mkfs.xfs]
259,1 10 1137 0.380202964 2307 D D 3238078464 + 1024
[kworker/10:1H]
259,1 10 1138 0.380237133 4801 Q D 3238079488 + 1024
[mkfs.xfs]
259,1 10 1139 0.380237393 4801 G D 3238079488 + 1024
[mkfs.xfs]
259,1 10 1140 0.380238333 4801 I D 3238079488 + 1024
[mkfs.xfs]
259,1 10 1141 0.380252580 2307 D D 3238079488 + 1024
[kworker/10:1H]
259,1 2 173 0.380260605 0 C D 3218709504 + 1024 [0]
259,1 10 1142 0.380283800 4801 Q D 3238080512 + 1024
[mkfs.xfs]
259,1 10 1143 0.380284158 4801 G D 3238080512 + 1024
[mkfs.xfs]
259,1 10 1144 0.380285150 4801 I D 3238080512 + 1024
[mkfs.xfs]
259,1 10 1145 0.380297127 2307 D D 3238080512 + 1024
[kworker/10:1H]
259,1 10 1146 0.380324340 4801 Q D 3238081536 + 1024
[mkfs.xfs]
259,1 10 1147 0.380324648 4801 G D 3238081536 + 1024
[mkfs.xfs]
259,1 10 1148 0.380325663 4801 I D 3238081536 + 1024
[mkfs.xfs]
259,1 2 174 0.380328083 0 C D 3218710528 + 1024 [0]
So we see these one-megabyte requests; moreover, they are issued
sequentially.
> There's a metric f tonne of bugs in this area so before anything I'd
> consider researching if there's a firmware update for your hardware
> and applying that and retesting.
I don't have access to that machine any more (I could get some with a
bit of trouble). But I think it's clear from the traces that the
problem is in the RAID layer?
> And then also after testing your
> ideal deployed version, use something much close to upstream (Arch or
> Fedora) and see if the problem is reproducible.
I'm hoping the RAID maintainers can confirm at a glance whether the
problem exists or not, it doesn't look like a minor glitch but simply
that this code path doesn't take the issue into account.
^ permalink raw reply
* Re: [RFC PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: Coly Li @ 2016-11-28 6:59 UTC (permalink / raw)
To: NeilBrown
Cc: Shaohua Li, linux-raid, Shaohua Li, Neil Brown,
Johannes Thumshirn, Guoqing Jiang
In-Reply-To: <871sy1bfd9.fsf@notabene.neil.brown.name>
On 2016/11/24 下午1:45, NeilBrown wrote:
> On Wed, Nov 23 2016, Shaohua Li wrote:
>>
>> Thanks! The idea is awesome and does makes the code easier to
>> understand.
>
> This is a key point - simpler code!!
>
>>
>> For hash conflict, I'm worrying about one case. resync does
>> sequential access. Say we have a sequential normal IO. If the
>> first normal IO and resync IO have conflict, it's possible they
>> will have conflict subsequently, since both are doing sequential
>> access. We can change the hash to something like this:
>>
>> for the first 64M * 512, the hash is 0, 1, ... 511 For the second
>> 64M * 512, the hash is 1, 3, 5 ... The third 64M * 512, the hash
>> is 2, 5, 8 ...
>
> This would happen when the write and the resync are at different
> addresses which happen to collide in the hash. They would only
> remain in sync if the two proceeded at the same speed. Normally
> resync is throttled when there is competing IO, so that is fairly
> unlikely.
>
> If this really was important, it might make sense to just use
> hash_long() to make the relevant bits of the sector number into
> and index. (i.e. keep the code simple)
>
>>>
>>> If user has a (realy) large raid1 device, for example 10PB
>>> size, we may just increase the buckets number
>>> BARRIER_BUCKETS_NR. Now this is a macro, it is possible to be a
>>> raid1-created-time-defined variable in future.
>
> I don't think the size of the array is very relevant. No matter
> how big the array is, resync can be expected to directly interfere
> with 0.2% of all writes. So, assuming a fairly random distribution
> over times, most writes will most often not be blocked by resync at
> all.
>
> I wouldn't give any thought to any problems like this until they
> are actually measured.
>
Yes, I agree with you. If we do have to use more buckets in future,
current patch makes future modification easier.
>>> static void reschedule_retry(struct r1bio *r1_bio) @@ -215,10
>>> +214,15 @@ static void reschedule_retry(struct r1bi unsigned
>>> long flags; struct mddev *mddev = r1_bio->mddev; struct r1conf
>>> *conf = mddev->private; + sector_t sector_nr; + long idx;
>
> I wonder about the use of "long" for "idx". As that is an array
> index, it would have to be a REALLY big array before "long" is
> better than "int".
>
Good suggestion! I will modify it to int in next version.
>>> @@ -255,19 +257,14 @@ static void call_bio_endio(struct r1bio
>>> if (!test_bit(R1BIO_Uptodate, &r1_bio->state)) bio->bi_error =
>>> -EIO;
>>>
>>> - if (done) { + if (done) bio_endio(bio); - /* - * Wake up
>>> any possible resync thread that waits for the device - * to
>>> go idle. - */ - allow_barrier(conf, start_next_window,
>>> bi_sector); - } }
>>>
>>> static void raid_end_bio_io(struct r1bio *r1_bio) { struct bio
>>> *bio = r1_bio->master_bio; + struct r1conf *conf =
>>> r1_bio->mddev->private;
>>>
>>> /* if nobody has done the final endio yet, do it now */ if
>>> (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) { @@ -278,6
>>> +275,12 @@ static void raid_end_bio_io(struct r1bio
>>>
>>> call_bio_endio(r1_bio); } + + /* + * Wake up any possible
>>> resync thread that waits for the device + * to go idle. + */
>>> + allow_barrier(conf, r1_bio->sector);
>> Why this change?
>
> I wondered too. I think it may be correct, but it should be in a
> separate patch. When you have a write-mostly device, I think the
> current code will allow_barrier() before the writes to the
> write-mostly devices have completed.
>
I explain this in the reply to Shaohua's email. Current upstream code
increases/decreases conf->nr_pending for each bio (a.k.a
r1_bio->master_bio) received by raid1_make_request(). In this patch,
conf->nr_pending[idx] increases/decreases for each r1_bio, this is the
reason why allow_barrier() moves to raid1_end_bio_io(), you may also
find out wait_barrier() also changes its location and is replaced by
wait_barrier() and wait_read_barrier() in raid1_make_request().
>>>
>>> +static sector_t align_to_barrier_unit_end(sector_t
>>> start_sector, + sector_t sectors) +{ + sector_t len; + +
>>> WARN_ON(sectors == 0); + /* len is the number of sectors from
>>> start_sector to end of the + * barrier unit which start_sector
>>> belongs to. + */ + len = ((start_sector + sectors +
>>> (1<<BARRIER_UNIT_SECTOR_BITS) - 1) & +
>>> (~(BARRIER_UNIT_SECTOR_SIZE - 1))) - + start_sector; + +
>>> if (len > sectors) + len = sectors; + + return len; +}
>>
>> I'd prefer calling bio_split at the early of raid1_make_request
>> and split the bio if it crosses the bucket boundary. please see
>> raid10_make_request for example. resync will not cross the
>> boundary. So the code will not need to worry about the boundary.
>> I believe this will make the code simpiler (all the
>> align_to_barrier_unit_end calling can removed) and easy to
>> understand.
>
> align_to_barrier_unit_end() is only called twice, once for writes
> and once for resync/recovery. If bio_split() were used, you would
> still need the same number of calls.
>
> Changing raid1.c to use bio_split() more may well be a valuable
> simplification, but I think it is a separate issue.
>
To use bio_split() won't be simpler, indeed it is a little bit more
complexed, because when r1_bio->master_bio will be the split bio of
master bio, not the original master bio, that's why I decide to not
use it.
>>> wait_event_lock_irq(conf->wait_barrier, -
>>> !conf->array_frozen && - conf->barrier < RESYNC_DEPTH &&
>>> - conf->current_window_requests == 0 && -
>>> (conf->start_next_window >= - conf->next_resync +
>>> RESYNC_SECTORS), + !conf->array_frozen &&
>>> !conf->nr_pending[idx] && + conf->barrier[idx] <
>>> RESYNC_DEPTH, conf->resync_lock);
>>
>> So there is a slight behavior change. Originally we guarautee no
>> more than RESYNC_DEPTH sync. Now this only checks
>> 'conf->barrier[idx] < RESYNC_DEPTH'. How about barrier[idx-1],
>> barrier[idx-2]...? It's possible conf->barrier[idx] <
>> RESYNC_DEPTH, but barrier[idx] + barrier[idx-1] > RESYNC_DEPTH.
>> Not sure how important this is, but at least we can check
>> barrier[idx] + barrier[idx-1].
>
> I confess that I'm not entirely sure the point of RESYNC_DEPTH - it
> was already in the code when I started working on it. My best guess
> is that it limits the number of requests we need to wait for before
> regular writes can be permitted in the resync region. In that case,
> the current code is good enough.
>
> Your "barrier[idx] + barrier[idx-1]" idea is interesting, but would
> only make a difference 1/1024 of the time (bucket is 64Meg, the
> RESYNC_BLOCK_SIZE is 64K).
>
>
IMHO RESYNC_DEPTH is for better resync throughput, since now we have
linear resync, and only one resync window, I doubt whether it takes
effect or not. Now only one resync sliding window, therefore it is
very similar for testing the resync depth within the sliding window,
or within a single barrier bucket. A resync I/O won't hit more then
one barrier bucket now.
In future when parallel resync implemented, the modification here may
increase whole raid1 device resync depth to BARRIER_BUCKETS_NR *
RESYNC_DEPTH, which results better resync I/O when regular I/O is idle.
>>> + +static void wait_all_barriers(struct r1conf *conf) +{ + long
>>> idx; + + for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++) +
>>> _wait_barrier(conf, idx);
>>
>> This is racy. Since resync is sequential, idealy we only need to
>> check the bucket sequentially. The compiler could do
>> _wait_barrier(conf, 1) and then _wait_barrier(conf, 0). It's
>> possible the bucket 1 has barrier right after the check of bucket
>> 0. Even the compiler doesn't reorder, there is case the bucket
>> 511 should be check before bucket 0 if the resync is just
>> crossing 512 buckets. It would be better we check the resync
>> position and adjust the bucket wait order according to the
>> position.
>
> I cannot see how it is racy. No new sync requests will be issued
> at this time, so once ->barrier[idx] reaches 0, it will stay at
> zero.
>
Yes, this is what I mean.
>>> @@ -2786,6 +2824,22 @@ static struct r1conf *setup_conf(struct
>>> if (!conf) goto abort;
>>>
>>> + conf->nr_pending = kzalloc(PAGE_SIZE, GFP_KERNEL); + if
>>> (!conf->nr_pending) + goto abort;
>>
>> This allocation is a little wierd. I'd define a count and uses
>> sizeof(nr_pending) * count to do the allocation. There is nothing
>> related to PAGE_SIZE. Alternatively, just make nr_pending an
>> array in r1conf.
>
> The thing about PAGE_SIZE is that the allocation will succeed and
> won't waste space. If you make all the arrays simple members of
> r1conf, r1conf will be about 4pages larger, so will require an
> 8-page allocation, which is more likely to fail. I agree that it
> seems a little weird, but I think it results in the best use of
> memory.
Yes, this is what I mean.
Thanks for your comments!
Coly
^ permalink raw reply
* Re: [RFC PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: Coly Li @ 2016-11-28 6:42 UTC (permalink / raw)
To: Shaohua Li
Cc: linux-raid, Shaohua Li, Neil Brown, Johannes Thumshirn,
Guoqing Jiang
In-Reply-To: <20161122213541.btgw4cpoly5j4jpc@kernel.org>
On 2016/11/23 上午5:35, Shaohua Li wrote:
> On Tue, Nov 22, 2016 at 05:54:00AM +0800, Coly Li wrote:
>> 'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")'
>> introduces a sliding resync window for raid1 I/O barrier, this idea limits
>> I/O barriers to happen only inside a slidingresync window, for regular
>> I/Os out of this resync window they don't need to wait for barrier any
>> more. On large raid1 device, it helps a lot to improve parallel writing
>> I/O throughput when there are background resync I/Os performing at
>> same time.
>>
>> The idea of sliding resync widow is awesome, but there are several
>> challenges are very difficult to solve,
>> - code complexity
>> Sliding resync window requires several veriables to work collectively,
>> this is complexed and very hard to make it work correctly. Just grep
>> "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches to fix
>> the original resync window patch. This is not the end, any further
>> related modification may easily introduce more regreassion.
>> - multiple sliding resync windows
>> Currently raid1 code only has a single sliding resync window, we cannot
>> do parallel resync with current I/O barrier implementation.
>> Implementing multiple resync windows are much more complexed, and very
>> hard to make it correctly.
>>
>> Therefore I decide to implement a much simpler raid1 I/O barrier, by
>> removing resync window code, I believe life will be much easier.
>>
>> The brief idea of the simpler barrier is,
>> - Do not maintain a logbal unique resync window
>> - Use multiple hash buckets to reduce I/O barrier conflictions, regular
>> I/O only has to wait for a resync I/O when both them have same barrier
>> bucket index, vice versa.
>> - I/O barrier can be recuded to an acceptable number if there are enought
>> barrier buckets
>>
>> Here I explain how the barrier buckets are designed,
>> - BARRIER_UNIT_SECTOR_SIZE
>> The whole LBA address space of a raid1 device is divided into multiple
>> barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
>> Bio request won't go across border of barrier unit size, that means
>> maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 in bytes.
>> - BARRIER_BUCKETS_NR
>> There are BARRIER_BUCKETS_NR buckets in total, if multiple I/O requests
>> hit different barrier units, they only need to compete I/O barrier with
>> other I/Os which hit the same barrier bucket index with each other. The
>> index of a barrier bucket which a bio should look for is calculated by
>> get_barrier_bucket_idx(),
>> (sector >> BARRIER_UNIT_SECTOR_BITS) % BARRIER_BUCKETS_NR
>> sector is the start sector number of a bio. align_to_barrier_unit_end()
>> will make sure the finall bio sent into generic_make_request() won't
>> exceed border of the barrier unit size.
>> - RRIER_BUCKETS_NR
>> Number of barrier buckets is defined by,
>> #define BARRIER_BUCKETS_NR (PAGE_SIZE/sizeof(long))
>> For 4KB page size, there are 512 buckets for each raid1 device. That
>> means the propobility of full random I/O barrier confliction may be
>> reduced down to 1/512.
>
> Thanks! The idea is awesome and does makes the code easier to understand.
>
> For hash conflict, I'm worrying about one case. resync does sequential access.
> Say we have a sequential normal IO. If the first normal IO and resync IO have
> conflict, it's possible they will have conflict subsequently, since both are
> doing sequential access. We can change the hash to something like this:
>
> for the first 64M * 512, the hash is 0, 1, ... 511
> For the second 64M * 512, the hash is 1, 3, 5 ...
> The third 64M * 512, the hash is 2, 5, 8 ...
>
> It should be very easy to implement something like this, and this should reduce
> the conflict of sequential access.
>
Hi Shaohua,
What I concerned was memory footprint. For very fast sequential I/O,
lineal mapping buckets means each (64 bytes) cache line contains 8 long
integer, it is very friendly for CPU caching,
- sequential writing 8 barrier units only requires 1 memory fetching
for each barrier variables (barrier[], nr_pending[], nr_waiting[],
nr_queued[]).
- memory prefetch may have positive effect in sequential I/O.
It will be very rare that resync I/O and regular I/O are always
conflicted in same barrier bucket: resync I/O throughput usually is
slower than regular I/O, it is very hard to keep them always conflicting
in same bucket. If this condition really happens, current sliding resync
window implementation may also have a similar conflict (regular I/O
always hits resync window). So the barrier buckets don't make things worse.
If we use a non-continuous mapping, memory fingerprint of the buckets
will be quite distributed, and occupies more cache lines for a
sequential I/O, which will increase the probability of more cache bounce
if there are multiple sequential I/O (on different offset) happen on
different CPU cores.
This is why I use a very simple linear buckets hashing.
>> Comparing to single sliding resync window,
>> - Currently resync I/O grows linearly, therefore regular and resync I/O
>> will have confliction within a single barrier units. So it is similar to
>> single sliding resync window.
>> - But a barrier unit bucket is shared by all barrier units with identical
>> barrier uinit index, the probability of confliction might be higher
>> than single sliding resync window, in condition that writing I/Os
>> always hit barrier units which have identical barrier bucket index with
>> the resync I/Os. This is a very rare condition in real I/O work loads,
>> I cannot imagine how it could happen in practice.
>> - Therefore we can achieve a good enough low confliction rate with much
>> simpler barrier algorithm and implementation.
>>
>> If user has a (realy) large raid1 device, for example 10PB size, we may
>> just increase the buckets number BARRIER_BUCKETS_NR. Now this is a macro,
>> it is possible to be a raid1-created-time-defined variable in future.
>>
>> There are two changes should be noticed,
>> - In raid1d(), I change the code to decrease conf->nr_pending[idx] into
>> single loop, it looks like this,
>> spin_lock_irqsave(&conf->device_lock, flags);
>> conf->nr_queued[idx]--;
>> spin_unlock_irqrestore(&conf->device_lock, flags);
>> This change generates more spin lock operations, but in next patch of
>> this patch set, it will be replaced by a single line code,
>> atomic_dec(conf->nr_queueud[idx]);
>> So we don't need to worry about spin lock cost here.
>> - In raid1_make_request(), wait_barrier() is replaced by,
>> a) wait_read_barrier(): wait barrier in regular read I/O code path
>> b) wait_barrier(): wait barrier in regular write I/O code path
>> The differnece is wait_read_barrier() only waits if array is frozen, I
>> am not able to combile them into one function, because they must receive
>> differnet data types in their arguments list.
>> - align_to_barrier_unit_end() is called to make sure both regular and
>> resync I/O won't go across the border of a barrier unit size.
>>
>> Open question:
>> - Need review from md clustring developer, I don't touch related code now.
>
> Don't think it matters, but please open eyes, Guoqing!
>
>> Signed-off-by: Coly Li <colyli@suse.de>
>> Cc: Shaohua Li <shli@fb.com>
>> Cc: Neil Brown <neilb@suse.de>
>> Cc: Johannes Thumshirn <jthumshirn@suse.de>
>> Cc: Guoqing Jiang <gqjiang@suse.com>
>> ---
>> drivers/md/raid1.c | 389 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------------------
>> drivers/md/raid1.h | 42 +++++-------
>> 2 files changed, 242 insertions(+), 189 deletions(-)
>>
>> Index: linux-raid1/drivers/md/raid1.c
>> ===================================================================
>> --- linux-raid1.orig/drivers/md/raid1.c
>> +++ linux-raid1/drivers/md/raid1.c
>> @@ -66,9 +66,8 @@
>> */
>> static int max_queued_requests = 1024;
>>
>> -static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
>> - sector_t bi_sector);
>> -static void lower_barrier(struct r1conf *conf);
>> +static void allow_barrier(struct r1conf *conf, sector_t sector_nr);
>> +static void lower_barrier(struct r1conf *conf, sector_t sector_nr);
>>
>> static void * r1bio_pool_alloc(gfp_t gfp_flags, void *data)
>> {
>> @@ -92,7 +91,6 @@ static void r1bio_pool_free(void *r1_bio
>> #define RESYNC_WINDOW_SECTORS (RESYNC_WINDOW >> 9)
>> #define CLUSTER_RESYNC_WINDOW (16 * RESYNC_WINDOW)
>> #define CLUSTER_RESYNC_WINDOW_SECTORS (CLUSTER_RESYNC_WINDOW >> 9)
>> -#define NEXT_NORMALIO_DISTANCE (3 * RESYNC_WINDOW_SECTORS)
>>
>> static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data)
>> {
>> @@ -198,6 +196,7 @@ static void put_buf(struct r1bio *r1_bio
>> {
>> struct r1conf *conf = r1_bio->mddev->private;
>> int i;
>> + sector_t sector_nr = r1_bio->sector;
>>
>> for (i = 0; i < conf->raid_disks * 2; i++) {
>> struct bio *bio = r1_bio->bios[i];
>> @@ -207,7 +206,7 @@ static void put_buf(struct r1bio *r1_bio
>>
>> mempool_free(r1_bio, conf->r1buf_pool);
>>
>> - lower_barrier(conf);
>> + lower_barrier(conf, sector_nr);
>> }
>>
>> static void reschedule_retry(struct r1bio *r1_bio)
>> @@ -215,10 +214,15 @@ static void reschedule_retry(struct r1bi
>> unsigned long flags;
>> struct mddev *mddev = r1_bio->mddev;
>> struct r1conf *conf = mddev->private;
>> + sector_t sector_nr;
>> + long idx;
>> +
>> + sector_nr = r1_bio->sector;
>> + idx = get_barrier_bucket_idx(sector_nr);
>>
>> spin_lock_irqsave(&conf->device_lock, flags);
>> list_add(&r1_bio->retry_list, &conf->retry_list);
>> - conf->nr_queued ++;
>> + conf->nr_queued[idx]++;
>> spin_unlock_irqrestore(&conf->device_lock, flags);
>>
>> wake_up(&conf->wait_barrier);
>> @@ -235,8 +239,6 @@ static void call_bio_endio(struct r1bio
>> struct bio *bio = r1_bio->master_bio;
>> int done;
>> struct r1conf *conf = r1_bio->mddev->private;
>> - sector_t start_next_window = r1_bio->start_next_window;
>> - sector_t bi_sector = bio->bi_iter.bi_sector;
>>
>> if (bio->bi_phys_segments) {
>> unsigned long flags;
>> @@ -255,19 +257,14 @@ static void call_bio_endio(struct r1bio
>> if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
>> bio->bi_error = -EIO;
>>
>> - if (done) {
>> + if (done)
>> bio_endio(bio);
>> - /*
>> - * Wake up any possible resync thread that waits for the device
>> - * to go idle.
>> - */
>> - allow_barrier(conf, start_next_window, bi_sector);
>> - }
>> }
>>
>> static void raid_end_bio_io(struct r1bio *r1_bio)
>> {
>> struct bio *bio = r1_bio->master_bio;
>> + struct r1conf *conf = r1_bio->mddev->private;
>>
>> /* if nobody has done the final endio yet, do it now */
>> if (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) {
>> @@ -278,6 +275,12 @@ static void raid_end_bio_io(struct r1bio
>>
>> call_bio_endio(r1_bio);
>> }
>> +
>> + /*
>> + * Wake up any possible resync thread that waits for the device
>> + * to go idle.
>> + */
>> + allow_barrier(conf, r1_bio->sector);
> Why this change?
>
I move allow_barrier() here on purpose. Because current raid1 code
raises nr_pending for each master bio, the barrier buckets patch raises
nr_pending[idx] for each r1_bio.
Current code increases nr_pending for each master bio (the bio structure
received by raid1_make_request()). A master bio may be split by multiple
r1_bio structures, when all r1_bio completes, nr_pending is decreased
for the original master bio.
Since the master bio may go across multiple barrier units, in this patch
master bio will be split into multiple r1_bio structures for each
barrier units. In this case, we need to call wait_barrer() for each
r1_bio, and call allow_barrier() on each r1_bio completion. This is why
allow_barrier() is moved form master bio completion time to r1_bio
completion time.
A master bio may also be split into multiple r1_bio if bad blocks
encountered, and these r1_bio may stay in same barrier unit. In this
case the different r1_bio does increase nr_pending[] for same bucket
index, we should also call wait_barrier() for each r1_bio and call
allow_barrier() at its completion time.
>> free_r1bio(r1_bio);
>> }
>>
>> @@ -311,6 +314,7 @@ static int find_bio_disk(struct r1bio *r
>> return mirror;
>> }
>>
>> +/* bi_end_io callback for regular READ bio */
>> static void raid1_end_read_request(struct bio *bio)
>> {
>> int uptodate = !bio->bi_error;
>> @@ -490,6 +494,25 @@ static void raid1_end_write_request(stru
>> bio_put(to_put);
>> }
>>
>> +static sector_t align_to_barrier_unit_end(sector_t start_sector,
>> + sector_t sectors)
>> +{
>> + sector_t len;
>> +
>> + WARN_ON(sectors == 0);
>> + /* len is the number of sectors from start_sector to end of the
>> + * barrier unit which start_sector belongs to.
>> + */
>> + len = ((start_sector + sectors + (1<<BARRIER_UNIT_SECTOR_BITS) - 1) &
>> + (~(BARRIER_UNIT_SECTOR_SIZE - 1))) -
>> + start_sector;
>> +
>> + if (len > sectors)
>> + len = sectors;
>> +
>> + return len;
>> +}
>
> I'd prefer calling bio_split at the early of raid1_make_request and split the
> bio if it crosses the bucket boundary. please see raid10_make_request for
> example. resync will not cross the boundary. So the code will not need to worry
> about the boundary. I believe this will make the code simpiler (all the
> align_to_barrier_unit_end calling can removed) and easy to understand.
>
Indeed, my first implementation uses bio_split(). The reasons why I
don't use it later are,
- indeed I need to write more code.
- I can't simply use existing r1_bio completion code to call
bio_end_io() to the master bio when bio->bi_phys_segments is zero (see
call_bio_endio()). Because r1_bio->master_bio is the split bio, not the
original master bio received by raid1_make_request()
Therefore finally I decide to use align_to_barrier_unit_end() to
generate more r1_bio structures if the original master bio goes across
barrier unit size, it makes less modification in this patch.
>> +
>> /*
>> * This routine returns the disk from which the requested read should
>> * be done. There is a per-array 'next expected sequential IO' sector
>> @@ -691,6 +714,7 @@ static int read_balance(struct r1conf *c
>> conf->mirrors[best_disk].next_seq_sect = this_sector + sectors;
>> }
>> rcu_read_unlock();
>> + sectors = align_to_barrier_unit_end(this_sector, sectors);
>> *max_sectors = sectors;
>>
>> return best_disk;
>> @@ -779,168 +803,174 @@ static void flush_pending_writes(struct
>> * there is no normal IO happeing. It must arrange to call
>> * lower_barrier when the particular background IO completes.
>> */
>> +
>> static void raise_barrier(struct r1conf *conf, sector_t sector_nr)
>> {
>> + long idx = get_barrier_bucket_idx(sector_nr);
>> +
>> spin_lock_irq(&conf->resync_lock);
>>
>> /* Wait until no block IO is waiting */
>> - wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting,
>> + wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting[idx],
>> conf->resync_lock);
>>
>> /* block any new IO from starting */
>> - conf->barrier++;
>> - conf->next_resync = sector_nr;
>> + conf->barrier[idx]++;
>>
>> /* For these conditions we must wait:
>> * A: while the array is in frozen state
>> - * B: while barrier >= RESYNC_DEPTH, meaning resync reach
>> - * the max count which allowed.
>> - * C: next_resync + RESYNC_SECTORS > start_next_window, meaning
>> - * next resync will reach to the window which normal bios are
>> - * handling.
>> - * D: while there are any active requests in the current window.
>> + * B: while conf->nr_pending[idx] is not 0, meaning regular I/O
>> + * existing in sector number ranges corresponding to idx.
>> + * C: while conf->barrier[idx] >= RESYNC_DEPTH, meaning resync reach
>> + * the max count which allowed in sector number ranges
>> + * conrresponding to idx.
>> */
>> wait_event_lock_irq(conf->wait_barrier,
>> - !conf->array_frozen &&
>> - conf->barrier < RESYNC_DEPTH &&
>> - conf->current_window_requests == 0 &&
>> - (conf->start_next_window >=
>> - conf->next_resync + RESYNC_SECTORS),
>> + !conf->array_frozen && !conf->nr_pending[idx] &&
>> + conf->barrier[idx] < RESYNC_DEPTH,
>> conf->resync_lock);
>
> So there is a slight behavior change. Originally we guarautee no more than
> RESYNC_DEPTH sync. Now this only checks 'conf->barrier[idx] < RESYNC_DEPTH'.
> How about barrier[idx-1], barrier[idx-2]...? It's possible conf->barrier[idx] <
> RESYNC_DEPTH, but barrier[idx] + barrier[idx-1] > RESYNC_DEPTH. Not sure how
> important this is, but at least we can check barrier[idx] + barrier[idx-1].
The original code wants to rise more multiple barriers, but less then
RESYNC_DEPTH. It helps resync I/O throughput with less resync I/O and
regular I/O conflict. Considering conf->barrier is a global barrier,
large RESYNC_DEPTH will starve regular I/O, it is only set to 32 for
whole raid1 device.
In the barrier buckets patch, conf->barrier[idx] only takes effect in a
single bucket. More barriers raised in this barrier buckets won't
interfere regular I/O in other barrier buckets, therefore we can have
much more resync I/O barriers to raise, which is good for resync I/O
throughput without starve more regular I/Os.
If we have 512 barrier buckets, that means on the whole raid1 device
there are 512*RESYNC_DEPTH barriers can be raised, and allows more
parallel resync I/O.
Before implementing parallel resync, I keep RESYNC_DEPTH as 32 in this
patch, because we only have a resync I/O hits one barrier buckets, same
RESYNC_DEPTH won't change barrier behavior now. Latter when we have
parallel resync I/O, let's see whether we need to modify this value,
also still keep it as 32.
>> -
>> - conf->nr_pending++;
>> + conf->nr_pending[idx]++;
>> spin_unlock_irq(&conf->resync_lock);
>> }
>>
>> -static void lower_barrier(struct r1conf *conf)
>> +static void lower_barrier(struct r1conf *conf, sector_t sector_nr)
>> {
>> unsigned long flags;
>> - BUG_ON(conf->barrier <= 0);
>> + long idx = get_barrier_bucket_idx(sector_nr);
>> +
>> + BUG_ON(conf->barrier[idx] <= 0);
>> spin_lock_irqsave(&conf->resync_lock, flags);
>> - conf->barrier--;
>> - conf->nr_pending--;
>> + conf->barrier[idx]--;
>> + conf->nr_pending[idx]--;
>> spin_unlock_irqrestore(&conf->resync_lock, flags);
>> wake_up(&conf->wait_barrier);
>> }
>>
>> -static bool need_to_wait_for_sync(struct r1conf *conf, struct bio *bio)
>> +/* A regular I/O should wait when,
>> + * - The whole array is frozen (both READ and WRITE)
>> + * - bio is WRITE and in same barrier bucket conf->barrier[idx] raised
>> + */
>> +static void _wait_barrier(struct r1conf *conf, long idx)
>> {
>> - bool wait = false;
>> -
>> - if (conf->array_frozen || !bio)
>> - wait = true;
>> - else if (conf->barrier && bio_data_dir(bio) == WRITE) {
>> - if ((conf->mddev->curr_resync_completed
>> - >= bio_end_sector(bio)) ||
>> - (conf->next_resync + NEXT_NORMALIO_DISTANCE
>> - <= bio->bi_iter.bi_sector))
>> - wait = false;
>> - else
>> - wait = true;
>> + spin_lock_irq(&conf->resync_lock);
>> + if (conf->array_frozen || conf->barrier[idx]) {
>> + conf->nr_waiting[idx]++;
>> + /* Wait for the barrier to drop. */
>> + wait_event_lock_irq(
>> + conf->wait_barrier,
>> + !conf->array_frozen && !conf->barrier[idx],
>> + conf->resync_lock);
>> + conf->nr_waiting[idx]--;
>> }
>>
>> - return wait;
>> + conf->nr_pending[idx]++;
>> + spin_unlock_irq(&conf->resync_lock);
>> }
>>
>> -static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
>> +static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr)
>> {
>> - sector_t sector = 0;
>> + long idx = get_barrier_bucket_idx(sector_nr);
>>
>> spin_lock_irq(&conf->resync_lock);
>> - if (need_to_wait_for_sync(conf, bio)) {
>> - conf->nr_waiting++;
>> - /* Wait for the barrier to drop.
>> - * However if there are already pending
>> - * requests (preventing the barrier from
>> - * rising completely), and the
>> - * per-process bio queue isn't empty,
>> - * then don't wait, as we need to empty
>> - * that queue to allow conf->start_next_window
>> - * to increase.
>> - */
>> - wait_event_lock_irq(conf->wait_barrier,
>> - !conf->array_frozen &&
>> - (!conf->barrier ||
>> - ((conf->start_next_window <
>> - conf->next_resync + RESYNC_SECTORS) &&
>> - current->bio_list &&
>> - !bio_list_empty(current->bio_list))),
>> - conf->resync_lock);
>> - conf->nr_waiting--;
>> - }
>> -
>> - if (bio && bio_data_dir(bio) == WRITE) {
>> - if (bio->bi_iter.bi_sector >= conf->next_resync) {
>> - if (conf->start_next_window == MaxSector)
>> - conf->start_next_window =
>> - conf->next_resync +
>> - NEXT_NORMALIO_DISTANCE;
>> -
>> - if ((conf->start_next_window + NEXT_NORMALIO_DISTANCE)
>> - <= bio->bi_iter.bi_sector)
>> - conf->next_window_requests++;
>> - else
>> - conf->current_window_requests++;
>> - sector = conf->start_next_window;
>> - }
>> + if (conf->array_frozen) {
>> + conf->nr_waiting[idx]++;
>> + /* Wait for array to unfreeze */
>> + wait_event_lock_irq(
>> + conf->wait_barrier,
>> + !conf->array_frozen,
>> + conf->resync_lock);
>> + conf->nr_waiting[idx]--;
>> }
>> -
>> - conf->nr_pending++;
>> + conf->nr_pending[idx]++;
>> spin_unlock_irq(&conf->resync_lock);
>> - return sector;
>> }
>>
>> -static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
>> - sector_t bi_sector)
>> +static void wait_barrier(struct r1conf *conf, sector_t sector_nr)
>> +{
>> + long idx = get_barrier_bucket_idx(sector_nr);
>> +
>> + _wait_barrier(conf, idx);
>> +}
>> +
>> +static void wait_all_barriers(struct r1conf *conf)
>> +{
>> + long idx;
>> +
>> + for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
>> + _wait_barrier(conf, idx);
>
> This is racy. Since resync is sequential, idealy we only need to check the
> bucket sequentially. The compiler could do _wait_barrier(conf, 1) and then
> _wait_barrier(conf, 0). It's possible the bucket 1 has barrier right after the
> check of bucket 0. Even the compiler doesn't reorder, there is case the bucket
> 511 should be check before bucket 0 if the resync is just crossing 512 buckets.
> It would be better we check the resync position and adjust the bucket wait
> order according to the position.
>
wait_all_barriers() and allow_all_barriers() are used in close_sync() only,
static void close_sync(struct r1conf *conf)
{
- wait_barrier(conf, NULL);
- allow_barrier(conf, 0, 0);
+ wait_all_barriers(conf);
+ allow_all_barriers(conf);
[snip]
}
wait_barrier() and allow_barrier() called here is for a synchronization
purpose, to make sure before close_sync() returns there is no resync I/O
existing.
close_sync() is called in raid1_sync_request() when resync I/O exceeds
the size of raid1 device, and resync should be closed. In this
condition, there won't be any new resync I/O generated. If a
conf->barrier[idx] reaches 0, it won't increase before a new resync
restarts. Therefore it is safe to call _wait_barrier() one by one, until
every conf->barrier[idx] reaches to 0.
This is a usage of wait/allow_barrier() which is not mentioned in the
code. They are not for regular I/O waiting for resync I/O, they are used
as a synchronization to make sure all on-flying resync I/O to complete.
>> int good_sectors = RESYNC_SECTORS;
>> int min_bad = 0; /* number of sectors that are bad in all devices */
>> + long idx = get_barrier_bucket_idx(sector_nr);
>>
>> if (!conf->r1buf_pool)
>> if (init_resync(conf))
>> @@ -2535,7 +2571,7 @@ static sector_t raid1_sync_request(struc
>> * If there is non-resync activity waiting for a turn, then let it
>> * though before starting on this new sync request.
>> */
>> - if (conf->nr_waiting)
>> + if (conf->nr_waiting[idx])
>> schedule_timeout_uninterruptible(1);
>>
>> /* we are incrementing sector_nr below. To be safe, we check against
>> @@ -2562,6 +2598,8 @@ static sector_t raid1_sync_request(struc
>> r1_bio->sector = sector_nr;
>> r1_bio->state = 0;
>> set_bit(R1BIO_IsSync, &r1_bio->state);
>> + /* make sure good_sectors won't go across barrier unit border */
>> + good_sectors = align_to_barrier_unit_end(sector_nr, good_sectors);
>>
>> for (i = 0; i < conf->raid_disks * 2; i++) {
>> struct md_rdev *rdev;
>> @@ -2786,6 +2824,22 @@ static struct r1conf *setup_conf(struct
>> if (!conf)
>> goto abort;
>>
>> + conf->nr_pending = kzalloc(PAGE_SIZE, GFP_KERNEL);
>> + if (!conf->nr_pending)
>> + goto abort;
>
> This allocation is a little wierd. I'd define a count and uses
> sizeof(nr_pending) * count to do the allocation. There is nothing related to
> PAGE_SIZE. Alternatively, just make nr_pending an array in r1conf.
This is just for better memory usage, r1conf is allocated by slab
allocator, large not aligned size may cause internal memory
fragmentation inside slab's pages. call kzalloc() with PAGE_SIZE will
avoid such issue.
Coly
^ permalink raw reply
* Re: raid0 vs. mkfs
From: Shaohua Li @ 2016-11-28 6:08 UTC (permalink / raw)
To: NeilBrown; +Cc: Avi Kivity, linux-raid
In-Reply-To: <87oa108a1x.fsf@notabene.neil.brown.name>
On Mon, Nov 28, 2016 at 04:09:46PM +1100, Neil Brown wrote:
> On Mon, Nov 28 2016, Avi Kivity wrote:
>
> > mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large
> > disk that supports TRIM/DISCARD (erase whichever is inappropriate).
> > That is because mkfs issues a TRIM/DISCARD (erase whichever is
> > inappropriate) for the entire partition. As far as I can tell, md
> > converts the large TRIM/DISCARD (erase whichever is inappropriate) into
> > a large number of TRIM/DISCARD (erase whichever is inappropriate)
> > requests, one per chunk-size worth of disk, and issues them to the RAID
> > components individually.
> >
> >
> > It seems to me that md can convert the large TRIM/DISCARD (erase
> > whichever is inappropriate) request it gets into one TRIM/DISCARD (erase
> > whichever is inappropriate) per RAID component, converting an O(disk
> > size / chunk size) operation into an O(number of RAID components)
> > operation, which is much faster.
> >
> >
> > I observed this with mkfs.xfs on a RAID0 of four 3TB NVMe devices, with
> > the operation taking about a quarter of an hour, continuously pushing
> > half-megabyte TRIM/DISCARD (erase whichever is inappropriate) requests
> > to the disk. Linux 4.1.12.
>
> Surely it is the task of the underlying driver, or the queuing
> infrastructure, to merge small requests into large requests.
> Why should this have anything do to with RAID0?
That's what I'm wondering too. The block layer should merge the small requests
into larget one, and it does, IIRC. There is one bug related to this (commit
9c573de3283af007e), but it only applies to 4.3+.
Thanks,
Shaohua
^ permalink raw reply
* Re: [PATCH] md/raid5: limit request size according to implementation limits
From: Konstantin Khlebnikov @ 2016-11-28 6:06 UTC (permalink / raw)
To: Coly Li
Cc: Konstantin Khlebnikov, Shaohua Li, Neil Brown, linux-raid,
Linux Kernel Mailing List, Stable
In-Reply-To: <e4f231db-6e4b-3d7f-e7a0-362dc4441823@suse.de>
On Mon, Nov 28, 2016 at 7:40 AM, Coly Li <colyli@suse.de> wrote:
> On 2016/11/28 上午12:32, Konstantin Khlebnikov wrote:
>> Current implementation employ 16bit counter of active stripes in lower
>> bits of bio->bi_phys_segments. If request is big enough to overflow
>> this counter bio will be completed and freed too early.
>>
>> Fortunately this not happens in default configuration because several
>> other limits prevent that: stripe_cache_size * nr_disks effectively
>> limits count of active stripes. And small max_sectors_kb at lower
>> disks prevent that during normal read/write operations.
>>
>> Overflow easily happens in discard if it's enabled by module parameter
>> "devices_handle_discard_safely" and stripe_cache_size is set big enough.
>>
>> This patch limits requests size with 256Mb - 8Kb to prevent overflows.
>>
>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
>> Cc: Shaohua Li <shli@kernel.org>
>> Cc: Neil Brown <neilb@suse.com>
>> Cc: stable@vger.kernel.org
>> ---
>> drivers/md/raid5.c | 9 +++++++++
>> 1 file changed, 9 insertions(+)
>>
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index 92ac251e91e6..cce6057b9aca 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -6984,6 +6984,15 @@ static int raid5_run(struct mddev *mddev)
>> stripe = (stripe | (stripe-1)) + 1;
>> mddev->queue->limits.discard_alignment = stripe;
>> mddev->queue->limits.discard_granularity = stripe;
>> +
>> + /*
>> + * We use 16-bit counter of active stripes in bi_phys_segments
>> + * (minus one for over-loaded initialization)
>> + */
>> + blk_queue_max_hw_sectors(mddev->queue, 0xfffe * STRIPE_SECTORS);
>> + blk_queue_max_discard_sectors(mddev->queue,
>> + 0xfffe * STRIPE_SECTORS);
>> +
>
> Could you please to explain why use 0xfffe * STRIPE_SECTORS here ?
This code send individual bio to lower device for each STRIPE_SECTORS (8)
and count them in 16-bit counter 0xffff max (you could find this
constant above in this file)
but counter initialized with 1 to prevent hitting zero during generation
thus maximum is 0xfffe stripes which is 256Mb - 8Kb in bytes
>
> Thanks.
>
> Coly
>
^ permalink raw reply
* Re: [patch] md/r5cache: enable IRQs on error path
From: Shaohua Li @ 2016-11-28 5:37 UTC (permalink / raw)
To: Dan Carpenter; +Cc: Song Liu, linux-raid, kernel-janitors
In-Reply-To: <20161124111304.GI17225@mwanda>
On Thu, Nov 24, 2016 at 02:13:04PM +0300, Dan Carpenter wrote:
> We need to re-enable the IRQs here before returning.
>
> Fixes: a39f7afde358 ("md/r5cache: write-out phase and reclaim support")
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Applied, thanks!
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 2a60ce4..6610134 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -1029,7 +1029,7 @@ static sector_t r5c_calculate_new_cp(struct r5conf *conf)
> spin_lock_irqsave(&log->stripe_in_journal_lock, flags);
> if (list_empty(&conf->log->stripe_in_journal_list)) {
> /* all stripes flushed */
> - spin_unlock(&log->stripe_in_journal_lock);
> + spin_unlock_irqrestore(&log->stripe_in_journal_lock, flags);
> return log->next_checkpoint;
> }
> sh = list_first_entry(&conf->log->stripe_in_journal_list,
^ permalink raw reply
* Re: [PATCH v5] md/r5cache: handle alloc_page failure
From: Shaohua Li @ 2016-11-28 5:34 UTC (permalink / raw)
To: Song Liu
Cc: linux-raid, neilb, shli, kernel-team, dan.j.williams, hch,
liuzhengyuang521, liuzhengyuan
In-Reply-To: <20161124065039.2151784-1-songliubraving@fb.com>
On Wed, Nov 23, 2016 at 10:50:39PM -0800, Song Liu wrote:
> RMW of r5c write back cache uses an extra page to store old data for
> prexor. handle_stripe_dirtying() allocates this page by calling
> alloc_page(). However, alloc_page() may fail.
>
> To handle alloc_page() failures, this patch adds an extra page to
> disk_info. When alloc_page fails, handle_stripe() trys to use these
> pages. When these pages are used by other stripe (R5C_EXTRA_PAGE_IN_USE),
> the stripe is added to delayed_list.
Thanks, Applied!
^ permalink raw reply
* Re: raid0 vs. mkfs
From: NeilBrown @ 2016-11-28 5:09 UTC (permalink / raw)
To: Avi Kivity, linux-raid
In-Reply-To: <56c83c4e-d451-07e5-88e2-40b085d8681c@scylladb.com>
[-- Attachment #1: Type: text/plain, Size: 1357 bytes --]
On Mon, Nov 28 2016, Avi Kivity wrote:
> mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large
> disk that supports TRIM/DISCARD (erase whichever is inappropriate).
> That is because mkfs issues a TRIM/DISCARD (erase whichever is
> inappropriate) for the entire partition. As far as I can tell, md
> converts the large TRIM/DISCARD (erase whichever is inappropriate) into
> a large number of TRIM/DISCARD (erase whichever is inappropriate)
> requests, one per chunk-size worth of disk, and issues them to the RAID
> components individually.
>
>
> It seems to me that md can convert the large TRIM/DISCARD (erase
> whichever is inappropriate) request it gets into one TRIM/DISCARD (erase
> whichever is inappropriate) per RAID component, converting an O(disk
> size / chunk size) operation into an O(number of RAID components)
> operation, which is much faster.
>
>
> I observed this with mkfs.xfs on a RAID0 of four 3TB NVMe devices, with
> the operation taking about a quarter of an hour, continuously pushing
> half-megabyte TRIM/DISCARD (erase whichever is inappropriate) requests
> to the disk. Linux 4.1.12.
Surely it is the task of the underlying driver, or the queuing
infrastructure, to merge small requests into large requests.
Why should this have anything do to with RAID0?
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
^ permalink raw reply
* Re: [BUG] MD/RAID1 hung forever on freeze_array
From: Coly Li @ 2016-11-28 4:47 UTC (permalink / raw)
To: Jinpu Wang; +Cc: linux-raid, Shaohua Li, Neil Brown, Nate Dailey
In-Reply-To: <CAMGffE=L_XziWyV+wCJFqKj+v_ahOPNSugZ3uY1q5ixd4iv2fQ@mail.gmail.com>
On 2016/11/25 下午9:59, Jinpu Wang wrote:
> On Fri, Nov 25, 2016 at 2:30 PM, Jinpu Wang <jinpu.wang@profitbricks.com> wrote:
>> Hi,
>>
>> I'm hitting hung task in mdx_raid1 when running test, I can reproduce
>> it easily with my tests below:
>>
>> I create one md with one local loop device and one remote scsi
>> exported by SRP. running fio with mix rw on top of md, force_close
>> session on storage side. mdx_raid1 is wait on free_array in D state,
>> and a lot of fio also in D state in wait_barrier.
>>
>> [ 335.154711] blk_update_request: I/O error, dev sdb, sector 8
>> [ 335.154855] md: super_written gets error=-5
>> [ 335.154999] md/raid1:md1: Disk failure on sdb, disabling device.
>> md/raid1:md1: Operation continuing on 1 devices.
>> [ 335.155258] sd 1:0:0:0: rejecting I/O to offline device
>> [ 335.155402] blk_update_request: I/O error, dev sdb, sector 80
>> [ 335.155547] md: super_written gets error=-5
>> [ 340.158828] scsi host1: ib_srp: reconnect succeeded
>> [ 373.017608] md/raid1:md1: redirecting sector 616617 to other mirror: loop1
>> [ 373.110527] md/raid1:md1: redirecting sector 1320893 to other mirror: loop1
>> [ 373.117230] md/raid1:md1: redirecting sector 1564499 to other mirror: loop1
>> [ 373.127652] md/raid1:md1: redirecting sector 104034 to other mirror: loop1
>> [ 373.135665] md/raid1:md1: redirecting sector 1209765 to other mirror: loop1
>> [ 373.145634] md/raid1:md1: redirecting sector 51200 to other mirror: loop1
>> [ 373.158824] md/raid1:md1: redirecting sector 755750 to other mirror: loop1
>> [ 373.169964] md/raid1:md1: redirecting sector 1681631 to other mirror: loop1
>> [ 373.178619] md/raid1:md1: redirecting sector 1894296 to other mirror: loop1
>> [ 373.186153] md/raid1:md1: redirecting sector 1905016 to other mirror: loop1
>> [ 374.364370] RAID1 conf printout:
>> [ 374.364377] --- wd:1 rd:2
>> [ 374.364379] disk 0, wo:1, o:0, dev:sdb
>> [ 374.364381] disk 1, wo:0, o:1, dev:loop1
>> [ 374.437099] RAID1 conf printout:
>> [ 374.437103] --- wd:1 rd:2
>> snip
>>
>>
>> [ 810.266112] sysrq: SysRq : Show Blocked State
>> [ 810.266235] task PC stack pid father
>> [ 810.266362] md1_raid1 D ffff88022d927c48 0 4022 2 0x00000000
>> [ 810.266487] ffff88022d927c48 ffff8802351a0000 ffff8800b91bc100
>> 000000008010000e
>> [ 810.266747] ffff88022d927c30 ffff88022d928000 0000000000000001
>> ffff880233b49b70
>> [ 810.266975] ffff880233b49b88 ffff8802325d5a40 ffff88022d927c60
>> ffffffff81810600
>> [ 810.267203] Call Trace:
>> [ 810.267322] [<ffffffff81810600>] schedule+0x30/0x80
>> [ 810.267437] [<ffffffffa01342c1>] freeze_array+0x71/0xc0 [raid1]
>> [ 810.267555] [<ffffffff81095480>] ? wake_atomic_t_function+0x70/0x70
>> [ 810.267669] [<ffffffffa013578b>] handle_read_error+0x3b/0x570 [raid1]
>> [ 810.267816] [<ffffffff81185783>] ? kmem_cache_free+0x183/0x190
>> [ 810.267929] [<ffffffff81094e36>] ? __wake_up+0x46/0x60
>> [ 810.268045] [<ffffffffa0136dcd>] raid1d+0x20d/0xfc0 [raid1]
>> [ 810.268159] [<ffffffff81813043>] ? schedule_timeout+0x1a3/0x230
>> [ 810.268274] [<ffffffff8180fe77>] ? __schedule+0x2e7/0xa40
>> [ 810.268391] [<ffffffffa0211839>] md_thread+0x119/0x120 [md_mod]
>> [ 810.268508] [<ffffffff81095480>] ? wake_atomic_t_function+0x70/0x70
>> [ 810.268624] [<ffffffffa0211720>] ? find_pers+0x70/0x70 [md_mod]
>> [ 810.268741] [<ffffffff81075614>] kthread+0xc4/0xe0
>> [ 810.268853] [<ffffffff81075550>] ? kthread_worker_fn+0x150/0x150
>> [ 810.268970] [<ffffffff8181415f>] ret_from_fork+0x3f/0x70
>> [ 810.269114] [<ffffffff81075550>] ? kthread_worker_fn+0x150/0x150
>> [ 810.269227] fio D ffff8802325137a0 0 4212 4206 0x00000000
>> [ 810.269347] ffff8802325137a0 ffff88022de3db00 ffff8800ba7bb400
>> 0000000000000000
>> [ 810.269574] ffff880233b49b00 ffff880232513788 ffff880232514000
>> ffff880233b49b88
>> [ 810.269801] ffff880233b49b70 ffff8800ba7bb400 ffff8800b5f5db00
>> ffff8802325137b8
>> [ 810.270028] Call Trace:
>> [ 810.270138] [<ffffffff81810600>] schedule+0x30/0x80
>> [ 810.270282] [<ffffffffa0133727>] wait_barrier+0x117/0x1f0 [raid1]
>> [ 810.270396] [<ffffffff81095480>] ? wake_atomic_t_function+0x70/0x70
>> [ 810.270513] [<ffffffffa0135d72>] make_request+0xb2/0xd80 [raid1]
>> [ 810.270628] [<ffffffffa02123fc>] md_make_request+0xec/0x230 [md_mod]
>> [ 810.270746] [<ffffffff813f96f9>] ? generic_make_request_checks+0x219/0x500
>> [ 810.270860] [<ffffffff813fc851>] blk_prologue_bio+0x91/0xc0
>> [ 810.270976] [<ffffffff813fc230>] generic_make_request+0xe0/0x1b0
>> [ 810.271090] [<ffffffff813fc362>] submit_bio+0x62/0x140
>> [ 810.271209] [<ffffffff811d2bbc>] do_blockdev_direct_IO+0x289c/0x33c0
>> [ 810.271323] [<ffffffff81810600>] ? schedule+0x30/0x80
>> [ 810.271468] [<ffffffff811cd620>] ? I_BDEV+0x10/0x10
>> [ 810.271580] [<ffffffff811d371e>] __blockdev_direct_IO+0x3e/0x40
>> [ 810.271696] [<ffffffff811cdfb7>] blkdev_direct_IO+0x47/0x50
>> [ 810.271828] [<ffffffff81132cbf>] generic_file_read_iter+0x44f/0x570
>> [ 810.271949] [<ffffffff811ceaa0>] ? blkdev_write_iter+0x110/0x110
>> [ 810.272062] [<ffffffff811cead0>] blkdev_read_iter+0x30/0x40
>> [ 810.272179] [<ffffffff811de5a6>] aio_run_iocb+0x126/0x2b0
>> [ 810.272291] [<ffffffff8181209d>] ? mutex_lock+0xd/0x30
>> [ 810.272407] [<ffffffff811ddd04>] ? aio_read_events+0x284/0x370
>> [ 810.272521] [<ffffffff81183c29>] ? kmem_cache_alloc+0xd9/0x180
>> [ 810.272665] [<ffffffff811df438>] ? do_io_submit+0x178/0x4a0
>> [ 810.272778] [<ffffffff811df4ed>] do_io_submit+0x22d/0x4a0
>> [ 810.272895] [<ffffffff811df76b>] SyS_io_submit+0xb/0x10
>> [ 810.273007] [<ffffffff81813e17>] entry_SYSCALL_64_fastpath+0x12/0x66
>> [ 810.273130] fio D ffff88022fa6f730 0 4213 4206 0x00000000
>> [ 810.273247] ffff88022fa6f730 ffff8800b549a700 ffff8800af703400
>> 0000000002011200
>> [ 810.273475] ffff880236001700 ffff88022fa6f718 ffff88022fa70000
>> ffff880233b49b88
>> [ 810.273702] ffff880233b49b70 ffff8800af703400 ffff88022f843700
>> ffff88022fa6f748
>> [ 810.273958] Call Trace:
>> [ 810.274070] [<ffffffff81810600>] schedule+0x30/0x80
>> [ 810.274183] [<ffffffffa0133727>] wait_barrier+0x117/0x1f0 [raid1]
>> [ 810.274300] [<ffffffff81095480>] ? wake_atomic_t_function+0x70/0x70
>> [ 810.274413] [<ffffffffa0135d72>] make_request+0xb2/0xd80 [raid1]
>> [ 810.274537] [<ffffffff81408f15>] ? __bt_get.isra.7+0xd5/0x1b0
>> [ 810.274650] [<ffffffff81094feb>] ? finish_wait+0x5b/0x80
>> [ 810.274766] [<ffffffff8140917f>] ? bt_get+0x18f/0x1b0
>> [ 810.274881] [<ffffffffa02123fc>] md_make_request+0xec/0x230 [md_mod]
>> [ 810.274998] [<ffffffff813f96f9>] ? generic_make_request_checks+0x219/0x500
>> [ 810.275144] [<ffffffff813fc851>] blk_prologue_bio+0x91/0xc0
>> [ 810.275257] [<ffffffff813fc230>] generic_make_request+0xe0/0x1b0
>> [ 810.275373] [<ffffffff813fc362>] submit_bio+0x62/0x140
>> [ 810.275486] [<ffffffff811d2bbc>] do_blockdev_direct_IO+0x289c/0x33c0
>> [ 810.275607] [<ffffffff811cd620>] ? I_BDEV+0x10/0x10
>> [ 810.275721] [<ffffffff811d371e>] __blockdev_direct_IO+0x3e/0x40
>> [ 810.275843] [<ffffffff811cdfb7>] blkdev_direct_IO+0x47/0x50
>> [ 810.275956] [<ffffffff81132e8c>] generic_file_direct_write+0xac/0x170
>> [ 810.276073] [<ffffffff8113301d>] __generic_file_write_iter+0xcd/0x1f0
>> [ 810.276187] [<ffffffff811ce990>] ? blkdev_close+0x30/0x30
>> [ 810.276332] [<ffffffff811cea17>] blkdev_write_iter+0x87/0x110
>> [ 810.276445] [<ffffffff811de6d0>] aio_run_iocb+0x250/0x2b0
>> [ 810.276560] [<ffffffff8181209d>] ? mutex_lock+0xd/0x30
>> [ 810.276673] [<ffffffff811ddd04>] ? aio_read_events+0x284/0x370
>> [ 810.276786] [<ffffffff81183c29>] ? kmem_cache_alloc+0xd9/0x180
>> [ 810.276902] [<ffffffff811df438>] ? do_io_submit+0x178/0x4a0
>> [ 810.277015] [<ffffffff811df4ed>] do_io_submit+0x22d/0x4a0
>> [ 810.277131] [<ffffffff811df76b>] SyS_io_submit+0xb/0x10
>> [ 810.277244] [<ffffffff81813e17>] entry_SYSCALL_64_fastpath+0x12/0x66
>> I dump r1conf in crash:
>> struct r1conf {
>> mddev = 0xffff88022d761800,
>> mirrors = 0xffff88023456a180,
>> raid_disks = 2,
>> next_resync = 18446744073709527039,
>> start_next_window = 18446744073709551615,
>> current_window_requests = 0,
>> next_window_requests = 0,
>> device_lock = {
>> {
>> rlock = {
>> raw_lock = {
>> val = {
>> counter = 0
>> }
>> }
>> }
>> }
>> },
>> retry_list = {
>> next = 0xffff8800b5fe3b40,
>> prev = 0xffff8800b50164c0
>> },
>> bio_end_io_list = {
>> next = 0xffff88022fcd45c0,
>> prev = 0xffff8800b53d57c0
>> },
>> pending_bio_list = {
>> head = 0x0,
>> tail = 0x0
>> },
>> pending_count = 0,
>> wait_barrier = {
>> lock = {
>> {
>> rlock = {
>> raw_lock = {
>> val = {
>> counter = 0
>> }
>> }
>> }
>> }
>> },
>> task_list = {
>> next = 0xffff8800b51d37e0,
>> prev = 0xffff88022fbbb770
>> }
>> },
>> resync_lock = {
>> {
>> rlock = {
>> raw_lock = {
>> val = {
>> counter = 0
>> }
>> }
>> }
>> }
>> },
>> nr_pending = 406,
>> nr_waiting = 100,
>> nr_queued = 404,
>> barrier = 0,
>> array_frozen = 1,
>> fullsync = 0,
>> recovery_disabled = 1,
>> poolinfo = 0xffff88022d829bb0,
>> r1bio_pool = 0xffff88022b4512a0,
>> r1buf_pool = 0x0,
>> tmppage = 0xffffea0008c97b00,
>> thread = 0x0,
>> cluster_sync_low = 0,
>> cluster_sync_high = 0
>> }
>>
>> every time nr_pending is 1 bigger then (nr_queued + 1), so seems we
>> forgot to increase nr_queued somewhere?
>>
>> I've noticed (commit ccfc7bf1f09d61)raid1: include bio_end_io_list in
>> nr_queued to prevent freeze_array hang. Seems it fixed similar bug.
>>
>> Could you give your suggestion?
>>
> Sorry, forgot to mention kernel version is 4.4.28
This commit is Cced to stable@vger.kernel.org for v4.3+, do you use a
stable kernel or a distribution with 4.4.28 kernel ?
Coly
^ permalink raw reply
* Re: [PATCH] md/raid5: limit request size according to implementation limits
From: Coly Li @ 2016-11-28 4:40 UTC (permalink / raw)
To: Konstantin Khlebnikov
Cc: Shaohua Li, Neil Brown, linux-raid, linux-kernel, stable
In-Reply-To: <148026435214.19980.7956943898609877817.stgit@buzz>
On 2016/11/28 上午12:32, Konstantin Khlebnikov wrote:
> Current implementation employ 16bit counter of active stripes in lower
> bits of bio->bi_phys_segments. If request is big enough to overflow
> this counter bio will be completed and freed too early.
>
> Fortunately this not happens in default configuration because several
> other limits prevent that: stripe_cache_size * nr_disks effectively
> limits count of active stripes. And small max_sectors_kb at lower
> disks prevent that during normal read/write operations.
>
> Overflow easily happens in discard if it's enabled by module parameter
> "devices_handle_discard_safely" and stripe_cache_size is set big enough.
>
> This patch limits requests size with 256Mb - 8Kb to prevent overflows.
>
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Cc: Shaohua Li <shli@kernel.org>
> Cc: Neil Brown <neilb@suse.com>
> Cc: stable@vger.kernel.org
> ---
> drivers/md/raid5.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 92ac251e91e6..cce6057b9aca 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -6984,6 +6984,15 @@ static int raid5_run(struct mddev *mddev)
> stripe = (stripe | (stripe-1)) + 1;
> mddev->queue->limits.discard_alignment = stripe;
> mddev->queue->limits.discard_granularity = stripe;
> +
> + /*
> + * We use 16-bit counter of active stripes in bi_phys_segments
> + * (minus one for over-loaded initialization)
> + */
> + blk_queue_max_hw_sectors(mddev->queue, 0xfffe * STRIPE_SECTORS);
> + blk_queue_max_discard_sectors(mddev->queue,
> + 0xfffe * STRIPE_SECTORS);
> +
Could you please to explain why use 0xfffe * STRIPE_SECTORS here ?
Thanks.
Coly
^ permalink raw reply
* Re: raid0 vs. mkfs
From: Chris Murphy @ 2016-11-28 4:11 UTC (permalink / raw)
To: Avi Kivity; +Cc: Linux-RAID
In-Reply-To: <56c83c4e-d451-07e5-88e2-40b085d8681c@scylladb.com>
On Sun, Nov 27, 2016 at 8:24 AM, Avi Kivity <avi@scylladb.com> wrote:
> mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large disk
> that supports TRIM/DISCARD (erase whichever is inappropriate)
Trim is the appropriate term. Term discard refers to a specific mount
time implementation of FITRIM ioctl, and fstrim refers to a user space
tool that does the same and can be scheduled or issued manually.
That is
> because mkfs issues a TRIM/DISCARD (erase whichever is inappropriate) for
> the entire partition. As far as I can tell, md converts the large
> TRIM/DISCARD (erase whichever is inappropriate) into a large number of
> TRIM/DISCARD (erase whichever is inappropriate) requests, one per chunk-size
> worth of disk, and issues them to the RAID components individually.
You could strace the mkfs command. Each filesystem is doing it a
little differently the last time I compared mkfs.xfs and mkfs.btrfs;
but I can't qualify the differences relative to how the device is
going to react to those commands.
It's also possible to enable block device tracing and see the actual
SCSI or ATA commands sent to a drive.
There's a metric f tonne of bugs in this area so before anything I'd
consider researching if there's a firmware update for your hardware
and applying that and retesting. And then also after testing your
ideal deployed version, use something much close to upstream (Arch or
Fedora) and see if the problem is reproducible.
--
Chris Murphy
^ permalink raw reply
* Re: [PATCH v2] Avoid nested function definition
From: Coly Li @ 2016-11-28 3:06 UTC (permalink / raw)
To: Peter Foley, linux-kernel, shli, linux-bcache, linux-raid,
kent.overstreet
In-Reply-To: <20161128025004.18782-1-pefoley2@pefoley.com>
On 2016/11/28 上午10:50, Peter Foley wrote:
> Fixes below error with clang:
> ../drivers/md/bcache/sysfs.c:759:3: error: function definition is not allowed here
> { return *((uint16_t *) r) - *((uint16_t *) l); }
> ^
> ../drivers/md/bcache/sysfs.c:789:32: error: use of undeclared identifier 'cmp'
> sort(p, n, sizeof(uint16_t), cmp, NULL);
> ^
> 2 errors generated.
>
> v2:
> rename function to __bch_cache_cmp
>
> Signed-off-by: Peter Foley <pefoley2@pefoley.com>
> ---
> drivers/md/bcache/sysfs.c | 10 ++++++----
> 1 file changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
> index b3ff57d61dde..8f12089a69e7 100644
> --- a/drivers/md/bcache/sysfs.c
> +++ b/drivers/md/bcache/sysfs.c
> @@ -731,6 +731,11 @@ static struct attribute *bch_cache_set_internal_files[] = {
> };
> KTYPE(bch_cache_set_internal);
>
> +static int __bch_cache_cmp(const void *l, const void *r)
> +{
> + return *((uint16_t *)r) - *((uint16_t *)l);
> +}
> +
> SHOW(__bch_cache)
> {
> struct cache *ca = container_of(kobj, struct cache, kobj);
> @@ -755,9 +760,6 @@ SHOW(__bch_cache)
> CACHE_REPLACEMENT(&ca->sb));
>
> if (attr == &sysfs_priority_stats) {
> - int cmp(const void *l, const void *r)
> - { return *((uint16_t *) r) - *((uint16_t *) l); }
> -
> struct bucket *b;
> size_t n = ca->sb.nbuckets, i;
> size_t unused = 0, available = 0, dirty = 0, meta = 0;
> @@ -786,7 +788,7 @@ SHOW(__bch_cache)
> p[i] = ca->buckets[i].prio;
> mutex_unlock(&ca->set->bucket_lock);
>
> - sort(p, n, sizeof(uint16_t), cmp, NULL);
> + sort(p, n, sizeof(uint16_t), __bch_cache_cmp, NULL);
>
> while (n &&
> !cached[n - 1])
>
Acked-by: Coly Li <colyli@suse.de>
--
Coly Li
^ permalink raw reply
* [PATCH v2] Avoid nested function definition
From: Peter Foley @ 2016-11-28 2:50 UTC (permalink / raw)
To: linux-kernel, shli, linux-bcache, linux-raid, i, kent.overstreet
Cc: Peter Foley
In-Reply-To: <9a8a84ed-ecf7-b1dd-59e8-1a36dbcef088@coly.li>
Fixes below error with clang:
../drivers/md/bcache/sysfs.c:759:3: error: function definition is not allowed here
{ return *((uint16_t *) r) - *((uint16_t *) l); }
^
../drivers/md/bcache/sysfs.c:789:32: error: use of undeclared identifier 'cmp'
sort(p, n, sizeof(uint16_t), cmp, NULL);
^
2 errors generated.
v2:
rename function to __bch_cache_cmp
Signed-off-by: Peter Foley <pefoley2@pefoley.com>
---
drivers/md/bcache/sysfs.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index b3ff57d61dde..8f12089a69e7 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -731,6 +731,11 @@ static struct attribute *bch_cache_set_internal_files[] = {
};
KTYPE(bch_cache_set_internal);
+static int __bch_cache_cmp(const void *l, const void *r)
+{
+ return *((uint16_t *)r) - *((uint16_t *)l);
+}
+
SHOW(__bch_cache)
{
struct cache *ca = container_of(kobj, struct cache, kobj);
@@ -755,9 +760,6 @@ SHOW(__bch_cache)
CACHE_REPLACEMENT(&ca->sb));
if (attr == &sysfs_priority_stats) {
- int cmp(const void *l, const void *r)
- { return *((uint16_t *) r) - *((uint16_t *) l); }
-
struct bucket *b;
size_t n = ca->sb.nbuckets, i;
size_t unused = 0, available = 0, dirty = 0, meta = 0;
@@ -786,7 +788,7 @@ SHOW(__bch_cache)
p[i] = ca->buckets[i].prio;
mutex_unlock(&ca->set->bucket_lock);
- sort(p, n, sizeof(uint16_t), cmp, NULL);
+ sort(p, n, sizeof(uint16_t), __bch_cache_cmp, NULL);
while (n &&
!cached[n - 1])
--
2.11.0.rc2
^ permalink raw reply related
* Re: MD Remnants After --stop
From: Marc Smith @ 2016-11-28 2:25 UTC (permalink / raw)
To: NeilBrown; +Cc: linux-raid
In-Reply-To: <8760n8a7k5.fsf@notabene.neil.brown.name>
On Sun, Nov 27, 2016 at 5:20 PM, NeilBrown <neilb@suse.com> wrote:
> On Sun, Nov 27 2016, Marc Smith wrote:
>
>> So, I modified mdopen.c to look like this:
>>
>> --- a/mdopen.c 2016-11-25 17:04:25.782299330 -0500
>> +++ b/mdopen.c 2016-11-26 10:57:35.883621355 -0500
>> @@ -416,7 +416,7 @@
>> */
>> int open_mddev(char *dev, int report_errors)
>> {
>> - int mdfd = open(dev, O_RDWR);
>> + int mdfd = open(dev, O_RDONLY);
>> if (mdfd < 0 && errno == EACCES)
>> mdfd = open(dev, O_RDONLY);
>> if (mdfd < 0) {
>>
>>
>> And now, when running 'mdadm --stop' here is what I see...
>>
>> From the output 'udevadm monitor -pku':
>>
>> --snip--
>> KERNEL[297486.536908] offline
>> /kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e (dlm)
>> ACTION=offline
>> DEVPATH=/kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e
>> LOCKSPACE=62fccfd6-605f-19e6-be6d-99a1e3cb987e
>> SEQNUM=3651
>> SUBSYSTEM=dlm
>>
>> UDEV [297486.537541] offline
>> /kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e (dlm)
>> ACTION=offline
>> DEVPATH=/kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e
>> LOCKSPACE=62fccfd6-605f-19e6-be6d-99a1e3cb987e
>> SEQNUM=3651
>> SUBSYSTEM=dlm
>> USEC_INITIALIZED=7486537404
>>
>> KERNEL[297486.538325] remove
>> /kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e (dlm)
>> ACTION=remove
>> DEVPATH=/kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e
>> LOCKSPACE=62fccfd6-605f-19e6-be6d-99a1e3cb987e
>> SEQNUM=3652
>> SUBSYSTEM=dlm
>>
>> UDEV [297486.538644] remove
>> /kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e (dlm)
>> ACTION=remove
>> DEVPATH=/kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e
>> LOCKSPACE=62fccfd6-605f-19e6-be6d-99a1e3cb987e
>> SEQNUM=3652
>> SUBSYSTEM=dlm
>> USEC_INITIALIZED=86538345
>> --snip--
>>
>> And from the kernel log:
>>
>> --snip--
>> [297504.958244] md127: detected capacity change from 73340747776 to 0
>> [297504.958249] md: md127 stopped.
>> [297504.958884] dlm: 62fccfd6-605f-19e6-be6d-99a1e3cb987e: leaving the
>> lockspace group...
>> [297504.959004] udevd[487]: seq 3651 queued, 'offline' 'dlm'
>> [297504.959161] udevd[487]: seq 3651 forked new worker [5392]
>> [297504.959417] udevd[5392]: seq 3651 running
>> [297504.959474] udevd[5392]: no db file to read
>> /run/udev/data/+dlm:62fccfd6-605f-19e6-be6d-99a1e3cb987e: No such file
>> or directory
>> [297504.959524] udevd[5392]: passed device to netlink monitor 0x2251c30
>> [297504.959527] udevd[5392]: seq 3651 processed
>> [297504.960101] dlm: 62fccfd6-605f-19e6-be6d-99a1e3cb987e: group event done 0 0
>> [297504.960299] dlm: 62fccfd6-605f-19e6-be6d-99a1e3cb987e:
>> release_lockspace final free
>> [297504.960329] md: unbind<dm-0>
>> [297504.960448] udevd[487]: seq 3652 queued, 'remove' 'dlm'
>> [297504.960500] udevd[487]: passed 214 byte device to netlink monitor 0x224b130
>> [297504.960584] udevd[5392]: seq 3652 running
>> [297504.960606] udevd[5392]: no db file to read
>> /run/udev/data/+dlm:62fccfd6-605f-19e6-be6d-99a1e3cb987e: No such file
>> or directory
>> [297504.967168] md: export_rdev(dm-0)
>> [297504.967231] md: unbind<dm-1>
>> [297504.975176] md: export_rdev(dm-1)
>> --snip--
>>
>> So that did get rid of the synthesized CHANGE event, but still no
>> REMOVE event. =)
>>
>> Still trying to rule-out there isn't anything strange with my Linux
>> distro / setup... but I assume even if udev was mishandling something,
>> we should still be seeing a REMOVE event on '--stop'.
>>
>
> Getting rid of the CHANGE events is good as they can muddy the waters a
> bit.
>
> The lack of a REMOVE event suggests that something it still holding a
> reference to the 'md' kobject. There aren't many references though.
> One is held when the array is active. That is dropped by
> mddev_delayed_delete() which is called from a work-queue. So it can be
> delayed a little bit, but not much.
> The other references are held by member devices, and dropped when the
> devices are unbound from the array - and the kernel log shows the
> "unbind" messages.
>
> In either case, if there is a reference outstanding, there should be
> some evidence in the /sys/mdXX/md directory. Either a 'dev-XX'
> directory or the various bitmap attributes..
>
> Can you report the result of
>
> find /sys/block/md127/md
> and
> find /sys/block/md127/md -type file | xargs grep .
>
> after stopping the array?
# find /sys/block/md127/md
/sys/block/md127/md
/sys/block/md127/md/reshape_position
/sys/block/md127/md/layout
/sys/block/md127/md/raid_disks
/sys/block/md127/md/bitmap
/sys/block/md127/md/bitmap/chunksize
/sys/block/md127/md/bitmap/space
/sys/block/md127/md/bitmap/can_clear
/sys/block/md127/md/bitmap/time_base
/sys/block/md127/md/bitmap/metadata
/sys/block/md127/md/bitmap/backlog
/sys/block/md127/md/bitmap/location
/sys/block/md127/md/bitmap/max_backlog_used
/sys/block/md127/md/array_size
/sys/block/md127/md/max_read_errors
/sys/block/md127/md/metadata_version
/sys/block/md127/md/component_size
/sys/block/md127/md/reshape_direction
/sys/block/md127/md/resync_start
/sys/block/md127/md/new_dev
/sys/block/md127/md/safe_mode_delay
/sys/block/md127/md/level
/sys/block/md127/md/chunk_size
/sys/block/md127/md/array_state
# find /sys/block/md127/md -type f | xargs grep .
/sys/block/md127/md/reshape_position:none
/sys/block/md127/md/layout:0
/sys/block/md127/md/bitmap/chunksize:0
/sys/block/md127/md/bitmap/space:0
/sys/block/md127/md/bitmap/time_base:0
/sys/block/md127/md/bitmap/metadata:internal
/sys/block/md127/md/bitmap/backlog:0
/sys/block/md127/md/bitmap/location:none
/sys/block/md127/md/bitmap/max_backlog_used:0
/sys/block/md127/md/array_size:default
/sys/block/md127/md/max_read_errors:20
/sys/block/md127/md/metadata_version:none
/sys/block/md127/md/component_size:0
/sys/block/md127/md/reshape_direction:forwards
/sys/block/md127/md/resync_start:0
grep: /sys/block/md127/md/new_dev: Permission denied
/sys/block/md127/md/safe_mode_delay:0.000
/sys/block/md127/md/chunk_size:0
/sys/block/md127/md/array_state:clear
--Marc
>
> Thanks,
> NeilBrown
^ permalink raw reply
* Re: [PATCH 2/2] raid5-cache: don't set STRIPE_R5C_PARTIAL_STRIPE flag while load stripe into cache
From: Song Liu @ 2016-11-27 23:49 UTC (permalink / raw)
To: Zhengyuan Liu; +Cc: Shaohua Li, NeilBrown, linux-raid, liuyun01@kylinos.cn
In-Reply-To: <1480129034-14700-2-git-send-email-liuzhengyuan@kylinos.cn>
> On Nov 25, 2016, at 6:57 PM, Zhengyuan Liu <liuzhengyuan@kylinos.cn> wrote:
>
> r5c_recovery_load_one_stripe should not set STRIPE_R5C_PARTIAL_STRIPE flag,as
> the data-only stripe may be STRIPE_R5C_FULL_STRIPE stripe. The state machine
> would release the stripe later and add it into neither r5c_cached_full_stripes
> list or r5c_cached_partial_stripes list and set correct flag. Also we should
> fix the counter corresponding.
>
> Reviewed-by: JackieLiu <liuyun01@kylinos.cn>
> Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
> ---
> drivers/md/raid5-cache.c | 3 +--
> drivers/md/raid5.c | 11 +++++++++++
> 2 files changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 9911164..e0ac758 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -1930,9 +1930,8 @@ static void r5c_recovery_load_one_stripe(struct r5l_log *log,
> set_bit(R5_UPTODATE, &dev->flags);
> }
> }
> - set_bit(STRIPE_R5C_PARTIAL_STRIPE, &sh->state);
> - atomic_inc(&conf->r5c_cached_partial_stripes);
> list_add_tail(&sh->r5c, &log->stripe_in_journal_list);
> + atomic_inc(&log->stripe_in_journal_count);
> }
This makes sense to me. Thanks for catching it.
> /*
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index dbab8c7..8120ce4 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -677,12 +677,23 @@ raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
> atomic_inc(&conf->active_stripes);
> BUG_ON(list_empty(&sh->lru) &&
> !test_bit(STRIPE_EXPANDING, &sh->state));
> +
> inc_empty_inactive_list_flag = 0;
> if (!list_empty(conf->inactive_list + hash))
> inc_empty_inactive_list_flag = 1;
> list_del_init(&sh->lru);
> if (list_empty(conf->inactive_list + hash) && inc_empty_inactive_list_flag)
> atomic_inc(&conf->empty_inactive_list_nr);
> +
> + if (test_and_clear_bit(STRIPE_R5C_PARTIAL_STRIPE, &sh->state)) {
> + WARN_ON(atomic_read(&conf->r5c_cached_partial_stripes) == 0);
> + atomic_dec(&conf->r5c_cached_partial_stripes);
> + }
> + if (test_and_clear_bit(STRIPE_R5C_FULL_STRIPE, &sh->state)) {
> + WARN_ON(atomic_read(&conf->r5c_cached_full_stripes) == 0);
> + atomic_dec(&conf->r5c_cached_full_stripes);
> + }
> +
I don't think we need these, as we decrease these counters in r5c_make_stripe_write_out().
Thanks,
Song
^ permalink raw reply
* Re: [PATCH] raid5-cache: Fix the logic of raid5-cache recovery
From: Song Liu @ 2016-11-27 23:36 UTC (permalink / raw)
To: Jackie Liu; +Cc: linux-raid, 刘正元, Shaohua Li
In-Reply-To: <tencent_4E54374D19A73C0B4D37AF14@qq.com>
Hi Jackie,
This patch has a lot of good fixes. Thanks for reviewing the code and
proposing the fixes.
Could you explain a little more why we need write an empty block before
calling r5c_recovery_rewrite_data_only_stripes()?
Thanks,
Song
> On Nov 25, 2016, at 3:39 AM, Jackie Liu <liuyun01@kylinos.cn> wrote:
>
> Hi Song.
>
> There is a doubt for r5l_recovery_log. I think we need write an empty block first,
> then call r5c_recovery_rewrite_data_only_stripes functions. this empty
> block will be mark as the last_checkpoint. when the CACHING block is rewritten,
> the superblock should be update this time. at the same time, we cann't be
> released the stripe_head at the front, it also be used in
> r5c_recovery_rewrite_data_only_stripes.
>
> here is the patch
>
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 5f817bd..fad1808 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -67,7 +67,7 @@ static char *r5c_journal_mode_str[] = {"write-through",
> /*
> * raid5 cache state machine
> *
> - * With rhe RAID cache, each stripe works in two phases:
> + * With the RAID cache, each stripe works in two phases:
> * - caching phase
> * - writing-out phase
> *
> @@ -1674,7 +1674,6 @@ r5l_recovery_replay_one_stripe(struct r5conf *conf,
>
> static struct stripe_head *
> r5c_recovery_alloc_stripe(struct r5conf *conf,
> - struct list_head *recovery_list,
> sector_t stripe_sect,
> sector_t log_start)
> {
> @@ -1855,8 +1854,8 @@ r5c_recovery_analyze_meta_block(struct r5l_log *log,
> stripe_sect);
>
> if (!sh) {
> - sh = r5c_recovery_alloc_stripe(conf, cached_stripe_list,
> - stripe_sect, ctx->pos);
> + sh = r5c_recovery_alloc_stripe(conf, stripe_sect,
> + ctx->pos);
> /*
> * cannot get stripe from raid5_get_active_stripe
> * try replay some stripes
> @@ -1865,8 +1864,7 @@ r5c_recovery_analyze_meta_block(struct r5l_log *log,
> r5c_recovery_replay_stripes(
> cached_stripe_list, ctx);
> sh = r5c_recovery_alloc_stripe(
> - conf, cached_stripe_list,
> - stripe_sect, ctx->pos);
> + conf, stripe_sect, ctx->pos);
> }
> if (!sh) {
> pr_debug("md/raid:%s: Increasing stripe cache size to %d to recovery data on journal.\n",
> @@ -1875,8 +1873,7 @@ r5c_recovery_analyze_meta_block(struct r5l_log *log,
> raid5_set_cache_size(mddev,
> conf->min_nr_stripes * 2);
> sh = r5c_recovery_alloc_stripe(
> - conf, cached_stripe_list, stripe_sect,
> - ctx->pos);
> + conf, stripe_sect, ctx->pos);
> }
> if (!sh) {
> pr_err("md/raid:%s: Cannot get enough stripes due to memory pressure. Recovery failed.\n",
> @@ -1986,8 +1983,6 @@ static int r5c_recovery_flush_log(struct r5l_log *log,
> list_for_each_entry_safe(sh, next, &ctx->cached_list, lru) {
> WARN_ON(!test_bit(STRIPE_R5C_CACHING, &sh->state));
> r5c_recovery_load_one_stripe(log, sh);
> - list_del_init(&sh->lru);
> - raid5_release_stripe(sh);
> ctx->data_only_stripes++;
> }
>
> @@ -2078,7 +2073,6 @@ r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
> return -ENOMEM;
> }
>
> - ctx->seq += 10;
> list_for_each_entry(sh, &ctx->cached_list, lru) {
> struct r5l_meta_block *mb;
> int i;
> @@ -2090,7 +2084,7 @@ r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
> ctx->pos, ctx->seq);
> mb = page_address(page);
> offset = le32_to_cpu(mb->meta_size);
> - write_pos = ctx->pos + BLOCK_SECTORS;
> + write_pos = r5l_ring_add(log, ctx->pos, BLOCK_SECTORS);
>
> for (i = sh->disks; i--; ) {
> struct r5dev *dev = &sh->dev[i];
> @@ -2125,6 +2119,9 @@ r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
> sh->log_start = ctx->pos;
> ctx->pos = write_pos;
> ctx->seq += 1;
> +
> + list_del_init(&sh->lru);
> + raid5_release_stripe(sh);
> }
> __free_page(page);
> return 0;
> @@ -2135,6 +2132,7 @@ static int r5l_recovery_log(struct r5l_log *log)
> struct mddev *mddev = log->rdev->mddev;
> struct r5l_recovery_ctx ctx;
> int ret;
> + sector_t pos;
>
> ctx.pos = log->last_checkpoint;
> ctx.seq = log->last_cp_seq;
> @@ -2152,6 +2150,10 @@ static int r5l_recovery_log(struct r5l_log *log)
> if (ret)
> return ret;
>
> + pos = ctx.pos;
> + r5l_log_write_empty_meta_block(log, ctx.pos, (ctx.seq += 10));
> + ctx.pos = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS);
> +
> if ((ctx.data_only_stripes == 0) && (ctx.data_parity_stripes == 0))
> pr_debug("md/raid:%s: starting from clean shutdown\n",
> mdname(mddev));
> @@ -2170,9 +2172,9 @@ static int r5l_recovery_log(struct r5l_log *log)
>
> log->log_start = ctx.pos;
> log->next_checkpoint = ctx.pos;
> + log->last_checkpoint = pos;
> log->seq = ctx.seq;
> - r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq);
> - r5l_write_super(log, ctx.pos);
> + r5l_write_super(log, pos);
> return 0;
> }
>
> --
> 2.7.4
>
> Thanks.
> Jackie
^ permalink raw reply
* Re: MD Remnants After --stop
From: NeilBrown @ 2016-11-27 22:20 UTC (permalink / raw)
To: Marc Smith; +Cc: linux-raid
In-Reply-To: <CAHkw+LfJzZoWVnO8kDi+YVV09fas2DveQoChh10wEBZfYin6aQ@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 4146 bytes --]
On Sun, Nov 27 2016, Marc Smith wrote:
> So, I modified mdopen.c to look like this:
>
> --- a/mdopen.c 2016-11-25 17:04:25.782299330 -0500
> +++ b/mdopen.c 2016-11-26 10:57:35.883621355 -0500
> @@ -416,7 +416,7 @@
> */
> int open_mddev(char *dev, int report_errors)
> {
> - int mdfd = open(dev, O_RDWR);
> + int mdfd = open(dev, O_RDONLY);
> if (mdfd < 0 && errno == EACCES)
> mdfd = open(dev, O_RDONLY);
> if (mdfd < 0) {
>
>
> And now, when running 'mdadm --stop' here is what I see...
>
> From the output 'udevadm monitor -pku':
>
> --snip--
> KERNEL[297486.536908] offline
> /kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e (dlm)
> ACTION=offline
> DEVPATH=/kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e
> LOCKSPACE=62fccfd6-605f-19e6-be6d-99a1e3cb987e
> SEQNUM=3651
> SUBSYSTEM=dlm
>
> UDEV [297486.537541] offline
> /kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e (dlm)
> ACTION=offline
> DEVPATH=/kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e
> LOCKSPACE=62fccfd6-605f-19e6-be6d-99a1e3cb987e
> SEQNUM=3651
> SUBSYSTEM=dlm
> USEC_INITIALIZED=7486537404
>
> KERNEL[297486.538325] remove
> /kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e (dlm)
> ACTION=remove
> DEVPATH=/kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e
> LOCKSPACE=62fccfd6-605f-19e6-be6d-99a1e3cb987e
> SEQNUM=3652
> SUBSYSTEM=dlm
>
> UDEV [297486.538644] remove
> /kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e (dlm)
> ACTION=remove
> DEVPATH=/kernel/dlm/62fccfd6-605f-19e6-be6d-99a1e3cb987e
> LOCKSPACE=62fccfd6-605f-19e6-be6d-99a1e3cb987e
> SEQNUM=3652
> SUBSYSTEM=dlm
> USEC_INITIALIZED=86538345
> --snip--
>
> And from the kernel log:
>
> --snip--
> [297504.958244] md127: detected capacity change from 73340747776 to 0
> [297504.958249] md: md127 stopped.
> [297504.958884] dlm: 62fccfd6-605f-19e6-be6d-99a1e3cb987e: leaving the
> lockspace group...
> [297504.959004] udevd[487]: seq 3651 queued, 'offline' 'dlm'
> [297504.959161] udevd[487]: seq 3651 forked new worker [5392]
> [297504.959417] udevd[5392]: seq 3651 running
> [297504.959474] udevd[5392]: no db file to read
> /run/udev/data/+dlm:62fccfd6-605f-19e6-be6d-99a1e3cb987e: No such file
> or directory
> [297504.959524] udevd[5392]: passed device to netlink monitor 0x2251c30
> [297504.959527] udevd[5392]: seq 3651 processed
> [297504.960101] dlm: 62fccfd6-605f-19e6-be6d-99a1e3cb987e: group event done 0 0
> [297504.960299] dlm: 62fccfd6-605f-19e6-be6d-99a1e3cb987e:
> release_lockspace final free
> [297504.960329] md: unbind<dm-0>
> [297504.960448] udevd[487]: seq 3652 queued, 'remove' 'dlm'
> [297504.960500] udevd[487]: passed 214 byte device to netlink monitor 0x224b130
> [297504.960584] udevd[5392]: seq 3652 running
> [297504.960606] udevd[5392]: no db file to read
> /run/udev/data/+dlm:62fccfd6-605f-19e6-be6d-99a1e3cb987e: No such file
> or directory
> [297504.967168] md: export_rdev(dm-0)
> [297504.967231] md: unbind<dm-1>
> [297504.975176] md: export_rdev(dm-1)
> --snip--
>
> So that did get rid of the synthesized CHANGE event, but still no
> REMOVE event. =)
>
> Still trying to rule-out there isn't anything strange with my Linux
> distro / setup... but I assume even if udev was mishandling something,
> we should still be seeing a REMOVE event on '--stop'.
>
Getting rid of the CHANGE events is good as they can muddy the waters a
bit.
The lack of a REMOVE event suggests that something it still holding a
reference to the 'md' kobject. There aren't many references though.
One is held when the array is active. That is dropped by
mddev_delayed_delete() which is called from a work-queue. So it can be
delayed a little bit, but not much.
The other references are held by member devices, and dropped when the
devices are unbound from the array - and the kernel log shows the
"unbind" messages.
In either case, if there is a reference outstanding, there should be
some evidence in the /sys/mdXX/md directory. Either a 'dev-XX'
directory or the various bitmap attributes..
Can you report the result of
find /sys/block/md127/md
and
find /sys/block/md127/md -type file | xargs grep .
after stopping the array?
Thanks,
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
^ permalink raw reply
* Re: raid0 vs. mkfs
From: Doug Dumitru @ 2016-11-27 19:25 UTC (permalink / raw)
To: Avi Kivity; +Cc: Coly Li, linux-raid
In-Reply-To: <14c4b1d4-2fd3-b97f-934e-414a8d45fb18@scylladb.com>
I recently ran into this issue with a proprietary device mapper target
that supports discard.
mkfs.ext4 looks like it issues 2GB discard requests. blkdiscard looks
like it issues 4GB-4K discard requests. Both of these are "way
bigger" than the iolimits for transfers.
At least this is what I see at my device mapper layer. raid0 might
get some additional filtering by the common raid code.
In the case of my mapper, I actually need to split the bio up and
re-issue the discards at iolimits sizes (this is how my device mapper
expects requests). Fortunately, my mapper is really fast at discards
even at 1MB each (> 8GB/sec on a single thread), so the performance
issue is not that bad.
It would be an easy patch for raid0 to be "smarter" at splitting the
discard request. It might not actually help that much. You should
test your nVME disk to see if the performance of discards is much
different between "chunk size" requests and "big requests". Using
blkdiscard in a script, fill a drive with real data and test discard
speed first using 256K calls to blkdiscard, and then again using 512MB
calls to blkdiscard. Do this to a single drive. I suspect that the
times will not be that far off.
Some drives take a real amount of time to process discards. Even
though it seems like the operation does nothing, the FTL inside of the
SSD is still is getting hammered pretty hard.
If your drives are a "lot" faster with bigger discard requests, then
maybe it would make sense to optimize raid0. I suspect the win is not
that big.
In terms of enlarging IO, the iolimits and buffering start to come
into play. With a discard, the bio only has a size and does not have
any actual buffers. If you push IO really big, then the size of the
bio starts to grow. 1MB is 256 4K biovecs. A bio_vec is a pointer
plus two ints, so it is 16 bytes long (on x86_64). 256 of these just
happen to fit into a single page. This is a linear array, so making
this bigger is hard. Remember that much of the kernel lives inside of
pages and pages (usually 4K) are somewhat of a deity over the entire
kernel.
Then again, you have another option to format your array that will be
very fast and even more effective:
a) secure erase the drives
b) create your raid0 array
c) create your file system with -o nodiscard
Doug Dumitru
On Sun, Nov 27, 2016 at 9:25 AM, Avi Kivity <avi@scylladb.com> wrote:
> On 11/27/2016 07:09 PM, Coly Li wrote:
>>
>> On 2016/11/27 下午11:24, Avi Kivity wrote:
>>>
>>> mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large
>>> disk that supports TRIM/DISCARD (erase whichever is inappropriate).
>>> That is because mkfs issues a TRIM/DISCARD (erase whichever is
>>> inappropriate) for the entire partition. As far as I can tell, md
>>> converts the large TRIM/DISCARD (erase whichever is inappropriate) into
>>> a large number of TRIM/DISCARD (erase whichever is inappropriate)
>>> requests, one per chunk-size worth of disk, and issues them to the RAID
>>> components individually.
>>>
>>>
>>> It seems to me that md can convert the large TRIM/DISCARD (erase
>>> whichever is inappropriate) request it gets into one TRIM/DISCARD (erase
>>> whichever is inappropriate) per RAID component, converting an O(disk
>>> size / chunk size) operation into an O(number of RAID components)
>>> operation, which is much faster.
>>>
>>>
>>> I observed this with mkfs.xfs on a RAID0 of four 3TB NVMe devices, with
>>> the operation taking about a quarter of an hour, continuously pushing
>>> half-megabyte TRIM/DISCARD (erase whichever is inappropriate) requests
>>> to the disk. Linux 4.1.12.
>>
>> It might be possible to improve a bit for DISCARD performance, by your
>> suggestion. The implementation might be tricky, but it is worthy to try.
>>
>> Indeed, it is not only for DISCARD, for read or write, it might be
>> helpful for better performance as well. We can check the bio size, if,
>> bio_sectors(bio)/conf->nr_strip_zones >= SOMETHRESHOLD
>> it means on each underlying device, we have more then SOMETHRESHOLD
>> continuous chunks to issue, and they can be merged into a larger bio.
>
>
> It's true that this does not strictly apply to TRIM/DISCARD (erase whichever
> is inappropriate), but to see any gain for READ/WRITE, you need a request
> that is larger than (chunk size) * (raid elements), which is unlikely for
> reasonable values of those parameters. But a common implementation can of
> course work for multiple request types.
>
>> IMHO it's interesting, good suggestion!
>
>
> Looking forward to seeing an implementation!
>
>
>>
>> Coly
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Doug Dumitru
EasyCo LLC
^ permalink raw reply
* Re: raid0 vs. mkfs
From: Avi Kivity @ 2016-11-27 17:25 UTC (permalink / raw)
To: Coly Li; +Cc: linux-raid
In-Reply-To: <470ba5d0-e54d-3e5e-c639-4591549b9574@suse.de>
On 11/27/2016 07:09 PM, Coly Li wrote:
> On 2016/11/27 下午11:24, Avi Kivity wrote:
>> mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large
>> disk that supports TRIM/DISCARD (erase whichever is inappropriate).
>> That is because mkfs issues a TRIM/DISCARD (erase whichever is
>> inappropriate) for the entire partition. As far as I can tell, md
>> converts the large TRIM/DISCARD (erase whichever is inappropriate) into
>> a large number of TRIM/DISCARD (erase whichever is inappropriate)
>> requests, one per chunk-size worth of disk, and issues them to the RAID
>> components individually.
>>
>>
>> It seems to me that md can convert the large TRIM/DISCARD (erase
>> whichever is inappropriate) request it gets into one TRIM/DISCARD (erase
>> whichever is inappropriate) per RAID component, converting an O(disk
>> size / chunk size) operation into an O(number of RAID components)
>> operation, which is much faster.
>>
>>
>> I observed this with mkfs.xfs on a RAID0 of four 3TB NVMe devices, with
>> the operation taking about a quarter of an hour, continuously pushing
>> half-megabyte TRIM/DISCARD (erase whichever is inappropriate) requests
>> to the disk. Linux 4.1.12.
> It might be possible to improve a bit for DISCARD performance, by your
> suggestion. The implementation might be tricky, but it is worthy to try.
>
> Indeed, it is not only for DISCARD, for read or write, it might be
> helpful for better performance as well. We can check the bio size, if,
> bio_sectors(bio)/conf->nr_strip_zones >= SOMETHRESHOLD
> it means on each underlying device, we have more then SOMETHRESHOLD
> continuous chunks to issue, and they can be merged into a larger bio.
It's true that this does not strictly apply to TRIM/DISCARD (erase
whichever is inappropriate), but to see any gain for READ/WRITE, you
need a request that is larger than (chunk size) * (raid elements), which
is unlikely for reasonable values of those parameters. But a common
implementation can of course work for multiple request types.
> IMHO it's interesting, good suggestion!
Looking forward to seeing an implementation!
>
> Coly
>
^ permalink raw reply
* Re: raid0 vs. mkfs
From: Coly Li @ 2016-11-27 17:09 UTC (permalink / raw)
To: Avi Kivity; +Cc: linux-raid
In-Reply-To: <56c83c4e-d451-07e5-88e2-40b085d8681c@scylladb.com>
On 2016/11/27 下午11:24, Avi Kivity wrote:
> mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large
> disk that supports TRIM/DISCARD (erase whichever is inappropriate).
> That is because mkfs issues a TRIM/DISCARD (erase whichever is
> inappropriate) for the entire partition. As far as I can tell, md
> converts the large TRIM/DISCARD (erase whichever is inappropriate) into
> a large number of TRIM/DISCARD (erase whichever is inappropriate)
> requests, one per chunk-size worth of disk, and issues them to the RAID
> components individually.
>
>
> It seems to me that md can convert the large TRIM/DISCARD (erase
> whichever is inappropriate) request it gets into one TRIM/DISCARD (erase
> whichever is inappropriate) per RAID component, converting an O(disk
> size / chunk size) operation into an O(number of RAID components)
> operation, which is much faster.
>
>
> I observed this with mkfs.xfs on a RAID0 of four 3TB NVMe devices, with
> the operation taking about a quarter of an hour, continuously pushing
> half-megabyte TRIM/DISCARD (erase whichever is inappropriate) requests
> to the disk. Linux 4.1.12.
It might be possible to improve a bit for DISCARD performance, by your
suggestion. The implementation might be tricky, but it is worthy to try.
Indeed, it is not only for DISCARD, for read or write, it might be
helpful for better performance as well. We can check the bio size, if,
bio_sectors(bio)/conf->nr_strip_zones >= SOMETHRESHOLD
it means on each underlying device, we have more then SOMETHRESHOLD
continuous chunks to issue, and they can be merged into a larger bio.
IMHO it's interesting, good suggestion!
Coly
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox