Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: [RFC PATCH] blk: reset 'bi_next' when bio is done inside request
From: Michael Wang @ 2017-04-04  8:13 UTC (permalink / raw)
  To: NeilBrown, linux-kernel@vger.kernel.org, linux-block, linux-raid
  Cc: Jens Axboe, Shaohua Li, Jinpu Wang
In-Reply-To: <877f31kwti.fsf@notabene.neil.brown.name>

Hi, Neil

On 04/03/2017 11:25 PM, NeilBrown wrote:
> On Mon, Apr 03 2017, Michael Wang wrote:
> 
>> blk_attempt_plug_merge() try to merge bio into request and chain them
>> by 'bi_next', while after the bio is done inside request, we forgot to
>> reset the 'bi_next'.
>>
>> This lead into BUG while removing all the underlying devices from md-raid1,
>> the bio once go through:
>>
>>   md_do_sync()
>>     sync_request()
>>       generic_make_request()
> 
> This is a read request from the "first" device.
> 
>>         blk_queue_bio()
>>           blk_attempt_plug_merge()
>>             CHAINED HERE
>>
>> will keep chained and reused by:
>>
>>   raid1d()
>>     sync_request_write()
>>       generic_make_request()
> 
> This is a write request to some other device, isn't it?
> 
> If sync_request_write() is using a bio that has already been used, it
> should call bio_reset() and fill in the details again.
> However I don't see how that would happen.
> Can you give specific details on the situation that triggers the bug?

We have storage side mapping lv through scst to server, on server side
we assemble them into multipath device, and then assemble these dm into
two raid1.

The test is firstly do mkfs.ext4 on raid1 then start fio on it, on storage
side we unmap all the lv (could during mkfs or fio), then on server side
we hit the BUG (reproducible).

The path of bio was confirmed by add tracing, it is reused in sync_request_write()
with 'bi_next' once chained inside blk_attempt_plug_merge().

We also tried to reset the bi_next inside sync_request_write() before
generic_make_request() which also works.

The testing was done with 4.4, but we found upstream also left bi_next
chained after done in request, thus we post this RFC.

Regarding raid1, we haven't found the place on path where the bio was
reset... where does it supposed to be?

BTW the fix_sync_read_error() also invoked and succeed before trigger
the BUG.

Regards,
Michael Wang

> 
> Thanks,
> NeilBrown
> 
> 
>>         BUG_ON(bio->bi_next)
>>
>> After reset the 'bi_next' this can no longer happen.
>>
>> Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
>> ---
>>  block/blk-core.c | 4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 43b7d06..91223b2 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -2619,8 +2619,10 @@ bool blk_update_request(struct request *req, int error, unsigned int nr_bytes)
>>                 struct bio *bio = req->bio;
>>                 unsigned bio_bytes = min(bio->bi_iter.bi_size, nr_bytes);
>>
>> -               if (bio_bytes == bio->bi_iter.bi_size)
>> +               if (bio_bytes == bio->bi_iter.bi_size) {
>>                         req->bio = bio->bi_next;
>> +                       bio->bi_next = NULL;
>> +               }
>>
>>                 req_bio_endio(req, bio, bio_bytes, error);
>>
>> -- 
>> 2.5.0

^ permalink raw reply

* Re: [RFC PATCH] blk: reset 'bi_next' when bio is done inside request
From: NeilBrown @ 2017-04-04  9:37 UTC (permalink / raw)
  To: Michael Wang, linux-kernel@vger.kernel.org, linux-block,
	linux-raid
  Cc: Jens Axboe, Shaohua Li, Jinpu Wang
In-Reply-To: <9be3ca00-d802-bf64-bcdc-1e76608147f0@profitbricks.com>

[-- Attachment #1: Type: text/plain, Size: 3240 bytes --]

On Tue, Apr 04 2017, Michael Wang wrote:

> Hi, Neil
>
> On 04/03/2017 11:25 PM, NeilBrown wrote:
>> On Mon, Apr 03 2017, Michael Wang wrote:
>> 
>>> blk_attempt_plug_merge() try to merge bio into request and chain them
>>> by 'bi_next', while after the bio is done inside request, we forgot to
>>> reset the 'bi_next'.
>>>
>>> This lead into BUG while removing all the underlying devices from md-raid1,
>>> the bio once go through:
>>>
>>>   md_do_sync()
>>>     sync_request()
>>>       generic_make_request()
>> 
>> This is a read request from the "first" device.
>> 
>>>         blk_queue_bio()
>>>           blk_attempt_plug_merge()
>>>             CHAINED HERE
>>>
>>> will keep chained and reused by:
>>>
>>>   raid1d()
>>>     sync_request_write()
>>>       generic_make_request()
>> 
>> This is a write request to some other device, isn't it?
>> 
>> If sync_request_write() is using a bio that has already been used, it
>> should call bio_reset() and fill in the details again.
>> However I don't see how that would happen.
>> Can you give specific details on the situation that triggers the bug?
>
> We have storage side mapping lv through scst to server, on server side
> we assemble them into multipath device, and then assemble these dm into
> two raid1.
>
> The test is firstly do mkfs.ext4 on raid1 then start fio on it, on storage
> side we unmap all the lv (could during mkfs or fio), then on server side
> we hit the BUG (reproducible).

So I assume the initial resync is still happening at this point?
And you unmap *all* the lv's so you expect IO to fail?
I can see that the code would behave strangely if you have a
bad-block-list configured (which is the default).
Do you have a bbl?  If you create the array without the bbl, does it
still crash?

>
> The path of bio was confirmed by add tracing, it is reused in sync_request_write()
> with 'bi_next' once chained inside blk_attempt_plug_merge().

I still don't see why it is re-used.
I assume you didn't explicitly ask for a check/repair (i.e. didn't write
to .../md/sync_action at all?).  In that case MD_RECOVERY_REQUESTED is
not set.
So sync_request() sends only one bio to generic_make_request():
   r1_bio->bios[r1_bio->read_disk];

then sync_request_write() *doesn't* send that bio again, but does send
all the others.

So where does it reuse a bio?

>
> We also tried to reset the bi_next inside sync_request_write() before
> generic_make_request() which also works.
>
> The testing was done with 4.4, but we found upstream also left bi_next
> chained after done in request, thus we post this RFC.
>
> Regarding raid1, we haven't found the place on path where the bio was
> reset... where does it supposed to be?

I'm not sure what you mean.
We only reset bios when they are being reused.
One place is in process_checks() where bio_reset() is called before
filling in all the details.


Maybe, in sync_request_write(), before

	wbio->bi_rw = WRITE;

add something like
  if (wbio->bi_next)
     printk("bi_next!= NULL i=%d read_disk=%d bi_end_io=%pf\n",
          i, r1_bio->read_disk, wbio->bi_end_io);

that might help narrow down what is happening.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH] raid6/altivec: adding vpermxor implementation for raid6 Q syndrome
From: Michael Ellerman @ 2017-04-04 10:04 UTC (permalink / raw)
  To: Daniel Axtens, Matt Brown, linux-raid; +Cc: linuxppc-dev
In-Reply-To: <87wpb1kvy7.fsf@possimpible.ozlabs.ibm.com>

Daniel Axtens <dja@axtens.net> writes:

>> In that function, the flow is:
>>  pagefault_disable();
>>  enable_kernel_altivec();
>>  <vectorised function>
>>  pagefault_enable();
>>
>> There are a few things that it would be nice (but by no means essential)
>> to find out:
>>  - what is the difference between pagefault and prempt enable/disable
>>  - is it required to disable altivec after the end of the function or
>>    can we leave that enabled?
>
> Answering my own question here, dc4fbba11e46 ("powerpc: Create
> disable_kernel_{fp,altivec,vsx,spe}()") adds the disable_ function, and
> it's a no-op except under debug conditions. So it should stay.

Yeah enabling altivec for use in the kernel requires saving the
userspace altivec state first (so we don't clobber it).

But once we've enabled it in the kernel, we can just leave it enabled
until we return to userspace and save the cost of disabling it. There's
a small risk that leaving altivec enabled allows some other kernel code
to use altivec when it shouldn't, hence the debug option to make
disable_kernel_altivec() actually disable it.

cheers

^ permalink raw reply

* Re: [RFC PATCH] blk: reset 'bi_next' when bio is done inside request
From: Michael Wang @ 2017-04-04 10:23 UTC (permalink / raw)
  To: NeilBrown, linux-kernel@vger.kernel.org, linux-block, linux-raid
  Cc: Jens Axboe, Shaohua Li, Jinpu Wang
In-Reply-To: <871st8jyya.fsf@notabene.neil.brown.name>


On 04/04/2017 11:37 AM, NeilBrown wrote:
> On Tue, Apr 04 2017, Michael Wang wrote:
[snip]
>>>
>>> If sync_request_write() is using a bio that has already been used, it
>>> should call bio_reset() and fill in the details again.
>>> However I don't see how that would happen.
>>> Can you give specific details on the situation that triggers the bug?
>>
>> We have storage side mapping lv through scst to server, on server side
>> we assemble them into multipath device, and then assemble these dm into
>> two raid1.
>>
>> The test is firstly do mkfs.ext4 on raid1 then start fio on it, on storage
>> side we unmap all the lv (could during mkfs or fio), then on server side
>> we hit the BUG (reproducible).
> 
> So I assume the initial resync is still happening at this point?
> And you unmap *all* the lv's so you expect IO to fail?
> I can see that the code would behave strangely if you have a
> bad-block-list configured (which is the default).
> Do you have a bbl?  If you create the array without the bbl, does it
> still crash?

The resync is at least happen concurrently in this case, we try
to simulate the case that all the connections dropped, the IO do
failed, also bunch of kernel log like:

md: super_written gets error=-5
blk_update_request: I/O error, dev dm-3, sector 64184
md/raid1:md1: dm-2: unrecoverable I/O read error for block 46848

we expect that to happen, but server should not crash on BUG.

And we haven't done any thing special regarding bbl, the bad_blocks
in sysfs are all empty.

> 
>>
>> The path of bio was confirmed by add tracing, it is reused in sync_request_write()
>> with 'bi_next' once chained inside blk_attempt_plug_merge().
> 
> I still don't see why it is re-used.
> I assume you didn't explicitly ask for a check/repair (i.e. didn't write
> to .../md/sync_action at all?).  In that case MD_RECOVERY_REQUESTED is
> not set.

Just unmap lv on storage side, no operation on server side.

> So sync_request() sends only one bio to generic_make_request():
>    r1_bio->bios[r1_bio->read_disk];
> 
> then sync_request_write() *doesn't* send that bio again, but does send
> all the others.
> 
> So where does it reuse a bio?

If that's the design then it would be strange... the log do showing the path
of that bio go through sync_request(), will do more investigation.

> 
>>
>> We also tried to reset the bi_next inside sync_request_write() before
>> generic_make_request() which also works.
>>
>> The testing was done with 4.4, but we found upstream also left bi_next
>> chained after done in request, thus we post this RFC.
>>
>> Regarding raid1, we haven't found the place on path where the bio was
>> reset... where does it supposed to be?
> 
> I'm not sure what you mean.
> We only reset bios when they are being reused.
> One place is in process_checks() where bio_reset() is called before
> filling in all the details.
> 
> 
> Maybe, in sync_request_write(), before
> 
> 	wbio->bi_rw = WRITE;
> 
> add something like
>   if (wbio->bi_next)
>      printk("bi_next!= NULL i=%d read_disk=%d bi_end_io=%pf\n",
>           i, r1_bio->read_disk, wbio->bi_end_io);
> 
> that might help narrow down what is happening.

Just triggered again in 4.4, dmesg like:

[  399.240230] md: super_written gets error=-5
[  399.240286] md: super_written gets error=-5
[  399.240286] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
[  399.240300] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
[  399.240312] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
[  399.240323] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
[  399.240334] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
[  399.240341] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
[  399.240349] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
[  399.240352] bi_next!= NULL i=0 read_disk=0 bi_end_io=end_sync_write [raid1]
[  399.240363] ------------[ cut here ]------------
[  399.240364] kernel BUG at block/blk-core.c:2147!
[  399.240365] invalid opcode: 0000 [#1] SMP 
[  399.240378] Modules linked in: ib_srp scsi_transport_srp raid1 md_mod ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_core vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ib_sa ib_mad ib_core ib_addr ib_netlink iTCO_wdt iTCO_vendor_support dcdbas dell_smm_hwmon acpi_cpufreq x86_pkg_temp_thermal tpm_tis coretemp evdev tpm i2c_i801 crct10dif_pclmul serio_raw crc32_pclmul battery processor acpi_pad button kvm_intel kvm dm_round_robin irqbypass dm_multipath autofs4 sg sd_mod crc32c_intel ahci libahci psmouse libata mlx4_core scsi_mod xhci_pci xhci_hcd mlx_compat fan thermal [last unloaded: scsi_transport_srp]
[  399.240380] CPU: 1 PID: 2052 Comm: md0_raid1 Not tainted 4.4.50-1-pserver+ #26
[  399.240381] Hardware name: Dell Inc. Precision Tower 3620/09WH54, BIOS 1.3.6 05/26/2016
[  399.240381] task: ffff8804031b6200 ti: ffff8800d72b4000 task.ti: ffff8800d72b4000
[  399.240385] RIP: 0010:[<ffffffff813fcd9e>]  [<ffffffff813fcd9e>] generic_make_request+0x29e/0x2a0
[  399.240385] RSP: 0018:ffff8800d72b7d10  EFLAGS: 00010286
[  399.240386] RAX: ffff8804031b6200 RBX: ffff8800d2577e00 RCX: 000000003fffffff
[  399.240387] RDX: ffffffffc0000001 RSI: 0000000000000001 RDI: ffff8800d5e8c1e0
[  399.240387] RBP: ffff8800d72b7d50 R08: 0000000000000000 R09: 000000000000003f
[  399.240388] R10: 0000000000000004 R11: 00000000001db9ac R12: 00000000ffffffff
[  399.240388] R13: ffff8800d2748e00 R14: ffff88040a016400 R15: ffff8800d2748e40
[  399.240389] FS:  0000000000000000(0000) GS:ffff88041dc40000(0000) knlGS:0000000000000000
[  399.240390] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  399.240390] CR2: 00007fb49246a000 CR3: 000000040215c000 CR4: 00000000003406e0
[  399.240391] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  399.240391] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  399.240392] Stack:
[  399.240393]  ffff8800d72b7d18 ffff8800d72b7d30 0000000000000000 0000000000000000
[  399.240394]  ffffffffa079c290 ffff8800d2577e00 0000000000000000 ffff8800d2748e00
[  399.240395]  ffff8800d72b7e58 ffffffffa079e74c ffff88040b661c00 ffff8800d2577e00
[  399.240396] Call Trace:
[  399.240398]  [<ffffffffa079c290>] ? sync_request+0xb20/0xb20 [raid1]
[  399.240400]  [<ffffffffa079e74c>] raid1d+0x65c/0x1060 [raid1]
[  399.240403]  [<ffffffff810b6800>] ? trace_raw_output_itimer_expire+0x80/0x80
[  399.240407]  [<ffffffffa0772040>] md_thread+0x130/0x140 [md_mod]
[  399.240409]  [<ffffffff81094790>] ? wait_woken+0x80/0x80
[  399.240412]  [<ffffffffa0771f10>] ? find_pers+0x70/0x70 [md_mod]
[  399.240414]  [<ffffffff81075066>] kthread+0xd6/0xf0
[  399.240415]  [<ffffffff81074f90>] ? kthread_park+0x50/0x50
[  399.240417]  [<ffffffff8180411f>] ret_from_fork+0x3f/0x70
[  399.240418]  [<ffffffff81074f90>] ? kthread_park+0x50/0x50
[  399.240433] Code: 89 04 24 e9 2d ff ff ff 49 8d bd d8 07 00 00 f0 49 83 ad d8 07 00 00 01 74 05 e9 8b fe ff ff 41 ff 95 e8 07 00 00 e9 7f fe ff ff <0f> 0b 55 48 63 c7 48 89 e5 41 54 53 48 89 f3 48 83 ec 28 48 0b 
[  399.240434] RIP  [<ffffffff813fcd9e>] generic_make_request+0x29e/0x2a0
[  399.240435]  RSP <ffff8800d72b7d10>


Regards,
Michael Wang

> 
> NeilBrown
> 

^ permalink raw reply

* [PATCH 0/4] PPL fixes and improvements
From: Artur Paszkiewicz @ 2017-04-04 11:13 UTC (permalink / raw)
  To: shli; +Cc: linux-raid, Artur Paszkiewicz

These are mostly fixes for possible deadlocks related to memory allocation and
one optimization.

Artur Paszkiewicz (4):
  raid5-ppl: use a single mempool for ppl_io_unit and header_page
  raid5-ppl: move no_mem_stripes to struct ppl_conf
  raid5-ppl: use resize_stripes() when enabling or disabling ppl
  raid5-ppl: partial parity calculation optimization

 drivers/md/raid5-log.h |   5 ++-
 drivers/md/raid5-ppl.c | 112 +++++++++++++++++++++++++++++++------------------
 drivers/md/raid5.c     |  94 ++++++++++++++++++-----------------------
 3 files changed, 115 insertions(+), 96 deletions(-)

-- 
2.11.0


^ permalink raw reply

* [PATCH 1/4] raid5-ppl: use a single mempool for ppl_io_unit and header_page
From: Artur Paszkiewicz @ 2017-04-04 11:13 UTC (permalink / raw)
  To: shli; +Cc: linux-raid, Artur Paszkiewicz
In-Reply-To: <20170404111358.14829-1-artur.paszkiewicz@intel.com>

Allocate both struct ppl_io_unit and its header_page from a shared
mempool to avoid a possible deadlock. Implement allocate and free
functions for the mempool, remove the second pool for allocating
header_page. The header_pages are now freed with their io_units, not
when the ppl bio completes. Also, use GFP_NOWAIT instead of GFP_ATOMIC
for allocating ppl_io_unit because we can handle failed allocations and
there is no reason to utilize emergency reserves.

Suggested-by: NeilBrown <neilb@suse.com>
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 drivers/md/raid5-ppl.c | 53 ++++++++++++++++++++++++++++++++++----------------
 1 file changed, 36 insertions(+), 17 deletions(-)

diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
index 86ea9addb51a..42e43467d1e8 100644
--- a/drivers/md/raid5-ppl.c
+++ b/drivers/md/raid5-ppl.c
@@ -102,7 +102,6 @@ struct ppl_conf {
 	struct kmem_cache *io_kc;
 	mempool_t *io_pool;
 	struct bio_set *bs;
-	mempool_t *meta_pool;
 
 	/* used only for recovery */
 	int recovered_entries;
@@ -195,6 +194,33 @@ ops_run_partial_parity(struct stripe_head *sh, struct raid5_percpu *percpu,
 	return tx;
 }
 
+static void *ppl_io_pool_alloc(gfp_t gfp_mask, void *pool_data)
+{
+	struct kmem_cache *kc = pool_data;
+	struct ppl_io_unit *io;
+
+	io = kmem_cache_alloc(kc, gfp_mask);
+	if (!io)
+		return NULL;
+
+	io->header_page = alloc_page(gfp_mask);
+	if (!io->header_page) {
+		kmem_cache_free(kc, io);
+		return NULL;
+	}
+
+	return io;
+}
+
+static void ppl_io_pool_free(void *element, void *pool_data)
+{
+	struct kmem_cache *kc = pool_data;
+	struct ppl_io_unit *io = element;
+
+	__free_page(io->header_page);
+	kmem_cache_free(kc, io);
+}
+
 static struct ppl_io_unit *ppl_new_iounit(struct ppl_log *log,
 					  struct stripe_head *sh)
 {
@@ -202,18 +228,19 @@ static struct ppl_io_unit *ppl_new_iounit(struct ppl_log *log,
 	struct ppl_io_unit *io;
 	struct ppl_header *pplhdr;
 
-	io = mempool_alloc(ppl_conf->io_pool, GFP_ATOMIC);
+	io = mempool_alloc(ppl_conf->io_pool, GFP_NOWAIT);
 	if (!io)
 		return NULL;
 
-	memset(io, 0, sizeof(*io));
 	io->log = log;
+	io->entries_count = 0;
+	io->pp_size = 0;
+	io->submitted = false;
 	INIT_LIST_HEAD(&io->log_sibling);
 	INIT_LIST_HEAD(&io->stripe_list);
 	atomic_set(&io->pending_stripes, 0);
 	bio_init(&io->bio, io->biovec, PPL_IO_INLINE_BVECS);
 
-	io->header_page = mempool_alloc(ppl_conf->meta_pool, GFP_NOIO);
 	pplhdr = page_address(io->header_page);
 	clear_page(pplhdr);
 	memset(pplhdr->reserved, 0xff, PPL_HDR_RESERVED);
@@ -369,8 +396,6 @@ static void ppl_log_endio(struct bio *bio)
 	if (bio->bi_error)
 		md_error(ppl_conf->mddev, log->rdev);
 
-	mempool_free(io->header_page, ppl_conf->meta_pool);
-
 	list_for_each_entry_safe(sh, next, &io->stripe_list, log_list) {
 		list_del_init(&sh->log_list);
 
@@ -998,7 +1023,6 @@ static void __ppl_exit_log(struct ppl_conf *ppl_conf)
 
 	kfree(ppl_conf->child_logs);
 
-	mempool_destroy(ppl_conf->meta_pool);
 	if (ppl_conf->bs)
 		bioset_free(ppl_conf->bs);
 	mempool_destroy(ppl_conf->io_pool);
@@ -1104,25 +1128,20 @@ int ppl_init_log(struct r5conf *conf)
 
 	ppl_conf->io_kc = KMEM_CACHE(ppl_io_unit, 0);
 	if (!ppl_conf->io_kc) {
-		ret = -EINVAL;
+		ret = -ENOMEM;
 		goto err;
 	}
 
-	ppl_conf->io_pool = mempool_create_slab_pool(conf->raid_disks, ppl_conf->io_kc);
+	ppl_conf->io_pool = mempool_create(conf->raid_disks, ppl_io_pool_alloc,
+					   ppl_io_pool_free, ppl_conf->io_kc);
 	if (!ppl_conf->io_pool) {
-		ret = -EINVAL;
+		ret = -ENOMEM;
 		goto err;
 	}
 
 	ppl_conf->bs = bioset_create(conf->raid_disks, 0);
 	if (!ppl_conf->bs) {
-		ret = -EINVAL;
-		goto err;
-	}
-
-	ppl_conf->meta_pool = mempool_create_page_pool(conf->raid_disks, 0);
-	if (!ppl_conf->meta_pool) {
-		ret = -EINVAL;
+		ret = -ENOMEM;
 		goto err;
 	}
 
-- 
2.11.0


^ permalink raw reply related

* [PATCH 2/4] raid5-ppl: move no_mem_stripes to struct ppl_conf
From: Artur Paszkiewicz @ 2017-04-04 11:13 UTC (permalink / raw)
  To: shli; +Cc: linux-raid, Artur Paszkiewicz
In-Reply-To: <20170404111358.14829-1-artur.paszkiewicz@intel.com>

Use a single no_mem_stripes list instead of per member device lists for
handling stripes that need retrying in case of failed io_unit
allocation. Because io_units are allocated from a memory pool shared
between all member disks, the no_mem_stripes list should be checked when
an io_unit for any member is freed. This fixes a deadlock that could
happen if there are stripes in more than one no_mem_stripes list.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 drivers/md/raid5-ppl.c | 36 +++++++++++++++++++++++-------------
 1 file changed, 23 insertions(+), 13 deletions(-)

diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
index 42e43467d1e8..0a39a6bbcbde 100644
--- a/drivers/md/raid5-ppl.c
+++ b/drivers/md/raid5-ppl.c
@@ -106,6 +106,10 @@ struct ppl_conf {
 	/* used only for recovery */
 	int recovered_entries;
 	int mismatch_count;
+
+	/* stripes to retry if failed to allocate io_unit */
+	struct list_head no_mem_stripes;
+	spinlock_t no_mem_stripes_lock;
 };
 
 struct ppl_log {
@@ -118,8 +122,6 @@ struct ppl_log {
 					 * always at the end of io_list */
 	spinlock_t io_list_lock;
 	struct list_head io_list;	/* all io_units of this log */
-	struct list_head no_mem_stripes;/* stripes to retry if failed to
-					 * allocate io_unit */
 };
 
 #define PPL_IO_INLINE_BVECS 32
@@ -374,9 +376,9 @@ int ppl_write_stripe(struct r5conf *conf, struct stripe_head *sh)
 	atomic_inc(&sh->count);
 
 	if (ppl_log_stripe(log, sh)) {
-		spin_lock_irq(&log->io_list_lock);
-		list_add_tail(&sh->log_list, &log->no_mem_stripes);
-		spin_unlock_irq(&log->io_list_lock);
+		spin_lock_irq(&ppl_conf->no_mem_stripes_lock);
+		list_add_tail(&sh->log_list, &ppl_conf->no_mem_stripes);
+		spin_unlock_irq(&ppl_conf->no_mem_stripes_lock);
 	}
 
 	mutex_unlock(&log->io_mutex);
@@ -517,25 +519,32 @@ void ppl_write_stripe_run(struct r5conf *conf)
 static void ppl_io_unit_finished(struct ppl_io_unit *io)
 {
 	struct ppl_log *log = io->log;
+	struct ppl_conf *ppl_conf = log->ppl_conf;
 	unsigned long flags;
 
 	pr_debug("%s: seq: %llu\n", __func__, io->seq);
 
-	spin_lock_irqsave(&log->io_list_lock, flags);
+	local_irq_save(flags);
 
+	spin_lock(&log->io_list_lock);
 	list_del(&io->log_sibling);
-	mempool_free(io, log->ppl_conf->io_pool);
+	spin_unlock(&log->io_list_lock);
+
+	mempool_free(io, ppl_conf->io_pool);
+
+	spin_lock(&ppl_conf->no_mem_stripes_lock);
+	if (!list_empty(&ppl_conf->no_mem_stripes)) {
+		struct stripe_head *sh;
 
-	if (!list_empty(&log->no_mem_stripes)) {
-		struct stripe_head *sh = list_first_entry(&log->no_mem_stripes,
-							  struct stripe_head,
-							  log_list);
+		sh = list_first_entry(&ppl_conf->no_mem_stripes,
+				      struct stripe_head, log_list);
 		list_del_init(&sh->log_list);
 		set_bit(STRIPE_HANDLE, &sh->state);
 		raid5_release_stripe(sh);
 	}
+	spin_unlock(&ppl_conf->no_mem_stripes_lock);
 
-	spin_unlock_irqrestore(&log->io_list_lock, flags);
+	local_irq_restore(flags);
 }
 
 void ppl_stripe_write_finished(struct stripe_head *sh)
@@ -1154,6 +1163,8 @@ int ppl_init_log(struct r5conf *conf)
 	}
 
 	atomic64_set(&ppl_conf->seq, 0);
+	INIT_LIST_HEAD(&ppl_conf->no_mem_stripes);
+	spin_lock_init(&ppl_conf->no_mem_stripes_lock);
 
 	if (!mddev->external) {
 		ppl_conf->signature = ~crc32c_le(~0, mddev->uuid, sizeof(mddev->uuid));
@@ -1169,7 +1180,6 @@ int ppl_init_log(struct r5conf *conf)
 		mutex_init(&log->io_mutex);
 		spin_lock_init(&log->io_list_lock);
 		INIT_LIST_HEAD(&log->io_list);
-		INIT_LIST_HEAD(&log->no_mem_stripes);
 
 		log->ppl_conf = ppl_conf;
 		log->rdev = rdev;
-- 
2.11.0


^ permalink raw reply related

* [PATCH 3/4] raid5-ppl: use resize_stripes() when enabling or disabling ppl
From: Artur Paszkiewicz @ 2017-04-04 11:13 UTC (permalink / raw)
  To: shli; +Cc: linux-raid, Artur Paszkiewicz
In-Reply-To: <20170404111358.14829-1-artur.paszkiewicz@intel.com>

Use resize_stripes() instead of raid5_reset_stripe_cache() to allocate
or free sh->ppl_page at runtime for all stripes in the stripe cache.
raid5_reset_stripe_cache() required suspending the mddev and could
deadlock because of GFP_KERNEL allocations.

Move the 'newsize' check to check_reshape() to allow reallocating the
stripes with the same number of disks. Allocate sh->ppl_page in
alloc_stripe() instead of grow_buffers(). Pass 'struct r5conf *conf' as
a parameter to alloc_stripe() because it is needed to check whether to
allocate ppl_page. Add free_stripe() and use it to free stripes rather
than directly call kmem_cache_free(). Also free sh->ppl_page in
free_stripe().

Set MD_HAS_PPL at the end of ppl_init_log() instead of explicitly
setting it in advance and add another parameter to log_init() to allow
calling ppl_init_log() without the bit set. Don't try to calculate
partial parity or add a stripe to log if it does not have ppl_page set.

Enabling ppl can now be performed without suspending the mddev, because
the log won't be used until new stripes are allocated with ppl_page.
Calling mddev_suspend/resume is still necessary when disabling ppl,
because we want all stripes to finish before stopping the log, but
resize_stripes() can be called after mddev_resume() when ppl is no
longer active.

Suggested-by: NeilBrown <neilb@suse.com>
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 drivers/md/raid5-log.h |  5 +--
 drivers/md/raid5-ppl.c |  3 +-
 drivers/md/raid5.c     | 88 ++++++++++++++++++++++----------------------------
 3 files changed, 43 insertions(+), 53 deletions(-)

diff --git a/drivers/md/raid5-log.h b/drivers/md/raid5-log.h
index 738930ff5d17..27097101ccca 100644
--- a/drivers/md/raid5-log.h
+++ b/drivers/md/raid5-log.h
@@ -93,11 +93,12 @@ static inline void log_exit(struct r5conf *conf)
 		ppl_exit_log(conf);
 }
 
-static inline int log_init(struct r5conf *conf, struct md_rdev *journal_dev)
+static inline int log_init(struct r5conf *conf, struct md_rdev *journal_dev,
+			   bool ppl)
 {
 	if (journal_dev)
 		return r5l_init_log(conf, journal_dev);
-	else if (raid5_has_ppl(conf))
+	else if (ppl)
 		return ppl_init_log(conf);
 
 	return 0;
diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
index 0a39a6bbcbde..e938669810c4 100644
--- a/drivers/md/raid5-ppl.c
+++ b/drivers/md/raid5-ppl.c
@@ -355,7 +355,7 @@ int ppl_write_stripe(struct r5conf *conf, struct stripe_head *sh)
 	struct ppl_io_unit *io = sh->ppl_io;
 	struct ppl_log *log;
 
-	if (io || test_bit(STRIPE_SYNCING, &sh->state) ||
+	if (io || test_bit(STRIPE_SYNCING, &sh->state) || !sh->ppl_page ||
 	    !test_bit(R5_Wantwrite, &sh->dev[sh->pd_idx].flags) ||
 	    !test_bit(R5_Insync, &sh->dev[sh->pd_idx].flags)) {
 		clear_bit(STRIPE_LOG_TRAPPED, &sh->state);
@@ -1223,6 +1223,7 @@ int ppl_init_log(struct r5conf *conf)
 	}
 
 	conf->log_private = ppl_conf;
+	set_bit(MD_HAS_PPL, &ppl_conf->mddev->flags);
 
 	return 0;
 err:
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 6036d5e41ddd..9cab2fe078c2 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -471,11 +471,6 @@ static void shrink_buffers(struct stripe_head *sh)
 		sh->dev[i].page = NULL;
 		put_page(p);
 	}
-
-	if (sh->ppl_page) {
-		put_page(sh->ppl_page);
-		sh->ppl_page = NULL;
-	}
 }
 
 static int grow_buffers(struct stripe_head *sh, gfp_t gfp)
@@ -493,12 +488,6 @@ static int grow_buffers(struct stripe_head *sh, gfp_t gfp)
 		sh->dev[i].orig_page = page;
 	}
 
-	if (raid5_has_ppl(sh->raid_conf)) {
-		sh->ppl_page = alloc_page(gfp);
-		if (!sh->ppl_page)
-			return 1;
-	}
-
 	return 0;
 }
 
@@ -2132,8 +2121,15 @@ static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request)
 	put_cpu();
 }
 
+static void free_stripe(struct kmem_cache *sc, struct stripe_head *sh)
+{
+	if (sh->ppl_page)
+		__free_page(sh->ppl_page);
+	kmem_cache_free(sc, sh);
+}
+
 static struct stripe_head *alloc_stripe(struct kmem_cache *sc, gfp_t gfp,
-	int disks)
+	int disks, struct r5conf *conf)
 {
 	struct stripe_head *sh;
 	int i;
@@ -2147,6 +2143,7 @@ static struct stripe_head *alloc_stripe(struct kmem_cache *sc, gfp_t gfp,
 		INIT_LIST_HEAD(&sh->r5c);
 		INIT_LIST_HEAD(&sh->log_list);
 		atomic_set(&sh->count, 1);
+		sh->raid_conf = conf;
 		sh->log_start = MaxSector;
 		for (i = 0; i < disks; i++) {
 			struct r5dev *dev = &sh->dev[i];
@@ -2154,6 +2151,14 @@ static struct stripe_head *alloc_stripe(struct kmem_cache *sc, gfp_t gfp,
 			bio_init(&dev->req, &dev->vec, 1);
 			bio_init(&dev->rreq, &dev->rvec, 1);
 		}
+
+		if (raid5_has_ppl(conf)) {
+			sh->ppl_page = alloc_page(gfp);
+			if (!sh->ppl_page) {
+				free_stripe(sc, sh);
+				sh = NULL;
+			}
+		}
 	}
 	return sh;
 }
@@ -2161,15 +2166,13 @@ static int grow_one_stripe(struct r5conf *conf, gfp_t gfp)
 {
 	struct stripe_head *sh;
 
-	sh = alloc_stripe(conf->slab_cache, gfp, conf->pool_size);
+	sh = alloc_stripe(conf->slab_cache, gfp, conf->pool_size, conf);
 	if (!sh)
 		return 0;
 
-	sh->raid_conf = conf;
-
 	if (grow_buffers(sh, gfp)) {
 		shrink_buffers(sh);
-		kmem_cache_free(conf->slab_cache, sh);
+		free_stripe(conf->slab_cache, sh);
 		return 0;
 	}
 	sh->hash_lock_index =
@@ -2314,9 +2317,6 @@ static int resize_stripes(struct r5conf *conf, int newsize)
 	int i;
 	int hash, cnt;
 
-	if (newsize <= conf->pool_size)
-		return 0; /* never bother to shrink */
-
 	err = md_allow_write(conf->mddev);
 	if (err)
 		return err;
@@ -2332,11 +2332,10 @@ static int resize_stripes(struct r5conf *conf, int newsize)
 	mutex_lock(&conf->cache_size_mutex);
 
 	for (i = conf->max_nr_stripes; i; i--) {
-		nsh = alloc_stripe(sc, GFP_KERNEL, newsize);
+		nsh = alloc_stripe(sc, GFP_KERNEL, newsize, conf);
 		if (!nsh)
 			break;
 
-		nsh->raid_conf = conf;
 		list_add(&nsh->lru, &newstripes);
 	}
 	if (i) {
@@ -2344,7 +2343,7 @@ static int resize_stripes(struct r5conf *conf, int newsize)
 		while (!list_empty(&newstripes)) {
 			nsh = list_entry(newstripes.next, struct stripe_head, lru);
 			list_del(&nsh->lru);
-			kmem_cache_free(sc, nsh);
+			free_stripe(sc, nsh);
 		}
 		kmem_cache_destroy(sc);
 		mutex_unlock(&conf->cache_size_mutex);
@@ -2370,7 +2369,7 @@ static int resize_stripes(struct r5conf *conf, int newsize)
 			nsh->dev[i].orig_page = osh->dev[i].page;
 		}
 		nsh->hash_lock_index = hash;
-		kmem_cache_free(conf->slab_cache, osh);
+		free_stripe(conf->slab_cache, osh);
 		cnt++;
 		if (cnt >= conf->max_nr_stripes / NR_STRIPE_HASH_LOCKS +
 		    !!((conf->max_nr_stripes % NR_STRIPE_HASH_LOCKS) > hash)) {
@@ -2445,7 +2444,7 @@ static int drop_one_stripe(struct r5conf *conf)
 		return 0;
 	BUG_ON(atomic_read(&sh->count));
 	shrink_buffers(sh);
-	kmem_cache_free(conf->slab_cache, sh);
+	free_stripe(conf->slab_cache, sh);
 	atomic_dec(&conf->active_stripes);
 	conf->max_nr_stripes--;
 	return 1;
@@ -3168,7 +3167,7 @@ schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s,
 		s->locked++;
 	}
 
-	if (raid5_has_ppl(sh->raid_conf) &&
+	if (raid5_has_ppl(sh->raid_conf) && sh->ppl_page &&
 	    test_bit(STRIPE_OP_BIODRAIN, &s->ops_request) &&
 	    !test_bit(STRIPE_FULL_WRITE, &sh->state) &&
 	    test_bit(R5_Insync, &sh->dev[pd_idx].flags))
@@ -7414,7 +7413,7 @@ static int raid5_run(struct mddev *mddev)
 		blk_queue_max_hw_sectors(mddev->queue, UINT_MAX);
 	}
 
-	if (log_init(conf, journal_dev))
+	if (log_init(conf, journal_dev, raid5_has_ppl(conf)))
 		goto abort;
 
 	return 0;
@@ -7623,7 +7622,7 @@ static int raid5_add_disk(struct mddev *mddev, struct md_rdev *rdev)
 		 * The array is in readonly mode if journal is missing, so no
 		 * write requests running. We should be safe
 		 */
-		log_init(conf, rdev);
+		log_init(conf, rdev, false);
 		return 0;
 	}
 	if (mddev->recovery_disabled == conf->recovery_disabled)
@@ -7773,6 +7772,9 @@ static int check_reshape(struct mddev *mddev)
 				      mddev->chunk_sectors)
 			    ) < 0)
 			return -ENOMEM;
+
+	if (conf->previous_raid_disks + mddev->delta_disks <= conf->pool_size)
+		return 0; /* never bother to shrink */
 	return resize_stripes(conf, (conf->previous_raid_disks
 				     + mddev->delta_disks));
 }
@@ -8263,20 +8265,6 @@ static void *raid6_takeover(struct mddev *mddev)
 	return setup_conf(mddev);
 }
 
-static void raid5_reset_stripe_cache(struct mddev *mddev)
-{
-	struct r5conf *conf = mddev->private;
-
-	mutex_lock(&conf->cache_size_mutex);
-	while (conf->max_nr_stripes &&
-	       drop_one_stripe(conf))
-		;
-	while (conf->min_nr_stripes > conf->max_nr_stripes &&
-	       grow_one_stripe(conf, GFP_KERNEL))
-		;
-	mutex_unlock(&conf->cache_size_mutex);
-}
-
 static int raid5_change_consistency_policy(struct mddev *mddev, const char *buf)
 {
 	struct r5conf *conf;
@@ -8291,23 +8279,23 @@ static int raid5_change_consistency_policy(struct mddev *mddev, const char *buf)
 		return -ENODEV;
 	}
 
-	if (strncmp(buf, "ppl", 3) == 0 && !raid5_has_ppl(conf)) {
+	if (strncmp(buf, "ppl", 3) == 0) {
 		/* ppl only works with RAID 5 */
-		if (conf->level == 5) {
-			mddev_suspend(mddev);
-			set_bit(MD_HAS_PPL, &mddev->flags);
-			err = log_init(conf, NULL);
-			if (!err)
-				raid5_reset_stripe_cache(mddev);
-			mddev_resume(mddev);
+		if (!raid5_has_ppl(conf) && conf->level == 5) {
+			err = log_init(conf, NULL, true);
+			if (!err) {
+				err = resize_stripes(conf, conf->pool_size);
+				if (err)
+					log_exit(conf);
+			}
 		} else
 			err = -EINVAL;
 	} else if (strncmp(buf, "resync", 6) == 0) {
 		if (raid5_has_ppl(conf)) {
 			mddev_suspend(mddev);
 			log_exit(conf);
-			raid5_reset_stripe_cache(mddev);
 			mddev_resume(mddev);
+			err = resize_stripes(conf, conf->pool_size);
 		} else if (test_bit(MD_HAS_JOURNAL, &conf->mddev->flags) &&
 			   r5l_log_disk_error(conf)) {
 			bool journal_dev_exists = false;
-- 
2.11.0


^ permalink raw reply related

* [PATCH 4/4] raid5-ppl: partial parity calculation optimization
From: Artur Paszkiewicz @ 2017-04-04 11:13 UTC (permalink / raw)
  To: shli; +Cc: linux-raid, Artur Paszkiewicz
In-Reply-To: <20170404111358.14829-1-artur.paszkiewicz@intel.com>

In case of read-modify-write, partial partity is the same as the result
of ops_run_prexor5(), so we can just copy sh->dev[pd_idx].page into
sh->ppl_page instead of calculating it again.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 drivers/md/raid5-ppl.c | 20 ++++++++++----------
 drivers/md/raid5.c     |  6 +++---
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
index e938669810c4..6fef999f2150 100644
--- a/drivers/md/raid5-ppl.c
+++ b/drivers/md/raid5-ppl.c
@@ -152,7 +152,7 @@ ops_run_partial_parity(struct stripe_head *sh, struct raid5_percpu *percpu,
 		       struct dma_async_tx_descriptor *tx)
 {
 	int disks = sh->disks;
-	struct page **xor_srcs = flex_array_get(percpu->scribble, 0);
+	struct page **srcs = flex_array_get(percpu->scribble, 0);
 	int count = 0, pd_idx = sh->pd_idx, i;
 	struct async_submit_ctl submit;
 
@@ -165,18 +165,18 @@ ops_run_partial_parity(struct stripe_head *sh, struct raid5_percpu *percpu,
 	 * differently.
 	 */
 	if (sh->reconstruct_state == reconstruct_state_prexor_drain_run) {
-		/* rmw: xor old data and parity from updated disks */
-		for (i = disks; i--;) {
-			struct r5dev *dev = &sh->dev[i];
-			if (test_bit(R5_Wantdrain, &dev->flags) || i == pd_idx)
-				xor_srcs[count++] = dev->page;
-		}
+		/*
+		 * rmw: xor old data and parity from updated disks
+		 * This is calculated earlier by ops_run_prexor5() so just copy
+		 * the parity dev page.
+		 */
+		srcs[count++] = sh->dev[pd_idx].page;
 	} else if (sh->reconstruct_state == reconstruct_state_drain_run) {
 		/* rcw: xor data from all not updated disks */
 		for (i = disks; i--;) {
 			struct r5dev *dev = &sh->dev[i];
 			if (test_bit(R5_UPTODATE, &dev->flags))
-				xor_srcs[count++] = dev->page;
+				srcs[count++] = dev->page;
 		}
 	} else {
 		return tx;
@@ -187,10 +187,10 @@ ops_run_partial_parity(struct stripe_head *sh, struct raid5_percpu *percpu,
 			  + sizeof(struct page *) * (sh->disks + 2));
 
 	if (count == 1)
-		tx = async_memcpy(sh->ppl_page, xor_srcs[0], 0, 0, PAGE_SIZE,
+		tx = async_memcpy(sh->ppl_page, srcs[0], 0, 0, PAGE_SIZE,
 				  &submit);
 	else
-		tx = async_xor(sh->ppl_page, xor_srcs, 0, count, PAGE_SIZE,
+		tx = async_xor(sh->ppl_page, srcs, 0, count, PAGE_SIZE,
 			       &submit);
 
 	return tx;
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 9cab2fe078c2..f99a12f50f9a 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2079,9 +2079,6 @@ static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request)
 			async_tx_ack(tx);
 	}
 
-	if (test_bit(STRIPE_OP_PARTIAL_PARITY, &ops_request))
-		tx = ops_run_partial_parity(sh, percpu, tx);
-
 	if (test_bit(STRIPE_OP_PREXOR, &ops_request)) {
 		if (level < 6)
 			tx = ops_run_prexor5(sh, percpu, tx);
@@ -2089,6 +2086,9 @@ static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request)
 			tx = ops_run_prexor6(sh, percpu, tx);
 	}
 
+	if (test_bit(STRIPE_OP_PARTIAL_PARITY, &ops_request))
+		tx = ops_run_partial_parity(sh, percpu, tx);
+
 	if (test_bit(STRIPE_OP_BIODRAIN, &ops_request)) {
 		tx = ops_run_biodrain(sh, tx);
 		overlap_clear++;
-- 
2.11.0


^ permalink raw reply related

* Re: [RFC PATCH] blk: reset 'bi_next' when bio is done inside request
From: Michael Wang @ 2017-04-04 12:24 UTC (permalink / raw)
  To: NeilBrown, linux-kernel@vger.kernel.org, linux-block, linux-raid
  Cc: Jens Axboe, Shaohua Li, Jinpu Wang
In-Reply-To: <2a6f8c30-616c-d6fe-1c3f-ab687c145cd7@profitbricks.com>

On 04/04/2017 12:23 PM, Michael Wang wrote:
[snip]
>> add something like
>>   if (wbio->bi_next)
>>      printk("bi_next!= NULL i=%d read_disk=%d bi_end_io=%pf\n",
>>           i, r1_bio->read_disk, wbio->bi_end_io);
>>
>> that might help narrow down what is happening.
> 
> Just triggered again in 4.4, dmesg like:
> 
> [  399.240230] md: super_written gets error=-5
> [  399.240286] md: super_written gets error=-5
> [  399.240286] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [  399.240300] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [  399.240312] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [  399.240323] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [  399.240334] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [  399.240341] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [  399.240349] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [  399.240352] bi_next!= NULL i=0 read_disk=0 bi_end_io=end_sync_write [raid1]

Is it possible that the fail fast who changed the 'bi_end_io' inside
fix_sync_read_error() help the used bio pass the check?

I'm not sure but if the read bio was supposed to be reused as write
for fail fast, maybe we should reset it like this?

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 7d67235..0554110 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
                /* Don't try recovering from here - just fail it
                 * ... unless it is the last working device of course */
                md_error(mddev, rdev);
-               if (test_bit(Faulty, &rdev->flags))
+               if (test_bit(Faulty, &rdev->flags)) {
                        /* Don't try to read from here, but make sure
                         * put_buf does it's thing
                         */
                        bio->bi_end_io = end_sync_write;
+                       bio->bi_next = NULL;
+               }
        }
 
        while(sectors) {

Regards,
Michael Wang


> [  399.240363] ------------[ cut here ]------------
> [  399.240364] kernel BUG at block/blk-core.c:2147!
> [  399.240365] invalid opcode: 0000 [#1] SMP 
> [  399.240378] Modules linked in: ib_srp scsi_transport_srp raid1 md_mod ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_core vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ib_sa ib_mad ib_core ib_addr ib_netlink iTCO_wdt iTCO_vendor_support dcdbas dell_smm_hwmon acpi_cpufreq x86_pkg_temp_thermal tpm_tis coretemp evdev tpm i2c_i801 crct10dif_pclmul serio_raw crc32_pclmul battery processor acpi_pad button kvm_intel kvm dm_round_robin irqbypass dm_multipath autofs4 sg sd_mod crc32c_intel ahci libahci psmouse libata mlx4_core scsi_mod xhci_pci xhci_hcd mlx_compat fan thermal [last unloaded: scsi_transport_srp]
> [  399.240380] CPU: 1 PID: 2052 Comm: md0_raid1 Not tainted 4.4.50-1-pserver+ #26
> [  399.240381] Hardware name: Dell Inc. Precision Tower 3620/09WH54, BIOS 1.3.6 05/26/2016
> [  399.240381] task: ffff8804031b6200 ti: ffff8800d72b4000 task.ti: ffff8800d72b4000
> [  399.240385] RIP: 0010:[<ffffffff813fcd9e>]  [<ffffffff813fcd9e>] generic_make_request+0x29e/0x2a0
> [  399.240385] RSP: 0018:ffff8800d72b7d10  EFLAGS: 00010286
> [  399.240386] RAX: ffff8804031b6200 RBX: ffff8800d2577e00 RCX: 000000003fffffff
> [  399.240387] RDX: ffffffffc0000001 RSI: 0000000000000001 RDI: ffff8800d5e8c1e0
> [  399.240387] RBP: ffff8800d72b7d50 R08: 0000000000000000 R09: 000000000000003f
> [  399.240388] R10: 0000000000000004 R11: 00000000001db9ac R12: 00000000ffffffff
> [  399.240388] R13: ffff8800d2748e00 R14: ffff88040a016400 R15: ffff8800d2748e40
> [  399.240389] FS:  0000000000000000(0000) GS:ffff88041dc40000(0000) knlGS:0000000000000000
> [  399.240390] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  399.240390] CR2: 00007fb49246a000 CR3: 000000040215c000 CR4: 00000000003406e0
> [  399.240391] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  399.240391] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  399.240392] Stack:
> [  399.240393]  ffff8800d72b7d18 ffff8800d72b7d30 0000000000000000 0000000000000000
> [  399.240394]  ffffffffa079c290 ffff8800d2577e00 0000000000000000 ffff8800d2748e00
> [  399.240395]  ffff8800d72b7e58 ffffffffa079e74c ffff88040b661c00 ffff8800d2577e00
> [  399.240396] Call Trace:
> [  399.240398]  [<ffffffffa079c290>] ? sync_request+0xb20/0xb20 [raid1]
> [  399.240400]  [<ffffffffa079e74c>] raid1d+0x65c/0x1060 [raid1]
> [  399.240403]  [<ffffffff810b6800>] ? trace_raw_output_itimer_expire+0x80/0x80
> [  399.240407]  [<ffffffffa0772040>] md_thread+0x130/0x140 [md_mod]
> [  399.240409]  [<ffffffff81094790>] ? wait_woken+0x80/0x80
> [  399.240412]  [<ffffffffa0771f10>] ? find_pers+0x70/0x70 [md_mod]
> [  399.240414]  [<ffffffff81075066>] kthread+0xd6/0xf0
> [  399.240415]  [<ffffffff81074f90>] ? kthread_park+0x50/0x50
> [  399.240417]  [<ffffffff8180411f>] ret_from_fork+0x3f/0x70
> [  399.240418]  [<ffffffff81074f90>] ? kthread_park+0x50/0x50
> [  399.240433] Code: 89 04 24 e9 2d ff ff ff 49 8d bd d8 07 00 00 f0 49 83 ad d8 07 00 00 01 74 05 e9 8b fe ff ff 41 ff 95 e8 07 00 00 e9 7f fe ff ff <0f> 0b 55 48 63 c7 48 89 e5 41 54 53 48 89 f3 48 83 ec 28 48 0b 
> [  399.240434] RIP  [<ffffffff813fcd9e>] generic_make_request+0x29e/0x2a0
> [  399.240435]  RSP <ffff8800d72b7d10>
> 
> 
> Regards,
> Michael Wang
> 
>>
>> NeilBrown
>>

^ permalink raw reply related

* Re: [RFC PATCH] blk: reset 'bi_next' when bio is done inside request
From: Michael Wang @ 2017-04-04 12:48 UTC (permalink / raw)
  To: NeilBrown, linux-kernel@vger.kernel.org, linux-block, linux-raid
  Cc: Jens Axboe, Shaohua Li, Jinpu Wang
In-Reply-To: <d84a1dcf-6f60-d089-f81d-85df5a504c19@profitbricks.com>



On 04/04/2017 02:24 PM, Michael Wang wrote:
> On 04/04/2017 12:23 PM, Michael Wang wrote:
> [snip]
>>> add something like
>>>   if (wbio->bi_next)
>>>      printk("bi_next!= NULL i=%d read_disk=%d bi_end_io=%pf\n",
>>>           i, r1_bio->read_disk, wbio->bi_end_io);
>>>
>>> that might help narrow down what is happening.
>>
>> Just triggered again in 4.4, dmesg like:
>>
>> [  399.240230] md: super_written gets error=-5
>> [  399.240286] md: super_written gets error=-5
>> [  399.240286] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
>> [  399.240300] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
>> [  399.240312] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
>> [  399.240323] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
>> [  399.240334] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
>> [  399.240341] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
>> [  399.240349] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
>> [  399.240352] bi_next!= NULL i=0 read_disk=0 bi_end_io=end_sync_write [raid1]
> 
> Is it possible that the fail fast who changed the 'bi_end_io' inside
> fix_sync_read_error() help the used bio pass the check?

Hi, NeilBrown, below patch fixed the issue in our testing, I'll post a md
RFC patch so we can continue the discussion there.

Regards,
Michael Wang

> 
> I'm not sure but if the read bio was supposed to be reused as write
> for fail fast, maybe we should reset it like this?
> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 7d67235..0554110 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
>                 /* Don't try recovering from here - just fail it
>                  * ... unless it is the last working device of course */
>                 md_error(mddev, rdev);
> -               if (test_bit(Faulty, &rdev->flags))
> +               if (test_bit(Faulty, &rdev->flags)) {
>                         /* Don't try to read from here, but make sure
>                          * put_buf does it's thing
>                          */
>                         bio->bi_end_io = end_sync_write;
> +                       bio->bi_next = NULL;
> +               }
>         }
>  
>         while(sectors) {
> 
> Regards,
> Michael Wang
> 
> 
>> [  399.240363] ------------[ cut here ]------------
>> [  399.240364] kernel BUG at block/blk-core.c:2147!
>> [  399.240365] invalid opcode: 0000 [#1] SMP 
>> [  399.240378] Modules linked in: ib_srp scsi_transport_srp raid1 md_mod ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_core vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ib_sa ib_mad ib_core ib_addr ib_netlink iTCO_wdt iTCO_vendor_support dcdbas dell_smm_hwmon acpi_cpufreq x86_pkg_temp_thermal tpm_tis coretemp evdev tpm i2c_i801 crct10dif_pclmul serio_raw crc32_pclmul battery processor acpi_pad button kvm_intel kvm dm_round_robin irqbypass dm_multipath autofs4 sg sd_mod crc32c_intel ahci libahci psmouse libata mlx4_core scsi_mod xhci_pci xhci_hcd mlx_compat fan thermal [last unloaded: scsi_transport_srp]
>> [  399.240380] CPU: 1 PID: 2052 Comm: md0_raid1 Not tainted 4.4.50-1-pserver+ #26
>> [  399.240381] Hardware name: Dell Inc. Precision Tower 3620/09WH54, BIOS 1.3.6 05/26/2016
>> [  399.240381] task: ffff8804031b6200 ti: ffff8800d72b4000 task.ti: ffff8800d72b4000
>> [  399.240385] RIP: 0010:[<ffffffff813fcd9e>]  [<ffffffff813fcd9e>] generic_make_request+0x29e/0x2a0
>> [  399.240385] RSP: 0018:ffff8800d72b7d10  EFLAGS: 00010286
>> [  399.240386] RAX: ffff8804031b6200 RBX: ffff8800d2577e00 RCX: 000000003fffffff
>> [  399.240387] RDX: ffffffffc0000001 RSI: 0000000000000001 RDI: ffff8800d5e8c1e0
>> [  399.240387] RBP: ffff8800d72b7d50 R08: 0000000000000000 R09: 000000000000003f
>> [  399.240388] R10: 0000000000000004 R11: 00000000001db9ac R12: 00000000ffffffff
>> [  399.240388] R13: ffff8800d2748e00 R14: ffff88040a016400 R15: ffff8800d2748e40
>> [  399.240389] FS:  0000000000000000(0000) GS:ffff88041dc40000(0000) knlGS:0000000000000000
>> [  399.240390] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  399.240390] CR2: 00007fb49246a000 CR3: 000000040215c000 CR4: 00000000003406e0
>> [  399.240391] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [  399.240391] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> [  399.240392] Stack:
>> [  399.240393]  ffff8800d72b7d18 ffff8800d72b7d30 0000000000000000 0000000000000000
>> [  399.240394]  ffffffffa079c290 ffff8800d2577e00 0000000000000000 ffff8800d2748e00
>> [  399.240395]  ffff8800d72b7e58 ffffffffa079e74c ffff88040b661c00 ffff8800d2577e00
>> [  399.240396] Call Trace:
>> [  399.240398]  [<ffffffffa079c290>] ? sync_request+0xb20/0xb20 [raid1]
>> [  399.240400]  [<ffffffffa079e74c>] raid1d+0x65c/0x1060 [raid1]
>> [  399.240403]  [<ffffffff810b6800>] ? trace_raw_output_itimer_expire+0x80/0x80
>> [  399.240407]  [<ffffffffa0772040>] md_thread+0x130/0x140 [md_mod]
>> [  399.240409]  [<ffffffff81094790>] ? wait_woken+0x80/0x80
>> [  399.240412]  [<ffffffffa0771f10>] ? find_pers+0x70/0x70 [md_mod]
>> [  399.240414]  [<ffffffff81075066>] kthread+0xd6/0xf0
>> [  399.240415]  [<ffffffff81074f90>] ? kthread_park+0x50/0x50
>> [  399.240417]  [<ffffffff8180411f>] ret_from_fork+0x3f/0x70
>> [  399.240418]  [<ffffffff81074f90>] ? kthread_park+0x50/0x50
>> [  399.240433] Code: 89 04 24 e9 2d ff ff ff 49 8d bd d8 07 00 00 f0 49 83 ad d8 07 00 00 01 74 05 e9 8b fe ff ff 41 ff 95 e8 07 00 00 e9 7f fe ff ff <0f> 0b 55 48 63 c7 48 89 e5 41 54 53 48 89 f3 48 83 ec 28 48 0b 
>> [  399.240434] RIP  [<ffffffff813fcd9e>] generic_make_request+0x29e/0x2a0
>> [  399.240435]  RSP <ffff8800d72b7d10>
>>
>>
>> Regards,
>> Michael Wang
>>
>>>
>>> NeilBrown
>>>

^ permalink raw reply

* [RFC PATCH] raid1: reset 'bi_next' before reuse the bio
From: Michael Wang @ 2017-04-04 13:50 UTC (permalink / raw)
  To: linux-raid, linux-kernel@vger.kernel.org
  Cc: Shaohua Li, NeilBrown, Jinpu Wang


During the testing we found the sync read bio can go through
path:

  md_do_sync()
    sync_request()
      generic_make_request()
        blk_queue_bio()
          blk_attempt_plug_merge()
            bio->bi_next CHAINED HERE

  ...

  raid1d()
    sync_request_write()
      fix_sync_read_error()
        if FailFast && Faulty
          bio->bi_end_io = end_sync_write
      generic_make_request()
        BUG_ON(bio->bi_next)

This need to meet the conditions:
  * bio once merged
  * read disk have FailFast enabled
  * read disk is Faulty

And since the block layer won't reset the 'bi_next' after bio
is done inside request, we hit the BUG like that.

This patch simply reset the bi_next before we reuse it.

Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
---
 drivers/md/raid1.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 7d67235..0554110 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
 		/* Don't try recovering from here - just fail it
 		 * ... unless it is the last working device of course */
 		md_error(mddev, rdev);
-		if (test_bit(Faulty, &rdev->flags))
+		if (test_bit(Faulty, &rdev->flags)) {
 			/* Don't try to read from here, but make sure
 			 * put_buf does it's thing
 			 */
 			bio->bi_end_io = end_sync_write;
+			bio->bi_next = NULL;
+		}
 	}
 
 	while(sectors) {
-- 
2.5.0

^ permalink raw reply related

* md-cluster Oops 4.9.13
From: Marc Smith @ 2017-04-04 14:06 UTC (permalink / raw)
  To: linux-raid

Hi,

I encountered an oops this morning when stopping a MD array
(md-cluster)... there were 4 md-cluster array started, and they were
in the middle of a rebuild. I stopped the first one and then stopped
the second one immediately after and got the oops, here is a
transcript of what was on my terminal session:

[root@brimstone-1b ~]# mdadm --stop /dev/md/array1
mdadm: stopped /dev/md/array1
[root@brimstone-1b ~]# mdadm --stop /dev/md/array2

Message from syslogd@brimstone-1b at Tue Apr  4 09:54:40 2017 ...
brimstone-1b kernel: [649162.174685] BUG: unable to handle kernel NULL
pointer dereference at 0000000000000098

Using Linux 4.9.13 and here is the output from the kernel messages:

--snip--
[649158.014731] dlm: 5b3b8f94-7875-b323-5bb8-29fa6866f4a8: leaving the
lockspace group...
[649158.015233] dlm: 5b3b8f94-7875-b323-5bb8-29fa6866f4a8: group event done 0 0
[649158.015303] dlm: 5b3b8f94-7875-b323-5bb8-29fa6866f4a8:
release_lockspace final free
[649158.015331] md: unbind<nvme0n1p1>
[649158.042540] md: export_rdev(nvme0n1p1)
[649158.042546] md: unbind<nvme1n1p1>
[649158.048501] md: export_rdev(nvme1n1p1)
[649161.759022] md127: detected capacity change from 1000068874240 to 0
[649161.759025] md: md127 stopped.
[649162.174685] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000098
[649162.174727] IP: [<ffffffff81868b40>] recv_daemon+0x1e9/0x373
[649162.174766] PGD 0
[649162.174776]
[649162.174792] Oops: 0000 [#1] SMP
[649162.174806] Modules linked in: qla2xxx bonding mlx5_core bna
[649162.174850] CPU: 53 PID: 37118 Comm: md127_cluster_r Not tainted
4.9.13-esos.prod #1
[649162.174902] Hardware name: Supermicro SSG-2028R-DN2R40L/X10DSN-TS,
BIOS 2.0 10/28/2016
[649162.174926] task: ffff88105a259880 task.stack: ffffc9002fb54000
[649162.174946] RIP: 0010:[<ffffffff81868b40>]  [<ffffffff81868b40>]
recv_daemon+0x1e9/0x373
[649162.175287] RSP: 0018:ffffc9002fb57e00  EFLAGS: 00010282
[649162.175462] RAX: ffff88084fa22d40 RBX: ffff881015807000 RCX:
0000000000000000
[649162.175799] RDX: 0000000000000000 RSI: 0000000000000001 RDI:
ffff881015807000
[649162.176132] RBP: ffff88105a27e400 R08: 0000000000019b40 R09:
ffff88084fa22d40
[649162.176464] R10: ffffc90006493e38 R11: 000000204867dce0 R12:
0000000000000000
[649162.176796] R13: ffff88085a3e4d80 R14: 00000000024ed800 R15:
00000000024fd800
[649162.177130] FS:  0000000000000000(0000) GS:ffff88105f440000(0000)
knlGS:0000000000000000
[649162.177467] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[649162.177641] CR2: 0000000000000098 CR3: 0000000002010000 CR4:
00000000003406e0
[649162.177974] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[649162.178307] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[649162.178640] Stack:
[649162.178804]  ffff88105a27e430 ffff881059e9e700 ffff88084fa22d40
0000000000000001
[649162.179151]  00000000024ed800 00000000024fd800 0000000000000000
0000000000000000
[649162.179502]  0000000000000000 000000004d2f4151 ffff8810152256c0
ffff88105a259880
[649162.179853] Call Trace:
[649162.180028]  [<ffffffff810686f8>] ? do_group_exit+0x39/0x91
[649162.180206]  [<ffffffff8188245a>] ? md_thread+0xff/0x113
[649162.180383]  [<ffffffff8108fde5>] ? wake_up_bit+0x1b/0x1b
[649162.180558]  [<ffffffff8188235b>] ? md_wait_for_blocked_rdev+0xe4/0xe4
[649162.180736]  [<ffffffff8107abf0>] ? kthread+0xc2/0xca
[649162.180910]  [<ffffffff8107ab2e>] ? kthread_park+0x4e/0x4e
[649162.181092]  [<ffffffff81a89ee2>] ? ret_from_fork+0x22/0x30
[649162.181267] Code: 00 e8 db 8d 8b ff 48 85 c0 0f 84 32 01 00 00 44
89 20 4c 89 70 08 48 89 df 4c 89 78 10 48 8b 53 08 be 01 00 00 00 48
89 44 24 10 <ff> 92 98 00 00 00 48 8b 53 08 31 f6 48 89 df ff 92 98 00
00 00
[649162.182024] RIP  [<ffffffff81868b40>] recv_daemon+0x1e9/0x373
[649162.182201]  RSP <ffffc9002fb57e00>
[649162.182369] CR2: 0000000000000098
[649162.183031] ---[ end trace 71c20646840bbbd3 ]---
[649162.183278] BUG: unable to handle kernel NULL pointer dereference
at           (null)
[649162.183801] IP: [<ffffffff8108f91e>] __wake_up_common+0x1d/0x73
[649162.184100] PGD 0
[649162.184170]
[649162.184459] Oops: 0000 [#2] SMP
[649162.184687] Modules linked in: qla2xxx bonding mlx5_core bna
[649162.185242] CPU: 53 PID: 37118 Comm: md127_cluster_r Tainted: G
  D         4.9.13-esos.prod #1
[649162.185640] Hardware name: Supermicro SSG-2028R-DN2R40L/X10DSN-TS,
BIOS 2.0 10/28/2016
[649162.186036] task: ffff88105a259880 task.stack: ffffc9002fb54000
[649162.186272] RIP: 0010:[<ffffffff8108f91e>]  [<ffffffff8108f91e>]
__wake_up_common+0x1d/0x73
[649162.186734] RSP: 0018:ffffc9002fb57e78  EFLAGS: 00010046
[649162.186966] RAX: 0000000000000286 RBX: ffffc9002fb57f20 RCX:
0000000000000000
[649162.187357] RDX: 0000000000000000 RSI: 0000000000000003 RDI:
ffffc9002fb57f20
[649162.187749] RBP: ffffc9002fb57f28 R08: 0000000000000000 R09:
0000000000000000
[649162.188157] R10: 00000000ffff8810 R11: 000000000000ffa0 R12:
0000000000000003
[649162.188548] R13: 0000000000000000 R14: 0000000000000000 R15:
0000000000000001
[649162.188939] FS:  0000000000000000(0000) GS:ffff88105f440000(0000)
knlGS:0000000000000000
[649162.189330] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[649162.189562] CR2: 0000000000000000 CR3: 0000000002010000 CR4:
00000000003406e0
[649162.189952] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[649162.190341] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[649162.190730] Stack:
[649162.190955]  0000000000000000 ffffc9002fb57f20 ffffc9002fb57f18
0000000000000286
[649162.191605]  ffffc9002fb57f10 0000000000000000 0000000000030001
ffffffff810900f7
[649162.192252]  0000000000000000 ffff88105a259880 0000000000000001
ffffffff810633f4
[649162.192897] Call Trace:
[649162.193124]  [<ffffffff810900f7>] ? complete+0x2b/0x3a
[649162.193357]  [<ffffffff810633f4>] ? mm_release+0xe3/0xed
[649162.193589]  [<ffffffff8106745c>] ? do_exit+0x265/0x886
[649162.193824]  [<ffffffff81a8b6e7>] ? rewind_stack_do_exit+0x17/0x20
[649162.194058] Code: 07 00 00 00 00 48 89 47 08 48 89 47 10 c3 41 57
41 56 41 89 d7 41 55 41 54 41 89 cd 55 53 48 8d 6f 08 41 51 48 8b 57
08 41 89 f4 <48> 8b 1a 48 8d 42 e8 48 83 eb 18 48 8d 50 18 48 39 ea 74
36 4c
[649162.198789] RIP  [<ffffffff8108f91e>] __wake_up_common+0x1d/0x73
[649162.199085]  RSP <ffffc9002fb57e78>
[649162.199310] CR2: 0000000000000000
[649162.199536] ---[ end trace 71c20646840bbbd4 ]---
[649162.199764] Fixing recursive fault but reboot is needed!
[649225.679190] INFO: rcu_sched self-detected stall on CPU
[649225.679480]         46-...: (59999 ticks this GP)
idle=6dd/140000000000001/0 softirq=118748/118748 fqs=15000
[649225.679862]          (t=60001 jiffies g=395492 c=395491 q=3050)
[649225.680153] Task dump for CPU 46:
[649225.680373] killall         R  running task        0 37212      1 0x0000000c
[649225.680726]  0000000000000000 ffffffff81081cbf 000000000000002e
ffffffff82049000
[649225.681352]  ffffffff810e4ef5 ffffffff82049000 ffff88105f297640
ffffc9002fb9fbe8
[649225.681981]  0000000000000001 ffffffff810a63d1 0000000000000bea
000000000000002e
[649225.682610] Call Trace:
[649225.682828]  <IRQ>
[649225.682901]  [<ffffffff81081cbf>] ? sched_show_task+0xc3/0xd0
[649225.687385]  [<ffffffff810e4ef5>] ? rcu_dump_cpu_stacks+0x72/0x95
[649225.687616]  [<ffffffff810a63d1>] ? rcu_check_callbacks+0x227/0x604
[649225.687844]  [<ffffffff810a8c41>] ? update_process_times+0x23/0x45
[649225.688073]  [<ffffffff810b4094>] ? tick_sched_handle+0x2e/0x38
[649225.688300]  [<ffffffff810b40cd>] ? tick_sched_timer+0x2f/0x53
[649225.688527]  [<ffffffff810a9414>] ? __hrtimer_run_queues+0x71/0xe8
[649225.688754]  [<ffffffff810a963a>] ? hrtimer_interrupt+0x8b/0x145
[649225.688984]  [<ffffffff8102d94b>] ? smp_apic_timer_interrupt+0x34/0x43
[649225.689213]  [<ffffffff81a8a88f>] ? apic_timer_interrupt+0x7f/0x90
[649225.689439]  <EOI>
[649225.689509]  [<ffffffff810916a5>] ? queued_spin_lock_slowpath+0x48/0x15d
[649225.689951]  [<ffffffff81177018>] ? pid_revalidate+0x52/0xa0
[649225.690178]  [<ffffffff81132989>] ? lookup_fast+0x1f5/0x267
[649225.690405]  [<ffffffff81134a03>] ? path_openat+0x39d/0xeb1
[649225.690630]  [<ffffffff81132900>] ? lookup_fast+0x16c/0x267
[649225.690858]  [<ffffffff81079105>] ? get_pid_task+0x5/0xf
[649225.691091]  [<ffffffff8113a5bb>] ? dput+0x30/0x1cb
[649225.691315]  [<ffffffff811336c0>] ? path_lookupat+0xea/0xfe
[649225.691541]  [<ffffffff8113555f>] ? do_filp_open+0x48/0x9e
[649225.691767]  [<ffffffff81062b07>] ? get_task_mm+0x10/0x33
[649225.691994]  [<ffffffff813a28b3>] ? lockref_put_or_lock+0x3a/0x50
[649225.692221]  [<ffffffff8113a65a>] ? dput+0xcf/0x1cb
[649225.692446]  [<ffffffff8112799b>] ? do_sys_open+0x135/0x1bc
[649225.692671]  [<ffffffff8112799b>] ? do_sys_open+0x135/0x1bc
[649225.692898]  [<ffffffff81a89ca0>] ? entry_SYSCALL_64_fastpath+0x13/0x94
[649225.693127] INFO: rcu_sched detected stalls on CPUs/tasks:
[649225.693413]         46-...: (60001 ticks this GP)
idle=6dd/140000000000000/0 softirq=118748/118748 fqs=15001
[649225.693786]         (detected by 33, t=60015 jiffies, g=395492,
c=395491, q=3050)
[649225.694070] Task dump for CPU 46:
[649225.694284] killall         R  running task        0 37212      1 0x0000000c
[649225.694627]  ffffffff00000000 ffff881014971d18 ffff881022ddc020
ffff88105b9d5ec0
[649225.695236]  ffffc9002fb9fdc0 ffffffff811336c0 000000002fb9fe38
0000000000000001
[649225.695844]  ffffc9002fb9fef0 ffffc9002fb9ff0c 00000000000090fe
00007f540db18690
[649225.696452] Call Trace:
[649225.696668]  [<ffffffff811336c0>] ? path_lookupat+0xea/0xfe
[649225.696888]  [<ffffffff8113555f>] ? do_filp_open+0x48/0x9e
[649225.697108]  [<ffffffff81062b07>] ? get_task_mm+0x10/0x33
[649225.697328]  [<ffffffff813a28b3>] ? lockref_put_or_lock+0x3a/0x50
[649225.697549]  [<ffffffff8113a65a>] ? dput+0xcf/0x1cb
[649225.697770]  [<ffffffff8112799b>] ? do_sys_open+0x135/0x1bc
[649225.697989]  [<ffffffff8112799b>] ? do_sys_open+0x135/0x1bc
[649225.698211]  [<ffffffff81a89ca0>] ? entry_SYSCALL_64_fastpath+0x13/0x94
--snip--

Perhaps this is already fixed in later versions? Let me know if you
need any additional information.


Thanks,

Marc

^ permalink raw reply

* Re: [RFC PATCH] blk: reset 'bi_next' when bio is done inside request
From: NeilBrown @ 2017-04-04 21:52 UTC (permalink / raw)
  To: Michael Wang, linux-kernel@vger.kernel.org, linux-block,
	linux-raid
  Cc: Jens Axboe, Shaohua Li, Jinpu Wang
In-Reply-To: <2a6f8c30-616c-d6fe-1c3f-ab687c145cd7@profitbricks.com>

[-- Attachment #1: Type: text/plain, Size: 4628 bytes --]

On Tue, Apr 04 2017, Michael Wang wrote:

> On 04/04/2017 11:37 AM, NeilBrown wrote:
>> On Tue, Apr 04 2017, Michael Wang wrote:
> [snip]
>>>>
>>>> If sync_request_write() is using a bio that has already been used, it
>>>> should call bio_reset() and fill in the details again.
>>>> However I don't see how that would happen.
>>>> Can you give specific details on the situation that triggers the bug?
>>>
>>> We have storage side mapping lv through scst to server, on server side
>>> we assemble them into multipath device, and then assemble these dm into
>>> two raid1.
>>>
>>> The test is firstly do mkfs.ext4 on raid1 then start fio on it, on storage
>>> side we unmap all the lv (could during mkfs or fio), then on server side
>>> we hit the BUG (reproducible).
>> 
>> So I assume the initial resync is still happening at this point?
>> And you unmap *all* the lv's so you expect IO to fail?
>> I can see that the code would behave strangely if you have a
>> bad-block-list configured (which is the default).
>> Do you have a bbl?  If you create the array without the bbl, does it
>> still crash?
>
> The resync is at least happen concurrently in this case, we try
> to simulate the case that all the connections dropped, the IO do
> failed, also bunch of kernel log like:
>
> md: super_written gets error=-5
> blk_update_request: I/O error, dev dm-3, sector 64184
> md/raid1:md1: dm-2: unrecoverable I/O read error for block 46848
>
> we expect that to happen, but server should not crash on BUG.
>
> And we haven't done any thing special regarding bbl, the bad_blocks
> in sysfs are all empty.
>
>> 
>>>
>>> The path of bio was confirmed by add tracing, it is reused in sync_request_write()
>>> with 'bi_next' once chained inside blk_attempt_plug_merge().
>> 
>> I still don't see why it is re-used.
>> I assume you didn't explicitly ask for a check/repair (i.e. didn't write
>> to .../md/sync_action at all?).  In that case MD_RECOVERY_REQUESTED is
>> not set.
>
> Just unmap lv on storage side, no operation on server side.
>
>> So sync_request() sends only one bio to generic_make_request():
>>    r1_bio->bios[r1_bio->read_disk];
>> 
>> then sync_request_write() *doesn't* send that bio again, but does send
>> all the others.
>> 
>> So where does it reuse a bio?
>
> If that's the design then it would be strange... the log do showing the path
> of that bio go through sync_request(), will do more investigation.
>
>> 
>>>
>>> We also tried to reset the bi_next inside sync_request_write() before
>>> generic_make_request() which also works.
>>>
>>> The testing was done with 4.4, but we found upstream also left bi_next
>>> chained after done in request, thus we post this RFC.
>>>
>>> Regarding raid1, we haven't found the place on path where the bio was
>>> reset... where does it supposed to be?
>> 
>> I'm not sure what you mean.
>> We only reset bios when they are being reused.
>> One place is in process_checks() where bio_reset() is called before
>> filling in all the details.
>> 
>> 
>> Maybe, in sync_request_write(), before
>> 
>> 	wbio->bi_rw = WRITE;
>> 
>> add something like
>>   if (wbio->bi_next)
>>      printk("bi_next!= NULL i=%d read_disk=%d bi_end_io=%pf\n",
>>           i, r1_bio->read_disk, wbio->bi_end_io);
>> 
>> that might help narrow down what is happening.
>
> Just triggered again in 4.4, dmesg like:
>
> [  399.240230] md: super_written gets error=-5
> [  399.240286] md: super_written gets error=-5
> [  399.240286] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [  399.240300] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [  399.240312] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [  399.240323] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [  399.240334] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [  399.240341] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [  399.240349] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [  399.240352] bi_next!= NULL i=0 read_disk=0 bi_end_io=end_sync_write [raid1]

It is very peculiar for r1_bio->bios[r1_bio->read_disk].bi_end_io to be
end_sync_write.
I can only see this happening when sync_request_write() assigns that,
and this printk is before there.
That seems to suggest that sync_request_write() is being performed on
the same r1_bio twice, which is also very peculiar.

I might have to try to reproduce this myself and see what is happening.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [RFC PATCH] raid1: reset 'bi_next' before reuse the bio
From: NeilBrown @ 2017-04-04 22:17 UTC (permalink / raw)
  To: Michael Wang, linux-raid, linux-kernel@vger.kernel.org
  Cc: Shaohua Li, Jinpu Wang
In-Reply-To: <dcd33b53-5c6f-3ebc-8c07-04f0c0372796@profitbricks.com>

[-- Attachment #1: Type: text/plain, Size: 2653 bytes --]

On Tue, Apr 04 2017, Michael Wang wrote:

> During the testing we found the sync read bio can go through
> path:
>
>   md_do_sync()
>     sync_request()
>       generic_make_request()
>         blk_queue_bio()
>           blk_attempt_plug_merge()
>             bio->bi_next CHAINED HERE
>
>   ...
>
>   raid1d()
>     sync_request_write()
>       fix_sync_read_error()
>         if FailFast && Faulty
>           bio->bi_end_io = end_sync_write
>       generic_make_request()
>         BUG_ON(bio->bi_next)
>
> This need to meet the conditions:
>   * bio once merged
>   * read disk have FailFast enabled
>   * read disk is Faulty
>
> And since the block layer won't reset the 'bi_next' after bio
> is done inside request, we hit the BUG like that.
>
> This patch simply reset the bi_next before we reuse it.
>
> Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
> ---
>  drivers/md/raid1.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 7d67235..0554110 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
>  		/* Don't try recovering from here - just fail it
>  		 * ... unless it is the last working device of course */
>  		md_error(mddev, rdev);
> -		if (test_bit(Faulty, &rdev->flags))
> +		if (test_bit(Faulty, &rdev->flags)) {
>  			/* Don't try to read from here, but make sure
>  			 * put_buf does it's thing
>  			 */
>  			bio->bi_end_io = end_sync_write;
> +			bio->bi_next = NULL;
> +		}
>  	}
>  
>  	while(sectors) {


Ah - I see what is happening now.  I was looking at the vanilla 4.4
code, which doesn't have the failfast changes.

I don't think your patch is correct though.  We really shouldn't be
re-using that bio, and setting bi_next to NULL just hides the bug.  It
doesn't fix it.
As the rdev is now Faulty, it doesn't make sense for
sync_request_write() to submit a write request to it.

Can you confirm that this works please.

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index d2d8b8a5bd56..219f1e1f1d1d 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2180,6 +2180,8 @@ static void sync_request_write(struct mddev *mddev, struct r1bio *r1_bio)
 		     (i == r1_bio->read_disk ||
 		      !test_bit(MD_RECOVERY_SYNC, &mddev->recovery))))
 			continue;
+		if (test_bit(Faulty, &conf->mirrors[i].rdev->flags))
+			continue;
 
 		bio_set_op_attrs(wbio, REQ_OP_WRITE, 0);
 		if (test_bit(FailFast, &conf->mirrors[i].rdev->flags))


Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related

* 20047 linux-raid
From: een @ 2017-04-05  2:50 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: 05452250.zip --]
[-- Type: application/zip, Size: 3813 bytes --]

^ permalink raw reply

* Re: md-cluster Oops 4.9.13
From: Guoqing Jiang @ 2017-04-05  3:01 UTC (permalink / raw)
  To: Marc Smith, linux-raid
In-Reply-To: <CAHkw+Lcr1+sWsQPK_ijfTHwG+zxRYT+iNhWGdpYTNxVuQi8TOw@mail.gmail.com>



On 04/04/2017 10:06 PM, Marc Smith wrote:
> Hi,
>
> I encountered an oops this morning when stopping a MD array
> (md-cluster)... there were 4 md-cluster array started, and they were
> in the middle of a rebuild. I stopped the first one and then stopped
> the second one immediately after and got the oops, here is a
> transcript of what was on my terminal session:
>
> [root@brimstone-1b ~]# mdadm --stop /dev/md/array1
> mdadm: stopped /dev/md/array1
> [root@brimstone-1b ~]# mdadm --stop /dev/md/array2
>
> Message from syslogd@brimstone-1b at Tue Apr  4 09:54:40 2017 ...
> brimstone-1b kernel: [649162.174685] BUG: unable to handle kernel NULL
> pointer dereference at 0000000000000098
>
> Using Linux 4.9.13 and here is the output from the kernel messages:
>
> --snip--
> [649158.014731] dlm: 5b3b8f94-7875-b323-5bb8-29fa6866f4a8: leaving the
> lockspace group...
> [649158.015233] dlm: 5b3b8f94-7875-b323-5bb8-29fa6866f4a8: group event done 0 0
> [649158.015303] dlm: 5b3b8f94-7875-b323-5bb8-29fa6866f4a8:
> release_lockspace final free
> [649158.015331] md: unbind<nvme0n1p1>
> [649158.042540] md: export_rdev(nvme0n1p1)
> [649158.042546] md: unbind<nvme1n1p1>
> [649158.048501] md: export_rdev(nvme1n1p1)
> [649161.759022] md127: detected capacity change from 1000068874240 to 0
> [649161.759025] md: md127 stopped.
> [649162.174685] BUG: unable to handle kernel NULL pointer dereference
> at 0000000000000098
> [649162.174727] IP: [<ffffffff81868b40>] recv_daemon+0x1e9/0x373

Looks like the recv_daemon is still running after stop array, commit
48df498 "md: move bitmap_destroy to the beginning of __md_stop"
ensure it won't happen.


[snip]

> Perhaps this is already fixed in later versions? Let me know if you
> need any additional information.

Could you pls try with the latest version? Please let me know if you
still see it, thanks.

Regards,
Guoqing


^ permalink raw reply

* [md PATCH 00/10] Simplify bio splitting and related code.
From: NeilBrown @ 2017-04-05  4:05 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid

This is part of my little project to make bio splitting
in Linux uniform and dead-lock free, in a way that will mean
that we can get rid of all the bioset threads.

The basic approach is that when a bio needs to be split, we call
bio_split(), bio_chain() and then generic_make_request().
We then proceed to handle the remainder without further splitting.
Recent changes to generic_make_request() ensure that this will
be safe from deadlocks, providing each bioset is used only once
in the stack.

This leads to simpler code in various places.  In particular, the
splitting of bios that is needed to work around known bad blocks
is now much less complex.  There is only ever one r1bio per bio.

As you can see from
 10 files changed, 335 insertions(+), 540 deletions(-)
there is a net reduction in code.

Thanks,
NeilBrown

---

NeilBrown (10):
      md/raid1: simplify the splitting of requests.
      md/raid1: simplify alloc_behind_master_bio()
      Revert "block: introduce bio_copy_data_partial"
      md/raid1: simplify handle_read_error().
      md/raid1: factor out flush_bio_list()
      md/raid10: simplify the splitting of requests.
      md/raid10: simplify handle_read_error()
      md/raid5: make chunk_aligned_read() split bios more cleanly.
      md/linear: improve bio splitting.
      md/raid0: fix up bio splitting.

 block/bio.c         |   60 ++-------
 drivers/md/linear.c |   75 +++++------
 drivers/md/raid0.c  |   73 +++++------
 drivers/md/raid1.c  |  346 ++++++++++++++++++++-------------------------------
 drivers/md/raid1.h  |    2 
 drivers/md/raid10.c |  282 ++++++++++++++----------------------------
 drivers/md/raid10.h |    1 
 drivers/md/raid5.c  |   33 +++--
 drivers/md/raid5.h  |    1 
 include/linux/bio.h |    2 
 10 files changed, 335 insertions(+), 540 deletions(-)

--
Signature

^ permalink raw reply

* [md PATCH 01/10] md/raid1: simplify the splitting of requests.
From: NeilBrown @ 2017-04-05  4:05 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid
In-Reply-To: <149136485390.25893.1797855041954158826.stgit@noble>

raid1 currently splits requests in two different ways for
two different reasons.

First, bio_split() is used to ensure the bio fits within a
resync accounting region.
Second, multiple r1bios are allocated for each bio to handle
the possiblity of known bad blocks on some devices.

This can be simplified to just use bio_split() once, and not
use multiple r1bios.
We delay the split until we know a maximum bio size that can
be handled with a single r1bio, and then split the bio and
queue the remainder for later handling.

This avoids all loops inside raid1.c request handling.  Just
a single read, or a single set of writes, is submitted to
lower-level devices for each bio that comes from
generic_make_request().

When the bio needs to be split, generic_make_request() will
do the necessary looping and call md_make_request() multiple
times.

raid1_make_request() no longer queues request for raid1 to handle,
so we can remove that branch from the 'if'.

This patch also creates a new private bio_set
(conf->bio_split) for splitting bios.  Using fs_bio_set
is wrong, as it is meant to be used by filesystems, not
block devices.  Using it inside md can lead to deadlocks
under high memory pressure.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid1.c |  138 ++++++++++++++++++----------------------------------
 drivers/md/raid1.h |    2 +
 2 files changed, 51 insertions(+), 89 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 7d6723558fd8..83853f65462a 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1202,7 +1202,8 @@ alloc_r1bio(struct mddev *mddev, struct bio *bio, sector_t sectors_handled)
 	return r1_bio;
 }
 
-static void raid1_read_request(struct mddev *mddev, struct bio *bio)
+static void raid1_read_request(struct mddev *mddev, struct bio *bio,
+			       int max_read_sectors)
 {
 	struct r1conf *conf = mddev->private;
 	struct raid1_info *mirror;
@@ -1211,7 +1212,6 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio)
 	struct bitmap *bitmap = mddev->bitmap;
 	const int op = bio_op(bio);
 	const unsigned long do_sync = (bio->bi_opf & REQ_SYNC);
-	int sectors_handled;
 	int max_sectors;
 	int rdisk;
 
@@ -1222,12 +1222,12 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio)
 	wait_read_barrier(conf, bio->bi_iter.bi_sector);
 
 	r1_bio = alloc_r1bio(mddev, bio, 0);
+	r1_bio->sectors = max_read_sectors;
 
 	/*
 	 * make_request() can abort the operation when read-ahead is being
 	 * used and no empty request is available.
 	 */
-read_again:
 	rdisk = read_balance(conf, r1_bio, &max_sectors);
 
 	if (rdisk < 0) {
@@ -1247,11 +1247,20 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio)
 		wait_event(bitmap->behind_wait,
 			   atomic_read(&bitmap->behind_writes) == 0);
 	}
+
+	if (max_sectors < bio_sectors(bio)) {
+		struct bio *split = bio_split(bio, max_sectors,
+					      GFP_NOIO, conf->bio_split);
+		bio_chain(split, bio);
+		generic_make_request(bio);
+		bio = split;
+		r1_bio->master_bio = bio;
+		r1_bio->sectors = max_sectors;
+	}
+
 	r1_bio->read_disk = rdisk;
 
 	read_bio = bio_clone_fast(bio, GFP_NOIO, mddev->bio_set);
-	bio_trim(read_bio, r1_bio->sector - bio->bi_iter.bi_sector,
-		 max_sectors);
 
 	r1_bio->bios[rdisk] = read_bio;
 
@@ -1270,30 +1279,11 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio)
 	                              read_bio, disk_devt(mddev->gendisk),
 	                              r1_bio->sector);
 
-	if (max_sectors < r1_bio->sectors) {
-		/*
-		 * could not read all from this device, so we will need another
-		 * r1_bio.
-		 */
-		sectors_handled = (r1_bio->sector + max_sectors
-				   - bio->bi_iter.bi_sector);
-		r1_bio->sectors = max_sectors;
-		bio_inc_remaining(bio);
-
-		/*
-		 * Cannot call generic_make_request directly as that will be
-		 * queued in __make_request and subsequent mempool_alloc might
-		 * block waiting for it.  So hand bio over to raid1d.
-		 */
-		reschedule_retry(r1_bio);
-
-		r1_bio = alloc_r1bio(mddev, bio, sectors_handled);
-		goto read_again;
-	} else
-		generic_make_request(read_bio);
+	generic_make_request(read_bio);
 }
 
-static void raid1_write_request(struct mddev *mddev, struct bio *bio)
+static void raid1_write_request(struct mddev *mddev, struct bio *bio,
+				int max_write_sectors)
 {
 	struct r1conf *conf = mddev->private;
 	struct r1bio *r1_bio;
@@ -1304,7 +1294,6 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio)
 	struct blk_plug_cb *cb;
 	struct raid1_plug_cb *plug = NULL;
 	int first_clone;
-	int sectors_handled;
 	int max_sectors;
 	sector_t offset;
 
@@ -1345,6 +1334,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio)
 	wait_barrier(conf, bio->bi_iter.bi_sector);
 
 	r1_bio = alloc_r1bio(mddev, bio, 0);
+	r1_bio->sectors = max_write_sectors;
 
 	if (conf->pending_count >= max_queued_requests) {
 		md_wakeup_thread(mddev->thread);
@@ -1443,10 +1433,15 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio)
 		goto retry_write;
 	}
 
-	if (max_sectors < r1_bio->sectors)
+	if (max_sectors < bio_sectors(bio)) {
+		struct bio *split = bio_split(bio, max_sectors,
+					      GFP_NOIO, conf->bio_split);
+		bio_chain(split, bio);
+		generic_make_request(bio);
+		bio = split;
+		r1_bio->master_bio = bio;
 		r1_bio->sectors = max_sectors;
-
-	sectors_handled = r1_bio->sector + max_sectors - bio->bi_iter.bi_sector;
+	}
 
 	atomic_set(&r1_bio->remaining, 1);
 	atomic_set(&r1_bio->behind_remaining, 0);
@@ -1470,7 +1465,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio)
 			     < mddev->bitmap_info.max_write_behind) &&
 			    !waitqueue_active(&bitmap->behind_wait)) {
 				mbio = alloc_behind_master_bio(r1_bio, bio,
-							       offset << 9,
+							       0,
 							       max_sectors << 9);
 			}
 
@@ -1486,10 +1481,8 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio)
 				mbio = bio_clone_fast(r1_bio->behind_master_bio,
 						      GFP_NOIO,
 						      mddev->bio_set);
-			else {
+			else
 				mbio = bio_clone_fast(bio, GFP_NOIO, mddev->bio_set);
-				bio_trim(mbio, offset, max_sectors);
-			}
 		}
 
 		if (r1_bio->behind_master_bio) {
@@ -1536,19 +1529,6 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio)
 		if (!plug)
 			md_wakeup_thread(mddev->thread);
 	}
-	/* Mustn't call r1_bio_write_done before this next test,
-	 * as it could result in the bio being freed.
-	 */
-	if (sectors_handled < bio_sectors(bio)) {
-		/* We need another r1_bio, which must be counted */
-		sector_t sect = bio->bi_iter.bi_sector + sectors_handled;
-
-		inc_pending(conf, sect);
-		bio_inc_remaining(bio);
-		r1_bio_write_done(r1_bio);
-		r1_bio = alloc_r1bio(mddev, bio, sectors_handled);
-		goto retry_write;
-	}
 
 	r1_bio_write_done(r1_bio);
 
@@ -1558,7 +1538,6 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio)
 
 static void raid1_make_request(struct mddev *mddev, struct bio *bio)
 {
-	struct bio *split;
 	sector_t sectors;
 
 	if (unlikely(bio->bi_opf & REQ_PREFLUSH)) {
@@ -1566,43 +1545,19 @@ static void raid1_make_request(struct mddev *mddev, struct bio *bio)
 		return;
 	}
 
-	/* if bio exceeds barrier unit boundary, split it */
-	do {
-		sectors = align_to_barrier_unit_end(
-				bio->bi_iter.bi_sector, bio_sectors(bio));
-		if (sectors < bio_sectors(bio)) {
-			split = bio_split(bio, sectors, GFP_NOIO, fs_bio_set);
-			bio_chain(split, bio);
-		} else {
-			split = bio;
-		}
-
-		if (bio_data_dir(split) == READ) {
-			raid1_read_request(mddev, split);
+	/* There is a limit to the maximum size, but
+	 * the read/write handler might find a lower limit
+	 * due to bad blocks.  To avoid multiple splits,
+	 * we pass the maximum number of sectors down
+	 * and let the lower level perform the split.
+	 */
+	sectors = align_to_barrier_unit_end(
+		bio->bi_iter.bi_sector, bio_sectors(bio));
 
-			/*
-			 * If a bio is splitted, the first part of bio will
-			 * pass barrier but the bio is queued in
-			 * current->bio_list (see generic_make_request). If
-			 * there is a raise_barrier() called here, the second
-			 * part of bio can't pass barrier. But since the first
-			 * part bio isn't dispatched to underlaying disks yet,
-			 * the barrier is never released, hence raise_barrier
-			 * will alays wait. We have a deadlock.
-			 * Note, this only happens in read path. For write
-			 * path, the first part of bio is dispatched in a
-			 * schedule() call (because of blk plug) or offloaded
-			 * to raid10d.
-			 * Quitting from the function immediately can change
-			 * the bio order queued in bio_list and avoid the deadlock.
-			 */
-			if (split != bio) {
-				generic_make_request(bio);
-				break;
-			}
-		} else
-			raid1_write_request(mddev, split);
-	} while (split != bio);
+	if (bio_data_dir(bio) == READ)
+		raid1_read_request(mddev, bio, sectors);
+	else
+		raid1_write_request(mddev, bio, sectors);
 }
 
 static void raid1_status(struct seq_file *seq, struct mddev *mddev)
@@ -2645,10 +2600,7 @@ static void raid1d(struct md_thread *thread)
 		else if (test_bit(R1BIO_ReadError, &r1_bio->state))
 			handle_read_error(conf, r1_bio);
 		else
-			/* just a partial read to be scheduled from separate
-			 * context
-			 */
-			generic_make_request(r1_bio->bios[r1_bio->read_disk]);
+			WARN_ON_ONCE(1);
 
 		cond_resched();
 		if (mddev->sb_flags & ~(1<<MD_SB_CHANGE_PENDING))
@@ -3036,6 +2988,10 @@ static struct r1conf *setup_conf(struct mddev *mddev)
 	if (!conf->r1bio_pool)
 		goto abort;
 
+	conf->bio_split = bioset_create(BIO_POOL_SIZE, 0);
+	if (!conf->bio_split)
+		goto abort;
+
 	conf->poolinfo->mddev = mddev;
 
 	err = -EINVAL;
@@ -3117,6 +3073,8 @@ static struct r1conf *setup_conf(struct mddev *mddev)
 		kfree(conf->nr_waiting);
 		kfree(conf->nr_queued);
 		kfree(conf->barrier);
+		if (conf->bio_split)
+			bioset_free(conf->bio_split);
 		kfree(conf);
 	}
 	return ERR_PTR(err);
@@ -3222,6 +3180,8 @@ static void raid1_free(struct mddev *mddev, void *priv)
 	kfree(conf->nr_waiting);
 	kfree(conf->nr_queued);
 	kfree(conf->barrier);
+	if (conf->bio_split)
+		bioset_free(conf->bio_split);
 	kfree(conf);
 }
 
diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
index 4271cd7ac2de..b0ab0da6e39e 100644
--- a/drivers/md/raid1.h
+++ b/drivers/md/raid1.h
@@ -107,6 +107,8 @@ struct r1conf {
 	mempool_t		*r1bio_pool;
 	mempool_t		*r1buf_pool;
 
+	struct bio_set		*bio_split;
+
 	/* temporary buffer to synchronous IO when attempting to repair
 	 * a read error.
 	 */



^ permalink raw reply related

* [md PATCH 02/10] md/raid1: simplify alloc_behind_master_bio()
From: NeilBrown @ 2017-04-05  4:05 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid
In-Reply-To: <149136485390.25893.1797855041954158826.stgit@noble>

Now that we always always pass an offset of 0 and a size
that matches the bio to alloc_behind_master_bio(),
we can remove the offset/size args and simplify the code.

We could probably remove bio_copy_data_partial() too.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid1.c |   11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 83853f65462a..06a13010f1b7 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1091,9 +1091,9 @@ static void unfreeze_array(struct r1conf *conf)
 }
 
 static struct bio *alloc_behind_master_bio(struct r1bio *r1_bio,
-					   struct bio *bio,
-					   int offset, int size)
+					   struct bio *bio)
 {
+	int size = bio->bi_iter.bi_size;
 	unsigned vcnt = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	int i = 0;
 	struct bio *behind_bio = NULL;
@@ -1120,8 +1120,7 @@ static struct bio *alloc_behind_master_bio(struct r1bio *r1_bio,
 		i++;
 	}
 
-	bio_copy_data_partial(behind_bio, bio, offset,
-			      behind_bio->bi_iter.bi_size);
+	bio_copy_data(behind_bio, bio);
 skip_copy:
 	r1_bio->behind_master_bio = behind_bio;;
 	set_bit(R1BIO_BehindIO, &r1_bio->state);
@@ -1464,9 +1463,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 			    (atomic_read(&bitmap->behind_writes)
 			     < mddev->bitmap_info.max_write_behind) &&
 			    !waitqueue_active(&bitmap->behind_wait)) {
-				mbio = alloc_behind_master_bio(r1_bio, bio,
-							       0,
-							       max_sectors << 9);
+				mbio = alloc_behind_master_bio(r1_bio, bio);
 			}
 
 			bitmap_startwrite(bitmap, r1_bio->sector,



^ permalink raw reply related

* [md PATCH 03/10] Revert "block: introduce bio_copy_data_partial"
From: NeilBrown @ 2017-04-05  4:05 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid
In-Reply-To: <149136485390.25893.1797855041954158826.stgit@noble>

This reverts commit 6f8802852f7e58a12177a86179803b9efaad98e2.
bio_copy_data_partial() is no longer needed.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 block/bio.c         |   60 +++++++++++----------------------------------------
 include/linux/bio.h |    2 --
 2 files changed, 13 insertions(+), 49 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 036435995c55..12c2837c4277 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -990,8 +990,19 @@ int bio_alloc_pages(struct bio *bio, gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(bio_alloc_pages);
 
-static void __bio_copy_data(struct bio *dst, struct bio *src,
-			    int offset, int size)
+/**
+ * bio_copy_data - copy contents of data buffers from one chain of bios to
+ * another
+ * @src: source bio list
+ * @dst: destination bio list
+ *
+ * If @src and @dst are single bios, bi_next must be NULL - otherwise, treats
+ * @src and @dst as linked lists of bios.
+ *
+ * Stops when it reaches the end of either @src or @dst - that is, copies
+ * min(src->bi_size, dst->bi_size) bytes (or the equivalent for lists of bios).
+ */
+void bio_copy_data(struct bio *dst, struct bio *src)
 {
 	struct bvec_iter src_iter, dst_iter;
 	struct bio_vec src_bv, dst_bv;
@@ -1001,12 +1012,6 @@ static void __bio_copy_data(struct bio *dst, struct bio *src,
 	src_iter = src->bi_iter;
 	dst_iter = dst->bi_iter;
 
-	/* for supporting partial copy */
-	if (offset || size != src->bi_iter.bi_size) {
-		bio_advance_iter(src, &src_iter, offset);
-		src_iter.bi_size = size;
-	}
-
 	while (1) {
 		if (!src_iter.bi_size) {
 			src = src->bi_next;
@@ -1043,47 +1048,8 @@ static void __bio_copy_data(struct bio *dst, struct bio *src,
 		bio_advance_iter(dst, &dst_iter, bytes);
 	}
 }
-
-/**
- * bio_copy_data - copy contents of data buffers from one chain of bios to
- * another
- * @src: source bio list
- * @dst: destination bio list
- *
- * If @src and @dst are single bios, bi_next must be NULL - otherwise, treats
- * @src and @dst as linked lists of bios.
- *
- * Stops when it reaches the end of either @src or @dst - that is, copies
- * min(src->bi_size, dst->bi_size) bytes (or the equivalent for lists of bios).
- */
-void bio_copy_data(struct bio *dst, struct bio *src)
-{
-	__bio_copy_data(dst, src, 0, src->bi_iter.bi_size);
-}
 EXPORT_SYMBOL(bio_copy_data);
 
-/**
- * bio_copy_data_partial - copy partial contents of data buffers from one
- * chain of bios to another
- * @dst: destination bio list
- * @src: source bio list
- * @offset: starting copy from the offset
- * @size: how many bytes to copy
- *
- * If @src and @dst are single bios, bi_next must be NULL - otherwise, treats
- * @src and @dst as linked lists of bios.
- *
- * Stops when it reaches the end of either @src or @dst - that is, copies
- * min(src->bi_size, dst->bi_size) bytes (or the equivalent for lists of bios).
- */
-void bio_copy_data_partial(struct bio *dst, struct bio *src,
-			   int offset, int size)
-{
-	__bio_copy_data(dst, src, offset, size);
-
-}
-EXPORT_SYMBOL(bio_copy_data_partial);
-
 struct bio_map_data {
 	int is_our_pages;
 	struct iov_iter iter;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index fafef6343d1b..7cf8a6c70a3f 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -461,8 +461,6 @@ static inline void bio_flush_dcache_pages(struct bio *bi)
 #endif
 
 extern void bio_copy_data(struct bio *dst, struct bio *src);
-extern void bio_copy_data_partial(struct bio *dst, struct bio *src,
-				  int offset, int size);
 extern int bio_alloc_pages(struct bio *bio, gfp_t gfp);
 extern void bio_free_pages(struct bio *bio);
 



^ permalink raw reply related

* [md PATCH 04/10] md/raid1: simplify handle_read_error().
From: NeilBrown @ 2017-04-05  4:05 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid
In-Reply-To: <149136485390.25893.1797855041954158826.stgit@noble>

handle_read_error() duplicates a lot of the work that raid1_read_request()
does, so it makes sense to just use that function.
This doesn't quite work as handle_read_error() relies on the same r1bio
being re-used so that, in the case of a read-only array, setting
IO_BLOCKED in r1bio->bios[] ensures read_balance() won't re-use
that device.
So we need to allow a r1bio to be passed to raid1_read_request(), and to
have that function mostly initialise the r1bio, but leave the bios[]
array untouched.

Two parts of handle_read_error() that need to be preserved are the warning
message it prints, so they are conditionally added to raid1_read_request().

Note that this highlights a minor bug on alloc_r1bio().  It doesn't
initalise the bios[] array, so it is possible that old content is there,
which might cause read_balance() to ignore some devices with no good reason.

With this change, we no longer need inc_pending(), or the sectors_handled
arg to alloc_r1bio().


As handle_read_error() is called from raid1d() and allocates memory,
there is tiny chance of a deadlock.  All element of various pools
could be queued waiting for raid1 to handle them, and there may be no
extra memory free.
Achieving guaranteed forward progress would probably require a second
thread and another mempool.  Instead of that complexity, add
__GFP_HIGH to any allocations when read1_read_request() is called
from raid1d.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid1.c |  138 ++++++++++++++++++++++------------------------------
 1 file changed, 58 insertions(+), 80 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 06a13010f1b7..d15268fff052 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -988,16 +988,6 @@ static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr)
 	spin_unlock_irq(&conf->resync_lock);
 }
 
-static void inc_pending(struct r1conf *conf, sector_t bi_sector)
-{
-	/* The current request requires multiple r1_bio, so
-	 * we need to increment the pending count, and the corresponding
-	 * window count.
-	 */
-	int idx = sector_to_idx(bi_sector);
-	atomic_inc(&conf->nr_pending[idx]);
-}
-
 static void wait_barrier(struct r1conf *conf, sector_t sector_nr)
 {
 	int idx = sector_to_idx(sector_nr);
@@ -1184,35 +1174,58 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule)
 	kfree(plug);
 }
 
+static void init_r1bio(struct r1bio *r1_bio, struct mddev *mddev, struct bio *bio)
+{
+	r1_bio->master_bio = bio;
+	r1_bio->sectors = bio_sectors(bio);
+	r1_bio->state = 0;
+	r1_bio->mddev = mddev;
+	r1_bio->sector = bio->bi_iter.bi_sector;
+}
+
 static inline struct r1bio *
-alloc_r1bio(struct mddev *mddev, struct bio *bio, sector_t sectors_handled)
+alloc_r1bio(struct mddev *mddev, struct bio *bio)
 {
 	struct r1conf *conf = mddev->private;
 	struct r1bio *r1_bio;
 
 	r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
-
-	r1_bio->master_bio = bio;
-	r1_bio->sectors = bio_sectors(bio) - sectors_handled;
-	r1_bio->state = 0;
-	r1_bio->mddev = mddev;
-	r1_bio->sector = bio->bi_iter.bi_sector + sectors_handled;
-
+	/* Ensure no bio records IO_BLOCKED */
+	memset(r1_bio->bios, 0, conf->raid_disks * sizeof(r1_bio->bios[0]));
+	init_r1bio(r1_bio, mddev, bio);
 	return r1_bio;
 }
 
 static void raid1_read_request(struct mddev *mddev, struct bio *bio,
-			       int max_read_sectors)
+			       int max_read_sectors, struct r1bio *r1_bio)
 {
 	struct r1conf *conf = mddev->private;
 	struct raid1_info *mirror;
-	struct r1bio *r1_bio;
 	struct bio *read_bio;
 	struct bitmap *bitmap = mddev->bitmap;
 	const int op = bio_op(bio);
 	const unsigned long do_sync = (bio->bi_opf & REQ_SYNC);
 	int max_sectors;
 	int rdisk;
+	bool print_msg = !!r1_bio;
+	char b[BDEVNAME_SIZE];
+	/* If r1_bio is set, we are blocking the raid1d thread
+	 * so there is a tiny risk of deadlock.  So ask for
+	 * emergency memory if needed.
+	 */
+	gfp_t gfp = r1_bio ? (GFP_NOIO | __GFP_HIGH) : GFP_NOIO;
+
+	if (print_msg) {
+		/* Need to get the block device name carefully */
+		struct md_rdev *rdev;
+		rcu_read_lock();
+		rdev = rcu_dereference(conf->mirrors[r1_bio->read_disk].rdev);
+		if (rdev)
+			bdevname(rdev->bdev, b);
+		else
+			strcpy(b, "???");
+		rcu_read_unlock();
+	}
 
 	/*
 	 * Still need barrier for READ in case that whole
@@ -1220,7 +1233,10 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
 	 */
 	wait_read_barrier(conf, bio->bi_iter.bi_sector);
 
-	r1_bio = alloc_r1bio(mddev, bio, 0);
+	if (!r1_bio)
+		r1_bio = alloc_r1bio(mddev, bio);
+	else
+		init_r1bio(r1_bio, mddev, bio);
 	r1_bio->sectors = max_read_sectors;
 
 	/*
@@ -1231,11 +1247,23 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
 
 	if (rdisk < 0) {
 		/* couldn't find anywhere to read from */
+		if (print_msg) {
+			pr_crit_ratelimited("md/raid1:%s: %s: unrecoverable I/O read error for block %llu\n",
+					    mdname(mddev),
+					    b,
+					    (unsigned long long)r1_bio->sector);
+		}
 		raid_end_bio_io(r1_bio);
 		return;
 	}
 	mirror = conf->mirrors + rdisk;
 
+	if (print_msg)
+		pr_info_ratelimited("md/raid1:%s: redirecting sector %llu to other mirror: %s\n",
+				    mdname(mddev),
+				    (unsigned long long)r1_bio->sector,
+				    bdevname(mirror->rdev->bdev, b));
+
 	if (test_bit(WriteMostly, &mirror->rdev->flags) &&
 	    bitmap) {
 		/*
@@ -1249,7 +1277,7 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
 
 	if (max_sectors < bio_sectors(bio)) {
 		struct bio *split = bio_split(bio, max_sectors,
-					      GFP_NOIO, conf->bio_split);
+					      gfp, conf->bio_split);
 		bio_chain(split, bio);
 		generic_make_request(bio);
 		bio = split;
@@ -1259,7 +1287,7 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
 
 	r1_bio->read_disk = rdisk;
 
-	read_bio = bio_clone_fast(bio, GFP_NOIO, mddev->bio_set);
+	read_bio = bio_clone_fast(bio, gfp, mddev->bio_set);
 
 	r1_bio->bios[rdisk] = read_bio;
 
@@ -1332,7 +1360,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 	}
 	wait_barrier(conf, bio->bi_iter.bi_sector);
 
-	r1_bio = alloc_r1bio(mddev, bio, 0);
+	r1_bio = alloc_r1bio(mddev, bio);
 	r1_bio->sectors = max_write_sectors;
 
 	if (conf->pending_count >= max_queued_requests) {
@@ -1552,7 +1580,7 @@ static void raid1_make_request(struct mddev *mddev, struct bio *bio)
 		bio->bi_iter.bi_sector, bio_sectors(bio));
 
 	if (bio_data_dir(bio) == READ)
-		raid1_read_request(mddev, bio, sectors);
+		raid1_read_request(mddev, bio, sectors, NULL);
 	else
 		raid1_write_request(mddev, bio, sectors);
 }
@@ -2442,11 +2470,8 @@ static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
 
 static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
 {
-	int disk;
-	int max_sectors;
 	struct mddev *mddev = conf->mddev;
 	struct bio *bio;
-	char b[BDEVNAME_SIZE];
 	struct md_rdev *rdev;
 	dev_t bio_dev;
 	sector_t bio_sector;
@@ -2462,7 +2487,6 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
 	 */
 
 	bio = r1_bio->bios[r1_bio->read_disk];
-	bdevname(bio->bi_bdev, b);
 	bio_dev = bio->bi_bdev->bd_dev;
 	bio_sector = conf->mirrors[r1_bio->read_disk].rdev->data_offset + r1_bio->sector;
 	bio_put(bio);
@@ -2480,58 +2504,12 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
 	}
 
 	rdev_dec_pending(rdev, conf->mddev);
+	allow_barrier(conf, r1_bio->sector);
+	bio = r1_bio->master_bio;
 
-read_more:
-	disk = read_balance(conf, r1_bio, &max_sectors);
-	if (disk == -1) {
-		pr_crit_ratelimited("md/raid1:%s: %s: unrecoverable I/O read error for block %llu\n",
-				    mdname(mddev), b, (unsigned long long)r1_bio->sector);
-		raid_end_bio_io(r1_bio);
-	} else {
-		const unsigned long do_sync
-			= r1_bio->master_bio->bi_opf & REQ_SYNC;
-		r1_bio->read_disk = disk;
-		bio = bio_clone_fast(r1_bio->master_bio, GFP_NOIO,
-				     mddev->bio_set);
-		bio_trim(bio, r1_bio->sector - bio->bi_iter.bi_sector,
-			 max_sectors);
-		r1_bio->bios[r1_bio->read_disk] = bio;
-		rdev = conf->mirrors[disk].rdev;
-		pr_info_ratelimited("md/raid1:%s: redirecting sector %llu to other mirror: %s\n",
-				    mdname(mddev),
-				    (unsigned long long)r1_bio->sector,
-				    bdevname(rdev->bdev, b));
-		bio->bi_iter.bi_sector = r1_bio->sector + rdev->data_offset;
-		bio->bi_bdev = rdev->bdev;
-		bio->bi_end_io = raid1_end_read_request;
-		bio_set_op_attrs(bio, REQ_OP_READ, do_sync);
-		if (test_bit(FailFast, &rdev->flags) &&
-		    test_bit(R1BIO_FailFast, &r1_bio->state))
-			bio->bi_opf |= MD_FAILFAST;
-		bio->bi_private = r1_bio;
-		if (max_sectors < r1_bio->sectors) {
-			/* Drat - have to split this up more */
-			struct bio *mbio = r1_bio->master_bio;
-			int sectors_handled = (r1_bio->sector + max_sectors
-					       - mbio->bi_iter.bi_sector);
-			r1_bio->sectors = max_sectors;
-			bio_inc_remaining(mbio);
-			trace_block_bio_remap(bdev_get_queue(bio->bi_bdev),
-					      bio, bio_dev, bio_sector);
-			generic_make_request(bio);
-			bio = NULL;
-
-			r1_bio = alloc_r1bio(mddev, mbio, sectors_handled);
-			set_bit(R1BIO_ReadError, &r1_bio->state);
-			inc_pending(conf, r1_bio->sector);
-
-			goto read_more;
-		} else {
-			trace_block_bio_remap(bdev_get_queue(bio->bi_bdev),
-					      bio, bio_dev, bio_sector);
-			generic_make_request(bio);
-		}
-	}
+	/* Reuse the old r1_bio so that the IO_BLOCKED settings are preserved */
+	r1_bio->state = 0;
+	raid1_read_request(mddev, bio, r1_bio->sectors, r1_bio);
 }
 
 static void raid1d(struct md_thread *thread)



^ permalink raw reply related

* [md PATCH 05/10] md/raid1: factor out flush_bio_list()
From: NeilBrown @ 2017-04-05  4:05 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid
In-Reply-To: <149136485390.25893.1797855041954158826.stgit@noble>

flush_pending_writes() and raid1_unplug() each contain identical
copies of a fairly large slab of code.  So factor that out into
new flush_bio_list() to simplify maintenance.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid1.c |   67 +++++++++++++++++++++-------------------------------
 1 file changed, 27 insertions(+), 40 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index d15268fff052..a70283753a35 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -787,6 +787,31 @@ static int raid1_congested(struct mddev *mddev, int bits)
 	return ret;
 }
 
+static void flush_bio_list(struct r1conf *conf, struct bio *bio)
+{
+	/* flush any pending bitmap writes to
+	 * disk before proceeding w/ I/O */
+	bitmap_unplug(conf->mddev->bitmap);
+	wake_up(&conf->wait_barrier);
+
+	while (bio) { /* submit pending writes */
+		struct bio *next = bio->bi_next;
+		struct md_rdev *rdev = (void*)bio->bi_bdev;
+		bio->bi_next = NULL;
+		bio->bi_bdev = rdev->bdev;
+		if (test_bit(Faulty, &rdev->flags)) {
+			bio->bi_error = -EIO;
+			bio_endio(bio);
+		} else if (unlikely((bio_op(bio) == REQ_OP_DISCARD) &&
+				    !blk_queue_discard(bdev_get_queue(bio->bi_bdev))))
+			/* Just ignore it */
+			bio_endio(bio);
+		else
+			generic_make_request(bio);
+		bio = next;
+	}
+}
+
 static void flush_pending_writes(struct r1conf *conf)
 {
 	/* Any writes that have been queued but are awaiting
@@ -799,27 +824,7 @@ static void flush_pending_writes(struct r1conf *conf)
 		bio = bio_list_get(&conf->pending_bio_list);
 		conf->pending_count = 0;
 		spin_unlock_irq(&conf->device_lock);
-		/* flush any pending bitmap writes to
-		 * disk before proceeding w/ I/O */
-		bitmap_unplug(conf->mddev->bitmap);
-		wake_up(&conf->wait_barrier);
-
-		while (bio) { /* submit pending writes */
-			struct bio *next = bio->bi_next;
-			struct md_rdev *rdev = (void*)bio->bi_bdev;
-			bio->bi_next = NULL;
-			bio->bi_bdev = rdev->bdev;
-			if (test_bit(Faulty, &rdev->flags)) {
-				bio->bi_error = -EIO;
-				bio_endio(bio);
-			} else if (unlikely((bio_op(bio) == REQ_OP_DISCARD) &&
-					    !blk_queue_discard(bdev_get_queue(bio->bi_bdev))))
-				/* Just ignore it */
-				bio_endio(bio);
-			else
-				generic_make_request(bio);
-			bio = next;
-		}
+		flush_bio_list(conf, bio);
 	} else
 		spin_unlock_irq(&conf->device_lock);
 }
@@ -1152,25 +1157,7 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule)
 
 	/* we aren't scheduling, so we can do the write-out directly. */
 	bio = bio_list_get(&plug->pending);
-	bitmap_unplug(mddev->bitmap);
-	wake_up(&conf->wait_barrier);
-
-	while (bio) { /* submit pending writes */
-		struct bio *next = bio->bi_next;
-		struct md_rdev *rdev = (void*)bio->bi_bdev;
-		bio->bi_next = NULL;
-		bio->bi_bdev = rdev->bdev;
-		if (test_bit(Faulty, &rdev->flags)) {
-			bio->bi_error = -EIO;
-			bio_endio(bio);
-		} else if (unlikely((bio_op(bio) == REQ_OP_DISCARD) &&
-				    !blk_queue_discard(bdev_get_queue(bio->bi_bdev))))
-			/* Just ignore it */
-			bio_endio(bio);
-		else
-			generic_make_request(bio);
-		bio = next;
-	}
+	flush_bio_list(conf, bio);
 	kfree(plug);
 }
 



^ permalink raw reply related

* [md PATCH 06/10] md/raid10: simplify the splitting of requests.
From: NeilBrown @ 2017-04-05  4:05 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid
In-Reply-To: <149136485390.25893.1797855041954158826.stgit@noble>

raid10 splits requests in two different ways for two different
reasons.

First, bio_split() is used to ensure the bio fits with a chunk.
Second, multiple r10bio structures are allocated to represent the
different sections that need to go to different devices, to avoid
known bad blocks.

This can be simplified to just use bio_split() once, and not to use
multiple r10bios.
We delay the split until we know a maximum bio size that can
be handled with a single r10bio, and then split the bio and queue
the remainder for later handling.

As with raid1, we allocate a new bio_set to help with the splitting.
It is not correct to use fs_bio_set in a device driver.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid10.c |  164 ++++++++++++++++-----------------------------------
 drivers/md/raid10.h |    1 
 2 files changed, 51 insertions(+), 114 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 0f13d57ef646..ff02c058dbdd 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1127,7 +1127,6 @@ static void raid10_read_request(struct mddev *mddev, struct bio *bio,
 	struct bio *read_bio;
 	const int op = bio_op(bio);
 	const unsigned long do_sync = (bio->bi_opf & REQ_SYNC);
-	int sectors_handled;
 	int max_sectors;
 	sector_t sectors;
 	struct md_rdev *rdev;
@@ -1140,7 +1139,7 @@ static void raid10_read_request(struct mddev *mddev, struct bio *bio,
 	 */
 	wait_barrier(conf);
 
-	sectors = bio_sectors(bio);
+	sectors = r10_bio->sectors;
 	while (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
 	    bio->bi_iter.bi_sector < conf->reshape_progress &&
 	    bio->bi_iter.bi_sector + sectors > conf->reshape_progress) {
@@ -1157,17 +1156,23 @@ static void raid10_read_request(struct mddev *mddev, struct bio *bio,
 		wait_barrier(conf);
 	}
 
-read_again:
 	rdev = read_balance(conf, r10_bio, &max_sectors);
 	if (!rdev) {
 		raid_end_bio_io(r10_bio);
 		return;
 	}
+	if (max_sectors < bio_sectors(bio)) {
+		struct bio *split = bio_split(bio, max_sectors,
+					      GFP_NOIO, conf->bio_split);
+		bio_chain(split, bio);
+		generic_make_request(bio);
+		bio = split;
+		r10_bio->master_bio = bio;
+		r10_bio->sectors = max_sectors;
+	}
 	slot = r10_bio->read_slot;
 
 	read_bio = bio_clone_fast(bio, GFP_NOIO, mddev->bio_set);
-	bio_trim(read_bio, r10_bio->sector - bio->bi_iter.bi_sector,
-		 max_sectors);
 
 	r10_bio->devs[slot].bio = read_bio;
 	r10_bio->devs[slot].rdev = rdev;
@@ -1186,40 +1191,13 @@ static void raid10_read_request(struct mddev *mddev, struct bio *bio,
 	        trace_block_bio_remap(bdev_get_queue(read_bio->bi_bdev),
 	                              read_bio, disk_devt(mddev->gendisk),
 	                              r10_bio->sector);
-	if (max_sectors < r10_bio->sectors) {
-		/*
-		 * Could not read all from this device, so we will need another
-		 * r10_bio.
-		 */
-		sectors_handled = (r10_bio->sector + max_sectors
-				   - bio->bi_iter.bi_sector);
-		r10_bio->sectors = max_sectors;
-		inc_pending(conf);
-		bio_inc_remaining(bio);
-		/*
-		 * Cannot call generic_make_request directly as that will be
-		 * queued in __generic_make_request and subsequent
-		 * mempool_alloc might block waiting for it.  so hand bio over
-		 * to raid10d.
-		 */
-		reschedule_retry(r10_bio);
-
-		r10_bio = mempool_alloc(conf->r10bio_pool, GFP_NOIO);
-
-		r10_bio->master_bio = bio;
-		r10_bio->sectors = bio_sectors(bio) - sectors_handled;
-		r10_bio->state = 0;
-		r10_bio->mddev = mddev;
-		r10_bio->sector = bio->bi_iter.bi_sector + sectors_handled;
-		goto read_again;
-	} else
-		generic_make_request(read_bio);
+	generic_make_request(read_bio);
 	return;
 }
 
 static void raid10_write_one_disk(struct mddev *mddev, struct r10bio *r10_bio,
 				  struct bio *bio, bool replacement,
-				  int n_copy, int max_sectors)
+				  int n_copy)
 {
 	const int op = bio_op(bio);
 	const unsigned long do_sync = (bio->bi_opf & REQ_SYNC);
@@ -1243,7 +1221,6 @@ static void raid10_write_one_disk(struct mddev *mddev, struct r10bio *r10_bio,
 		rdev = conf->mirrors[devnum].rdev;
 
 	mbio = bio_clone_fast(bio, GFP_NOIO, mddev->bio_set);
-	bio_trim(mbio, r10_bio->sector - bio->bi_iter.bi_sector, max_sectors);
 	if (replacement)
 		r10_bio->devs[n_copy].repl_bio = mbio;
 	else
@@ -1294,7 +1271,6 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
 	int i;
 	struct md_rdev *blocked_rdev;
 	sector_t sectors;
-	int sectors_handled;
 	int max_sectors;
 
 	md_write_start(mddev, bio);
@@ -1306,7 +1282,7 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
 	 */
 	wait_barrier(conf);
 
-	sectors = bio_sectors(bio);
+	sectors = r10_bio->sectors;
 	while (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
 	    bio->bi_iter.bi_sector < conf->reshape_progress &&
 	    bio->bi_iter.bi_sector + sectors > conf->reshape_progress) {
@@ -1476,44 +1452,29 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
 
 	if (max_sectors < r10_bio->sectors)
 		r10_bio->sectors = max_sectors;
-	sectors_handled = r10_bio->sector + max_sectors -
-		bio->bi_iter.bi_sector;
+
+	if (r10_bio->sectors < bio_sectors(bio)) {
+		struct bio *split = bio_split(bio, r10_bio->sectors,
+					      GFP_NOIO, conf->bio_split);
+		bio_chain(split, bio);
+		generic_make_request(bio);
+		bio = split;
+		r10_bio->master_bio = bio;
+	}
 
 	atomic_set(&r10_bio->remaining, 1);
 	bitmap_startwrite(mddev->bitmap, r10_bio->sector, r10_bio->sectors, 0);
 
 	for (i = 0; i < conf->copies; i++) {
 		if (r10_bio->devs[i].bio)
-			raid10_write_one_disk(mddev, r10_bio, bio, false,
-					      i, max_sectors);
+			raid10_write_one_disk(mddev, r10_bio, bio, false, i);
 		if (r10_bio->devs[i].repl_bio)
-			raid10_write_one_disk(mddev, r10_bio, bio, true,
-					      i, max_sectors);
-	}
-
-	/* Don't remove the bias on 'remaining' (one_write_done) until
-	 * after checking if we need to go around again.
-	 */
-
-	if (sectors_handled < bio_sectors(bio)) {
-		/* We need another r10_bio and it needs to be counted */
-		inc_pending(conf);
-		bio_inc_remaining(bio);
-		one_write_done(r10_bio);
-		r10_bio = mempool_alloc(conf->r10bio_pool, GFP_NOIO);
-
-		r10_bio->master_bio = bio;
-		r10_bio->sectors = bio_sectors(bio) - sectors_handled;
-
-		r10_bio->mddev = mddev;
-		r10_bio->sector = bio->bi_iter.bi_sector + sectors_handled;
-		r10_bio->state = 0;
-		goto retry_write;
+			raid10_write_one_disk(mddev, r10_bio, bio, true, i);
 	}
 	one_write_done(r10_bio);
 }
 
-static void __make_request(struct mddev *mddev, struct bio *bio)
+static void __make_request(struct mddev *mddev, struct bio *bio, int sectors)
 {
 	struct r10conf *conf = mddev->private;
 	struct r10bio *r10_bio;
@@ -1521,7 +1482,7 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
 	r10_bio = mempool_alloc(conf->r10bio_pool, GFP_NOIO);
 
 	r10_bio->master_bio = bio;
-	r10_bio->sectors = bio_sectors(bio);
+	r10_bio->sectors = sectors;
 
 	r10_bio->mddev = mddev;
 	r10_bio->sector = bio->bi_iter.bi_sector;
@@ -1538,54 +1499,26 @@ static void raid10_make_request(struct mddev *mddev, struct bio *bio)
 	struct r10conf *conf = mddev->private;
 	sector_t chunk_mask = (conf->geo.chunk_mask & conf->prev.chunk_mask);
 	int chunk_sects = chunk_mask + 1;
-
-	struct bio *split;
+	int sectors = bio_sectors(bio);
 
 	if (unlikely(bio->bi_opf & REQ_PREFLUSH)) {
 		md_flush_request(mddev, bio);
 		return;
 	}
 
-	do {
-
-		/*
-		 * If this request crosses a chunk boundary, we need to split
-		 * it.
-		 */
-		if (unlikely((bio->bi_iter.bi_sector & chunk_mask) +
-			     bio_sectors(bio) > chunk_sects
-			     && (conf->geo.near_copies < conf->geo.raid_disks
-				 || conf->prev.near_copies <
-				 conf->prev.raid_disks))) {
-			split = bio_split(bio, chunk_sects -
-					  (bio->bi_iter.bi_sector &
-					   (chunk_sects - 1)),
-					  GFP_NOIO, fs_bio_set);
-			bio_chain(split, bio);
-		} else {
-			split = bio;
-		}
-
-		/*
-		 * If a bio is splitted, the first part of bio will pass
-		 * barrier but the bio is queued in current->bio_list (see
-		 * generic_make_request). If there is a raise_barrier() called
-		 * here, the second part of bio can't pass barrier. But since
-		 * the first part bio isn't dispatched to underlaying disks
-		 * yet, the barrier is never released, hence raise_barrier will
-		 * alays wait. We have a deadlock.
-		 * Note, this only happens in read path. For write path, the
-		 * first part of bio is dispatched in a schedule() call
-		 * (because of blk plug) or offloaded to raid10d.
-		 * Quitting from the function immediately can change the bio
-		 * order queued in bio_list and avoid the deadlock.
-		 */
-		__make_request(mddev, split);
-		if (split != bio && bio_data_dir(bio) == READ) {
-			generic_make_request(bio);
-			break;
-		}
-	} while (split != bio);
+	/*
+	 * If this request crosses a chunk boundary, we need to split
+	 * it.
+	 */
+	if (unlikely((bio->bi_iter.bi_sector & chunk_mask) +
+		     sectors > chunk_sects
+		     && (conf->geo.near_copies < conf->geo.raid_disks
+			 || conf->prev.near_copies <
+			 conf->prev.raid_disks)))
+		sectors = chunk_sects -
+			(bio->bi_iter.bi_sector &
+			 (chunk_sects - 1));
+	__make_request(mddev, bio, sectors);
 
 	/* In case raid10d snuck in to freeze_array */
 	wake_up(&conf->wait_barrier);
@@ -2873,13 +2806,8 @@ static void raid10d(struct md_thread *thread)
 			recovery_request_write(mddev, r10_bio);
 		else if (test_bit(R10BIO_ReadError, &r10_bio->state))
 			handle_read_error(mddev, r10_bio);
-		else {
-			/* just a partial read to be scheduled from a
-			 * separate context
-			 */
-			int slot = r10_bio->read_slot;
-			generic_make_request(r10_bio->devs[slot].bio);
-		}
+		else
+			WARN_ON_ONCE(1);
 
 		cond_resched();
 		if (mddev->sb_flags & ~(1<<MD_SB_CHANGE_PENDING))
@@ -3652,6 +3580,10 @@ static struct r10conf *setup_conf(struct mddev *mddev)
 	if (!conf->r10bio_pool)
 		goto out;
 
+	conf->bio_split = bioset_create(BIO_POOL_SIZE, 0);
+	if (!conf->bio_split)
+		goto out;
+
 	calc_sectors(conf, mddev->dev_sectors);
 	if (mddev->reshape_position == MaxSector) {
 		conf->prev = conf->geo;
@@ -3689,6 +3621,8 @@ static struct r10conf *setup_conf(struct mddev *mddev)
 		mempool_destroy(conf->r10bio_pool);
 		kfree(conf->mirrors);
 		safe_put_page(conf->tmppage);
+		if (conf->bio_split)
+			bioset_free(conf->bio_split);
 		kfree(conf);
 	}
 	return ERR_PTR(err);
@@ -3898,6 +3832,8 @@ static void raid10_free(struct mddev *mddev, void *priv)
 	kfree(conf->mirrors);
 	kfree(conf->mirrors_old);
 	kfree(conf->mirrors_new);
+	if (conf->bio_split)
+		bioset_free(conf->bio_split);
 	kfree(conf);
 }
 
diff --git a/drivers/md/raid10.h b/drivers/md/raid10.h
index 3162615e57bd..735ce1a3d260 100644
--- a/drivers/md/raid10.h
+++ b/drivers/md/raid10.h
@@ -82,6 +82,7 @@ struct r10conf {
 	mempool_t		*r10bio_pool;
 	mempool_t		*r10buf_pool;
 	struct page		*tmppage;
+	struct bio_set		*bio_split;
 
 	/* When taking over an array from a different personality, we store
 	 * the new thread here until we fully activate the array.



^ permalink raw reply related

* [md PATCH 07/10] md/raid10: simplify handle_read_error()
From: NeilBrown @ 2017-04-05  4:05 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid
In-Reply-To: <149136485390.25893.1797855041954158826.stgit@noble>

handle_read_error() duplicates a lot of the work that raid10_read_request()
does, so it makes sense to just use that function.

handle_read_error() relies on the same r10bio being re-used so that,
in the case of a read-only array, setting IO_BLOCKED in r1bio->devs[].bio
ensures read_balance() won't re-use that device.
So when called from raid10_make_request() we clear that array, but not
when called from handle_read_error().

Two parts of handle_read_error() that need to be preserved are the warning
message it prints, so they are conditionally added to
raid10_read_request().  If the failing rdev can be found, messages
are printed.  Otherwise they aren't.

Not that as rdev_dec_pending() has already been called on the failing
rdev, we need to use rcu_read_lock() to get a new reference from
the conf.  We only use this to get the name of the failing block device.

With this change, we no longer need inc_pending().

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid10.c |  120 +++++++++++++++++++--------------------------------
 1 file changed, 45 insertions(+), 75 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index ff02c058dbdd..ca722570bd38 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1008,15 +1008,6 @@ static void wait_barrier(struct r10conf *conf)
 	spin_unlock_irq(&conf->resync_lock);
 }
 
-static void inc_pending(struct r10conf *conf)
-{
-	/* The current request requires multiple r10_bio, so
-	 * we need to increment the pending count.
-	 */
-	WARN_ON(!atomic_read(&conf->nr_pending));
-	atomic_inc(&conf->nr_pending);
-}
-
 static void allow_barrier(struct r10conf *conf)
 {
 	if ((atomic_dec_and_test(&conf->nr_pending)) ||
@@ -1130,8 +1121,36 @@ static void raid10_read_request(struct mddev *mddev, struct bio *bio,
 	int max_sectors;
 	sector_t sectors;
 	struct md_rdev *rdev;
-	int slot;
+	char b[BDEVNAME_SIZE];
+	int slot = r10_bio->read_slot;
+	struct md_rdev *err_rdev = NULL;
+	gfp_t gfp = GFP_NOIO;
+
+	if (r10_bio->devs[slot].rdev) {
+		/* This is an error retry, but we cannot
+		 * safely dereference the rdev in the r10_bio,
+		 * we must use the one in conf.
+		 * If it has already been disconnected (unlikely)
+		 * we lose the device name in error messages.
+		 */
+		int disk;
+		/* As we are blocking raid10, it is a little safer to
+		 * use __GFP_HIGH.
+		 */
+		gfp = GFP_NOIO | __GFP_HIGH;
 
+		rcu_read_lock();
+		disk = r10_bio->devs[slot].devnum;
+		err_rdev = rcu_dereference(conf->mirrors[disk].rdev);
+		if (err_rdev)
+			bdevname(err_rdev->bdev, b);
+		else {
+			strcpy(b, "???");
+			/* This never gets dereferenced */
+			err_rdev = r10_bio->devs[slot].rdev;
+		}
+		rcu_read_unlock();
+	}
 	/*
 	 * Register the new request and wait if the reconstruction
 	 * thread has put up a bar for new requests.
@@ -1158,12 +1177,22 @@ static void raid10_read_request(struct mddev *mddev, struct bio *bio,
 
 	rdev = read_balance(conf, r10_bio, &max_sectors);
 	if (!rdev) {
+		if (err_rdev) {
+			pr_crit_ratelimited("md/raid10:%s: %s: unrecoverable I/O read error for block %llu\n",
+					    mdname(mddev), b,
+					    (unsigned long long)r10_bio->sector);
+		}
 		raid_end_bio_io(r10_bio);
 		return;
 	}
+	if (err_rdev)
+		pr_err_ratelimited("md/raid10:%s: %s: redirecting sector %llu to another mirror\n",
+				   mdname(mddev),
+				   bdevname(rdev->bdev, b),
+				   (unsigned long long)r10_bio->sector);
 	if (max_sectors < bio_sectors(bio)) {
 		struct bio *split = bio_split(bio, max_sectors,
-					      GFP_NOIO, conf->bio_split);
+					      gfp, conf->bio_split);
 		bio_chain(split, bio);
 		generic_make_request(bio);
 		bio = split;
@@ -1172,7 +1201,7 @@ static void raid10_read_request(struct mddev *mddev, struct bio *bio,
 	}
 	slot = r10_bio->read_slot;
 
-	read_bio = bio_clone_fast(bio, GFP_NOIO, mddev->bio_set);
+	read_bio = bio_clone_fast(bio, gfp, mddev->bio_set);
 
 	r10_bio->devs[slot].bio = read_bio;
 	r10_bio->devs[slot].rdev = rdev;
@@ -1487,6 +1516,7 @@ static void __make_request(struct mddev *mddev, struct bio *bio, int sectors)
 	r10_bio->mddev = mddev;
 	r10_bio->sector = bio->bi_iter.bi_sector;
 	r10_bio->state = 0;
+	memset(r10_bio->devs, 0, sizeof(r10_bio->devs[0]) * conf->copies);
 
 	if (bio_data_dir(bio) == READ)
 		raid10_read_request(mddev, bio, r10_bio);
@@ -2556,9 +2586,6 @@ static void handle_read_error(struct mddev *mddev, struct r10bio *r10_bio)
 	struct bio *bio;
 	struct r10conf *conf = mddev->private;
 	struct md_rdev *rdev = r10_bio->devs[slot].rdev;
-	char b[BDEVNAME_SIZE];
-	unsigned long do_sync;
-	int max_sectors;
 	dev_t bio_dev;
 	sector_t bio_last_sector;
 
@@ -2571,7 +2598,6 @@ static void handle_read_error(struct mddev *mddev, struct r10bio *r10_bio)
 	 * frozen.
 	 */
 	bio = r10_bio->devs[slot].bio;
-	bdevname(bio->bi_bdev, b);
 	bio_dev = bio->bi_bdev->bd_dev;
 	bio_last_sector = r10_bio->devs[slot].addr + rdev->data_offset + r10_bio->sectors;
 	bio_put(bio);
@@ -2587,65 +2613,9 @@ static void handle_read_error(struct mddev *mddev, struct r10bio *r10_bio)
 		md_error(mddev, rdev);
 
 	rdev_dec_pending(rdev, mddev);
-
-read_more:
-	rdev = read_balance(conf, r10_bio, &max_sectors);
-	if (rdev == NULL) {
-		pr_crit_ratelimited("md/raid10:%s: %s: unrecoverable I/O read error for block %llu\n",
-				    mdname(mddev), b,
-				    (unsigned long long)r10_bio->sector);
-		raid_end_bio_io(r10_bio);
-		return;
-	}
-
-	do_sync = (r10_bio->master_bio->bi_opf & REQ_SYNC);
-	slot = r10_bio->read_slot;
-	pr_err_ratelimited("md/raid10:%s: %s: redirecting sector %llu to another mirror\n",
-			   mdname(mddev),
-			   bdevname(rdev->bdev, b),
-			   (unsigned long long)r10_bio->sector);
-	bio = bio_clone_fast(r10_bio->master_bio, GFP_NOIO, mddev->bio_set);
-	bio_trim(bio, r10_bio->sector - bio->bi_iter.bi_sector, max_sectors);
-	r10_bio->devs[slot].bio = bio;
-	r10_bio->devs[slot].rdev = rdev;
-	bio->bi_iter.bi_sector = r10_bio->devs[slot].addr
-		+ choose_data_offset(r10_bio, rdev);
-	bio->bi_bdev = rdev->bdev;
-	bio_set_op_attrs(bio, REQ_OP_READ, do_sync);
-	if (test_bit(FailFast, &rdev->flags) &&
-	    test_bit(R10BIO_FailFast, &r10_bio->state))
-		bio->bi_opf |= MD_FAILFAST;
-	bio->bi_private = r10_bio;
-	bio->bi_end_io = raid10_end_read_request;
-	trace_block_bio_remap(bdev_get_queue(bio->bi_bdev),
-			      bio, bio_dev,
-			      bio_last_sector - r10_bio->sectors);
-
-	if (max_sectors < r10_bio->sectors) {
-		/* Drat - have to split this up more */
-		struct bio *mbio = r10_bio->master_bio;
-		int sectors_handled =
-			r10_bio->sector + max_sectors
-			- mbio->bi_iter.bi_sector;
-		r10_bio->sectors = max_sectors;
-		bio_inc_remaining(mbio);
-		inc_pending(conf);
-		generic_make_request(bio);
-
-		r10_bio = mempool_alloc(conf->r10bio_pool,
-					GFP_NOIO);
-		r10_bio->master_bio = mbio;
-		r10_bio->sectors = bio_sectors(mbio) - sectors_handled;
-		r10_bio->state = 0;
-		set_bit(R10BIO_ReadError,
-			&r10_bio->state);
-		r10_bio->mddev = mddev;
-		r10_bio->sector = mbio->bi_iter.bi_sector
-			+ sectors_handled;
-
-		goto read_more;
-	} else
-		generic_make_request(bio);
+	allow_barrier(conf);
+	r10_bio->state = 0;
+	raid10_read_request(mddev, r10_bio->master_bio, r10_bio);
 }
 
 static void handle_write_completed(struct r10conf *conf, struct r10bio *r10_bio)



^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox