Re: [BUG] MD/RAID1 hung forever on freeze_array

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jinpu Wang <jinpu.wang@profitbricks.com>
To: NeilBrown <neilb@suse.com>
Cc: linux-raid@vger.kernel.org, Shaohua Li <shli@fb.com>,
	Nate Dailey <nate.dailey@stratus.com>
Subject: Re: [BUG] MD/RAID1 hung forever on freeze_array
Date: Wed, 14 Dec 2016 13:13:27 +0100	[thread overview]
Message-ID: <CAMGffE=KoVdoYRzkHdRMuCopjmUdcrP9-woFFr-4-VszGsSHRQ@mail.gmail.com> (raw)
In-Reply-To: <CAMGffEnCesgUp4gBsPN2L9qg3WSxNXsCcYEPWH-BaeEEktaqcw@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 8255 bytes --]

On Wed, Dec 14, 2016 at 11:22 AM, Jinpu Wang
<jinpu.wang@profitbricks.com> wrote:
> Thanks Neil,
>
> On Tue, Dec 13, 2016 at 11:18 PM, NeilBrown <neilb@suse.com> wrote:
>> On Wed, Dec 14 2016, Jinpu Wang wrote:
>>
>>>
>>> As you suggested, I re-run same test with 4.4.36 with no our own patch on MD.
>>> I can still reproduce the same bug, nr_pending on heathy leg(loop1) is till 1.
>>>
>>
>> Thanks.
>>
>> I have an hypothesis.
>>
>> md_make_request() calls blk_queue_split().
>> If that does split the request it will call generic_make_request()
>> on the first half. That will call back into md_make_request() and
>> raid1_make_request() which will submit requests to the underlying
>> devices.  These will get caught on the bio_list_on_stack queue in
>> generic_make_request().
>> This is a queue which is not accounted in nr_queued.
>>
>> When blk_queue_split() completes, 'bio' will be the second half of the
>> bio.
>> This enters raid1_make_request() and by this time the array have been
>> frozen.
>> So wait_barrier() has to wait for pending requests to complete, and that
>> includes the one that it stuck in bio_list_on_stack, which will never
>> complete now.
>>
>> To see if this might be happening, please change the
>>
>>         blk_queue_split(q, &bio, q->bio_split);
>>
>> call in md_make_request() to
>>
>>         struct bio *tmp = bio;
>>         blk_queue_split(q, &bio, q->bio_split);
>>         WARN_ON_ONCE(bio != tmp);
>>
>> If that ever triggers, then the above is a real possibility.
>
> I triggered the warning as you expected, we can confirm the bug was
> caused by your above hypothesis.
> [  429.282235] ------------[ cut here ]------------
> [  429.282407] WARNING: CPU: 2 PID: 4139 at drivers/md/md.c:262
> md_set_array_sectors+0xac0/0xc30 [md_mod]()
> [  429.285288] Modules linked in: raid1 ibnbd_client(O)
> ibtrs_client(O) dm_service_time dm_multipath rdma_ucm ib_ucm rdma_cm
> iw_cm ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_c
> ore vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ib_sa ib_mad ib_core
> ib_addr ib_netlink mlx4_core mlx_compat loop md_mod kvm_amd
> edac_mce_amd kvm edac_core irqbypass acpi_cpufreq tpm
> _infineon tpm_tis i2c_piix4 tpm serio_raw evdev k10temp processor
> button fam15h_power crct10dif_pclmul crc32_pclmul sg sd_mod ahci
> libahci libata scsi_mod crc32c_intel r8169 psmo
> use xhci_pci xhci_hcd [last unloaded: mlx_compat]
> [  429.288543] CPU: 2 PID: 4139 Comm: fio Tainted: G           O    4.4.36-1-pse
> rver #1
> [  429.288825] Hardware name: To be filled by O.E.M. To be filled by
> O.E.M./M5A97 R2.0, BIOS 2501 04/07/2014
> [  429.289113]  0000000000000000 ffff8801f64ff8f0 ffffffff81424486
> 0000000000000000
> [  429.289538]  ffffffffa0561938 ffff8801f64ff928 ffffffff81058a60
> ffff8800b8f3e000
> [  429.290157]  0000000000000000 ffff8800b51f4100 ffff880234f9a700
> ffff880234f9a700
> [  429.290594] Call Trace:
> [  429.290743]  [<ffffffff81424486>] dump_stack+0x4d/0x67
> [  429.290893]  [<ffffffff81058a60>] warn_slowpath_common+0x90/0xd0
> [  429.291046]  [<ffffffff81058b55>] warn_slowpath_null+0x15/0x20
> [  429.291202]  [<ffffffffa0550740>] md_set_array_sectors+0xac0/0xc30 [md_mod]
> [  429.291358]  [<ffffffff813fd3de>] generic_make_request+0xfe/0x1e0
> [  429.291540]  [<ffffffff813fd522>] submit_bio+0x62/0x150
> [  429.291693]  [<ffffffff813f53d9>] ? bio_set_pages_dirty+0x49/0x60
> [  429.291847]  [<ffffffff811d32a7>] do_blockdev_direct_IO+0x2317/0x2ba0
> [  429.292011]  [<ffffffffa0834f64>] ?
> ib_post_rdma_write_imm+0x24/0x30 [ibtrs_client]
> [  429.292271]  [<ffffffff811cdc40>] ? I_BDEV+0x10/0x10
> [  429.292417]  [<ffffffff811d3b6e>] __blockdev_direct_IO+0x3e/0x40
> [  429.292566]  [<ffffffff811ce2d7>] blkdev_direct_IO+0x47/0x50
> [  429.292746]  [<ffffffff81132abf>] generic_file_read_iter+0x45f/0x580
> [  429.292894]  [<ffffffff811ce620>] ? blkdev_write_iter+0x110/0x110
> [  429.293073]  [<ffffffff811ce652>] blkdev_read_iter+0x32/0x40
> [  429.293284]  [<ffffffff811deb86>] aio_run_iocb+0x116/0x2a0
> [  429.293492]  [<ffffffff813fed52>] ? blk_flush_plug_list+0xc2/0x200
> [  429.293703]  [<ffffffff81183ac6>] ? kmem_cache_alloc+0xb6/0x180
> [  429.293901]  [<ffffffff811dfaf4>] ? do_io_submit+0x184/0x4d0
> [  429.294047]  [<ffffffff811dfbaa>] do_io_submit+0x23a/0x4d0
> [  429.294194]  [<ffffffff811dfe4b>] SyS_io_submit+0xb/0x10
> [  429.294375]  [<ffffffff81815497>] entry_SYSCALL_64_fastpath+0x12/0x6a
> [  429.294610] ---[ end trace 25d1cece0e01494b ]---
>
> I double checked the nr_pending on heathy leg is still 1 as before.
>
>>
>> Fixing the problem isn't very easy...
>>
>> You could try:
>> 1/ write a function in raid1.c which calls punt_bios_to_rescuer()
>>   (which you will need to export from block/bio.c),
>>   passing mddev->queue->bio_split as the bio_set.
>>
>> 1/ change the wait_event_lock_irq() call in wait_barrier() to
>>    wait_event_lock_irq_cmd(), and pass the new function as the command.
>>
>> That way, if wait_barrier() ever blocks, all the requests in
>> bio_list_on_stack will be handled by a separate thread.
>>
>> NeilBrown
>
> I will try your sugested way to see if it fix the bug, will report back soon.
>
Hi Neil,

Sorry, bad news, with the 2 patch attached, I can still reproduce the same bug.
nr_pending on healthy leg is still 1, as before.
crash> struct r1conf 0xffff8800b7176100
struct r1conf {
  mddev = 0xffff8800b59b0000,
  mirrors = 0xffff88022bab7900,
  raid_disks = 2,
  next_resync = 18446744073709527039,
  start_next_window = 18446744073709551615,
  current_window_requests = 0,
  next_window_requests = 0,
  device_lock = {
    {
      rlock = {
        raw_lock = {
          val = {
            counter = 0
          }
        }
      }
    }
  },
  retry_list = {
    next = 0xffff880211b2ec40,
    prev = 0xffff88022819ad40
  },
  bio_end_io_list = {
    next = 0xffff880227e9a9c0,
    prev = 0xffff8802119c6140
  },
  pending_bio_list = {
    head = 0x0,
    tail = 0x0
  },
  pending_count = 0,
  wait_barrier = {
    lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    task_list = {
      next = 0xffff8800adf3b818,
      prev = 0xffff88021180f7a8
    }
  },
  resync_lock = {
    {
      rlock = {
        raw_lock = {
          val = {
            counter = 0
          }
        }
      }
    }
  },
  nr_pending = 1675,
  nr_waiting = 100,
  nr_queued = 1673,
  barrier = 0,
  array_frozen = 1,
  fullsync = 0,
  recovery_disabled = 1,
  poolinfo = 0xffff88022c80f640,
  r1bio_pool = 0xffff88022b8b6a20,
  r1buf_pool = 0x0,
  tmppage = 0xffffea0008a90c80,
  thread = 0x0,
  cluster_sync_low = 0,
  cluster_sync_high = 0
}

 kobj = {
    name = 0xffff88022b7194a0 "dev-loop1",
    entry = {
      next = 0xffff880231495280,
      prev = 0xffff880231495280
    },
    parent = 0xffff8800b59b0050,
    kset = 0x0,
    ktype = 0xffffffffa0564060 <rdev_ktype>,
    sd = 0xffff8800b6510960,
    kref = {
      refcount = {
        counter = 1
      }
    },
    state_initialized = 1,
    state_in_sysfs = 1,
    state_add_uevent_sent = 0,
    state_remove_uevent_sent = 0,
    uevent_suppress = 0
  },
  flags = 2,
  blocked_wait = {
    lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    task_list = {
      next = 0xffff8802314952c8,
      prev = 0xffff8802314952c8
    }
  },
  desc_nr = 1,
  raid_disk = 1,
  new_raid_disk = 0,
  saved_raid_disk = -1,
  {
    recovery_offset = 0,
    journal_tail = 0
  },
  nr_pending = {
    counter = 1
  },


-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

[-- Attachment #2: 0001-block-export-punt_bios_to_rescuer.patch --]
[-- Type: text/x-patch, Size: 1566 bytes --]

From e7adbbb1a8d542ea68ada5996e0f9ffe87c479b6 Mon Sep 17 00:00:00 2001
From: Jack Wang <jinpu.wang@profitbricks.com>
Date: Wed, 14 Dec 2016 11:26:23 +0100
Subject: [PATCH 1/2] block: export punt_bios_to_rescuer

We need it later in raid1

Suggested-by: Neil Brown <neil.brown@suse.com>
Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
---
 block/bio.c         | 3 ++-
 include/linux/bio.h | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index 46e2cc1..f6a250d 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -354,7 +354,7 @@ static void bio_alloc_rescue(struct work_struct *work)
 	}
 }
 
-static void punt_bios_to_rescuer(struct bio_set *bs)
+void punt_bios_to_rescuer(struct bio_set *bs)
 {
 	struct bio_list punt, nopunt;
 	struct bio *bio;
@@ -384,6 +384,7 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
 
 	queue_work(bs->rescue_workqueue, &bs->rescue_work);
 }
+EXPORT_SYMBOL(punt_bios_to_rescuer);
 
 /**
  * bio_alloc_bioset - allocate a bio for I/O
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 42e4e3c..6256ba7 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -479,6 +479,7 @@ extern void bio_advance(struct bio *, unsigned);
 extern void bio_init(struct bio *);
 extern void bio_reset(struct bio *);
 void bio_chain(struct bio *, struct bio *);
+void punt_bios_to_rescuer(struct bio_set *);
 
 extern int bio_add_page(struct bio *, struct page *, unsigned int,unsigned int);
 extern int bio_add_pc_page(struct request_queue *, struct bio *, struct page *,
-- 
2.7.4


[-- Attachment #3: 0002-raid1-fix-deadlock.patch --]
[-- Type: text/x-patch, Size: 1420 bytes --]

From 2ad4cc5e8b5d7ec9db7a6fffaa2fdcd5f20419bf Mon Sep 17 00:00:00 2001
From: Jack Wang <jinpu.wang@profitbricks.com>
Date: Wed, 14 Dec 2016 11:35:52 +0100
Subject: [PATCH 2/2] raid1: fix deadlock

Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/md/raid1.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 478223c..61dafb1 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -190,6 +190,11 @@ static void put_all_bios(struct r1conf *conf, struct r1bio *r1_bio)
 	}
 }
 
+static void raid1_punt_bios_to_rescuer(struct r1conf *conf)
+{
+	punt_bios_to_rescuer(conf->mddev->queue->bio_split);
+}
+
 static void free_r1bio(struct r1bio *r1_bio)
 {
 	struct r1conf *conf = r1_bio->mddev->private;
@@ -871,14 +876,15 @@ static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
 		 * that queue to allow conf->start_next_window
 		 * to increase.
 		 */
-		wait_event_lock_irq(conf->wait_barrier,
+		wait_event_lock_irq_cmd(conf->wait_barrier,
 				    !conf->array_frozen &&
 				    (!conf->barrier ||
 				     ((conf->start_next_window <
 				       conf->next_resync + RESYNC_SECTORS) &&
 				      current->bio_list &&
 				      !bio_list_empty(current->bio_list))),
-				    conf->resync_lock);
+				    conf->resync_lock,
+				    raid1_punt_bios_to_rescuer(conf));
 		conf->nr_waiting--;
 	}
 
-- 
2.7.4

next prev parent reply	other threads:[~2016-12-14 12:13 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-11-25 13:30 [BUG] MD/RAID1 hung forever on freeze_array Jinpu Wang
2016-11-25 13:59 ` Jinpu Wang
2016-11-28  4:47   ` Coly Li
2016-11-28  8:24     ` Jinpu Wang
2016-11-28  8:54       ` Coly Li
2016-11-28  9:02         ` Jinpu Wang
2016-11-28  9:10           ` Coly Li
2016-11-29 11:15             ` Jinpu Wang
2016-12-07 14:17               ` Jinpu Wang
2016-12-08  3:17                 ` NeilBrown
2016-12-08  9:50                   ` Jinpu Wang
2016-12-09  6:01                     ` NeilBrown
2016-12-09 15:28                       ` Jinpu Wang
2016-12-09 15:36                       ` Jinpu Wang
2016-12-12  0:59 ` NeilBrown
2016-12-12 13:10   ` Jinpu Wang
2016-12-12 21:53     ` NeilBrown
2016-12-13 15:08       ` Jinpu Wang
2016-12-13 22:18         ` NeilBrown
2016-12-14 10:22           ` Jinpu Wang
2016-12-14 12:13             ` Jinpu Wang [this message]
2016-12-14 14:49               ` Jinpu Wang
2016-12-15  3:20                 ` NeilBrown
2016-12-15  9:24                   ` Jinpu Wang
     [not found]                   ` <CAMGffEkufeaDytaHxtLR02iiQifZDhcwkLdzMj3X8_yaitSoFQ@mail.gmail.com>
2016-12-19 14:56                     ` Jinpu Wang
2016-12-19 22:45                     ` NeilBrown
2016-12-20 10:34                       ` Jinpu Wang
2016-12-20 21:23                         ` NeilBrown
2016-12-21 12:48                           ` Jinpu Wang
2016-12-21 23:51                             ` NeilBrown
2016-12-22  8:35                               ` Jinpu Wang

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:46e2cc1 dfblob:f6a250d dfblob:42e4e3c dfblob:6256ba7
dfblob:478223c dfblob:61dafb1 )
 OR (
bs:"raid1: fix deadlock" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAMGffE=KoVdoYRzkHdRMuCopjmUdcrP9-woFFr-4-VszGsSHRQ@mail.gmail.com' \
    --to=jinpu.wang@profitbricks.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=nate.dailey@stratus.com \
    --cc=neilb@suse.com \
    --cc=shli@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).