All of lore.kernel.org
 help / color / mirror / Atom feed
From: Guoqing Jiang <gqjiang@suse.com>
To: Stephane Thiell <sthiell@stanford.edu>,
	"linux-raid@vger.kernel.org" <linux-raid@vger.kernel.org>
Subject: Re: BUG_ON(sh->batch_head) in init_stripe()
Date: Tue, 15 Aug 2017 17:24:53 +0800	[thread overview]
Message-ID: <5992BDE5.8060108@suse.com> (raw)
In-Reply-To: <D70C6DBD-6B79-4FBE-A22D-8BC2E17CC2EB@stanford.edu>



On 08/14/2017 01:29 AM, Stephane Thiell wrote:
> Also known as "kernel BUG at drivers/md/raid5.c:527!"
>
> I’m still struggling with this issue in raid5.c. I currently have one crash per week in average on a system using Lustre and 24 x 10-disk raid6 per server. It did happen with or without check running. I now think there is a race at some point where a released stripe head (batched) is added to conf->inactive_list without sh->batch_head being set back to NULL.
>
> Because sh->batch_head is only being set back to NULL in a single function, break_stripe_batch_list(), my guess is that a call to raid5_release_stripe() on the head stripe is performed elsewhere. I didn’t find where this could come from yet. So perhaps sh->batch_head should be just set to NULL in do_release_stripe(), before being added to the inactive list?
>
> Alternatively, I wonder if it would make more sense to set batch_head to NULL where sh->overwrite_disks is also set back to 0 (like in ops_run_biodrain() today).
>
> Because sh->overwrite_disks is also set to 0 in init_stripe(), my current plan is to try this change, at least as a workaround: remove the BUG_ON(sh->batch_head) from init_stripe() and set sh->batch_head = NULL (I don’t think I need to lock on sh->stripe_lock there?). I will get new hardware soon so I will give it a try on a test system. Any suggestions welcomed…
>
> Quick analysis of the last vmcore:
>
> crash> ps | grep ">"
> ...
>> 16119  16111  22  ffff881f93bace70  RU   0.0  150772  10748  e2label
>> 151448      2  34  ffff881ff753bec0  RU   0.0       0      0  [kiblnd_sd_00_00]
>> 151449      2   2  ffff881ff753edd0  RU   0.0       0      0  [kiblnd_sd_00_01]
>> 154637      2  29  ffff881f776b1f60  RU   0.0       0      0  [md32_raid6]
>> 163643      2  11  ffff8820d200bec0  UN   0.0       0      0  [ll_ost_io01_008]
>> 164083      2  13  ffff8820d2753ec0  UN   0.0       0      0  [ll_ost_io01_034]
>> 171389      2   5  ffff883fd072de20  UN   0.0       0      0  [ll_ost_io01_058]
>> 171391      2  17  ffff883fd072ce70  UN   0.0       0      0  [ll_ost_io01_060]
>> 171399      2  33  ffff883fd9e9ce70  UN   0.0       0      0  [ll_ost_io01_068]
>> 237019      2  15  ffff882715628fb0  UN   0.0       0      0  [ll_ost_io01_084]
>> 237033      2   3  ffff881fb0250fb0  UN   0.0       0      0  [ll_ost_io01_098]
> ...
>
> crash>  bt 237033
> PID: 237033  TASK: ffff881fb0250fb0  CPU: 3   COMMAND: "ll_ost_io01_098"
>   #0 [ffff8826fa3e7240] machine_kexec at ffffffff81059bdb
>   #1 [ffff8826fa3e72a0] __crash_kexec at ffffffff811057c2
>   #2 [ffff8826fa3e7370] crash_kexec at ffffffff811058b0
>   #3 [ffff8826fa3e7388] oops_end at ffffffff8168fac8
>   #4 [ffff8826fa3e73b0] die at ffffffff8102e93b
>   #5 [ffff8826fa3e73e0] do_trap at ffffffff8168f180
>   #6 [ffff8826fa3e7430] do_invalid_op at ffffffff8102b144
>   #7 [ffff8826fa3e74e0] invalid_op at ffffffff81698b5e
>      [exception RIP: raid5_get_active_stripe+1809]
>      RIP: ffffffffa0fefd71  RSP: ffff8826fa3e7590  RFLAGS: 00010086
>      RAX: 0000000000000000  RBX: ffff883df051641c  RCX: dead000000000200
>      RDX: 0000000000000000  RSI: 0000000000000005  RDI: ffff881e08d54d30
>      RBP: ffff8826fa3e7638   R8: ffff881e08d54d40   R9: 0000000000000080
>      R10: 0000000000000000  R11: 0000000000000000  R12: ffff883df0516400
>      R13: ffff881e08d54d30  R14: 0000000000000000  R15: 0000000000000000
>      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>   #8 [ffff8826fa3e7640] raid5_make_request at ffffffffa0feff55 [raid456]
>   #9 [ffff8826fa3e7730] md_make_request at ffffffff814f838c
> #10 [ffff8826fa3e7790] generic_make_request at ffffffff812eedc9
> #11 [ffff8826fa3e77d8] submit_bio at ffffffff812eef11
> #12 [ffff8826fa3e7830] osd_submit_bio at ffffffffa10cdbcc [osd_ldiskfs]
> #13 [ffff8826fa3e7840] osd_do_bio at ffffffffa10d03c1 [osd_ldiskfs]
> #14 [ffff8826fa3e7960] osd_write_commit at ffffffffa10d0b94 [osd_ldiskfs]
> #15 [ffff8826fa3e7a00] ofd_commitrw_write at ffffffffa120c734 [ofd]
> #16 [ffff8826fa3e7a78] ofd_commitrw at ffffffffa120ff07 [ofd]
> #17 [ffff8826fa3e7b08] obd_commitrw at ffffffffa0c7dc22 [ptlrpc]
> #18 [ffff8826fa3e7b70] tgt_brw_write at ffffffffa0c55fc1 [ptlrpc]
> #19 [ffff8826fa3e7cd8] tgt_request_handle at ffffffffa0c52275 [ptlrpc]
> #20 [ffff8826fa3e7d20] ptlrpc_server_handle_request at ffffffffa0bfe1fb [ptlrpc]
> #21 [ffff8826fa3e7de8] ptlrpc_main at ffffffffa0c022b0 [ptlrpc]
> #22 [ffff8826fa3e7ec8] kthread at ffffffff810b0a4f
> #23 [ffff8826fa3e7f50] ret_from_fork at ffffffff81697218
>
>
> After some disassembly work, I found out that all active tasks are working on the same md (all are raid6):
>
> md32’s (struct r5conf *) is 0xffff883df0516400
>
> Our crashed task:
>
> crash> printf "%s", ((struct r5conf)*0xffff883df0516400).cache_name[0]
> raid6-md32
>
> Example for md32_raid6 task in release_inactive_stripe_list:
>
> crash> bt 154637
> PID: 154637  TASK: ffff881f776b1f60  CPU: 29  COMMAND: "md32_raid6"
>   #0 [ffff883ffc985e48] crash_nmi_callback at ffffffff8104d2e2
>   #1 [ffff883ffc985e58] nmi_handle at ffffffff8168fc77
>   #2 [ffff883ffc985eb0] do_nmi at ffffffff8168fe83
>   #3 [ffff883ffc985ef0] end_repeat_nmi at ffffffff8168f093
>      [exception RIP: _raw_spin_lock_irqsave+64]
>      RIP: ffffffff8168e5f0  RSP: ffff881e8d0cbc60  RFLAGS: 00000006
>      RAX: 00000000000031bb  RBX: ffff883df0516740  RCX: 000000000000009a
>      RDX: 0000000000000098  RSI: 000000000000009a  RDI: ffff883df051641c
>      RBP: ffff881e8d0cbc60   R8: 0000000000000206   R9: 0000000000000000
>      R10: 0000000000000000  R11: 0000000000000007  R12: ffff883df0516400
>      R13: ffff883df051641c  R14: 0000000000000006  R15: 0000000000000005
>      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> --- <NMI exception stack> ---
>   #4 [ffff881e8d0cbc60] _raw_spin_lock_irqsave at ffffffff8168e5f0
>   #5 [ffff881e8d0cbc68] release_inactive_stripe_list at ffffffffa0fe5500 [raid456]
>   #6 [ffff881e8d0cbcb0] handle_active_stripes at ffffffffa0ff34c2 [raid456]
>   #7 [ffff881e8d0cbd60] raid5d at ffffffffa0ff3c28 [raid456]
>   #8 [ffff881e8d0cbe50] md_thread at ffffffff814fb745
>   #9 [ffff881e8d0cbec8] kthread at ffffffff810b0a4f
> #10 [ffff881e8d0cbf50] ret_from_fork at ffffffff81697218
>
>
> crash> bt 163643 164083 171389 171391 171399 237019 237033 |grep "exception RIP:"
>      [exception RIP: _raw_spin_lock_irq+63]
>      [exception RIP: _raw_spin_lock_irq+63]
>      [exception RIP: _raw_spin_lock_irq+58]
>      [exception RIP: _raw_spin_lock_irq+63]
>      [exception RIP: _raw_spin_lock_irq+58]
>      [exception RIP: _raw_spin_lock_irq+63]
>      [exception RIP: raid5_get_active_stripe+1809]
>
> crash> bt 163643 164083 171389 171391 171399 237019 237033 |grep "raid5_get_active_stripe"
>   #5 [ffff883bcee93588] raid5_get_active_stripe at ffffffffa0fef6cb [raid456]
>   #5 [ffff883ba4af7588] raid5_get_active_stripe at ffffffffa0fef6cb [raid456]
>   #5 [ffff88390877b588] raid5_get_active_stripe at ffffffffa0fef6cb [raid456]
>   #5 [ffff88390803b588] raid5_get_active_stripe at ffffffffa0fef6cb [raid456]
>   #5 [ffff88390831f588] raid5_get_active_stripe at ffffffffa0fef6cb [raid456]
>   #5 [ffff8826fa7f7588] raid5_get_active_stripe at ffffffffa0fef6cb [raid456]
>      [exception RIP: raid5_get_active_stripe+1809]
>
> crash> bt 163643 164083 171389 171391 171399 237019 237033 |grep R12
>      R10: 0000000000000003  R11: 0000000000000000  R12: ffff883df0516400
>      R10: 0000000000000001  R11: 0000000000000000  R12: ffff883df0516400
>      R10: 0000000000000005  R11: 0000000000000000  R12: ffff883df0516400
>      R10: 0000000000000007  R11: 0000000000000000  R12: ffff883df0516400
>      R10: 0000000000000009  R11: 0000000000000000  R12: ffff883df0516400
>      R10: 0000000000000009  R11: 0000000000000000  R12: ffff883df0516400
>      R10: 0000000000000000  R11: 0000000000000000  R12: ffff883df0516400
>
>
> The bad stripe_head (sh) is available at (usually in %r13):
>
> crash> p ((struct stripe_head)*0xffff881e08d54d30).raid_conf
> $98 = (struct r5conf *) 0xffff883df0516400
>
> and batch_head is not NULL... that's the issue:
>
> crash> p ((struct stripe_head)*0xffff881e08d54d30).batch_head
> $99 = (struct stripe_head *) 0xffff881e08d54d30
>
> Full struct dump below. We can see than overwrite_disks is set to 0, state is 0… but this batch_head not set to NULL makes it crash in init_stripe():
>
> crash> p ((struct stripe_head)*0xffff881e08d54d30)
> $6 = {
>    hash = {
>      next = 0x0,
>      pprev = 0x0
>    },
>    lru = {
>      next = 0xffff881e08d54d40,
>      prev = 0xffff881e08d54d40
>    },
>    release_list = {
>      next = 0xffff881dfe405cc0
>    },
>    raid_conf = 0xffff883df0516400,
>    generation = 0,
>    sector = 3701224808,
>    pd_idx = 1,
>    qd_idx = 2,
>    ddf_layout = 0,
>    hash_lock_index = 5,
>    state = 0,
>    count = {
>      counter = 0
>    },
>    bm_seq = 22708787,
>    disks = 10,
>    overwrite_disks = 0,
>    check_state = check_state_idle,
>    reconstruct_state = reconstruct_state_idle,
>    stripe_lock = {
>      {
>        rlock = {
>          raw_lock = {
>            {
>              head_tail = 3709263126,
>              tickets = {
>                head = 56598,
>                tail = 56598
>              }
>            }
>          }
>        }
>      }
>    },
>    cpu = 15,
>    group = 0x0,
>    batch_head = 0xffff881e08d54d30,
>    batch_lock = {
>      {
>        rlock = {
>          raw_lock = {
>            {
>              head_tail = 9961624,
>              tickets = {
>                head = 152,
>                tail = 152
>              }
>            }
>          }
>        }
>      }
>    },
>    batch_list = {
>      next = 0xffff881dfe405d28,
>      prev = 0xffff881e0e2dec98
>    },
>    log_io = 0x0,
>    log_list = {
>      next = 0x0,
>      prev = 0x0
>    },
>    ops = {
>      target = 0,
>      target2 = 0,
>      zero_sum_result = (unknown: 0)
>    },
>    dev = {{
>        req = {
>          bi_sector = 0,
>          bi_next = 0x0,
>          bi_bdev = 0x0,
>          bi_flags = 1,
>          bi_rw = 0,
>          bi_vcnt = 0,
>          bi_idx = 0,
>          bi_phys_segments = 0,
>          bi_size = 0,
>          bi_seg_front_size = 0,
>          bi_seg_back_size = 0,
>          bi_end_io = 0x0,
>          bi_private = 0xffff881e08d54d30,
>          bi_ioc = 0x0,
>          bi_css = 0x0,
>          bi_integrity = 0x0,
>          bi_max_vecs = 1,
>          bi_cnt = {
>            counter = 1
>          },
>          bi_io_vec = 0xffff881e08d54f00,
>          bi_pool = 0x0,
>          {
>            bio_aux = 0x0,
>            __UNIQUE_ID_rh_kabi_hide15 = {
>              rh_reserved1 = 0x0
>            },
>            {<No data fields>}
>          },
>          bi_inline_vecs = 0xffff881e08d54e78
>        },
>        rreq = {
>          bi_sector = 0,
>          bi_next = 0x0,
>          bi_bdev = 0x0,
>          bi_flags = 1,
>          bi_rw = 0,
>          bi_vcnt = 0,
>          bi_idx = 0,
>          bi_phys_segments = 0,
>          bi_size = 0,
>          bi_seg_front_size = 0,
>          bi_seg_back_size = 0,
>          bi_end_io = 0x0,
>          bi_private = 0xffff881e08d54d30,
>          bi_ioc = 0x0,
>          bi_css = 0x0,
>          bi_integrity = 0x0,
>          bi_max_vecs = 1,
>          bi_cnt = {
>            counter = 1
>          },
>          bi_io_vec = 0xffff881e08d54f10,
>          bi_pool = 0x0,
>          {
>            bio_aux = 0x0,
>            __UNIQUE_ID_rh_kabi_hide15 = {
>              rh_reserved1 = 0x0
>            },
>            {<No data fields>}
>          },
>          bi_inline_vecs = 0xffff881e08d54f00
>        },
>        vec = {
>          bv_page = 0xffffea0078236080,
>          bv_len = 4096,
>          bv_offset = 0
>        },
>        rvec = {
>          bv_page = 0x0,
>          bv_len = 0,
>          bv_offset = 0
>        },
>        page = 0xffffea0078236080,
>        orig_page = 0xffffea0078236080,
>        toread = 0x0,
>        read = 0x0,
>        towrite = 0x0,
>        written = 0xffff8824832a7c00,
>        sector = 29609798632,
>        flags = 24,

Seems the flags is R5_OVERWRITE | R5_Insync, handle_stripe_clean_event
only calls break_stripe_batch_list if do_endio is set to true.

if (head_sh->batch_head && do_endio)
                 break_stripe_batch_list(head_sh, STRIPE_EXPAND_SYNC_FLAGS);

And do_endio is set to true if we meet the below condition.

if (!test_bit(R5_LOCKED, &dev->flags) &&
                             (test_bit(R5_UPTODATE, &dev->flags) ||
                              test_bit(R5_Discard, &dev->flags) ||
                              test_bit(R5_SkipCopy, &dev->flags))) {

I guess it is why the batch_head is not null, just my 2 cents.

>        log_checksum = 0
>      }}
> }

Thanks,
Guoqing


  reply	other threads:[~2017-08-15  9:24 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-05-08 17:15 BUG_ON(sh->batch_head) in init_stripe() Stephane Thiell
2017-08-13 17:29 ` Stephane Thiell
2017-08-15  9:24   ` Guoqing Jiang [this message]
2017-08-15 17:59     ` Stephane Thiell
2017-08-15 18:39       ` Shaohua Li
2017-08-15 19:27         ` Stephane Thiell
2017-08-17 22:58           ` Shaohua Li
2017-08-18  4:13             ` Stephane Thiell
2017-08-18  5:50               ` Shaohua Li
     [not found]                 ` <C45EC19A-1CF2-45BA-B907-BCD5B4E644F0@stanford.edu>
2017-08-21 16:10                   ` Shaohua Li
2017-08-21 17:35                     ` Stephane Thiell
2017-08-21 17:46                       ` Shaohua Li
2017-09-05 17:54                         ` Stephane Thiell
2017-08-15 18:36     ` Shaohua Li
2017-08-15 19:07       ` Stephane Thiell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5992BDE5.8060108@suse.com \
    --to=gqjiang@suse.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=sthiell@stanford.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.