linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Xiao Ni <xni@redhat.com>
To: NeilBrown <neilb@suse.com>
Cc: linux-raid@vger.kernel.org
Subject: Re: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without
Date: Sat, 16 Sep 2017 09:15:57 -0400 (EDT)	[thread overview]
Message-ID: <1823716408.11533021.1505567757693.JavaMail.zimbra@redhat.com> (raw)
In-Reply-To: <393232447.10845976.1505375841983.JavaMail.zimbra@redhat.com>



----- Original Message -----
> From: "Xiao Ni" <xni@redhat.com>
> To: "NeilBrown" <neilb@suse.com>
> Cc: linux-raid@vger.kernel.org
> Sent: Thursday, September 14, 2017 3:57:21 PM
> Subject: Re: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without
> 
> 
> 
> ----- Original Message -----
> > From: "NeilBrown" <neilb@suse.com>
> > To: "Xiao Ni" <xni@redhat.com>
> > Cc: linux-raid@vger.kernel.org
> > Sent: Thursday, September 14, 2017 1:32:02 PM
> > Subject: Re: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata
> > without
> > 
> > On Thu, Sep 14 2017, Xiao Ni wrote:
> > 
> > > ----- Original Message -----
> > >> From: "NeilBrown" <neilb@suse.com>
> > >> To: "Xiao Ni" <xni@redhat.com>
> > >> Cc: linux-raid@vger.kernel.org
> > >> Sent: Thursday, September 14, 2017 7:05:20 AM
> > >> Subject: Re: [PATCH 0/4] RFC: attempt to remove md deadlocks with
> > >> metadata
> > >> without
> > >> 
> > >> On Wed, Sep 13 2017, Xiao Ni wrote:
> > >> >
> > >> > Hi Neil
> > >> >
> > >> > Sorry for the bad news. The test is still running and it's stuck
> > >> > again.
> > >> 
> > >> Any details?  Anything at all?  Just a little hint maybe?
> > >> 
> > >> Just saying "it's stuck again" is very nearly useless.
> > >> 
> > > Hi Neil
> > >
> > > It doesn't show any useful information in /var/log/messages
> > >
> > > echo file raid5.c +p > /sys/kernel/debug/dynamic_debug/control
> > > There aren't any messages too.
> > >
> > > It looks like another problem.
> > >
> > > [root@dell-pr1700-02 ~]# ps auxf | grep D
> > > USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> > > root      8381  0.0  0.0      0     0 ?        D    Sep13   0:00  \_
> > > [kworker/u8:1]
> > > root      8966  0.0  0.0      0     0 ?        D    Sep13   0:00  \_
> > > [jbd2/md0-8]
> > > root       824  0.0  0.1 216856  8492 ?        Ss   Sep03   0:06
> > > /usr/bin/abrt-watch-log -F BUG: WARNING: at WARNING: CPU: INFO: possible
> > > recursive locking detected ernel BUG at list_del corruption list_add
> > > corruption do_IRQ: stack overflow: ear stack overflow (cur: eneral
> > > protection fault nable to handle kernel ouble fault: RTNL: assertion
> > > failed eek! page_mapcount(page) went negative! adness at NETDEV WATCHDOG
> > > ysctl table check failed : nobody cared IRQ handler type mismatch Machine
> > > Check Exception: Machine check events logged divide error: bounds:
> > > coprocessor segment overrun: invalid TSS: segment not present: invalid
> > > opcode: alignment check: stack segment: fpu exception: simd exception:
> > > iret exception: /var/log/messages -- /usr/bin/abrt-dump-oops -xtD
> > > root       836  0.0  0.0 195052  3200 ?        Ssl  Sep03   0:00
> > > /usr/sbin/gssproxy -D
> > > root      1225  0.0  0.0 106008  7436 ?        Ss   Sep03   0:00
> > > /usr/sbin/sshd -D
> > > root     12411  0.0  0.0 112672  2264 pts/0    S+   00:50   0:00
> > > \_ grep --color=auto D
> > > root      8987  0.0  0.0 109000  2728 pts/2    D+   Sep13   0:04
> > > \_ dd if=/dev/urandom of=/mnt/md_test/testfile bs=1M count=1000
> > > root      8983  0.0  0.0   7116  2080 ?        Ds   Sep13   0:00
> > > /usr/sbin/mdadm --grow --continue /dev/md0
> > >
> > > [root@dell-pr1700-02 ~]# cat /proc/mdstat
> > > Personalities : [raid6] [raid5] [raid4]
> > > md0 : active raid5 loop6[7] loop4[6] loop5[5](S) loop3[3] loop2[2]
> > > loop1[1]
> > > loop0[0]
> > >       2039808 blocks super 1.2 level 5, 512k chunk, algorithm 2 [6/6]
> > >       [UUUUUU]
> > >       [>....................]  reshape =  0.0% (1/509952)
> > >       finish=1059.5min
> > >       speed=7K/sec
> > >       
> > > unused devices: <none>
> > >
> > >
> > > It looks like the reshape doesn't start. This time I didn't add the codes
> > > to check
> > > the information of mddev->suspended and active_stripes. I just added the
> > > patches
> > > to source codes. Do you have other suggestions to check more things?
> > >
> > > Best Regards
> > > Xiao
> > 
> > What do
> >  cat /proc/8987/stack
> >  cat /proc/8983/stack
> >  cat /proc/8966/stack
> >  cat /proc/8381/stack
> > 
> > show??
> 
> dd if=/dev/urandom of=/mnt/md_test/testfile bs=1M count=1000
> 
> [root@dell-pr1700-02 ~]# cat /proc/8987/stack
> [<ffffffff810d4ea6>] io_schedule+0x16/0x40
> [<ffffffff811c66ae>] __lock_page+0x10e/0x160
> [<ffffffffa09b4ef0>] mpage_prepare_extent_to_map+0x290/0x310 [ext4]
> [<ffffffffa09ba007>] ext4_writepages+0x467/0xe80 [ext4]
> [<ffffffff811d6bec>] do_writepages+0x1c/0x70
> [<ffffffff811c7c66>] __filemap_fdatawrite_range+0xc6/0x100
> [<ffffffff811c7d6c>] filemap_flush+0x1c/0x20
> [<ffffffffa09b757c>] ext4_alloc_da_blocks+0x2c/0x70 [ext4]
> [<ffffffffa09a89a9>] ext4_release_file+0x79/0xc0 [ext4]
> [<ffffffff81263d67>] __fput+0xe7/0x210
> [<ffffffff81263ece>] ____fput+0xe/0x10
> [<ffffffff810c59c3>] task_work_run+0x83/0xb0
> [<ffffffff81003d64>] exit_to_usermode_loop+0x6c/0xa8
> [<ffffffff8100389a>] do_syscall_64+0x13a/0x150
> [<ffffffff81777527>] entry_SYSCALL64_slow_path+0x25/0x25
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> /usr/sbin/mdadm --grow --continue /dev/md0. Is it the reason to add
> lockdep_assert_held(&mddev->reconfig_mutex)?
> [root@dell-pr1700-02 ~]# cat /proc/8983/stack
> [<ffffffffa0a3464c>] mddev_suspend+0x12c/0x160 [md_mod]
> [<ffffffffa0a379ec>] suspend_lo_store+0x7c/0xe0 [md_mod]
> [<ffffffffa0a3b7d0>] md_attr_store+0x80/0xc0 [md_mod]
> [<ffffffff812ec8da>] sysfs_kf_write+0x3a/0x50
> [<ffffffff812ec39f>] kernfs_fop_write+0xff/0x180
> [<ffffffff81260457>] __vfs_write+0x37/0x170
> [<ffffffff812619e2>] vfs_write+0xb2/0x1b0
> [<ffffffff81263015>] SyS_write+0x55/0xc0
> [<ffffffff810037c7>] do_syscall_64+0x67/0x150
> [<ffffffff81777527>] entry_SYSCALL64_slow_path+0x25/0x25
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> [jbd2/md0-8]
> [root@dell-pr1700-02 ~]# cat /proc/8966/stack
> [<ffffffffa0a39b20>] md_write_start+0xf0/0x220 [md_mod]
> [<ffffffffa0972b49>] raid5_make_request+0x89/0x8b0 [raid456]
> [<ffffffffa0a34175>] md_make_request+0xf5/0x260 [md_mod]
> [<ffffffff81376427>] generic_make_request+0x117/0x2f0
> [<ffffffff81376675>] submit_bio+0x75/0x150
> [<ffffffff8129e0b0>] submit_bh_wbc+0x140/0x170
> [<ffffffff8129e683>] submit_bh+0x13/0x20
> [<ffffffffa0957e29>] jbd2_write_superblock+0x109/0x230 [jbd2]
> [<ffffffffa0957f8b>] jbd2_journal_update_sb_log_tail+0x3b/0x80 [jbd2]
> [<ffffffffa09517ff>] jbd2_journal_commit_transaction+0x16ef/0x19e0 [jbd2]
> [<ffffffffa0955d02>] kjournald2+0xd2/0x260 [jbd2]
> [<ffffffff810c73f9>] kthread+0x109/0x140
> [<ffffffff817776c5>] ret_from_fork+0x25/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> [kworker/u8:1]
> [root@dell-pr1700-02 ~]# cat /proc/8381/stack
> [<ffffffffa0a34131>] md_make_request+0xb1/0x260 [md_mod]
> [<ffffffff81376427>] generic_make_request+0x117/0x2f0
> [<ffffffff81376675>] submit_bio+0x75/0x150
> [<ffffffffa09d421c>] ext4_io_submit+0x4c/0x60 [ext4]
> [<ffffffffa09d43f4>] ext4_bio_write_page+0x1a4/0x3b0 [ext4]
> [<ffffffffa09b44f7>] mpage_submit_page+0x57/0x70 [ext4]
> [<ffffffffa09b4778>] mpage_map_and_submit_buffers+0x168/0x290 [ext4]
> [<ffffffffa09ba3f2>] ext4_writepages+0x852/0xe80 [ext4]
> [<ffffffff811d6bec>] do_writepages+0x1c/0x70
> [<ffffffff81293895>] __writeback_single_inode+0x45/0x320
> [<ffffffff812940c0>] writeback_sb_inodes+0x280/0x570
> [<ffffffff8129443c>] __writeback_inodes_wb+0x8c/0xc0
> [<ffffffff812946e6>] wb_writeback+0x276/0x310
> [<ffffffff81294f9c>] wb_workfn+0x19c/0x3b0
> [<ffffffff810c0ff9>] process_one_work+0x149/0x360
> [<ffffffff810c177d>] worker_thread+0x4d/0x3c0
> [<ffffffff810c73f9>] kthread+0x109/0x140
> [<ffffffff817776c5>] ret_from_fork+0x25/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> If they can't give useful hints, I can try to print more information and do
> test again.

Hi Neil

I added some codes to print some information. 

[13404.528231] mddev->suspended : 1
[13404.531170] mddev->active_io : 1
[13404.533774] conf->quiesce 0

MD_SB_CHANGE_PENDING of mddev->flags is not set
MD_UPDATING_SB of mddev->flags is not set

It's stuck at mddev_suspend
wait_event(mddev->sb_wait, atomic_read(&mddev->active_io) == 0)

md_write_start
wait_event(mddev->sb_wait,
      !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) && !mddev->suspended);

Best Regards
Xiao

> 
> Best Regards
> Xiao
> > 
> > NeilBrown
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

  reply	other threads:[~2017-09-16 13:15 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-09-12  1:49 [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without NeilBrown
2017-09-12  1:49 ` [PATCH 4/4] md: allow metadata update while suspending NeilBrown
2017-09-12  1:49 ` [PATCH 3/4] md: use mddev_suspend/resume instead of ->quiesce() NeilBrown
2017-09-12  1:49 ` [PATCH 2/4] md: don't call bitmap_create() while array is quiesced NeilBrown
2017-09-12  1:49 ` [PATCH 1/4] md: always hold reconfig_mutex when calling mddev_suspend() NeilBrown
2017-09-12  2:51 ` [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without Xiao Ni
2017-09-13  2:11 ` Xiao Ni
2017-09-13 15:09   ` Xiao Ni
2017-09-13 23:05     ` NeilBrown
2017-09-14  4:55       ` Xiao Ni
2017-09-14  5:32         ` NeilBrown
2017-09-14  7:57           ` Xiao Ni
2017-09-16 13:15             ` Xiao Ni [this message]
2017-10-05  5:17             ` NeilBrown
2017-10-06  3:53               ` Xiao Ni
2017-10-06  4:32                 ` NeilBrown
2017-10-09  1:21                   ` Xiao Ni
2017-10-09  4:57                     ` NeilBrown
2017-10-09  5:32                       ` Xiao Ni
2017-10-09  5:52                         ` NeilBrown
2017-10-10  6:05                           ` Xiao Ni
2017-10-10 21:20                             ` NeilBrown
     [not found]                               ` <960568852.19225619.1507689864371.JavaMail.zimbra@redhat.com>
2017-10-13  3:48                                 ` NeilBrown
2017-10-16  4:43                                   ` Xiao Ni
2017-09-30  9:46 ` Xiao Ni
2017-10-05  5:03   ` NeilBrown
2017-10-06  3:40     ` Xiao Ni

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1823716408.11533021.1505567757693.JavaMail.zimbra@redhat.com \
    --to=xni@redhat.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).