linux-rt-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dmitry Monakhov <dmonlist-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
To: Chris Friesen
	<chris.friesen-CWA4WttNNZF54TAoqtyWWQ@public.gmane.org>,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	"J. Bruce Fields"
	<bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	tytso-3s7WtUTddSA@public.gmane.org,
	adilger.kernel-m1MBpc4rdrD3fQ9qLvQP4Q@public.gmane.org,
	linux-rt-users-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: Hang writing to nfs-mounted filesystem from client, all nfsd tasks on server blocked in D
Date: Fri, 17 Oct 2014 23:12:09 +0400	[thread overview]
Message-ID: <87k33yik8m.fsf@openvz.org> (raw)
In-Reply-To: <544156FE.7070905-CWA4WttNNZF54TAoqtyWWQ@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 10178 bytes --]

Chris Friesen <chris.friesen-CWA4WttNNZF54TAoqtyWWQ@public.gmane.org> writes:

> Hi,
>
> We've run into an odd bug and I'm looking for some advice on how to 
> proceed.  Please excuse the cross-posting, I wasn't sure where this 
> belongs.  Any guidance would be greatly appreciated.
>
> The problem we're seeing is that by copying a large (10GB) file to an 
> nfs-mounted directory we can cause all the nfsd threads to hang in D 
> state on the server.
>
> Background:
>
> We've got a couple of systems running 3.4.97, with the "rt120" patch for 
> CONFIG_PREEMPT_RT.  On the server we've got an nfs-exported ext4 
> filesystem, on top of drbd (used for mirroring over the network via 
> protocol C), on top of LVM (simple vg on top of pv, nothing fancy), on 
> top of a SATA SSD.
>
> Stuff we've ruled out:
>
> With the above system within 1-3 iterations of copying the file we can 
> cause a hang.  If we switch to spinning SATA disks we were unable to 
> trigger the problem.  However, if we run within virtualbox with virtual 
> PIIX4 IDE disks then we can reproduce the problem.  This seems to point 
> to a timing issue as opposed to a hardware issue.
>
> We've confirmed that we can reproduce the problem with drbd out of the 
> picture (NFS-exported ext4 on LVM on virtual disks in virtualbox) but it 
> takes longer than with the drbd disks.
>
> We've confirmed that we can reproduce the problem with both "deadline" 
> and "noop" I/O schedulers.
>
>
> Stuff still to try:
>
> We're working on setting up a minimal test-case with both the RT and 
> standard kernels.  As it stands there are a lot of other components 
> running that make it harder to rebuild with different kernel options.
>
>
> Details:
>
> These are the processes we saw stuck in disk sleep:
>
> $ ps aux | grep D
> USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> root      1797  4.2  0.0      0     0 ?        D    19:46   9:51 [nfsd]
> root      1798  4.2  0.0      0     0 ?        D    19:46   9:54 [nfsd]
> root      1799  4.2  0.0      0     0 ?        D    19:46   9:54 [nfsd]
> root      1800  4.2  0.0      0     0 ?        D    19:46   9:52 [nfsd]
> root      1801  4.2  0.0      0     0 ?        D    19:46   9:53 [nfsd]
> root      1802  4.2  0.0      0     0 ?        D    19:46   9:52 [nfsd]
> root      1803  4.2  0.0      0     0 ?        D    19:46   9:52 [nfsd]
> root      1804  4.2  0.0      0     0 ?        D    19:46   9:52 [nfsd]
> root      5305  0.0  0.0      0     0 ?        D    19:49   0:03 
> [jbd2/drbd3-8]
> root      5634  0.0  0.0      0     0 ?        D    19:49   0:05 
> [flush-147:3]
>
> Most of the nfsd threads are stuck waiting for the mutex on an inode in 
> order to do a write to the disk. For example, the traceback for pid 1797 
> [nfsd]:
>
> [<ffffffff8110fd46>] generic_file_aio_write+0x56/0xe0
> [<ffffffff8120591f>] ext4_file_write+0xbf/0x250
> [<ffffffff811689ab>] do_sync_readv_writev+0xdb/0x120
> [<ffffffff81168c94>] do_readv_writev+0xd4/0x1d0
> [<ffffffff81168dcc>] vfs_writev+0x3c/0x50
> [<ffffffffa01701b8>] nfsd_vfs_write.isra.12+0xf8/0x2e0 [nfsd]
> [<ffffffffa0172888>] nfsd_write+0xf8/0x110 [nfsd]
> [<ffffffffa0178b5f>] nfsd3_proc_write+0x9f/0xe0 [nfsd]
> [<ffffffffa016cade>] nfsd_dispatch+0xbe/0x1c0 [nfsd]
> [<ffffffff81871679>] svc_process+0x499/0x790
> [<ffffffffa016c125>] nfsd+0xc5/0x1a0 [nfsd]
> [<ffffffff810609bb>] kthread+0xdb/0xe0
> [<ffffffff8193b904>] kernel_thread_helper+0x4/0x10
>
> Where generic_file_aio_write() is calling mutex_lock(&inode->i_mutex).
>
> Looking through the other tracebacks, it appears that pid 1803 [nfsd] 
> probably has this mutex. Also, looking at the block_start in 
> /proc/1803/sched (, it appears that this was the first process to block:
>
> se.statistics.block_start          :      13948189.066634
>
> Its traceback looks like this:
> [<ffffffff8125dc55>] jbd2_log_wait_commit+0xc5/0x150
> [<ffffffff8125f628>] jbd2_complete_transaction+0xb8/0xd0
> [<ffffffff81205dcd>] ext4_sync_file+0x1fd/0x4b0
> [<ffffffff81197ad5>] generic_write_sync+0x55/0x70
> [<ffffffff8110fdb6>] generic_file_aio_write+0xc6/0xe0
> [<ffffffff8120591f>] ext4_file_write+0xbf/0x250
> [<ffffffff811689ab>] do_sync_readv_writev+0xdb/0x120
> [<ffffffff81168c94>] do_readv_writev+0xd4/0x1d0
> [<ffffffff81168dcc>] vfs_writev+0x3c/0x50
> [<ffffffffa01701b8>] nfsd_vfs_write.isra.12+0xf8/0x2e0 [nfsd]
> [<ffffffffa0172888>] nfsd_write+0xf8/0x110 [nfsd]
> [<ffffffffa0178b5f>] nfsd3_proc_write+0x9f/0xe0 [nfsd]
> [<ffffffffa016cade>] nfsd_dispatch+0xbe/0x1c0 [nfsd]
> [<ffffffff81871679>] svc_process+0x499/0x790
> [<ffffffffa016c125>] nfsd+0xc5/0x1a0 [nfsd]
> [<ffffffff810609bb>] kthread+0xdb/0xe0
> [<ffffffff8193b904>] kernel_thread_helper+0x4/0x10
>
> The inode-i_mutex seems to be taken in ext4_sync_file() before the call 
> to jbd2_complete_transaction().
>
> It looks like jbd2_log_wait_commit() has just called schedule() in 
> wait_event() in the code below:
>
> while (tid_gt(tid, journal->j_commit_sequence)) {
> 	jbd_debug(1, "JBD2: want %d, j_commit_sequence=%d\n",
> 		tid, journal->j_commit_sequence);
> 	wake_up(&journal->j_wait_commit);
> 	read_unlock(&journal->j_state_lock);
> 	wait_event(journal->j_wait_done_commit,
> 		!tid_gt(tid, journal->j_commit_sequence));
> 	read_lock(&journal->j_state_lock);
> }
>
>
> The kjournald2 thread (pid 5305) blocks significantly later:
> se.statistics.block_start          :      13953712.751778
>
> Here is the traceback:
> [<ffffffff8110e3ae>] sleep_on_page+0xe/0x20
> [<ffffffff8110e5b8>] wait_on_page_bit+0x78/0x80
> [<ffffffff8110e81c>] filemap_fdatawait_range+0x10c/0x1b0
> [<ffffffff8110e8eb>] filemap_fdatawait+0x2b/0x30
> [<ffffffff8125832c>] jbd2_journal_commit_transaction+0x8dc/0x1a70
> 	This calls journal_finish_inode_data_buffers which calls
> 	filemap_fdatawait()
> [<ffffffff8125ddaf>] kjournald2+0xbf/0x280
> [<ffffffff810609bb>] kthread+0xdb/0xe0
> [<ffffffff8193b904>] kernel_thread_helper+0x4/0x10
>
> It is stuck in jbd2_journal_commit_transaction, apparently waiting for 
> inodes to be written to disk? I'm not sure what would be preventing that 
> from happening.
>
>
> One of the nfsd processes is stuck in a different spot:
> root      1804     2  3 Oct15 ?        00:09:52 [nfsd]
>
> This is the second process to block:
> se.statistics.block_start          :      13948189.759783
>
> And the traceback:
> [<ffffffff81256f85>] do_get_write_access+0x2b5/0x4e0
> [<ffffffff81257300>] jbd2_journal_get_write_access+0x30/0x50
> [<ffffffff8123848e>] __ext4_journal_get_write_access+0x3e/0x80
> [<ffffffff8123efaf>] ext4_mb_mark_diskspace_used+0xff/0x530
> [<ffffffff8124074f>] ext4_mb_new_blocks+0x1ef/0x5f0
> [<ffffffff812361d5>] ext4_ext_map_blocks+0xc05/0x1e60
> [<ffffffff8120ae15>] ext4_map_blocks+0x1c5/0x270
> [<ffffffff8120f0dc>] mpage_da_map_and_submit+0xac/0x4c0
> [<ffffffff8120fd5e>] ext4_da_writepages+0x36e/0x720
> [<ffffffff8111a5f4>] do_writepages+0x24/0x40
> [<ffffffff8110f5db>] __filemap_fdatawrite_range+0x5b/0x60
> [<ffffffff8110f6ea>] filemap_write_and_wait_range+0x5a/0x80
> [<ffffffff81205c48>] ext4_sync_file+0x78/0x4b0
> [<ffffffff81197ad5>] generic_write_sync+0x55/0x70
> [<ffffffff8110fdb6>] generic_file_aio_write+0xc6/0xe0
> [<ffffffff8120591f>] ext4_file_write+0xbf/0x250
> [<ffffffff811689ab>] do_sync_readv_writev+0xdb/0x120
> [<ffffffff81168c94>] do_readv_writev+0xd4/0x1d0
> [<ffffffff81168dcc>] vfs_writev+0x3c/0x50
> [<ffffffffa01701b8>] nfsd_vfs_write.isra.12+0xf8/0x2e0 [nfsd]
> [<ffffffffa0172888>] nfsd_write+0xf8/0x110 [nfsd]
> [<ffffffffa0178b5f>] nfsd3_proc_write+0x9f/0xe0 [nfsd]
> [<ffffffffa016cade>] nfsd_dispatch+0xbe/0x1c0 [nfsd]
> [<ffffffff81871679>] svc_process+0x499/0x790
> [<ffffffffa016c125>] nfsd+0xc5/0x1a0 [nfsd]
> [<ffffffff810609bb>] kthread+0xdb/0xe0
> [<ffffffff8193b904>] kernel_thread_helper+0x4/0x10
>
> I'm pretty sure it called schedule() below in the code from 
> do_get_write_access():
> if (jh->b_jlist == BJ_Shadow) {
> 	DEFINE_WAIT_BIT(wait, &bh->b_state, BH_Unshadow);
> 	wait_queue_head_t *wqh;
> 	wqh = bit_waitqueue(&bh->b_state, BH_Unshadow);
> 	JBUFFER_TRACE(jh, "on shadow: sleep");
> 	jbd_unlock_bh_state(bh);
> 	/* commit wakes up all shadow buffers after IO */
> 	for ( ; ; ) {
> 		prepare_to_wait(wqh, &wait.wait,
> 		TASK_UNINTERRUPTIBLE);
> 		if (jh->b_jlist != BJ_Shadow)
> 			break;
> 		schedule();
> 	}
> 	finish_wait(wqh, &wait.wait);
> 	goto repeat;
> }
>
>
> And finally, there is a flush stuck as well:
> root      5634     2  0 Oct15 ?        00:00:05 [flush-147:3]
Please check that you have this path in your tree
https://lkml.org/lkml/2014/3/7/612
There are also other bug which was fixed long time ago which was
related with incorrect barrier handling. Please disable barriers
temporarily (mount_opt="-onobarrier") and try to reproduce the issue.
>
> This is the third process to block:
> se.statistics.block_start          :      13948189.814929
>
> And the traceback:
> [<ffffffff8109cd0c>] rt_down_read+0x2c/0x40
> [<ffffffff8120ac91>] ext4_map_blocks+0x41/0x270
> [<ffffffff8120f0dc>] mpage_da_map_and_submit+0xac/0x4c0
> [<ffffffff8120f9c9>] write_cache_pages_da+0x3f9/0x420
> [<ffffffff8120fd30>] ext4_da_writepages+0x340/0x720
> [<ffffffff8111a5f4>] do_writepages+0x24/0x40
> [<ffffffff81191b71>] writeback_single_inode+0x181/0x4b0
> [<ffffffff811922a2>] writeback_sb_inodes+0x1b2/0x290
> [<ffffffff8119241e>] __writeback_inodes_wb+0x9e/0xd0
> [<ffffffff811928e3>] wb_writeback+0x223/0x3f0
> [<ffffffff81192b4f>] wb_check_old_data_flush+0x9f/0xb0
> [<ffffffff8119403f>] wb_do_writeback+0x12f/0x250
> [<ffffffff811941f4>] bdi_writeback_thread+0x94/0x320
> [<ffffffff810609bb>] kthread+0xdb/0xe0
> [<ffffffff8193b904>] kernel_thread_helper+0x4/0x10
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: Type: application/pgp-signature, Size: 818 bytes --]

  parent reply	other threads:[~2014-10-17 19:12 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-10-17 17:50 Hang writing to nfs-mounted filesystem from client, all nfsd tasks on server blocked in D Chris Friesen
2014-10-17 18:01 ` Pavel Vasilyev
     [not found]   ` <CANGgnMbQmsdMDJUx7Bop9Xs=jQMmAJgWRjhXVFUGx-DwF=inYw@mail.gmail.com>
2014-10-23 17:54     ` RT/ext4/jbd2 circular dependency (was: Re: Hang writing to nfs-mounted filesystem from client) Chris Friesen
2014-10-26 14:25       ` Thomas Gleixner
2014-10-27 16:22         ` RT/ext4/jbd2 circular dependency Chris Friesen
2014-10-29 18:05           ` Thomas Gleixner
2014-10-29 19:11             ` Chris Friesen
2014-10-29 19:26               ` Thomas Gleixner
2014-10-29 20:17                 ` Chris Friesen
2014-10-29 20:31                   ` Thomas Gleixner
2014-10-29 23:19                 ` Theodore Ts'o
2014-10-29 23:37                   ` Chris Friesen
2014-10-30  1:44                     ` Theodore Ts'o
2014-10-30  8:15                       ` Kevin Liao
2014-10-30 12:24                         ` Theodore Ts'o
2014-10-30 21:11                   ` Thomas Gleixner
2014-10-30 23:24                     ` Theodore Ts'o
2014-10-31  0:08                       ` Chris Friesen
2014-10-31  0:16                       ` Thomas Gleixner
2014-11-13 19:06                       ` Jan Kara
2014-10-27 19:57       ` Chris Friesen
     [not found] ` <544156FE.7070905-CWA4WttNNZF54TAoqtyWWQ@public.gmane.org>
2014-10-17 18:58   ` Hang writing to nfs-mounted filesystem from client, all nfsd tasks on server blocked in D Austin Schuh
2014-10-17 19:12   ` Dmitry Monakhov [this message]
2014-10-18 17:05   ` Hang writing to nfs-mounted filesystem from client -- expected code path? Chris Friesen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87k33yik8m.fsf@openvz.org \
    --to=dmonlist-re5jqeeqqe8avxtiumwx3w@public.gmane.org \
    --cc=adilger.kernel-m1MBpc4rdrD3fQ9qLvQP4Q@public.gmane.org \
    --cc=bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org \
    --cc=chris.friesen-CWA4WttNNZF54TAoqtyWWQ@public.gmane.org \
    --cc=linux-ext4-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=linux-rt-users-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=tytso-3s7WtUTddSA@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).