Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "J.  Bruce Fields" <bfields@fieldses.org>
To: NeilBrown <neilb@suse.de>
Cc: Timothy Pearson <tpearson@raptorengineering.com>,
	Chuck Lever <chuck.lever@oracle.com>,
	linux-nfs@vger.kernel.org, Trond Myklebust <trondmy@gmail.com>
Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load
Date: Thu, 12 Aug 2021 10:44:28 -0400	[thread overview]
Message-ID: <20210812144428.GA9536@fieldses.org> (raw)
In-Reply-To: <162855621114.22632.14151019687856585770@noble.neil.brown.name>

On Tue, Aug 10, 2021 at 10:43:31AM +1000, NeilBrown wrote:
> On Mon, 05 Jul 2021, Timothy Pearson wrote:
> > 
> > Sample backtrace below:
> > 
> > [16846.426141] rcu: INFO: rcu_sched self-detected stall on CPU
> > [16846.426202] rcu:     32-....: (5249 ticks this GP) idle=78a/1/0x4000000000000002 softirq=1663878/1663878 fqs=1986
> > [16846.426241]  (t=5251 jiffies g=2720809 q=756724)
> > [16846.426273] NMI backtrace for cpu 32
> > [16846.426298] CPU: 32 PID: 10624 Comm: kworker/u130:25 Not tainted 5.12.14 #1
> > [16846.426342] Workqueue: rpciod rpc_async_schedule [sunrpc]
> > [16846.426406] Call Trace:
> > [16846.426429] [c000200010823250] [c00000000074e630] dump_stack+0xc4/0x114 (unreliable)
> > [16846.426483] [c000200010823290] [c00000000075aebc] nmi_cpu_backtrace+0xfc/0x150
> > [16846.426506] [c000200010823310] [c00000000075b0a8] nmi_trigger_cpumask_backtrace+0x198/0x1f0
> > [16846.426577] [c0002000108233b0] [c000000000072818] arch_trigger_cpumask_backtrace+0x28/0x40
> > [16846.426621] [c0002000108233d0] [c000000000202db8] rcu_dump_cpu_stacks+0x158/0x1b8
> > [16846.426667] [c000200010823470] [c000000000201828] rcu_sched_clock_irq+0x908/0xb10
> > [16846.426708] [c000200010823560] [c0000000002141d0] update_process_times+0xc0/0x140
> > [16846.426768] [c0002000108235a0] [c00000000022dd34] tick_sched_handle.isra.18+0x34/0xd0
> > [16846.426808] [c0002000108235d0] [c00000000022e1e8] tick_sched_timer+0x68/0xe0
> > [16846.426856] [c000200010823610] [c00000000021577c] __hrtimer_run_queues+0x16c/0x370
> > [16846.426903] [c000200010823690] [c000000000216378] hrtimer_interrupt+0x128/0x2f0
> > [16846.426947] [c000200010823740] [c000000000029494] timer_interrupt+0x134/0x310
> > [16846.426989] [c0002000108237a0] [c000000000016c54] replay_soft_interrupts+0x124/0x2e0
> > [16846.427045] [c000200010823990] [c000000000016f14] arch_local_irq_restore+0x104/0x170
> > [16846.427103] [c0002000108239c0] [c00000000017247c] mod_delayed_work_on+0x8c/0xe0
> > [16846.427149] [c000200010823a20] [c00800000819fe04] rpc_set_queue_timer+0x5c/0x80 [sunrpc]
> > [16846.427234] [c000200010823a40] [c0080000081a096c] __rpc_sleep_on_priority_timeout+0x194/0x1b0 [sunrpc]
> > [16846.427324] [c000200010823a90] [c0080000081a3080] rpc_sleep_on_timeout+0x88/0x110 [sunrpc]
> > [16846.427388] [c000200010823ad0] [c0080000071f7220] nfsd4_cb_done+0x468/0x530 [nfsd]
> > [16846.427457] [c000200010823b60] [c0080000081a0a0c] rpc_exit_task+0x84/0x1d0 [sunrpc]
> > [16846.427520] [c000200010823ba0] [c0080000081a2448] __rpc_execute+0xd0/0x760 [sunrpc]
> > [16846.427598] [c000200010823c30] [c0080000081a2b18] rpc_async_schedule+0x40/0x70 [sunrpc]
> 
> Time to play the sleuth .....
> "rpc_async_schedule" - so this is clearly an async task.
> It is running in __rpc_execute(), and calls rpc_exit_task().
> 
> rpc_exit_task() is a possible value for ->tk_action, which is set in
> several places.
> 1/ in call_bc_transmit_status(), but only after generating a kernel
>   message
>      RPC: Could not send backchannel reply......
>  You didn't report that message, so I'll assume it didn't happen.
> 
> 2/ In call_decode() if ->p_decode is NULL.  This implies a message
>    which didn't expect a reply.  All nfs4 callback procedures
>    (nfs4_cb_procedures[]) do set p_decode, so it cannot be here.
> 
> 3/ In call_decode() if the reply was successfully decoded.
>    I cannot rule this out quite so easily, but it seems unlikely as this
>    is a normal pattern and I wouldn't expect it to cause a soft-lockup.
> 
> 4/ In rpc_exit().  This is my guess.  All the places that rpc_exit() can
>    be called by nfsd (and nfsd appears in the call stack) are for handling 
>    errors.
> 
> So GUESS: rpc_exit() is getting called.
> Not only that, but it is getting called *often*.  The call to
> rpc_exit_task() (which is not the same as rpc_exit() - be careful) sets
> tk_action to NULL.  So rpc_exit() must get called again and again and
> again to keept setting tk_action back to rpc_exit_task, resulting in the
> soft lockup.
> 
> After setting ->tk_action to NULL, rpc_exit_task() calls
> ->rpc_call_done, which we see in the stack trace is nfsd4_cb_done().
> 
> nfsd4_cb_done() in turn calls ->done which is one of
>   nfsd4_cb_probe_done()
>   nfsd4_cb_sequence_done()
>   nfsd4_cb_layout_done()
> or
>   nfsd4_cb_notify_lock_done()
> 
> Some of these call rpc_delay(task,...) and return 0, causing
> nfsd4_cb_done() to call rpc_restart_call_prepare() This means the task
> can be requeued, but only after a delay.
> 
> This doesn't yet explain the spin, but now let's look back at
>  __rpc_execute().
> After calling do_action()  (which is rpc_exit_task() in the call trace)
> it checks if the task is queued.  If rpc_delay_task() wasn't call, it
> won't be queued and tk_action will be NULL, so it will loop around,
> do_action will be NULL, and the task aborts.
> 
> But if rpc_delay_task() was called, then the task will be queued (on the
> delay queue), and we continue in __rpc_execute().
> 
> The next test if is RPC_SIGNALLED().  If so, then rpc_exit() is called.
> Aha! We though that must have been getting called repeatedly.  It
> *might* not be here, but I think it is.  Let's assume so.
> rpc_exit() will set ->tk_action to rpc_exit_task, dequeue the task and
> (as it is async) schedule it for handling by rpc_async_schedule (that is
> in rpc_make_runnable()).
> 
> __rpc_execute_task continues down to
> 		if (task_is_async)
> 			return;
> and 
> rpc_async_schedule() returns.  But the task has already been queued to
> be handled again, so the whole process loops.
> 
> The problem here appears to be that a signalled task is being retried
> without clearing the SIGNALLED flag.  That is causing the infinite loop
> and the soft lockup.
> 
> This bug appears to have been introduced in Linux 5.2 by
> Commit: ae67bd3821bb ("SUNRPC: Fix up task signalling")

I wonder how we arrived here.  Does it require that an rpc task returns
from one of those rpc_delay() calls just as rpc_shutdown_client() is
signalling it?  That's the only way async tasks get signalled, I think.

> Prior to this commit a flag RPC_TASK_KILLED was used, and it gets
> cleared by rpc_reset_task_statistics() (called from rpc_exit_task()).
> After this commit a new flag RPC_TASK_SIGNALLED is used, and it is never
> cleared.
> 
> A fix might be to clear RPC_TASK_SIGNALLED in
> rpc_reset_task_statistics(), but I'll leave that decision to someone
> else.

Might be worth testing with that change just to verify that this is
what's happening.

diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index c045f63d11fa..caa931888747 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -813,7 +813,8 @@ static void
 rpc_reset_task_statistics(struct rpc_task *task)
 {
 	task->tk_timeouts = 0;
-	task->tk_flags &= ~(RPC_CALL_MAJORSEEN|RPC_TASK_SENT);
+	task->tk_flags &= ~(RPC_CALL_MAJORSEEN|RPC_TASK_SIGNALLED|
+							RPC_TASK_SENT);
 	rpc_init_task_statistics(task);
 }
 

--b.

next prev parent reply	other threads:[~2021-08-12 14:44 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-05  9:44 CPU stall, eventual host hang with BTRFS + NFS under heavy load Timothy Pearson
2021-07-05  9:47 ` Timothy Pearson
2021-07-23 21:01   ` J. Bruce Fields
     [not found]   ` <B4D8C4B7-EE8C-456C-A6C5-D25FF1F3608E@rutgers.edu>
     [not found]     ` <3A4DF3BB-955C-4301-BBED-4D5F02959F71@rutgers.edu>
2021-08-09 17:06       ` Timothy Pearson
2021-08-09 17:15         ` hedrick
2021-08-09 17:25           ` Timothy Pearson
2021-08-09 17:37           ` Chuck Lever III
     [not found]             ` <F5179A41-FB9A-4AB1-BE58-C2859DB7EC06@rutgers.edu>
2021-08-09 18:30               ` Timothy Pearson
2021-08-09 18:38                 ` hedrick
2021-08-09 18:44                   ` Timothy Pearson
2021-08-09 18:49                   ` J. Bruce Fields
     [not found]                     ` <15AD846A-4638-4ACF-B47C-8EF655AD6E85@rutgers.edu>
2021-08-09 18:56                       ` Timothy Pearson
2021-08-09 20:54                         ` Charles Hedrick
2021-08-09 21:49                           ` Timothy Pearson
2021-08-09 22:01                             ` Charles Hedrick
     [not found]             ` <1119B476-171F-4C5A-9DEF-184F211A6A98@rutgers.edu>
2021-08-10 16:22               ` Timothy Pearson
2021-08-16 14:43                 ` hedrick
2021-08-09 18:30           ` J. Bruce Fields
2021-08-09 18:34             ` hedrick
     [not found]             ` <413163A6-8484-4170-9877-C0C2D50B13C0@rutgers.edu>
2021-08-10 14:58               ` J. Bruce Fields
2021-07-23 21:00 ` J. Bruce Fields
2021-07-23 21:22   ` Timothy Pearson
2021-07-28 19:51     ` Timothy Pearson
2021-08-02 19:28       ` J. Bruce Fields
2021-08-10  0:43 ` NeilBrown
2021-08-10  0:54   ` J.  Bruce Fields
2021-08-12 14:44   ` J.  Bruce Fields [this message]
2021-08-12 21:36     ` NeilBrown
2021-10-08 20:27       ` Scott Mayhew
2021-10-08 20:53         ` Timothy Pearson
2021-10-08 21:11         ` J.  Bruce Fields
2021-10-09 17:33         ` Chuck Lever III
2021-10-11 14:30           ` Bruce Fields
2021-10-11 16:36             ` Chuck Lever III
2021-10-11 21:57               ` NeilBrown
2021-10-14 22:36                 ` Trond Myklebust
2021-10-14 22:51                   ` NeilBrown
2021-10-15  8:03                     ` Trond Myklebust
2021-10-15  8:05                       ` Trond Myklebust
2021-12-01 18:36                         ` Scott Mayhew
2021-12-01 19:35                           ` Trond Myklebust
2021-12-01 20:13                             ` Scott Mayhew

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:c045f63d11f dfblob:caa93188874 )
 OR (
bs:"Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210812144428.GA9536@fieldses.org \
    --to=bfields@fieldses.org \
    --cc=chuck.lever@oracle.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=tpearson@raptorengineering.com \
    --cc=trondmy@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).