Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "J.  Bruce Fields" <bfields@fieldses.org>
To: NeilBrown <neilb@suse.de>
Cc: Timothy Pearson <tpearson@raptorengineering.com>,
	Chuck Lever <chuck.lever@oracle.com>,
	linux-nfs@vger.kernel.org, Trond Myklebust <trondmy@gmail.com>
Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load
Date: Thu, 12 Aug 2021 10:44:28 -0400	[thread overview]
Message-ID: <20210812144428.GA9536@fieldses.org> (raw)
In-Reply-To: <162855621114.22632.14151019687856585770@noble.neil.brown.name>

On Tue, Aug 10, 2021 at 10:43:31AM +1000, NeilBrown wrote:
> On Mon, 05 Jul 2021, Timothy Pearson wrote:
> > 
> > Sample backtrace below:
> > 
> > [16846.426141] rcu: INFO: rcu_sched self-detected stall on CPU
> > [16846.426202] rcu:     32-....: (5249 ticks this GP) idle=78a/1/0x4000000000000002 softirq=1663878/1663878 fqs=1986
> > [16846.426241]  (t=5251 jiffies g=2720809 q=756724)
> > [16846.426273] NMI backtrace for cpu 32
> > [16846.426298] CPU: 32 PID: 10624 Comm: kworker/u130:25 Not tainted 5.12.14 #1
> > [16846.426342] Workqueue: rpciod rpc_async_schedule [sunrpc]
> > [16846.426406] Call Trace:
> > [16846.426429] [c000200010823250] [c00000000074e630] dump_stack+0xc4/0x114 (unreliable)
> > [16846.426483] [c000200010823290] [c00000000075aebc] nmi_cpu_backtrace+0xfc/0x150
> > [16846.426506] [c000200010823310] [c00000000075b0a8] nmi_trigger_cpumask_backtrace+0x198/0x1f0
> > [16846.426577] [c0002000108233b0] [c000000000072818] arch_trigger_cpumask_backtrace+0x28/0x40
> > [16846.426621] [c0002000108233d0] [c000000000202db8] rcu_dump_cpu_stacks+0x158/0x1b8
> > [16846.426667] [c000200010823470] [c000000000201828] rcu_sched_clock_irq+0x908/0xb10
> > [16846.426708] [c000200010823560] [c0000000002141d0] update_process_times+0xc0/0x140
> > [16846.426768] [c0002000108235a0] [c00000000022dd34] tick_sched_handle.isra.18+0x34/0xd0
> > [16846.426808] [c0002000108235d0] [c00000000022e1e8] tick_sched_timer+0x68/0xe0
> > [16846.426856] [c000200010823610] [c00000000021577c] __hrtimer_run_queues+0x16c/0x370
> > [16846.426903] [c000200010823690] [c000000000216378] hrtimer_interrupt+0x128/0x2f0
> > [16846.426947] [c000200010823740] [c000000000029494] timer_interrupt+0x134/0x310
> > [16846.426989] [c0002000108237a0] [c000000000016c54] replay_soft_interrupts+0x124/0x2e0
> > [16846.427045] [c000200010823990] [c000000000016f14] arch_local_irq_restore+0x104/0x170
> > [16846.427103] [c0002000108239c0] [c00000000017247c] mod_delayed_work_on+0x8c/0xe0
> > [16846.427149] [c000200010823a20] [c00800000819fe04] rpc_set_queue_timer+0x5c/0x80 [sunrpc]
> > [16846.427234] [c000200010823a40] [c0080000081a096c] __rpc_sleep_on_priority_timeout+0x194/0x1b0 [sunrpc]
> > [16846.427324] [c000200010823a90] [c0080000081a3080] rpc_sleep_on_timeout+0x88/0x110 [sunrpc]
> > [16846.427388] [c000200010823ad0] [c0080000071f7220] nfsd4_cb_done+0x468/0x530 [nfsd]
> > [16846.427457] [c000200010823b60] [c0080000081a0a0c] rpc_exit_task+0x84/0x1d0 [sunrpc]
> > [16846.427520] [c000200010823ba0] [c0080000081a2448] __rpc_execute+0xd0/0x760 [sunrpc]
> > [16846.427598] [c000200010823c30] [c0080000081a2b18] rpc_async_schedule+0x40/0x70 [sunrpc]
> 
> Time to play the sleuth .....
> "rpc_async_schedule" - so this is clearly an async task.
> It is running in __rpc_execute(), and calls rpc_exit_task().
> 
> rpc_exit_task() is a possible value for ->tk_action, which is set in
> several places.
> 1/ in call_bc_transmit_status(), but only after generating a kernel
>   message
>      RPC: Could not send backchannel reply......
>  You didn't report that message, so I'll assume it didn't happen.
> 
> 2/ In call_decode() if ->p_decode is NULL.  This implies a message
>    which didn't expect a reply.  All nfs4 callback procedures
>    (nfs4_cb_procedures[]) do set p_decode, so it cannot be here.
> 
> 3/ In call_decode() if the reply was successfully decoded.
>    I cannot rule this out quite so easily, but it seems unlikely as this
>    is a normal pattern and I wouldn't expect it to cause a soft-lockup.
> 
> 4/ In rpc_exit().  This is my guess.  All the places that rpc_exit() can
>    be called by nfsd (and nfsd appears in the call stack) are for handling 
>    errors.
> 
> So GUESS: rpc_exit() is getting called.
> Not only that, but it is getting called *often*.  The call to
> rpc_exit_task() (which is not the same as rpc_exit() - be careful) sets
> tk_action to NULL.  So rpc_exit() must get called again and again and
> again to keept setting tk_action back to rpc_exit_task, resulting in the
> soft lockup.
> 
> After setting ->tk_action to NULL, rpc_exit_task() calls
> ->rpc_call_done, which we see in the stack trace is nfsd4_cb_done().
> 
> nfsd4_cb_done() in turn calls ->done which is one of
>   nfsd4_cb_probe_done()
>   nfsd4_cb_sequence_done()
>   nfsd4_cb_layout_done()
> or
>   nfsd4_cb_notify_lock_done()
> 
> Some of these call rpc_delay(task,...) and return 0, causing
> nfsd4_cb_done() to call rpc_restart_call_prepare() This means the task
> can be requeued, but only after a delay.
> 
> This doesn't yet explain the spin, but now let's look back at
>  __rpc_execute().
> After calling do_action()  (which is rpc_exit_task() in the call trace)
> it checks if the task is queued.  If rpc_delay_task() wasn't call, it
> won't be queued and tk_action will be NULL, so it will loop around,
> do_action will be NULL, and the task aborts.
> 
> But if rpc_delay_task() was called, then the task will be queued (on the
> delay queue), and we continue in __rpc_execute().
> 
> The next test if is RPC_SIGNALLED().  If so, then rpc_exit() is called.
> Aha! We though that must have been getting called repeatedly.  It
> *might* not be here, but I think it is.  Let's assume so.
> rpc_exit() will set ->tk_action to rpc_exit_task, dequeue the task and
> (as it is async) schedule it for handling by rpc_async_schedule (that is
> in rpc_make_runnable()).
> 
> __rpc_execute_task continues down to
> 		if (task_is_async)
> 			return;
> and 
> rpc_async_schedule() returns.  But the task has already been queued to
> be handled again, so the whole process loops.
> 
> The problem here appears to be that a signalled task is being retried
> without clearing the SIGNALLED flag.  That is causing the infinite loop
> and the soft lockup.
> 
> This bug appears to have been introduced in Linux 5.2 by
> Commit: ae67bd3821bb ("SUNRPC: Fix up task signalling")

I wonder how we arrived here.  Does it require that an rpc task returns
from one of those rpc_delay() calls just as rpc_shutdown_client() is
signalling it?  That's the only way async tasks get signalled, I think.

> Prior to this commit a flag RPC_TASK_KILLED was used, and it gets
> cleared by rpc_reset_task_statistics() (called from rpc_exit_task()).
> After this commit a new flag RPC_TASK_SIGNALLED is used, and it is never
> cleared.
> 
> A fix might be to clear RPC_TASK_SIGNALLED in
> rpc_reset_task_statistics(), but I'll leave that decision to someone
> else.

Might be worth testing with that change just to verify that this is
what's happening.

diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index c045f63d11fa..caa931888747 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -813,7 +813,8 @@ static void
 rpc_reset_task_statistics(struct rpc_task *task)
 {
 	task->tk_timeouts = 0;
-	task->tk_flags &= ~(RPC_CALL_MAJORSEEN|RPC_TASK_SENT);
+	task->tk_flags &= ~(RPC_CALL_MAJORSEEN|RPC_TASK_SIGNALLED|
+							RPC_TASK_SENT);
 	rpc_init_task_statistics(task);
 }
 

--b.

next prev parent reply	other threads:[~2021-08-12 14:44 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-05  9:44 CPU stall, eventual host hang with BTRFS + NFS under heavy load Timothy Pearson
2021-07-05  9:47 ` Timothy Pearson
2021-07-23 21:01   ` J. Bruce Fields
     [not found]   ` <B4D8C4B7-EE8C-456C-A6C5-D25FF1F3608E@rutgers.edu>
     [not found]     ` <3A4DF3BB-955C-4301-BBED-4D5F02959F71@rutgers.edu>
2021-08-09 17:06       ` Timothy Pearson
2021-08-09 17:15         ` hedrick
2021-08-09 17:25           ` Timothy Pearson
2021-08-09 17:37           ` Chuck Lever III
     [not found]             ` <F5179A41-FB9A-4AB1-BE58-C2859DB7EC06@rutgers.edu>
2021-08-09 18:30               ` Timothy Pearson
2021-08-09 18:38                 ` hedrick
2021-08-09 18:44                   ` Timothy Pearson
2021-08-09 18:49                   ` J. Bruce Fields
     [not found]                     ` <15AD846A-4638-4ACF-B47C-8EF655AD6E85@rutgers.edu>
2021-08-09 18:56                       ` Timothy Pearson
2021-08-09 20:54                         ` Charles Hedrick
2021-08-09 21:49                           ` Timothy Pearson
2021-08-09 22:01                             ` Charles Hedrick
     [not found]             ` <1119B476-171F-4C5A-9DEF-184F211A6A98@rutgers.edu>
2021-08-10 16:22               ` Timothy Pearson
2021-08-16 14:43                 ` hedrick
2021-08-09 18:30           ` J. Bruce Fields
2021-08-09 18:34             ` hedrick
     [not found]             ` <413163A6-8484-4170-9877-C0C2D50B13C0@rutgers.edu>
2021-08-10 14:58               ` J. Bruce Fields
2021-07-23 21:00 ` J. Bruce Fields
2021-07-23 21:22   ` Timothy Pearson
2021-07-28 19:51     ` Timothy Pearson
2021-08-02 19:28       ` J. Bruce Fields
2021-08-10  0:43 ` NeilBrown
2021-08-10  0:54   ` J.  Bruce Fields
2021-08-12 14:44   ` J.  Bruce Fields [this message]
2021-08-12 21:36     ` NeilBrown
2021-10-08 20:27       ` Scott Mayhew
2021-10-08 20:53         ` Timothy Pearson
2021-10-08 21:11         ` J.  Bruce Fields
2021-10-09 17:33         ` Chuck Lever III
2021-10-11 14:30           ` Bruce Fields
2021-10-11 16:36             ` Chuck Lever III
2021-10-11 21:57               ` NeilBrown
2021-10-14 22:36                 ` Trond Myklebust
2021-10-14 22:51                   ` NeilBrown
2021-10-15  8:03                     ` Trond Myklebust
2021-10-15  8:05                       ` Trond Myklebust
2021-12-01 18:36                         ` Scott Mayhew
2021-12-01 19:35                           ` Trond Myklebust
2021-12-01 20:13                             ` Scott Mayhew

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:c045f63d11f dfblob:caa93188874 )
 OR (
bs:"Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210812144428.GA9536@fieldses.org \
    --to=bfields@fieldses.org \
    --cc=chuck.lever@oracle.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=tpearson@raptorengineering.com \
    --cc=trondmy@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.