Re: [rcu_sched stall] regression/miss-config ?

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Cc: linux-kernel@vger.kernel.org, Sasha Levin <sasha.levin@oracle.com>
Subject: Re: [rcu_sched stall] regression/miss-config ?
Date: Mon, 16 May 2016 17:58:20 -0700	[thread overview]
Message-ID: <20160517005820.GI3528@linux.vnet.ibm.com> (raw)
In-Reply-To: <67eb4bf6-c3d2-b9af-30ff-713a6d75e773@oracle.com>

On Mon, May 16, 2016 at 12:49:41PM -0700, Santosh Shilimkar wrote:
> On 5/16/2016 10:34 AM, Paul E. McKenney wrote:
> >On Mon, May 16, 2016 at 09:33:57AM -0700, Santosh Shilimkar wrote:
> >>On 5/16/2016 5:03 AM, Paul E. McKenney wrote:
> >>>On Sun, May 15, 2016 at 09:35:40PM -0700, santosh.shilimkar@oracle.com wrote:
> >>>>On 5/15/16 2:18 PM, Santosh Shilimkar wrote:
> >>>>>Hi Paul,
> >>>>>
> >>>>>I was asking Sasha about [1] since other folks in Oracle
> >>>>>also stumbled upon similar RCU stalls with v4.1 kernel in
> >>>>>different workloads. I was reported similar issue with
> >>>>>RDS as well and looking at [1], [2], [3] and [4], thought
> >>>>>of reaching out to see if you can help us to understand
> >>>>>this issue better.
> >>>>>
> >>>>>Have also included RCU specific config used in these
> >>>>>test(s). Its very hard to reproduce the issue but one of
> >>>>>the data point is, it reproduces on systems with larger
> >>>>>CPUs(64+). Same workload with less than 64 CPUs, don't
> >>>>>show the issue. Someone also told me, making use of
> >>>>>SLAB instead SLUB allocator makes difference but I
> >>>>>haven't verified that part for RDS.
> >>>>>
> >>>>>Let me know your thoughts. Thanks in advance !!
> >>>>>
> >>>>One of my colleague told me the pastebin server I used
> >>>>is Oracle internal only so adding the relevant logs along
> >>>>with email.
> >>>>
> >>
> >>[...]
> >>
> >>>>>[1] https://lkml.org/lkml/2014/12/14/304
> >>>>
> >>>>
> >>>>[2]  Log 1 snippet:
> >>>>-----------------------------------------------------------------
> >>>>INFO: rcu_sched self-detected stall on CPU
> >>>>INFO: rcu_sched self-detected stall on CPU { 54}  (t=60000 jiffies
> >>>>g=66023 c=66022 q=0)
> >>>>Task dump for CPU 54:
> >>>>ksoftirqd/54    R  running task        0   389      2 0x00000008
> >>>> 0000000000000007 ffff88ff7f403d38 ffffffff810a8621 0000000000000036
> >>>> ffffffff81ab6540 ffff88ff7f403d58 ffffffff810a86cf 0000000000000086
> >>>> ffffffff81ab6940 ffff88ff7f403d88 ffffffff810e3ad3 ffffffff81ab6540
> >>>>Call Trace:
> >>>> <IRQ>  [<ffffffff810a8621>] sched_show_task+0xb1/0x120
> >>>> [<ffffffff810a86cf>] dump_cpu_task+0x3f/0x50
> >>>> [<ffffffff810e3ad3>] rcu_dump_cpu_stacks+0x83/0xc0
> >>>> [<ffffffff810e490c>] print_cpu_stall+0xfc/0x170
> >>>> [<ffffffff810e5eeb>] __rcu_pending+0x2bb/0x2c0
> >>>> [<ffffffff810e5f8d>] rcu_check_callbacks+0x9d/0x170
> >>>> [<ffffffff810e9772>] update_process_times+0x42/0x70
> >>>> [<ffffffff810fb589>] tick_sched_handle+0x39/0x80
> >>>> [<ffffffff810fb824>] tick_sched_timer+0x44/0x80
> >>>> [<ffffffff810ebc04>] __run_hrtimer+0x74/0x1d0
> >>>> [<ffffffff810fb7e0>] ? tick_nohz_handler+0xa0/0xa0
> >>>> [<ffffffff810ebf92>] hrtimer_interrupt+0x102/0x240
> >>>> [<ffffffff810521f9>] local_apic_timer_interrupt+0x39/0x60
> >>>> [<ffffffff816c47b5>] smp_apic_timer_interrupt+0x45/0x59
> >>>> [<ffffffff816c263e>] apic_timer_interrupt+0x6e/0x80
> >>>> <EOI>  [<ffffffff8118db64>] ? free_one_page+0x164/0x380
> >>>> [<ffffffff8118de43>] ? __free_pages_ok+0xc3/0xe0
> >>>> [<ffffffff8118e775>] __free_pages+0x25/0x40
> >>>> [<ffffffffa21054f0>] rds_message_purge+0x60/0x150 [rds]
> >>>> [<ffffffffa2105624>] rds_message_put+0x44/0x80 [rds]
> >>>> [<ffffffffa21535b4>] rds_ib_send_cqe_handler+0x134/0x2d0 [rds_rdma]
> >>>> [<ffffffff816c102b>] ? _raw_spin_unlock_irqrestore+0x1b/0x50
> >>>> [<ffffffffa18c0273>] ? mlx4_ib_poll_cq+0xb3/0x2a0 [mlx4_ib]
> >>>> [<ffffffffa214c6f1>] poll_cq+0xa1/0xe0 [rds_rdma]
> >>>> [<ffffffffa214d489>] rds_ib_tasklet_fn_send+0x79/0xf0 [rds_rdma]
> >>>
> >>>The most likely possibility is that there is a 60-second-long loop in
> >>>one of the above functions.  This is within bottom-half execution, so
> >>>unfortunately the usual trick of placing cond_resched_rcu_qs() within this
> >>>loop, but outside of any RCU read-side critical section does not work.
> >>>
> >>First of all thanks for explanation.
> >>
> >>There is no loop which can last for 60 seconds in above code since
> >>its just completion queue handler used to free up buffers much like
> >>NIC
> >>drivers bottom half(NAPI). Its done in tasklet context for latency
> >>reasons which RDS care most. Just to get your attention, the RCU
> >>stall is also seen with XEN code too. Log for it end of the email.
> >>
> >>Another important observation is, for RDS if we avoid higher
> >>order page(s) allocation, issue is not reproducible so far.
> >>In other words, for PAGE_SIZE(4K, get_order(bytes) ==0) allocations,
> >>the system continues to run without any issue, so the loop scenario
> >>is ruled out more or less.
> >>
> >>To be specific, with PAGE_SIZE allocations, alloc_pages()
> >>is just allocating a page and __free_page() is used
> >>instead of __free_pages() from below snippet.
> >>
> >>------------------
> >>if (bytes >= PAGE_SIZE)
> >>	page = alloc_pages(gfp, get_order(bytes));
> >>
> >>.....
> >>
> >>(rm->data.op_sg[i].length <= PAGE_SIZE) ?
> >>__free_page(sg_page(&rm->data.op_sg[i])) :
> >>__free_pages(sg_page(&rm->data.op_sg[i]),
> >>get_order(rm->data.op_sg[i].length));
> >>----------------------------
> >
> >This sounds like something to take up with the mm folks.
> >
> Sure. Will do once the link between two issues is established.

Fair enough!

> >>>Therefore, if there really is a loop here, one fix would be to
> >>>periodically unwind back out to run_ksoftirqd(), but setting up so that
> >>>the work would be continued later.  Another fix might be to move this
> >>>from tasklet context to workqueue context, where cond_resched_rcu_qs()
> >>>can be used -- however, this looks a bit like networking code, which
> >>>does not always take kindly to being run in process context (though
> >>>careful use of local_bh_disable() and local_bh_enable() can sometimes
> >>>overcome this issue).  A third fix, which works only if this code does
> >>>not use RCU and does not invoke any code that does use RCU, is to tell
> >>>RCU that it should ignore this code (which will require a little work
> >>>on RCU, as it currently does not tolerate this sort of thing aside from
> >>>the idle threads).  In this last approach, event-tracing calls must use
> >>>the _nonidle suffix.
> >>>
> >>>I am not familiar with the RDS code, so I cannot be more specific.
> >>
> >>No worries. Since we saw the issue with XEN too, I was suspecting
> >>that somehow we didn't have RCU_TREE config setup correctly or
> >>some important RCU patch(s) missing in v4.1 which made it in
> >>later kernels.
> >>
> >>The only common denominator I saw between these two different
> >>usecases (RDS and XEN), was the 'hrtimer_interrupt()' chain
> >>which I believe triggers the rcu_sched() chain.
> >
> >Exactly!
> >
> >>I was thinking of enabling "CONFIG_RCU_TRACE" to see if we can
> >>get more information out of the system. Do you suggest any other
> >>RCU option(s)/patch(s) which we can help us to capture more
> >>information to understand the issue better. I wasn't sure about
> >>options like "RCU_KTHREAD_PRIO", TASKS_RCU etc. Used RCU config
> >>is also mentioned end of the email.
> >
> >Hmmm...  I just now noticed the "All QSes seen" message below.
> >That can happen if the load is quite high, and could as you say
> >one thing to try would be to set CONFIG_RCU_KTHREAD_PRIO=1.
> >This is not free, as it will mean more context switches involving
> >the RCU grace-period kthreads.
> >
> >Another thing to try is to set CONFIG_RCU_CPU_STALL_TIMEOUT to something
> >smaller than 60, say, 21.  This will cause the stall warnings to leave
> >less time before splatting.
> >
> OK will try with CONFIG_RCU_KTHREAD_PRIO=1 &
> CONFIG_RCU_CPU_STALL_TIMEOUT =21 . Indeed the usecases generate very
> high load on the systems.
> 
> >Are you running CONFIG_NO_HZ_FULL=y?  If so, the problem might be that
> >you need more housekeeping CPUs than you currently have configured.
> >
> Yes, CONFIG_NO_HZ_FULL=y. Do you mean "CONFIG_NO_HZ_FULL_ALL=y" for
> book keeping. Seems like without that clock-event code will just use
> CPU0 for things like broadcasting which might become bottleneck.
> This could explain connect the hrtimer_interrupt() path getting slowed
> down because of book keeping bottleneck.
> 
> $cat .config | grep NO_HZ
> CONFIG_NO_HZ_COMMON=y
> # CONFIG_NO_HZ_IDLE is not set
> CONFIG_NO_HZ_FULL=y
> # CONFIG_NO_HZ_FULL_ALL is not set
> # CONFIG_NO_HZ_FULL_SYSIDLE is not set
> CONFIG_NO_HZ=y
> # CONFIG_RCU_FAST_NO_HZ is not set

Yes, CONFIG_NO_HZ_FULL_ALL=y would give you only one CPU for all
housekeeping tasks, including the RCU grace-period kthreads.  So you are
booting without any nohz_full boot parameter?  You can end up with the
same problem with CONFIG_NO_HZ_FULL=y and the nohz_full boot parameter
that you can with CONFIG_NO_HZ_FULL_ALL=y.

							Thanx, Paul

next prev parent reply	other threads:[~2016-05-17  0:58 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-15 21:18 [rcu_sched stall] regression/miss-config ? Santosh Shilimkar
2016-05-16  4:35 ` santosh.shilimkar
2016-05-16 12:03   ` Paul E. McKenney
2016-05-16 16:33     ` Santosh Shilimkar
2016-05-16 17:34       ` Paul E. McKenney
2016-05-16 19:49         ` Santosh Shilimkar
2016-05-17  0:58           ` Paul E. McKenney [this message]
2016-05-17 13:46             ` santosh.shilimkar
2016-05-17 19:15               ` Paul E. McKenney
2016-05-19 23:45                 ` Santosh Shilimkar
2016-05-20  4:05                   ` Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160517005820.GI3528@linux.vnet.ibm.com \
    --to=paulmck@linux.vnet.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=santosh.shilimkar@oracle.com \
    --cc=sasha.levin@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.