public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
From: "Paul E. McKenney" <paulmck@kernel.org>
To: Eric Dumazet <eric.dumazet@gmail.com>
Cc: SeongJae Park <sjpark@amazon.com>,
	Eric Dumazet <edumazet@google.com>,
	David Miller <davem@davemloft.net>,
	Al Viro <viro@zeniv.linux.org.uk>,
	Jakub Kicinski <kuba@kernel.org>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	sj38.park@gmail.com, netdev <netdev@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	SeongJae Park <sjpark@amazon.de>,
	snu@amazon.com, amit@kernel.org, stable@vger.kernel.org
Subject: Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change
Date: Tue, 5 May 2020 10:23:58 -0700	[thread overview]
Message-ID: <20200505172358.GC2869@paulmck-ThinkPad-P72> (raw)
In-Reply-To: <05843a3c-eb9d-3a0d-f992-7e4b97cc1f19@gmail.com>

On Tue, May 05, 2020 at 09:25:06AM -0700, Eric Dumazet wrote:
> 
> 
> On 5/5/20 9:13 AM, SeongJae Park wrote:
> > On Tue, 5 May 2020 09:00:44 -0700 Eric Dumazet <edumazet@google.com> wrote:
> > 
> >> On Tue, May 5, 2020 at 8:47 AM SeongJae Park <sjpark@amazon.com> wrote:
> >>>
> >>> On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >>>
> >>>>
> >>>>
> >>>> On 5/5/20 8:07 AM, SeongJae Park wrote:
> >>>>> On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet <edumazet@google.com> wrote:
> >>>>>
> >>>>
> >>>>>> Why do we have 10,000,000 objects around ? Could this be because of
> >>>>>> some RCU problem ?
> >>>>>
> >>>>> Mainly because of a long RCU grace period, as you guess.  I have no idea how
> >>>>> the grace period became so long in this case.
> >>>>>
> >>>>> As my test machine was a virtual machine instance, I guess RCU readers
> >>>>> preemption[1] like problem might affected this.
> >>>>>
> >>>>> [1] https://www.usenix.org/system/files/conference/atc17/atc17-prasad.pdf

If this is the root cause of the problem, then it will be necessary to
provide a hint to the hypervisor.  Or, in the near term, avoid loading
the hypervisor the point that vCPU preemption is so lengthy.

RCU could also provide some sort of pre-stall-warning notification that
some of the CPUs aren't passing through quiescent states, which might
allow the guest OS's userspace to take corrective action.

But first, what are you doing to either confirm or invalidate the
hypothesis that this might be due to vCPU preemption?

> >>>>>> Once Al patches reverted, do you have 10,000,000 sock_alloc around ?
> >>>>>
> >>>>> Yes, both the old kernel that prior to Al's patches and the recent kernel
> >>>>> reverting the Al's patches didn't reproduce the problem.
> >>>>>
> >>>>
> >>>> I repeat my question : Do you have 10,000,000 (smaller) objects kept in slab caches ?
> >>>>
> >>>> TCP sockets use the (very complex, error prone) SLAB_TYPESAFE_BY_RCU, but not the struct socket_wq
> >>>> object that was allocated in sock_alloc_inode() before Al patches.
> >>>>
> >>>> These objects should be visible in kmalloc-64 kmem cache.
> >>>
> >>> Not exactly the 10,000,000, as it is only the possible highest number, but I
> >>> was able to observe clear exponential increase of the number of the objects
> >>> using slabtop.  Before the start of the problematic workload, the number of
> >>> objects of 'kmalloc-64' was 5760, but I was able to observe the number increase
> >>> to 1,136,576.
> >>>
> >>>           OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
> >>> before:   5760   5088  88%    0.06K     90       64       360K kmalloc-64
> >>> after:  1136576 1136576 100%    0.06K  17759       64     71036K kmalloc-64
> >>>
> >>
> >> Great, thanks.
> >>
> >> How recent is the kernel you are running for your experiment ?
> > 
> > It's based on 5.4.35.

Is it possible to retest on v5.6?  I have been adding various mechanisms
to make RCU keep up better with heavy callback overload.

Also, could you please provide the .config?  If either NO_HZ_FULL or
RCU_NOCB_CPU, please also provide the kernel boot parameters.

> >> Let's make sure the bug is not in RCU.
> > 
> > One thing I can currently say is that the grace period passes at last.  I
> > modified the benchmark to repeat not 10,000 times but only 5,000 times to run
> > the test without OOM but easily observable memory pressure.  As soon as the
> > benchmark finishes, the memory were freed.
> > 
> > If you need more tests, please let me know.
> 
> I would ask Paul opinion on this issue, because we have many objects
> being freed after RCU grace periods.

As always, "It depends."

o	If the problem is a too-long RCU reader, RCU is prohibited from
	ending the grace period.  The reader duration must be shortened,
	and until it is shortened, there is nothing RCU can do.

o	In some special cases of the above, RCU can and does help, for
	example, by enlisting the aid of cond_resched().  So perhaps
	there is a long in-kernel loop that needs a cond_resched().

	And perhaps RCU can help for some types of vCPU preemption.

o	As Al suggested offline and as has been discussed in the past,
	it would not be hard to cause RCU to burn CPU to attain faster
	grace periods during OOM events.  This could be helpful, but only
	given that RCU readers are completing in reasonable timeframes.

> If RCU subsystem can not keep-up, I guess other workloads will also suffer.

If readers are not excessively long, RCU should be able to keep up.
(In the absence of misconfigurations, for example, both NO_HZ_FULL and
then binding all the rcuo kthreads to a single CPU on a 100-CPU system
or some such.)

> Sure, we can revert patches there and there trying to work around the issue,
> but for objects allocated from process context, we should not have these problems.

Agreed, let's get more info on what is happening to RCU.

One approach is to shorten the RCU CPU stall warning timeout
(rcupdate.rcu_cpu_stall_timeout=10 for 10 seconds).

							Thanx, Paul

  parent reply	other threads:[~2020-05-05 17:24 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-05  8:10 [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change SeongJae Park
2020-05-05  8:10 ` [PATCH net v2 1/2] Revert "coallocate socket_wq with socket itself" SeongJae Park
2020-05-06  4:55   ` kbuild test robot
2020-05-05  8:10 ` [PATCH net v2 2/2] Revert "sockfs: switch to ->free_inode()" SeongJae Park
2020-05-05 11:54 ` [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change SeongJae Park
2020-05-05 12:31   ` Nuernberger, Stefan
2020-05-05 14:53   ` Eric Dumazet
2020-05-05 15:07     ` SeongJae Park
2020-05-05 15:20       ` Eric Dumazet
2020-05-05 15:46         ` SeongJae Park
2020-05-05 16:00           ` Eric Dumazet
2020-05-05 16:13             ` SeongJae Park
2020-05-05 16:25               ` Eric Dumazet
2020-05-05 16:31                 ` Eric Dumazet
2020-05-05 16:37                   ` Eric Dumazet
2020-05-05 17:05                     ` SeongJae Park
2020-05-05 17:30                       ` Paul E. McKenney
2020-05-05 17:56                         ` SeongJae Park
2020-05-05 18:17                           ` Paul E. McKenney
2020-05-05 18:34                             ` SeongJae Park
2020-05-05 18:49                               ` Paul E. McKenney
2020-05-06 12:59                                 ` SeongJae Park
2020-05-06 14:33                                   ` Eric Dumazet
2020-05-06 14:41                                   ` Paul E. McKenney
2020-05-06 15:20                                     ` SeongJae Park
2020-05-05 17:28                     ` Paul E. McKenney
2020-05-05 18:11                       ` SeongJae Park
2020-05-05 17:23                 ` Paul E. McKenney [this message]
2020-05-05 17:49                   ` SeongJae Park
2020-05-05 18:27                     ` Paul E. McKenney
2020-05-05 18:40                       ` SeongJae Park
2020-05-05 18:48                         ` Paul E. McKenney
2020-05-05 16:26             ` Al Viro
2020-05-05 18:48 ` David Miller
2020-05-05 19:00   ` David Miller
2020-05-06  6:24     ` SeongJae Park

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200505172358.GC2869@paulmck-ThinkPad-P72 \
    --to=paulmck@kernel.org \
    --cc=amit@kernel.org \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=eric.dumazet@gmail.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=sj38.park@gmail.com \
    --cc=sjpark@amazon.com \
    --cc=sjpark@amazon.de \
    --cc=snu@amazon.com \
    --cc=stable@vger.kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox