linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Vladimir Davydov <vdavydov@virtuozzo.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: "David S. Miller" <davem@davemloft.net>,
	Andrew Morton <akpm@linux-foundation.org>,
	Michal Hocko <mhocko@suse.cz>, Tejun Heo <tj@kernel.org>,
	netdev@vger.kernel.org, linux-mm@kvack.org,
	cgroups@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
Date: Mon, 2 Nov 2015 17:47:29 +0300	[thread overview]
Message-ID: <20151102144729.GA17424@esperanza> (raw)
In-Reply-To: <20151029175228.GB9160@cmpxchg.org>

On Thu, Oct 29, 2015 at 10:52:28AM -0700, Johannes Weiner wrote:
...
> Now, you mentioned that you'd rather see the socket buffers accounted
> at the allocator level, but I looked at the different allocation paths
> and network protocols and I'm not convinced that this makes sense. We
> don't want to be in the hotpath of every single packet when a lot of
> them are small, short-lived management blips that don't involve user
> space to let the kernel dispose of them.
> 
> __sk_mem_schedule() on the other hand is already wired up to exactly
> those consumers we are interested in for memory isolation: those with
> bigger chunks of data attached to them and those that have exploding
> receive queues when userspace fails to read(). UDP and TCP.
> 
> I mean, there is a reason why the global memory limits apply to only
> those types of packets in the first place: everything else is noise.
> 
> I agree that it's appealing to account at the allocator level and set
> page->mem_cgroup etc. but in this case we'd pay extra to capture a lot
> of noise, and I don't want to pay that just for aesthetics. In this
> case it's better to track ownership on the socket level and only count
> packets that can accumulate a significant amount of memory consumed.

Sigh, you seem to be right. Moreover, I can't even think of a neat way
to account skb pages to memcg, because rcv skbs are generated in device
drivers, where we don't know which socket/memcg it will go to. We could
recharge individual pages when skb gets to the network or transport
layer, but it would result in unjustified overhead.

> 
> > > We tried using the per-memcg tcp limits, and that prevents the OOMs
> > > for sure, but it's horrendous for network performance. There is no
> > > "stop growing" phase, it just keeps going full throttle until it hits
> > > the wall hard.
> > > 
> > > Now, we could probably try to replicate the global knobs and add a
> > > per-memcg soft limit. But you know better than anyone else how hard it
> > > is to estimate the overall workingset size of a workload, and the
> > > margins on containerized loads are razor-thin. Performance is much
> > > more sensitive to input errors, and often times parameters must be
> > > adjusted continuously during the runtime of a workload. It'd be
> > > disasterous to rely on yet more static, error-prone user input here.
> > 
> > Yeah, but the dynamic approach proposed in your patch set doesn't
> > guarantee we won't hit OOM in memcg due to overgrown buffers. It just
> > reduces this possibility. Of course, memcg OOM is far not as disastrous
> > as the global one, but still it usually means the workload breakage.
> 
> Right now, the entire machine breaks. Confining it to a faulty memcg,
> as well as reducing the likelihood of that OOM in many cases seems
> like a good move in the right direction, no?

It seems. However, memcg OOM is also bad, we should strive to avoid it
if we can.

> 
> And how likely are memcg OOMs because of this anyway? There is of

Frankly, I've no idea. Your arguments below sound reassuring though.

> course a scenario imaginable where the packets pile up, followed by
> some *other* part of the workload, the one that doesn't read() and
> process packets, trying to expand--which then doesn't work and goes
> OOM. But that seems like a complete corner case. In the vast majority
> of cases, the application will be in full operation and just fail to
> read() fast enough--because the network bandwidth is enormous compared
> to the container's size, or because it shares the CPU with thousands
> of other workloads and there is scheduling latency.
> 
> This would be the perfect point to reign in the transmit window...
> 
> > The static approach is error-prone for sure, but it has existed for
> > years and worked satisfactory AFAIK.
> 
> ...but that point is not a fixed amount of memory consumed. It depends
> on the workload and the random interactions it's having with thousands
> of other containers on that same machine.
> 
> The point of containers is to maximize utilization of your hardware
> and systematically eliminate slack in the system. But it's exactly
> that slack on dedicated bare-metal machines that allowed us to take a
> wild guess at the settings and then tune them based on observing a
> handful of workloads. This approach is not going to work anymore when
> we pack the machine to capacity and still expect every single
> container out of thousands to perform well. We need that automation.

But we do use static approach when setting memory limits, no?
memory.{low,high,max} - they are all static.

I understand it's appealing to have just one knob - memory size - like
in case of virtual machines, but it doesn't seem to work with
containers. You added memory.low and memory.high knobs. VMs don't have
anything like that. How is one supposed to set them? Depends on the
workload, I guess. Also, there is the pids cgroup for limiting the
number of pids that can be used by a cgroup, because pid turns out to be
a resource in case of containers.  May be, tcp window should be
considered as a separate resource either, as it is now, and shouldn't go
to memcg? I'm just wondering...

> 
> The static setting working okay on the global level is also why I'm
> not interested in starting to experiment with it. There is no reason
> to change it. It's much more likely that any attempt to change it will
> be shot down, not because of the approach chosen, but because there is
> no problem to solve there. I doubt we can get networking people to
> care about containers by screwing with things that work for them ;-)

Fair enough.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

      reply	other threads:[~2015-11-02 14:47 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-22  4:21 [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy Johannes Weiner
2015-10-22  4:21 ` [PATCH 1/8] mm: page_counter: let page_counter_try_charge() return bool Johannes Weiner
2015-10-23 11:31   ` Michal Hocko
2015-10-22  4:21 ` [PATCH 2/8] mm: memcontrol: export root_mem_cgroup Johannes Weiner
2015-10-23 11:32   ` Michal Hocko
2015-10-22  4:21 ` [PATCH 3/8] net: consolidate memcg socket buffer tracking and accounting Johannes Weiner
2015-10-22 18:46   ` Vladimir Davydov
2015-10-22 19:09     ` Johannes Weiner
2015-10-23 13:42       ` Vladimir Davydov
2015-10-23 12:38   ` Michal Hocko
2015-10-22  4:21 ` [PATCH 4/8] mm: memcontrol: prepare for unified hierarchy socket accounting Johannes Weiner
2015-10-23 12:39   ` Michal Hocko
2015-10-22  4:21 ` [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Johannes Weiner
2015-10-22 18:47   ` Vladimir Davydov
2015-10-23 13:19   ` Michal Hocko
2015-10-23 13:59     ` David Miller
2015-10-26 16:56       ` Johannes Weiner
2015-10-27 12:26         ` Michal Hocko
2015-10-27 13:49           ` David Miller
2015-10-27 15:41           ` Johannes Weiner
2015-10-27 16:15             ` Michal Hocko
2015-10-27 16:42               ` Johannes Weiner
2015-10-28  0:45                 ` David Miller
2015-10-28  3:05                   ` Johannes Weiner
2015-10-29 15:25                 ` Michal Hocko
2015-10-29 16:10                   ` Johannes Weiner
2015-11-04 10:42                     ` Michal Hocko
2015-11-04 19:50                       ` Johannes Weiner
2015-11-05 14:40                         ` Michal Hocko
2015-11-05 16:16                           ` David Miller
2015-11-05 16:28                             ` Michal Hocko
2015-11-05 16:30                               ` David Miller
2015-11-05 22:32                               ` Johannes Weiner
2015-11-06 12:51                                 ` Michal Hocko
2015-11-05 20:55                           ` Johannes Weiner
2015-11-05 22:52                             ` Johannes Weiner
2015-11-06 10:57                               ` Michal Hocko
2015-11-06 16:19                                 ` Johannes Weiner
2015-11-06 16:46                                   ` Michal Hocko
2015-11-06 17:45                                     ` Johannes Weiner
2015-11-07  3:45                                     ` David Miller
2015-11-12 18:36                                   ` Mel Gorman
2015-11-12 19:12                                     ` Johannes Weiner
2015-11-06  9:05                             ` Vladimir Davydov
2015-11-06 13:29                               ` Michal Hocko
2015-11-06 16:35                               ` Johannes Weiner
2015-11-06 13:21                             ` Michal Hocko
2015-10-22  4:21 ` [PATCH 6/8] mm: vmscan: simplify memcg vs. global shrinker invocation Johannes Weiner
2015-10-23 13:26   ` Michal Hocko
2015-10-22  4:21 ` [PATCH 7/8] mm: vmscan: report vmpressure at the level of reclaim activity Johannes Weiner
2015-10-22 18:48   ` Vladimir Davydov
2015-10-23 13:49   ` Michal Hocko
2015-10-22  4:21 ` [PATCH 8/8] mm: memcontrol: hook up vmpressure to socket pressure Johannes Weiner
2015-10-22 18:57   ` Vladimir Davydov
2015-10-22 18:45 ` [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy Vladimir Davydov
2015-10-26 17:22   ` Johannes Weiner
2015-10-27  8:43     ` Vladimir Davydov
2015-10-27 16:01       ` Johannes Weiner
2015-10-28  8:20         ` Vladimir Davydov
2015-10-28 18:58           ` Johannes Weiner
2015-10-29  9:27             ` Vladimir Davydov
2015-10-29 17:52               ` Johannes Weiner
2015-11-02 14:47                 ` Vladimir Davydov [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20151102144729.GA17424@esperanza \
    --to=vdavydov@virtuozzo.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=davem@davemloft.net \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.cz \
    --cc=netdev@vger.kernel.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).