From: Vladimir Davydov <vdavydov@virtuozzo.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: "David S. Miller" <davem@davemloft.net>,
Andrew Morton <akpm@linux-foundation.org>,
Michal Hocko <mhocko@suse.cz>, Tejun Heo <tj@kernel.org>,
netdev@vger.kernel.org, linux-mm@kvack.org,
cgroups@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
Date: Thu, 29 Oct 2015 12:27:47 +0300 [thread overview]
Message-ID: <20151029092747.GR13221@esperanza> (raw)
In-Reply-To: <20151028185810.GA31488@cmpxchg.org>
On Wed, Oct 28, 2015 at 11:58:10AM -0700, Johannes Weiner wrote:
> On Wed, Oct 28, 2015 at 11:20:03AM +0300, Vladimir Davydov wrote:
> > Then you'd better not touch existing tcp limits at all, because they
> > just work, and the logic behind them is very close to that of global tcp
> > limits. I don't think one can simplify it somehow.
>
> Uhm, no, there is a crapload of boilerplate code and complication that
> seems entirely unnecessary. The only thing missing from my patch seems
> to be the part where it enters memory pressure state when the limit is
> hit. I'm adding this for completeness, but I doubt it even matters.
>
> > Moreover, frankly I still have my reservations about this vmpressure
> > propagation to skb you're proposing. It might work, but I doubt it
> > will allow us to throw away explicit tcp limit, as I explained
> > previously. So, even with your approach I think we can still need
> > per memcg tcp limit *unless* you get rid of global tcp limit
> > somehow.
>
> Having the hard limit as a failsafe (or a minimum for other consumers)
> is one thing, and certainly something I'm open to for cgroupv2, should
> we have problems with load startup up after a socket memory landgrab.
>
> That being said, if the VM is struggling to reclaim pages, or is even
> swapping, it makes perfect sense to let the socket memory scheduler
> know it shouldn't continue to increase its footprint until the VM
> recovers. Regardless of any hard limitations/minimum guarantees.
>
> This is what my patch does and it seems pretty straight-forward to
> me. I don't really understand why this is so controversial.
I'm not arguing that the idea behind this patch set is necessarily bad.
Quite the contrary, it does look interesting to me. I'm just saying that
IMO it can't replace hard/soft limits. It probably could if it was
possible to shrink buffers, but I don't think it's feasible, even
theoretically. That's why I propose not to change the behavior of the
existing per memcg tcp limit at all. And frankly I don't get why you are
so keen on simplifying it. You say it's a "crapload of boilerplate
code". Well, I don't see how it is - it just replicates global knobs and
I don't see how it could be done in a better way. The code is hidden
behind jump labels, so the overhead is zero if it isn't used. If you
really dislike this code, we can isolate it under a separate config
option. But all right, I don't rule out the possibility that the code
could be simplified. If you do that w/o breaking it, that'll be OK to
me, but I don't see why it should be related to this particular patch
set.
>
> The *next* step would be to figure out whether we can actually
> *reclaim* memory in the network subsystem--shrink windows and steal
> buffers back--and that might even be an avenue to replace tcp window
> limits. But it's not necessary for *this* patch series to be useful.
Again, I don't think we can *reclaim* network memory, but you're right.
>
> > > So this seemed like a good way to prove a new mechanism before rolling
> > > it out to every single Linux setup, rather than switch everybody over
> > > after the limited scope testing I can do as a developer on my own.
> > >
> > > Keep in mind that my patches are not committing anything in terms of
> > > interface, so we retain all the freedom to fix and tune the way this
> > > is implemented, including the freedom to re-add tcp window limits in
> > > case the pressure balancing is not a comprehensive solution.
> >
> > I really dislike this kind of proof. It looks like you're trying to
> > push something you think is right covertly, w/o having a proper
> > discussion with networking people and then say that it just works
> > and hence should be done globally, but what if it won't? Revert it?
> > We already have a lot of dubious stuff in memcg that should be
> > reverted, so let's please try to avoid this kind of mistakes in
> > future. Note, I say "w/o having a proper discussion with networking
> > people", because I don't think they will really care *unless* you
> > change the global logic, simply because most of them aren't very
> > interested in memcg AFAICS.
>
> Come on, Dave is the first To and netdev is CC'd. They might not care
> about memcg, but "pushing things covertly" is a bit of a stretch.
Sorry if it sounded rude to you. I just look back at my experience
patching slab internals to make kmem accountable, and AFAICS Christoph
didn't really care about *what* I was doing, he only cared about the
global case - if there was no performance degradation when kmemcg was
disabled, he was usually fine with it, even if from the memcg pov it was
a crap.
Anyway, I can't force you to patch the global case first or
simultaneously with the memcg case, so let's just hope I'm a bit too
overcautious.
>
> > That effectively means you loose a chance to listen to networking
> > experts, who could point you at design flaws and propose an improvement
> > right away. Let's please not miss such an opportunity. You said that
> > you'd seen this problem happen w/o cgroups, so you have a use case that
> > might need fixing at the global level. IMO it shouldn't be difficult to
> > prepare an RFC patch for the global case first and see what people think
> > about it.
>
> No, the problem we are running into is when network memory is not
> tracked per cgroup. The lack of containment means that the socket
> memory consumption of individual cgroups can trigger system OOM.
>
> We tried using the per-memcg tcp limits, and that prevents the OOMs
> for sure, but it's horrendous for network performance. There is no
> "stop growing" phase, it just keeps going full throttle until it hits
> the wall hard.
>
> Now, we could probably try to replicate the global knobs and add a
> per-memcg soft limit. But you know better than anyone else how hard it
> is to estimate the overall workingset size of a workload, and the
> margins on containerized loads are razor-thin. Performance is much
> more sensitive to input errors, and often times parameters must be
> adjusted continuously during the runtime of a workload. It'd be
> disasterous to rely on yet more static, error-prone user input here.
Yeah, but the dynamic approach proposed in your patch set doesn't
guarantee we won't hit OOM in memcg due to overgrown buffers. It just
reduces this possibility. Of course, memcg OOM is far not as disastrous
as the global one, but still it usually means the workload breakage.
The static approach is error-prone for sure, but it has existed for
years and worked satisfactory AFAIK.
>
> What all this means to me is that fixing it on the cgroup level has
> higher priority. But it also means that once we figured it out under
> such a high-pressure environment, it's much easier to apply to the
> global case and potentially replace the soft limit there.
>
> This seems like a better approach to me than starting globally, only
> to realize that the solution is not workable for cgroups and we need
> yet something else.
>
Are we in rush? I think if you try your approach at the global level and
fail, it's still good, because it will probably give us all a better
understanding of the problem. If you successfully fix the global case,
but then realize that it doesn't fit memcg, it's even better, because
you actually fixed a problem. If you patch both global and memcg cases,
it's perfect.
But of course, that's my understanding and I may be mistaken. Let's hope
you're right.
Thanks,
Vladimir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2015-10-29 9:28 UTC|newest]
Thread overview: 63+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-10-22 4:21 [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy Johannes Weiner
2015-10-22 4:21 ` [PATCH 1/8] mm: page_counter: let page_counter_try_charge() return bool Johannes Weiner
2015-10-23 11:31 ` Michal Hocko
2015-10-22 4:21 ` [PATCH 2/8] mm: memcontrol: export root_mem_cgroup Johannes Weiner
2015-10-23 11:32 ` Michal Hocko
2015-10-22 4:21 ` [PATCH 3/8] net: consolidate memcg socket buffer tracking and accounting Johannes Weiner
2015-10-22 18:46 ` Vladimir Davydov
2015-10-22 19:09 ` Johannes Weiner
2015-10-23 13:42 ` Vladimir Davydov
2015-10-23 12:38 ` Michal Hocko
2015-10-22 4:21 ` [PATCH 4/8] mm: memcontrol: prepare for unified hierarchy socket accounting Johannes Weiner
2015-10-23 12:39 ` Michal Hocko
2015-10-22 4:21 ` [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Johannes Weiner
2015-10-22 18:47 ` Vladimir Davydov
2015-10-23 13:19 ` Michal Hocko
2015-10-23 13:59 ` David Miller
2015-10-26 16:56 ` Johannes Weiner
2015-10-27 12:26 ` Michal Hocko
2015-10-27 13:49 ` David Miller
2015-10-27 15:41 ` Johannes Weiner
2015-10-27 16:15 ` Michal Hocko
2015-10-27 16:42 ` Johannes Weiner
2015-10-28 0:45 ` David Miller
2015-10-28 3:05 ` Johannes Weiner
2015-10-29 15:25 ` Michal Hocko
2015-10-29 16:10 ` Johannes Weiner
2015-11-04 10:42 ` Michal Hocko
2015-11-04 19:50 ` Johannes Weiner
2015-11-05 14:40 ` Michal Hocko
2015-11-05 16:16 ` David Miller
2015-11-05 16:28 ` Michal Hocko
2015-11-05 16:30 ` David Miller
2015-11-05 22:32 ` Johannes Weiner
2015-11-06 12:51 ` Michal Hocko
2015-11-05 20:55 ` Johannes Weiner
2015-11-05 22:52 ` Johannes Weiner
2015-11-06 10:57 ` Michal Hocko
2015-11-06 16:19 ` Johannes Weiner
2015-11-06 16:46 ` Michal Hocko
2015-11-06 17:45 ` Johannes Weiner
2015-11-07 3:45 ` David Miller
2015-11-12 18:36 ` Mel Gorman
2015-11-12 19:12 ` Johannes Weiner
2015-11-06 9:05 ` Vladimir Davydov
2015-11-06 13:29 ` Michal Hocko
2015-11-06 16:35 ` Johannes Weiner
2015-11-06 13:21 ` Michal Hocko
2015-10-22 4:21 ` [PATCH 6/8] mm: vmscan: simplify memcg vs. global shrinker invocation Johannes Weiner
2015-10-23 13:26 ` Michal Hocko
2015-10-22 4:21 ` [PATCH 7/8] mm: vmscan: report vmpressure at the level of reclaim activity Johannes Weiner
2015-10-22 18:48 ` Vladimir Davydov
2015-10-23 13:49 ` Michal Hocko
2015-10-22 4:21 ` [PATCH 8/8] mm: memcontrol: hook up vmpressure to socket pressure Johannes Weiner
2015-10-22 18:57 ` Vladimir Davydov
2015-10-22 18:45 ` [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy Vladimir Davydov
2015-10-26 17:22 ` Johannes Weiner
2015-10-27 8:43 ` Vladimir Davydov
2015-10-27 16:01 ` Johannes Weiner
2015-10-28 8:20 ` Vladimir Davydov
2015-10-28 18:58 ` Johannes Weiner
2015-10-29 9:27 ` Vladimir Davydov [this message]
2015-10-29 17:52 ` Johannes Weiner
2015-11-02 14:47 ` Vladimir Davydov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20151029092747.GR13221@esperanza \
--to=vdavydov@virtuozzo.com \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=davem@davemloft.net \
--cc=hannes@cmpxchg.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.cz \
--cc=netdev@vger.kernel.org \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).