From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy Date: Thu, 22 Oct 2015 00:21:28 -0400 Message-ID: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> Return-path: Sender: owner-linux-mm@kvack.org List-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: "David S. Miller" , Andrew Morton Cc: Michal Hocko , Vladimir Davydov , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Hi, this series adds socket buffer memory tracking and accounting to the unified hierarchy memory cgroup controller. [ Networking people, at this time please check the diffstat below to avoid going into convulsions. ] Socket buffer memory can make up a significant share of a workload's memory footprint, and so it needs to be accounted and tracked out of the box, along with other types of memory that can be directly linked to userspace activity, in order to provide useful resource isolation. Historically, socket buffers were accounted in a separate counter, without any pressure equalization between anonymous memory, page cache, and the socket buffers. When the socket buffer pool was exhausted, buffer allocations would fail hard and cause network performance to tank, regardless of whether there was still memory available to the group or not. Likewise, struggling anonymous or cache workingsets could not dip into an idle socket memory pool. Because of this, the feature was not usable for many real life applications. To not repeat this mistake, the new memory controller will account all types of memory pages it is tracking on behalf of a cgroup in a single pool. And upon pressure, the VM reclaims and shrinks whatever memory in that pool is within its reach. These patches add accounting for memory consumed by sockets associated with a cgroup to the existing pool of anonymous pages and page cache. Patch #3 reworks the existing memcg socket infrastructure. It has many provisions for future plans that won't materialize, and much of this simply evaporates. The networking people should be happy about this. Patch #5 adds accounting and tracking of socket memory to the unified hierarchy memory controller, as described above. It uses the existing per-cpu charge caches and triggers high limit reclaim asynchroneously. Patch #8 uses the vmpressure extension to equalize pressure between the pages tracked natively by the VM and socket buffer pages. As the pool is shared, it makes sense that while natively tracked pages are under duress the network transmit windows are also not increased. As per above, this is an essential part of the new memory controller's core functionality. With the unified hierarchy nearing release, please consider this for 4.4. include/linux/memcontrol.h | 90 +++++++++------- include/linux/page_counter.h | 6 +- include/net/sock.h | 139 ++---------------------- include/net/tcp.h | 5 +- include/net/tcp_memcontrol.h | 7 -- mm/backing-dev.c | 2 +- mm/hugetlb_cgroup.c | 3 +- mm/memcontrol.c | 235 ++++++++++++++++++++++++++--------------- mm/page_counter.c | 14 +-- mm/vmpressure.c | 29 ++++- mm/vmscan.c | 41 +++---- net/core/sock.c | 78 ++++---------- net/ipv4/sysctl_net_ipv4.c | 1 - net/ipv4/tcp.c | 3 +- net/ipv4/tcp_ipv4.c | 9 +- net/ipv4/tcp_memcontrol.c | 147 ++++---------------------- net/ipv4/tcp_output.c | 6 +- net/ipv6/tcp_ipv6.c | 3 - 18 files changed, 319 insertions(+), 499 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: [PATCH 1/8] mm: page_counter: let page_counter_try_charge() return bool Date: Thu, 22 Oct 2015 00:21:29 -0400 Message-ID: <1445487696-21545-2-git-send-email-hannes@cmpxchg.org> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> Return-path: In-Reply-To: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: "David S. Miller" , Andrew Morton Cc: Michal Hocko , Vladimir Davydov , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org page_counter_try_charge() currently returns 0 on success and -ENOMEM on failure, which is surprising behavior given the function name. Make it follow the expected pattern of try_stuff() functions that return a boolean true to indicate success, or false for failure. Signed-off-by: Johannes Weiner --- include/linux/page_counter.h | 6 +++--- mm/hugetlb_cgroup.c | 3 ++- mm/memcontrol.c | 11 +++++------ mm/page_counter.c | 14 +++++++------- 4 files changed, 17 insertions(+), 17 deletions(-) diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h index 17fa4f8..7e62920 100644 --- a/include/linux/page_counter.h +++ b/include/linux/page_counter.h @@ -36,9 +36,9 @@ static inline unsigned long page_counter_read(struct page_counter *counter) void page_counter_cancel(struct page_counter *counter, unsigned long nr_pages); void page_counter_charge(struct page_counter *counter, unsigned long nr_pages); -int page_counter_try_charge(struct page_counter *counter, - unsigned long nr_pages, - struct page_counter **fail); +bool page_counter_try_charge(struct page_counter *counter, + unsigned long nr_pages, + struct page_counter **fail); void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages); int page_counter_limit(struct page_counter *counter, unsigned long limit); int page_counter_memparse(const char *buf, const char *max, diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c index 6a44263..d8fb10d 100644 --- a/mm/hugetlb_cgroup.c +++ b/mm/hugetlb_cgroup.c @@ -186,7 +186,8 @@ again: } rcu_read_unlock(); - ret = page_counter_try_charge(&h_cg->hugepage[idx], nr_pages, &counter); + if (!page_counter_try_charge(&h_cg->hugepage[idx], nr_pages, &counter)) + ret = -ENOMEM; css_put(&h_cg->css); done: *ptr = h_cg; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c71fe40..a8ccdbc 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2018,8 +2018,8 @@ retry: return 0; if (!do_swap_account || - !page_counter_try_charge(&memcg->memsw, batch, &counter)) { - if (!page_counter_try_charge(&memcg->memory, batch, &counter)) + page_counter_try_charge(&memcg->memsw, batch, &counter)) { + if (page_counter_try_charge(&memcg->memory, batch, &counter)) goto done_restock; if (do_swap_account) page_counter_uncharge(&memcg->memsw, batch); @@ -2383,14 +2383,13 @@ int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order, { unsigned int nr_pages = 1 << order; struct page_counter *counter; - int ret = 0; + int ret; if (!memcg_kmem_is_active(memcg)) return 0; - ret = page_counter_try_charge(&memcg->kmem, nr_pages, &counter); - if (ret) - return ret; + if (!page_counter_try_charge(&memcg->kmem, nr_pages, &counter)) + return -ENOMEM; ret = try_charge(memcg, gfp, nr_pages); if (ret) { diff --git a/mm/page_counter.c b/mm/page_counter.c index 11b4bed..7c6a63d 100644 --- a/mm/page_counter.c +++ b/mm/page_counter.c @@ -56,12 +56,12 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages) * @nr_pages: number of pages to charge * @fail: points first counter to hit its limit, if any * - * Returns 0 on success, or -ENOMEM and @fail if the counter or one of - * its ancestors has hit its configured limit. + * Returns %true on success, or %false and @fail if the counter or one + * of its ancestors has hit its configured limit. */ -int page_counter_try_charge(struct page_counter *counter, - unsigned long nr_pages, - struct page_counter **fail) +bool page_counter_try_charge(struct page_counter *counter, + unsigned long nr_pages, + struct page_counter **fail) { struct page_counter *c; @@ -99,13 +99,13 @@ int page_counter_try_charge(struct page_counter *counter, if (new > c->watermark) c->watermark = new; } - return 0; + return true; failed: for (c = counter; c != *fail; c = c->parent) page_counter_cancel(c, nr_pages); - return -ENOMEM; + return false; } /** -- 2.6.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: [PATCH 4/8] mm: memcontrol: prepare for unified hierarchy socket accounting Date: Thu, 22 Oct 2015 00:21:32 -0400 Message-ID: <1445487696-21545-5-git-send-email-hannes@cmpxchg.org> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> Return-path: In-Reply-To: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: "David S. Miller" , Andrew Morton Cc: Michal Hocko , Vladimir Davydov , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org The unified hierarchy memory controller will account socket memory. Move the infrastructure functions accordingly. Signed-off-by: Johannes Weiner --- mm/memcontrol.c | 136 ++++++++++++++++++++++++++++---------------------------- 1 file changed, 68 insertions(+), 68 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c41e6d7..3789050 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -287,74 +287,6 @@ static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id) return mem_cgroup_from_css(css); } -/* Writing them here to avoid exposing memcg's inner layout */ -#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM) - -DEFINE_STATIC_KEY_FALSE(mem_cgroup_sockets); - -void sock_update_memcg(struct sock *sk) -{ - struct mem_cgroup *memcg; - /* - * Socket cloning can throw us here with sk_cgrp already - * filled. It won't however, necessarily happen from - * process context. So the test for root memcg given - * the current task's memcg won't help us in this case. - * - * Respecting the original socket's memcg is a better - * decision in this case. - */ - if (sk->sk_memcg) { - BUG_ON(mem_cgroup_is_root(sk->sk_memcg)); - css_get(&sk->sk_memcg->css); - return; - } - - rcu_read_lock(); - memcg = mem_cgroup_from_task(current); - if (css_tryget_online(&memcg->css)) - sk->sk_memcg = memcg; - rcu_read_unlock(); -} -EXPORT_SYMBOL(sock_update_memcg); - -void sock_release_memcg(struct sock *sk) -{ - if (sk->sk_memcg) - css_put(&sk->sk_memcg->css); -} - -/** - * mem_cgroup_charge_skmem - charge socket memory - * @memcg: memcg to charge - * @nr_pages: number of pages to charge - * - * Charges @nr_pages to @memcg. Returns %true if the charge fit within - * the memcg's configured limit, %false if the charge had to be forced. - */ -bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) -{ - struct page_counter *counter; - - if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter)) - return true; - - page_counter_charge(&memcg->skmem, nr_pages); - return false; -} - -/** - * mem_cgroup_uncharge_skmem - uncharge socket memory - * @memcg: memcg to uncharge - * @nr_pages: number of pages to uncharge - */ -void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) -{ - page_counter_uncharge(&memcg->skmem, nr_pages); -} - -#endif - #ifdef CONFIG_MEMCG_KMEM /* * This will be the memcg's index in each cache's ->memcg_params.memcg_caches. @@ -5521,6 +5453,74 @@ void mem_cgroup_replace_page(struct page *oldpage, struct page *newpage) commit_charge(newpage, memcg, true); } +/* Writing them here to avoid exposing memcg's inner layout */ +#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM) + +DEFINE_STATIC_KEY_FALSE(mem_cgroup_sockets); + +void sock_update_memcg(struct sock *sk) +{ + struct mem_cgroup *memcg; + /* + * Socket cloning can throw us here with sk_cgrp already + * filled. It won't however, necessarily happen from + * process context. So the test for root memcg given + * the current task's memcg won't help us in this case. + * + * Respecting the original socket's memcg is a better + * decision in this case. + */ + if (sk->sk_memcg) { + BUG_ON(mem_cgroup_is_root(sk->sk_memcg)); + css_get(&sk->sk_memcg->css); + return; + } + + rcu_read_lock(); + memcg = mem_cgroup_from_task(current); + if (css_tryget_online(&memcg->css)) + sk->sk_memcg = memcg; + rcu_read_unlock(); +} +EXPORT_SYMBOL(sock_update_memcg); + +void sock_release_memcg(struct sock *sk) +{ + if (sk->sk_memcg) + css_put(&sk->sk_memcg->css); +} + +/** + * mem_cgroup_charge_skmem - charge socket memory + * @memcg: memcg to charge + * @nr_pages: number of pages to charge + * + * Charges @nr_pages to @memcg. Returns %true if the charge fit within + * the memcg's configured limit, %false if the charge had to be forced. + */ +bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) +{ + struct page_counter *counter; + + if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter)) + return true; + + page_counter_charge(&memcg->skmem, nr_pages); + return false; +} + +/** + * mem_cgroup_uncharge_skmem - uncharge socket memory + * @memcg: memcg to uncharge + * @nr_pages: number of pages to uncharge + */ +void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) +{ + page_counter_uncharge(&memcg->skmem, nr_pages); +} + +#endif + /* * subsys_initcall() for memory controller. * -- 2.6.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: [PATCH 2/8] mm: memcontrol: export root_mem_cgroup Date: Thu, 22 Oct 2015 00:21:30 -0400 Message-ID: <1445487696-21545-3-git-send-email-hannes@cmpxchg.org> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> Return-path: In-Reply-To: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: "David S. Miller" , Andrew Morton Cc: Michal Hocko , Vladimir Davydov , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org A later patch will need this symbol in files other than memcontrol.c, so export it now and replace mem_cgroup_root_css at the same time. Signed-off-by: Johannes Weiner --- include/linux/memcontrol.h | 3 ++- mm/backing-dev.c | 2 +- mm/memcontrol.c | 5 ++--- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 805da1f..19ff87b 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -275,7 +275,8 @@ struct mem_cgroup { struct mem_cgroup_per_node *nodeinfo[0]; /* WARNING: nodeinfo must be the last member here */ }; -extern struct cgroup_subsys_state *mem_cgroup_root_css; + +extern struct mem_cgroup *root_mem_cgroup; /** * mem_cgroup_events - count memory events against a cgroup diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 095b23b..73ab967 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -702,7 +702,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi) ret = wb_init(&bdi->wb, bdi, 1, GFP_KERNEL); if (!ret) { - bdi->wb.memcg_css = mem_cgroup_root_css; + bdi->wb.memcg_css = &root_mem_cgroup->css; bdi->wb.blkcg_css = blkcg_root_css; } return ret; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index a8ccdbc..e54f434 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -76,9 +76,9 @@ struct cgroup_subsys memory_cgrp_subsys __read_mostly; EXPORT_SYMBOL(memory_cgrp_subsys); +struct mem_cgroup *root_mem_cgroup __read_mostly; + #define MEM_CGROUP_RECLAIM_RETRIES 5 -static struct mem_cgroup *root_mem_cgroup __read_mostly; -struct cgroup_subsys_state *mem_cgroup_root_css __read_mostly; /* Whether the swap controller is active */ #ifdef CONFIG_MEMCG_SWAP @@ -4213,7 +4213,6 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) /* root ? */ if (parent_css == NULL) { root_mem_cgroup = memcg; - mem_cgroup_root_css = &memcg->css; page_counter_init(&memcg->memory, NULL); memcg->high = PAGE_COUNTER_MAX; memcg->soft_limit = PAGE_COUNTER_MAX; -- 2.6.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: [PATCH 3/8] net: consolidate memcg socket buffer tracking and accounting Date: Thu, 22 Oct 2015 00:21:31 -0400 Message-ID: <1445487696-21545-4-git-send-email-hannes@cmpxchg.org> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> Return-path: In-Reply-To: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: "David S. Miller" , Andrew Morton Cc: Michal Hocko , Vladimir Davydov , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org The tcp memory controller has extensive provisions for future memory accounting interfaces that won't materialize after all. Cut the code base down to what's actually used, now and in the likely future. - There won't be any different protocol counters in the future, so a direct sock->sk_memcg linkage is enough. This eliminates a lot of callback maze and boilerplate code, and restores most of the socket allocation code to pre-tcp_memcontrol state. - There won't be a tcp control soft limit, so integrating the memcg code into the global skmem limiting scheme complicates things unnecessarily. Replace all that with simple and clear charge and uncharge calls--hidden behind a jump label--to account skb memory. - The previous jump label code was an elaborate state machine that tracked the number of cgroups with an active socket limit in order to enable the skmem tracking and accounting code only when actively necessary. But this is overengineered: it was meant to protect the people who never use this feature in the first place. Simply enable the branches once when the first limit is set until the next reboot. Signed-off-by: Johannes Weiner --- include/linux/memcontrol.h | 64 ++++++++----------- include/net/sock.h | 135 +++------------------------------------ include/net/tcp.h | 3 - include/net/tcp_memcontrol.h | 7 --- mm/memcontrol.c | 101 +++++++++++++++-------------- net/core/sock.c | 78 ++++++----------------- net/ipv4/sysctl_net_ipv4.c | 1 - net/ipv4/tcp.c | 3 +- net/ipv4/tcp_ipv4.c | 9 +-- net/ipv4/tcp_memcontrol.c | 147 +++++++------------------------------------ net/ipv4/tcp_output.c | 6 +- net/ipv6/tcp_ipv6.c | 3 - 12 files changed, 136 insertions(+), 421 deletions(-) delete mode 100644 include/net/tcp_memcontrol.h diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 19ff87b..5b72f83 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -85,34 +85,6 @@ enum mem_cgroup_events_target { MEM_CGROUP_NTARGETS, }; -/* - * Bits in struct cg_proto.flags - */ -enum cg_proto_flags { - /* Currently active and new sockets should be assigned to cgroups */ - MEMCG_SOCK_ACTIVE, - /* It was ever activated; we must disarm static keys on destruction */ - MEMCG_SOCK_ACTIVATED, -}; - -struct cg_proto { - struct page_counter memory_allocated; /* Current allocated memory. */ - struct percpu_counter sockets_allocated; /* Current number of sockets. */ - int memory_pressure; - long sysctl_mem[3]; - unsigned long flags; - /* - * memcg field is used to find which memcg we belong directly - * Each memcg struct can hold more than one cg_proto, so container_of - * won't really cut. - * - * The elegant solution would be having an inverse function to - * proto_cgroup in struct proto, but that means polluting the structure - * for everybody, instead of just for memcg users. - */ - struct mem_cgroup *memcg; -}; - #ifdef CONFIG_MEMCG struct mem_cgroup_stat_cpu { long count[MEM_CGROUP_STAT_NSTATS]; @@ -185,8 +157,15 @@ struct mem_cgroup { /* Accounted resources */ struct page_counter memory; + + /* + * Legacy non-resource counters. In unified hierarchy, all + * memory is accounted and limited through memcg->memory. + * Consumer breakdown happens in the statistics. + */ struct page_counter memsw; struct page_counter kmem; + struct page_counter skmem; /* Normal memory consumption range */ unsigned long low; @@ -246,9 +225,6 @@ struct mem_cgroup { */ struct mem_cgroup_stat_cpu __percpu *stat; -#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET) - struct cg_proto tcp_mem; -#endif #if defined(CONFIG_MEMCG_KMEM) /* Index in the kmem_cache->memcg_params.memcg_caches array */ int kmemcg_id; @@ -676,12 +652,6 @@ void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx) } #endif /* CONFIG_MEMCG */ -enum { - UNDER_LIMIT, - SOFT_LIMIT, - OVER_LIMIT, -}; - #ifdef CONFIG_CGROUP_WRITEBACK struct list_head *mem_cgroup_cgwb_list(struct mem_cgroup *memcg); @@ -707,15 +677,35 @@ static inline void mem_cgroup_wb_stats(struct bdi_writeback *wb, struct sock; #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM) +extern struct static_key_false mem_cgroup_sockets; +static inline bool mem_cgroup_do_sockets(void) +{ + return static_branch_unlikely(&mem_cgroup_sockets); +} void sock_update_memcg(struct sock *sk); void sock_release_memcg(struct sock *sk); +bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages); +void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages); #else +static inline bool mem_cgroup_do_sockets(void) +{ + return false; +} static inline void sock_update_memcg(struct sock *sk) { } static inline void sock_release_memcg(struct sock *sk) { } +static inline bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, + unsigned int nr_pages) +{ + return true; +} +static inline void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, + unsigned int nr_pages) +{ +} #endif /* CONFIG_INET && CONFIG_MEMCG_KMEM */ #ifdef CONFIG_MEMCG_KMEM diff --git a/include/net/sock.h b/include/net/sock.h index 59a7196..67795fc 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -69,22 +69,6 @@ #include #include -struct cgroup; -struct cgroup_subsys; -#ifdef CONFIG_NET -int mem_cgroup_sockets_init(struct mem_cgroup *memcg, struct cgroup_subsys *ss); -void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg); -#else -static inline -int mem_cgroup_sockets_init(struct mem_cgroup *memcg, struct cgroup_subsys *ss) -{ - return 0; -} -static inline -void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg) -{ -} -#endif /* * This structure really needs to be cleaned up. * Most of it is for TCP, and not used by any of @@ -243,7 +227,6 @@ struct sock_common { /* public: */ }; -struct cg_proto; /** * struct sock - network layer representation of sockets * @__sk_common: shared layout with inet_timewait_sock @@ -310,7 +293,7 @@ struct cg_proto; * @sk_security: used by security modules * @sk_mark: generic packet mark * @sk_classid: this socket's cgroup classid - * @sk_cgrp: this socket's cgroup-specific proto data + * @sk_memcg: this socket's memcg association * @sk_write_pending: a write to stream socket waits to start * @sk_state_change: callback to indicate change in the state of the sock * @sk_data_ready: callback to indicate there is data to be processed @@ -447,7 +430,7 @@ struct sock { #ifdef CONFIG_CGROUP_NET_CLASSID u32 sk_classid; #endif - struct cg_proto *sk_cgrp; + struct mem_cgroup *sk_memcg; void (*sk_state_change)(struct sock *sk); void (*sk_data_ready)(struct sock *sk); void (*sk_write_space)(struct sock *sk); @@ -1051,18 +1034,6 @@ struct proto { #ifdef SOCK_REFCNT_DEBUG atomic_t socks; #endif -#ifdef CONFIG_MEMCG_KMEM - /* - * cgroup specific init/deinit functions. Called once for all - * protocols that implement it, from cgroups populate function. - * This function has to setup any files the protocol want to - * appear in the kmem cgroup filesystem. - */ - int (*init_cgroup)(struct mem_cgroup *memcg, - struct cgroup_subsys *ss); - void (*destroy_cgroup)(struct mem_cgroup *memcg); - struct cg_proto *(*proto_cgroup)(struct mem_cgroup *memcg); -#endif }; int proto_register(struct proto *prot, int alloc_slab); @@ -1093,23 +1064,6 @@ static inline void sk_refcnt_debug_release(const struct sock *sk) #define sk_refcnt_debug_release(sk) do { } while (0) #endif /* SOCK_REFCNT_DEBUG */ -#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_NET) -extern struct static_key memcg_socket_limit_enabled; -static inline struct cg_proto *parent_cg_proto(struct proto *proto, - struct cg_proto *cg_proto) -{ - return proto->proto_cgroup(parent_mem_cgroup(cg_proto->memcg)); -} -#define mem_cgroup_sockets_enabled static_key_false(&memcg_socket_limit_enabled) -#else -#define mem_cgroup_sockets_enabled 0 -static inline struct cg_proto *parent_cg_proto(struct proto *proto, - struct cg_proto *cg_proto) -{ - return NULL; -} -#endif - static inline bool sk_stream_memory_free(const struct sock *sk) { if (sk->sk_wmem_queued >= sk->sk_sndbuf) @@ -1136,9 +1090,6 @@ static inline bool sk_under_memory_pressure(const struct sock *sk) if (!sk->sk_prot->memory_pressure) return false; - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) - return !!sk->sk_cgrp->memory_pressure; - return !!*sk->sk_prot->memory_pressure; } @@ -1146,61 +1097,19 @@ static inline void sk_leave_memory_pressure(struct sock *sk) { int *memory_pressure = sk->sk_prot->memory_pressure; - if (!memory_pressure) - return; - - if (*memory_pressure) + if (memory_pressure && *memory_pressure) *memory_pressure = 0; - - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) { - struct cg_proto *cg_proto = sk->sk_cgrp; - struct proto *prot = sk->sk_prot; - - for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto)) - cg_proto->memory_pressure = 0; - } - } static inline void sk_enter_memory_pressure(struct sock *sk) { - if (!sk->sk_prot->enter_memory_pressure) - return; - - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) { - struct cg_proto *cg_proto = sk->sk_cgrp; - struct proto *prot = sk->sk_prot; - - for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto)) - cg_proto->memory_pressure = 1; - } - - sk->sk_prot->enter_memory_pressure(sk); + if (sk->sk_prot->enter_memory_pressure) + sk->sk_prot->enter_memory_pressure(sk); } static inline long sk_prot_mem_limits(const struct sock *sk, int index) { - long *prot = sk->sk_prot->sysctl_mem; - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) - prot = sk->sk_cgrp->sysctl_mem; - return prot[index]; -} - -static inline void memcg_memory_allocated_add(struct cg_proto *prot, - unsigned long amt, - int *parent_status) -{ - page_counter_charge(&prot->memory_allocated, amt); - - if (page_counter_read(&prot->memory_allocated) > - prot->memory_allocated.limit) - *parent_status = OVER_LIMIT; -} - -static inline void memcg_memory_allocated_sub(struct cg_proto *prot, - unsigned long amt) -{ - page_counter_uncharge(&prot->memory_allocated, amt); + return sk->sk_prot->sysctl_mem[index]; } static inline long @@ -1208,24 +1117,14 @@ sk_memory_allocated(const struct sock *sk) { struct proto *prot = sk->sk_prot; - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) - return page_counter_read(&sk->sk_cgrp->memory_allocated); - return atomic_long_read(prot->memory_allocated); } static inline long -sk_memory_allocated_add(struct sock *sk, int amt, int *parent_status) +sk_memory_allocated_add(struct sock *sk, int amt) { struct proto *prot = sk->sk_prot; - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) { - memcg_memory_allocated_add(sk->sk_cgrp, amt, parent_status); - /* update the root cgroup regardless */ - atomic_long_add_return(amt, prot->memory_allocated); - return page_counter_read(&sk->sk_cgrp->memory_allocated); - } - return atomic_long_add_return(amt, prot->memory_allocated); } @@ -1234,9 +1133,6 @@ sk_memory_allocated_sub(struct sock *sk, int amt) { struct proto *prot = sk->sk_prot; - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) - memcg_memory_allocated_sub(sk->sk_cgrp, amt); - atomic_long_sub(amt, prot->memory_allocated); } @@ -1244,13 +1140,6 @@ static inline void sk_sockets_allocated_dec(struct sock *sk) { struct proto *prot = sk->sk_prot; - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) { - struct cg_proto *cg_proto = sk->sk_cgrp; - - for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto)) - percpu_counter_dec(&cg_proto->sockets_allocated); - } - percpu_counter_dec(prot->sockets_allocated); } @@ -1258,13 +1147,6 @@ static inline void sk_sockets_allocated_inc(struct sock *sk) { struct proto *prot = sk->sk_prot; - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) { - struct cg_proto *cg_proto = sk->sk_cgrp; - - for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto)) - percpu_counter_inc(&cg_proto->sockets_allocated); - } - percpu_counter_inc(prot->sockets_allocated); } @@ -1273,9 +1155,6 @@ sk_sockets_allocated_read_positive(struct sock *sk) { struct proto *prot = sk->sk_prot; - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) - return percpu_counter_read_positive(&sk->sk_cgrp->sockets_allocated); - return percpu_counter_read_positive(prot->sockets_allocated); } diff --git a/include/net/tcp.h b/include/net/tcp.h index eed94fc..77b6c7e 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -291,9 +291,6 @@ extern int tcp_memory_pressure; /* optimized version of sk_under_memory_pressure() for TCP sockets */ static inline bool tcp_under_memory_pressure(const struct sock *sk) { - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) - return !!sk->sk_cgrp->memory_pressure; - return tcp_memory_pressure; } /* diff --git a/include/net/tcp_memcontrol.h b/include/net/tcp_memcontrol.h deleted file mode 100644 index 05b94d9..0000000 --- a/include/net/tcp_memcontrol.h +++ /dev/null @@ -1,7 +0,0 @@ -#ifndef _TCP_MEMCG_H -#define _TCP_MEMCG_H - -struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg); -int tcp_init_cgroup(struct mem_cgroup *memcg, struct cgroup_subsys *ss); -void tcp_destroy_cgroup(struct mem_cgroup *memcg); -#endif /* _TCP_MEMCG_H */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e54f434..c41e6d7 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -66,7 +66,6 @@ #include "internal.h" #include #include -#include #include "slab.h" #include @@ -291,58 +290,68 @@ static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id) /* Writing them here to avoid exposing memcg's inner layout */ #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM) +DEFINE_STATIC_KEY_FALSE(mem_cgroup_sockets); + void sock_update_memcg(struct sock *sk) { - if (mem_cgroup_sockets_enabled) { - struct mem_cgroup *memcg; - struct cg_proto *cg_proto; - - BUG_ON(!sk->sk_prot->proto_cgroup); - - /* Socket cloning can throw us here with sk_cgrp already - * filled. It won't however, necessarily happen from - * process context. So the test for root memcg given - * the current task's memcg won't help us in this case. - * - * Respecting the original socket's memcg is a better - * decision in this case. - */ - if (sk->sk_cgrp) { - BUG_ON(mem_cgroup_is_root(sk->sk_cgrp->memcg)); - css_get(&sk->sk_cgrp->memcg->css); - return; - } - - rcu_read_lock(); - memcg = mem_cgroup_from_task(current); - cg_proto = sk->sk_prot->proto_cgroup(memcg); - if (cg_proto && test_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags) && - css_tryget_online(&memcg->css)) { - sk->sk_cgrp = cg_proto; - } - rcu_read_unlock(); + struct mem_cgroup *memcg; + /* + * Socket cloning can throw us here with sk_cgrp already + * filled. It won't however, necessarily happen from + * process context. So the test for root memcg given + * the current task's memcg won't help us in this case. + * + * Respecting the original socket's memcg is a better + * decision in this case. + */ + if (sk->sk_memcg) { + BUG_ON(mem_cgroup_is_root(sk->sk_memcg)); + css_get(&sk->sk_memcg->css); + return; } + + rcu_read_lock(); + memcg = mem_cgroup_from_task(current); + if (css_tryget_online(&memcg->css)) + sk->sk_memcg = memcg; + rcu_read_unlock(); } EXPORT_SYMBOL(sock_update_memcg); void sock_release_memcg(struct sock *sk) { - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) { - struct mem_cgroup *memcg; - WARN_ON(!sk->sk_cgrp->memcg); - memcg = sk->sk_cgrp->memcg; - css_put(&sk->sk_cgrp->memcg->css); - } + if (sk->sk_memcg) + css_put(&sk->sk_memcg->css); } -struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg) +/** + * mem_cgroup_charge_skmem - charge socket memory + * @memcg: memcg to charge + * @nr_pages: number of pages to charge + * + * Charges @nr_pages to @memcg. Returns %true if the charge fit within + * the memcg's configured limit, %false if the charge had to be forced. + */ +bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) { - if (!memcg || mem_cgroup_is_root(memcg)) - return NULL; + struct page_counter *counter; + + if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter)) + return true; - return &memcg->tcp_mem; + page_counter_charge(&memcg->skmem, nr_pages); + return false; +} + +/** + * mem_cgroup_uncharge_skmem - uncharge socket memory + * @memcg: memcg to uncharge + * @nr_pages: number of pages to uncharge + */ +void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) +{ + page_counter_uncharge(&memcg->skmem, nr_pages); } -EXPORT_SYMBOL(tcp_proto_cgroup); #endif @@ -3592,13 +3601,7 @@ static int mem_cgroup_oom_control_write(struct cgroup_subsys_state *css, #ifdef CONFIG_MEMCG_KMEM static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss) { - int ret; - - ret = memcg_propagate_kmem(memcg); - if (ret) - return ret; - - return mem_cgroup_sockets_init(memcg, ss); + return memcg_propagate_kmem(memcg); } static void memcg_deactivate_kmem(struct mem_cgroup *memcg) @@ -3654,7 +3657,6 @@ static void memcg_destroy_kmem(struct mem_cgroup *memcg) static_key_slow_dec(&memcg_kmem_enabled_key); WARN_ON(page_counter_read(&memcg->kmem)); } - mem_cgroup_sockets_destroy(memcg); } #else static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss) @@ -4218,6 +4220,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) memcg->soft_limit = PAGE_COUNTER_MAX; page_counter_init(&memcg->memsw, NULL); page_counter_init(&memcg->kmem, NULL); + page_counter_init(&memcg->skmem, NULL); } memcg->last_scanned_node = MAX_NUMNODES; @@ -4266,6 +4269,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css) memcg->soft_limit = PAGE_COUNTER_MAX; page_counter_init(&memcg->memsw, &parent->memsw); page_counter_init(&memcg->kmem, &parent->kmem); + page_counter_init(&memcg->skmem, &parent->skmem); /* * No need to take a reference to the parent because cgroup @@ -4277,6 +4281,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css) memcg->soft_limit = PAGE_COUNTER_MAX; page_counter_init(&memcg->memsw, NULL); page_counter_init(&memcg->kmem, NULL); + page_counter_init(&memcg->skmem, NULL); /* * Deeper hierachy with use_hierarchy == false doesn't make * much sense so let cgroup subsystem know about this diff --git a/net/core/sock.c b/net/core/sock.c index 0fafd27..0debff5 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -194,44 +194,6 @@ bool sk_net_capable(const struct sock *sk, int cap) } EXPORT_SYMBOL(sk_net_capable); - -#ifdef CONFIG_MEMCG_KMEM -int mem_cgroup_sockets_init(struct mem_cgroup *memcg, struct cgroup_subsys *ss) -{ - struct proto *proto; - int ret = 0; - - mutex_lock(&proto_list_mutex); - list_for_each_entry(proto, &proto_list, node) { - if (proto->init_cgroup) { - ret = proto->init_cgroup(memcg, ss); - if (ret) - goto out; - } - } - - mutex_unlock(&proto_list_mutex); - return ret; -out: - list_for_each_entry_continue_reverse(proto, &proto_list, node) - if (proto->destroy_cgroup) - proto->destroy_cgroup(memcg); - mutex_unlock(&proto_list_mutex); - return ret; -} - -void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg) -{ - struct proto *proto; - - mutex_lock(&proto_list_mutex); - list_for_each_entry_reverse(proto, &proto_list, node) - if (proto->destroy_cgroup) - proto->destroy_cgroup(memcg); - mutex_unlock(&proto_list_mutex); -} -#endif - /* * Each address family might have different locking rules, so we have * one slock key per address family: @@ -239,11 +201,6 @@ void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg) static struct lock_class_key af_family_keys[AF_MAX]; static struct lock_class_key af_family_slock_keys[AF_MAX]; -#if defined(CONFIG_MEMCG_KMEM) -struct static_key memcg_socket_limit_enabled; -EXPORT_SYMBOL(memcg_socket_limit_enabled); -#endif - /* * Make lock validator output more readable. (we pre-construct these * strings build-time, so that runtime initialization of socket @@ -1476,12 +1433,6 @@ void sk_free(struct sock *sk) } EXPORT_SYMBOL(sk_free); -static void sk_update_clone(const struct sock *sk, struct sock *newsk) -{ - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) - sock_update_memcg(newsk); -} - /** * sk_clone_lock - clone a socket, and lock its clone * @sk: the socket to clone @@ -1577,7 +1528,8 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority) sk_set_socket(newsk, NULL); newsk->sk_wq = NULL; - sk_update_clone(sk, newsk); + if (mem_cgroup_do_sockets()) + sock_update_memcg(newsk); if (newsk->sk_prot->sockets_allocated) sk_sockets_allocated_inc(newsk); @@ -2036,27 +1988,27 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind) struct proto *prot = sk->sk_prot; int amt = sk_mem_pages(size); long allocated; - int parent_status = UNDER_LIMIT; sk->sk_forward_alloc += amt * SK_MEM_QUANTUM; - allocated = sk_memory_allocated_add(sk, amt, &parent_status); + allocated = sk_memory_allocated_add(sk, amt); + + if (mem_cgroup_do_sockets() && sk->sk_memcg && + !mem_cgroup_charge_skmem(sk->sk_memcg, amt)) + goto suppress_allocation; /* Under limit. */ - if (parent_status == UNDER_LIMIT && - allocated <= sk_prot_mem_limits(sk, 0)) { + if (allocated <= sk_prot_mem_limits(sk, 0)) { sk_leave_memory_pressure(sk); return 1; } - /* Under pressure. (we or our parents) */ - if ((parent_status > SOFT_LIMIT) || - allocated > sk_prot_mem_limits(sk, 1)) + /* Under pressure. */ + if (allocated > sk_prot_mem_limits(sk, 1)) sk_enter_memory_pressure(sk); - /* Over hard limit (we or our parents) */ - if ((parent_status == OVER_LIMIT) || - (allocated > sk_prot_mem_limits(sk, 2))) + /* Over hard limit. */ + if (allocated > sk_prot_mem_limits(sk, 2)) goto suppress_allocation; /* guarantee minimum buffer size under pressure */ @@ -2105,6 +2057,9 @@ suppress_allocation: sk_memory_allocated_sub(sk, amt); + if (mem_cgroup_do_sockets() && sk->sk_memcg) + mem_cgroup_uncharge_skmem(sk->sk_memcg, amt); + return 0; } EXPORT_SYMBOL(__sk_mem_schedule); @@ -2120,6 +2075,9 @@ void __sk_mem_reclaim(struct sock *sk, int amount) sk_memory_allocated_sub(sk, amount); sk->sk_forward_alloc -= amount << SK_MEM_QUANTUM_SHIFT; + if (mem_cgroup_do_sockets() && sk->sk_memcg) + mem_cgroup_uncharge_skmem(sk->sk_memcg, amount); + if (sk_under_memory_pressure(sk) && (sk_memory_allocated(sk) < sk_prot_mem_limits(sk, 0))) sk_leave_memory_pressure(sk); diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 894da3a..1f00819 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -24,7 +24,6 @@ #include #include #include -#include static int zero; static int one = 1; diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index ac1bdbb..ec931c0 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -421,7 +421,8 @@ void tcp_init_sock(struct sock *sk) sk->sk_rcvbuf = sysctl_tcp_rmem[1]; local_bh_disable(); - sock_update_memcg(sk); + if (mem_cgroup_do_sockets()) + sock_update_memcg(sk); sk_sockets_allocated_inc(sk); local_bh_enable(); } diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 30dd45c..bb5f4f2 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -73,7 +73,6 @@ #include #include #include -#include #include #include @@ -1808,7 +1807,8 @@ void tcp_v4_destroy_sock(struct sock *sk) tcp_saved_syn_free(tp); sk_sockets_allocated_dec(sk); - sock_release_memcg(sk); + if (mem_cgroup_do_sockets()) + sock_release_memcg(sk); } EXPORT_SYMBOL(tcp_v4_destroy_sock); @@ -2330,11 +2330,6 @@ struct proto tcp_prot = { .compat_setsockopt = compat_tcp_setsockopt, .compat_getsockopt = compat_tcp_getsockopt, #endif -#ifdef CONFIG_MEMCG_KMEM - .init_cgroup = tcp_init_cgroup, - .destroy_cgroup = tcp_destroy_cgroup, - .proto_cgroup = tcp_proto_cgroup, -#endif }; EXPORT_SYMBOL(tcp_prot); diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c index 2379c1b..09a37eb 100644 --- a/net/ipv4/tcp_memcontrol.c +++ b/net/ipv4/tcp_memcontrol.c @@ -1,107 +1,10 @@ -#include -#include -#include -#include -#include +#include #include +#include #include - -int tcp_init_cgroup(struct mem_cgroup *memcg, struct cgroup_subsys *ss) -{ - /* - * The root cgroup does not use page_counters, but rather, - * rely on the data already collected by the network - * subsystem - */ - struct mem_cgroup *parent = parent_mem_cgroup(memcg); - struct page_counter *counter_parent = NULL; - struct cg_proto *cg_proto, *parent_cg; - - cg_proto = tcp_prot.proto_cgroup(memcg); - if (!cg_proto) - return 0; - - cg_proto->sysctl_mem[0] = sysctl_tcp_mem[0]; - cg_proto->sysctl_mem[1] = sysctl_tcp_mem[1]; - cg_proto->sysctl_mem[2] = sysctl_tcp_mem[2]; - cg_proto->memory_pressure = 0; - cg_proto->memcg = memcg; - - parent_cg = tcp_prot.proto_cgroup(parent); - if (parent_cg) - counter_parent = &parent_cg->memory_allocated; - - page_counter_init(&cg_proto->memory_allocated, counter_parent); - percpu_counter_init(&cg_proto->sockets_allocated, 0, GFP_KERNEL); - - return 0; -} -EXPORT_SYMBOL(tcp_init_cgroup); - -void tcp_destroy_cgroup(struct mem_cgroup *memcg) -{ - struct cg_proto *cg_proto; - - cg_proto = tcp_prot.proto_cgroup(memcg); - if (!cg_proto) - return; - - percpu_counter_destroy(&cg_proto->sockets_allocated); - - if (test_bit(MEMCG_SOCK_ACTIVATED, &cg_proto->flags)) - static_key_slow_dec(&memcg_socket_limit_enabled); - -} -EXPORT_SYMBOL(tcp_destroy_cgroup); - -static int tcp_update_limit(struct mem_cgroup *memcg, unsigned long nr_pages) -{ - struct cg_proto *cg_proto; - int i; - int ret; - - cg_proto = tcp_prot.proto_cgroup(memcg); - if (!cg_proto) - return -EINVAL; - - ret = page_counter_limit(&cg_proto->memory_allocated, nr_pages); - if (ret) - return ret; - - for (i = 0; i < 3; i++) - cg_proto->sysctl_mem[i] = min_t(long, nr_pages, - sysctl_tcp_mem[i]); - - if (nr_pages == PAGE_COUNTER_MAX) - clear_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags); - else { - /* - * The active bit needs to be written after the static_key - * update. This is what guarantees that the socket activation - * function is the last one to run. See sock_update_memcg() for - * details, and note that we don't mark any socket as belonging - * to this memcg until that flag is up. - * - * We need to do this, because static_keys will span multiple - * sites, but we can't control their order. If we mark a socket - * as accounted, but the accounting functions are not patched in - * yet, we'll lose accounting. - * - * We never race with the readers in sock_update_memcg(), - * because when this value change, the code to process it is not - * patched in yet. - * - * The activated bit is used to guarantee that no two writers - * will do the update in the same memcg. Without that, we can't - * properly shutdown the static key. - */ - if (!test_and_set_bit(MEMCG_SOCK_ACTIVATED, &cg_proto->flags)) - static_key_slow_inc(&memcg_socket_limit_enabled); - set_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags); - } - - return 0; -} +#include +#include +#include enum { RES_USAGE, @@ -124,11 +27,17 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of, switch (of_cft(of)->private) { case RES_LIMIT: /* see memcontrol.c */ + if (memcg == root_mem_cgroup) { + ret = -EINVAL; + break; + } ret = page_counter_memparse(buf, "-1", &nr_pages); if (ret) break; mutex_lock(&tcp_limit_mutex); - ret = tcp_update_limit(memcg, nr_pages); + ret = page_counter_limit(&memcg->skmem, nr_pages); + if (!ret) + static_branch_enable(&mem_cgroup_sockets); mutex_unlock(&tcp_limit_mutex); break; default: @@ -141,32 +50,28 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of, static u64 tcp_cgroup_read(struct cgroup_subsys_state *css, struct cftype *cft) { struct mem_cgroup *memcg = mem_cgroup_from_css(css); - struct cg_proto *cg_proto = tcp_prot.proto_cgroup(memcg); u64 val; switch (cft->private) { case RES_LIMIT: - if (!cg_proto) - return PAGE_COUNTER_MAX; - val = cg_proto->memory_allocated.limit; + val = memcg->skmem.limit; val *= PAGE_SIZE; break; case RES_USAGE: - if (!cg_proto) + if (memcg == root_mem_cgroup) val = atomic_long_read(&tcp_memory_allocated); else - val = page_counter_read(&cg_proto->memory_allocated); + val = page_counter_read(&memcg->skmem); val *= PAGE_SIZE; break; case RES_FAILCNT: - if (!cg_proto) - return 0; - val = cg_proto->memory_allocated.failcnt; + val = memcg->skmem.failcnt; break; case RES_MAX_USAGE: - if (!cg_proto) - return 0; - val = cg_proto->memory_allocated.watermark; + if (memcg == root_mem_cgroup) + val = 0; + else + val = memcg->skmem.watermark; val *= PAGE_SIZE; break; default: @@ -178,20 +83,14 @@ static u64 tcp_cgroup_read(struct cgroup_subsys_state *css, struct cftype *cft) static ssize_t tcp_cgroup_reset(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { - struct mem_cgroup *memcg; - struct cg_proto *cg_proto; - - memcg = mem_cgroup_from_css(of_css(of)); - cg_proto = tcp_prot.proto_cgroup(memcg); - if (!cg_proto) - return nbytes; + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); switch (of_cft(of)->private) { case RES_MAX_USAGE: - page_counter_reset_watermark(&cg_proto->memory_allocated); + page_counter_reset_watermark(&memcg->skmem); break; case RES_FAILCNT: - cg_proto->memory_allocated.failcnt = 0; + memcg->skmem.failcnt = 0; break; } diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 19adedb..b496fc9 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -2819,13 +2819,15 @@ begin_fwd: */ void sk_forced_mem_schedule(struct sock *sk, int size) { - int amt, status; + int amt; if (size <= sk->sk_forward_alloc) return; amt = sk_mem_pages(size); sk->sk_forward_alloc += amt * SK_MEM_QUANTUM; - sk_memory_allocated_add(sk, amt, &status); + sk_memory_allocated_add(sk, amt); + if (mem_cgroup_do_sockets() && sk->sk_memcg) + mem_cgroup_charge_skmem(sk->sk_memcg, amt); } /* Send a FIN. The caller locks the socket for us. diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index f495d18..cf19e65 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -1862,9 +1862,6 @@ struct proto tcpv6_prot = { .compat_setsockopt = compat_tcp_setsockopt, .compat_getsockopt = compat_tcp_getsockopt, #endif -#ifdef CONFIG_MEMCG_KMEM - .proto_cgroup = tcp_proto_cgroup, -#endif .clear_sk = tcp_v6_clear_sk, }; -- 2.6.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: [PATCH 7/8] mm: vmscan: report vmpressure at the level of reclaim activity Date: Thu, 22 Oct 2015 00:21:35 -0400 Message-ID: <1445487696-21545-8-git-send-email-hannes@cmpxchg.org> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> Return-path: In-Reply-To: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: "David S. Miller" , Andrew Morton Cc: Michal Hocko , Vladimir Davydov , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org The vmpressure metric is based on reclaim efficiency, which in turn is an attribute of the LRU. However, vmpressure events are currently reported at the source of pressure rather than at the reclaim level. Switch the reporting to the reclaim level to allow finer-grained analysis of which memcg is having trouble reclaiming its pages. As far as memory.pressure_level interface semantics go, events are escalated up the hierarchy until a listener is found, so this won't affect existing users that listen at higher levels. This also prepares vmpressure for hooking it up to the networking stack's memory pressure code. Signed-off-by: Johannes Weiner --- mm/vmscan.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index ecc2125..50630c8 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2404,6 +2404,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc, memcg = mem_cgroup_iter(root, NULL, &reclaim); do { unsigned long lru_pages; + unsigned long reclaimed; unsigned long scanned; struct lruvec *lruvec; int swappiness; @@ -2416,6 +2417,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc, lruvec = mem_cgroup_zone_lruvec(zone, memcg); swappiness = mem_cgroup_swappiness(memcg); + reclaimed = sc->nr_reclaimed; scanned = sc->nr_scanned; shrink_lruvec(lruvec, swappiness, sc, &lru_pages); @@ -2437,6 +2439,10 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc, } } + vmpressure(sc->gfp_mask, memcg, + sc->nr_scanned - scanned, + sc->nr_reclaimed - reclaimed); + /* * Direct reclaim and kswapd have to scan all memory * cgroups to fulfill the overall scan target for the @@ -2454,10 +2460,6 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc, } } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim))); - vmpressure(sc->gfp_mask, sc->target_mem_cgroup, - sc->nr_scanned - nr_scanned, - sc->nr_reclaimed - nr_reclaimed); - if (sc->nr_reclaimed - nr_reclaimed) reclaimable = true; -- 2.6.1 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Date: Thu, 22 Oct 2015 00:21:33 -0400 Message-ID: <1445487696-21545-6-git-send-email-hannes@cmpxchg.org> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> Return-path: In-Reply-To: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: "David S. Miller" , Andrew Morton Cc: Michal Hocko , Vladimir Davydov , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Socket memory can be a significant share of overall memory consumed by common workloads. In order to provide reasonable resource isolation out-of-the-box in the unified hierarchy, this type of memory needs to be accounted and tracked per default in the memory controller. Signed-off-by: Johannes Weiner --- include/linux/memcontrol.h | 16 ++++++-- mm/memcontrol.c | 95 ++++++++++++++++++++++++++++++++++++---------- 2 files changed, 87 insertions(+), 24 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 5b72f83..6f1e0f8 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -244,6 +244,10 @@ struct mem_cgroup { struct wb_domain cgwb_domain; #endif +#ifdef CONFIG_INET + struct work_struct socket_work; +#endif + /* List of events which userspace want to receive */ struct list_head event_list; spinlock_t event_list_lock; @@ -676,11 +680,15 @@ static inline void mem_cgroup_wb_stats(struct bdi_writeback *wb, #endif /* CONFIG_CGROUP_WRITEBACK */ struct sock; -#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM) -extern struct static_key_false mem_cgroup_sockets; +#ifdef CONFIG_INET +extern struct static_key_true mem_cgroup_sockets; static inline bool mem_cgroup_do_sockets(void) { - return static_branch_unlikely(&mem_cgroup_sockets); + if (mem_cgroup_disabled()) + return false; + if (!static_branch_likely(&mem_cgroup_sockets)) + return false; + return true; } void sock_update_memcg(struct sock *sk); void sock_release_memcg(struct sock *sk); @@ -706,7 +714,7 @@ static inline void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) { } -#endif /* CONFIG_INET && CONFIG_MEMCG_KMEM */ +#endif /* CONFIG_INET */ #ifdef CONFIG_MEMCG_KMEM extern struct static_key memcg_kmem_enabled_key; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 3789050..cb1d6aa 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1916,6 +1916,18 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb, return NOTIFY_OK; } +static void reclaim_high(struct mem_cgroup *memcg, + unsigned int nr_pages, + gfp_t gfp_mask) +{ + do { + if (page_counter_read(&memcg->memory) <= memcg->high) + continue; + mem_cgroup_events(memcg, MEMCG_HIGH, 1); + try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true); + } while ((memcg = parent_mem_cgroup(memcg))); +} + /* * Scheduled by try_charge() to be executed from the userland return path * and reclaims memory over the high limit. @@ -1923,20 +1935,13 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb, void mem_cgroup_handle_over_high(void) { unsigned int nr_pages = current->memcg_nr_pages_over_high; - struct mem_cgroup *memcg, *pos; + struct mem_cgroup *memcg; if (likely(!nr_pages)) return; - pos = memcg = get_mem_cgroup_from_mm(current->mm); - - do { - if (page_counter_read(&pos->memory) <= pos->high) - continue; - mem_cgroup_events(pos, MEMCG_HIGH, 1); - try_to_free_mem_cgroup_pages(pos, nr_pages, GFP_KERNEL, true); - } while ((pos = parent_mem_cgroup(pos))); - + memcg = get_mem_cgroup_from_mm(current->mm); + reclaim_high(memcg, nr_pages, GFP_KERNEL); css_put(&memcg->css); current->memcg_nr_pages_over_high = 0; } @@ -4129,6 +4134,8 @@ struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg) } EXPORT_SYMBOL(parent_mem_cgroup); +static void socket_work_func(struct work_struct *work); + static struct cgroup_subsys_state * __ref mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) { @@ -4169,6 +4176,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) #ifdef CONFIG_CGROUP_WRITEBACK INIT_LIST_HEAD(&memcg->cgwb_list); #endif +#ifdef CONFIG_INET + INIT_WORK(&memcg->socket_work, socket_work_func); +#endif return &memcg->css; free_out: @@ -4266,6 +4276,8 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css) { struct mem_cgroup *memcg = mem_cgroup_from_css(css); + cancel_work_sync(&memcg->socket_work); + memcg_destroy_kmem(memcg); __mem_cgroup_free(memcg); } @@ -4948,10 +4960,15 @@ static void mem_cgroup_bind(struct cgroup_subsys_state *root_css) * guarantees that @root doesn't have any children, so turning it * on for the root memcg is enough. */ - if (cgroup_subsys_on_dfl(memory_cgrp_subsys)) + if (cgroup_subsys_on_dfl(memory_cgrp_subsys)) { root_mem_cgroup->use_hierarchy = true; - else +#ifdef CONFIG_INET + /* unified hierarchy always counts skmem */ + static_branch_enable(&mem_cgroup_sockets); +#endif + } else { root_mem_cgroup->use_hierarchy = false; + } } static u64 memory_current_read(struct cgroup_subsys_state *css, @@ -5453,10 +5470,9 @@ void mem_cgroup_replace_page(struct page *oldpage, struct page *newpage) commit_charge(newpage, memcg, true); } -/* Writing them here to avoid exposing memcg's inner layout */ -#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM) +#ifdef CONFIG_INET -DEFINE_STATIC_KEY_FALSE(mem_cgroup_sockets); +DEFINE_STATIC_KEY_TRUE(mem_cgroup_sockets); void sock_update_memcg(struct sock *sk) { @@ -5490,6 +5506,14 @@ void sock_release_memcg(struct sock *sk) css_put(&sk->sk_memcg->css); } +static void socket_work_func(struct work_struct *work) +{ + struct mem_cgroup *memcg; + + memcg = container_of(work, struct mem_cgroup, socket_work); + reclaim_high(memcg, CHARGE_BATCH, GFP_KERNEL); +} + /** * mem_cgroup_charge_skmem - charge socket memory * @memcg: memcg to charge @@ -5500,13 +5524,38 @@ void sock_release_memcg(struct sock *sk) */ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) { + unsigned int batch = max(CHARGE_BATCH, nr_pages); struct page_counter *counter; + bool force = false; - if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter)) + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) { + if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter)) + return true; + page_counter_charge(&memcg->skmem, nr_pages); + return false; + } + + if (consume_stock(memcg, nr_pages)) return true; +retry: + if (page_counter_try_charge(&memcg->memory, batch, &counter)) + goto done; - page_counter_charge(&memcg->skmem, nr_pages); - return false; + if (batch > nr_pages) { + batch = nr_pages; + goto retry; + } + + force = true; + page_counter_charge(&memcg->memory, batch); +done: + css_get_many(&memcg->css, batch); + if (batch > nr_pages) + refill_stock(memcg, batch - nr_pages); + + schedule_work(&memcg->socket_work); + + return !force; } /** @@ -5516,10 +5565,16 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) */ void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) { - page_counter_uncharge(&memcg->skmem, nr_pages); + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) { + page_counter_uncharge(&memcg->skmem, nr_pages); + return; + } + + page_counter_uncharge(&memcg->memory, nr_pages); + css_put_many(&memcg->css, nr_pages); } -#endif +#endif /* CONFIG_INET */ /* * subsys_initcall() for memory controller. -- 2.6.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: [PATCH 6/8] mm: vmscan: simplify memcg vs. global shrinker invocation Date: Thu, 22 Oct 2015 00:21:34 -0400 Message-ID: <1445487696-21545-7-git-send-email-hannes@cmpxchg.org> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> Return-path: In-Reply-To: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: "David S. Miller" , Andrew Morton Cc: Michal Hocko , Vladimir Davydov , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Letting shrink_slab() handle the root_mem_cgroup, and implicitely the !CONFIG_MEMCG case, allows shrink_zone() to invoke the shrinkers unconditionally from within the memcg iteration loop. Signed-off-by: Johannes Weiner --- include/linux/memcontrol.h | 2 ++ mm/vmscan.c | 31 ++++++++++++++++--------------- 2 files changed, 18 insertions(+), 15 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 6f1e0f8..d66ae18 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -482,6 +482,8 @@ void mem_cgroup_split_huge_fixup(struct page *head); #else /* CONFIG_MEMCG */ struct mem_cgroup; +#define root_mem_cgroup NULL + static inline void mem_cgroup_events(struct mem_cgroup *memcg, enum mem_cgroup_events_index idx, unsigned int nr) diff --git a/mm/vmscan.c b/mm/vmscan.c index 9b52ecf..ecc2125 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -411,6 +411,10 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct shrinker *shrinker; unsigned long freed = 0; + /* Global shrinker mode */ + if (memcg == root_mem_cgroup) + memcg = NULL; + if (memcg && !memcg_kmem_is_active(memcg)) return 0; @@ -2417,11 +2421,22 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc, shrink_lruvec(lruvec, swappiness, sc, &lru_pages); zone_lru_pages += lru_pages; - if (memcg && is_classzone) + /* + * Shrink the slab caches in the same proportion that + * the eligible LRU pages were scanned. + */ + if (is_classzone) { shrink_slab(sc->gfp_mask, zone_to_nid(zone), memcg, sc->nr_scanned - scanned, lru_pages); + if (reclaim_state) { + sc->nr_reclaimed += + reclaim_state->reclaimed_slab; + reclaim_state->reclaimed_slab = 0; + } + } + /* * Direct reclaim and kswapd have to scan all memory * cgroups to fulfill the overall scan target for the @@ -2439,20 +2454,6 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc, } } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim))); - /* - * Shrink the slab caches in the same proportion that - * the eligible LRU pages were scanned. - */ - if (global_reclaim(sc) && is_classzone) - shrink_slab(sc->gfp_mask, zone_to_nid(zone), NULL, - sc->nr_scanned - nr_scanned, - zone_lru_pages); - - if (reclaim_state) { - sc->nr_reclaimed += reclaim_state->reclaimed_slab; - reclaim_state->reclaimed_slab = 0; - } - vmpressure(sc->gfp_mask, sc->target_mem_cgroup, sc->nr_scanned - nr_scanned, sc->nr_reclaimed - nr_reclaimed); -- 2.6.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: [PATCH 8/8] mm: memcontrol: hook up vmpressure to socket pressure Date: Thu, 22 Oct 2015 00:21:36 -0400 Message-ID: <1445487696-21545-9-git-send-email-hannes@cmpxchg.org> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> Return-path: In-Reply-To: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: "David S. Miller" , Andrew Morton Cc: Michal Hocko , Vladimir Davydov , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Let the networking stack know when a memcg is under reclaim pressure, so it can shrink its transmit windows accordingly. Whenever the reclaim efficiency of a memcg's LRU lists drops low enough for a MEDIUM or HIGH vmpressure event to occur, assert a pressure state in the socket and tcp memory code that tells it to reduce memory usage in sockets associated with said memory cgroup. vmpressure events are edge triggered, so for hysteresis assert socket pressure for a second to allow for subsequent vmpressure events to occur before letting the socket code return to normal. Signed-off-by: Johannes Weiner --- include/linux/memcontrol.h | 9 +++++++++ include/net/sock.h | 4 ++++ include/net/tcp.h | 4 ++++ mm/memcontrol.c | 1 + mm/vmpressure.c | 29 ++++++++++++++++++++++++----- 5 files changed, 42 insertions(+), 5 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index d66ae18..b9990f7 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -246,6 +246,7 @@ struct mem_cgroup { #ifdef CONFIG_INET struct work_struct socket_work; + unsigned long socket_pressure; #endif /* List of events which userspace want to receive */ @@ -696,6 +697,10 @@ void sock_update_memcg(struct sock *sk); void sock_release_memcg(struct sock *sk); bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages); void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages); +static inline bool mem_cgroup_socket_pressure(struct mem_cgroup *memcg) +{ + return time_before(jiffies, memcg->socket_pressure); +} #else static inline bool mem_cgroup_do_sockets(void) { @@ -716,6 +721,10 @@ static inline void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) { } +static inline bool mem_cgroup_socket_pressure(struct mem_cgroup *memcg) +{ + return false; +} #endif /* CONFIG_INET */ #ifdef CONFIG_MEMCG_KMEM diff --git a/include/net/sock.h b/include/net/sock.h index 67795fc..22bfb9c 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1087,6 +1087,10 @@ static inline bool sk_has_memory_pressure(const struct sock *sk) static inline bool sk_under_memory_pressure(const struct sock *sk) { + if (mem_cgroup_do_sockets() && sk->sk_memcg && + mem_cgroup_socket_pressure(sk->sk_memcg)) + return true; + if (!sk->sk_prot->memory_pressure) return false; diff --git a/include/net/tcp.h b/include/net/tcp.h index 77b6c7e..c7d342c 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -291,6 +291,10 @@ extern int tcp_memory_pressure; /* optimized version of sk_under_memory_pressure() for TCP sockets */ static inline bool tcp_under_memory_pressure(const struct sock *sk) { + if (mem_cgroup_do_sockets() && sk->sk_memcg && + mem_cgroup_socket_pressure(sk->sk_memcg)) + return true; + return tcp_memory_pressure; } /* diff --git a/mm/memcontrol.c b/mm/memcontrol.c index cb1d6aa..2e09def 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4178,6 +4178,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) #endif #ifdef CONFIG_INET INIT_WORK(&memcg->socket_work, socket_work_func); + memcg->socket_pressure = jiffies; #endif return &memcg->css; diff --git a/mm/vmpressure.c b/mm/vmpressure.c index 4c25e62..f64c0e1 100644 --- a/mm/vmpressure.c +++ b/mm/vmpressure.c @@ -137,14 +137,11 @@ struct vmpressure_event { }; static bool vmpressure_event(struct vmpressure *vmpr, - unsigned long scanned, unsigned long reclaimed) + enum vmpressure_levels level) { struct vmpressure_event *ev; - enum vmpressure_levels level; bool signalled = false; - level = vmpressure_calc_level(scanned, reclaimed); - mutex_lock(&vmpr->events_lock); list_for_each_entry(ev, &vmpr->events, node) { @@ -162,6 +159,7 @@ static bool vmpressure_event(struct vmpressure *vmpr, static void vmpressure_work_fn(struct work_struct *work) { struct vmpressure *vmpr = work_to_vmpressure(work); + enum vmpressure_levels level; unsigned long scanned; unsigned long reclaimed; @@ -185,8 +183,29 @@ static void vmpressure_work_fn(struct work_struct *work) vmpr->reclaimed = 0; spin_unlock(&vmpr->sr_lock); + level = vmpressure_calc_level(scanned, reclaimed); + + if (level > VMPRESSURE_LOW) { + struct mem_cgroup *memcg; + /* + * Let the socket buffer allocator know that we are + * having trouble reclaiming LRU pages. + * + * For hysteresis, keep the pressure state asserted + * for a second in which subsequent pressure events + * can occur. + * + * XXX: is vmpressure a global feature or part of + * memcg? There shouldn't be anything memcg-specific + * about exporting reclaim success ratios from the VM. + */ + memcg = container_of(vmpr, struct mem_cgroup, vmpressure); + if (memcg != root_mem_cgroup) + memcg->socket_pressure = jiffies + HZ; + } + do { - if (vmpressure_event(vmpr, scanned, reclaimed)) + if (vmpressure_event(vmpr, level)) break; /* * If not handled, propagate the event upward into the -- 2.6.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Vladimir Davydov Subject: Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy Date: Thu, 22 Oct 2015 21:45:10 +0300 Message-ID: <20151022184509.GM18351@esperanza> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Return-path: Content-Disposition: inline In-Reply-To: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Hi Johannes, On Thu, Oct 22, 2015 at 12:21:28AM -0400, Johannes Weiner wrote: ... > Patch #5 adds accounting and tracking of socket memory to the unified > hierarchy memory controller, as described above. It uses the existing > per-cpu charge caches and triggers high limit reclaim asynchroneously. > > Patch #8 uses the vmpressure extension to equalize pressure between > the pages tracked natively by the VM and socket buffer pages. As the > pool is shared, it makes sense that while natively tracked pages are > under duress the network transmit windows are also not increased. First of all, I've no experience in networking, so I'm likely to be mistaken. Nevertheless I beg to disagree that this patch set is a step in the right direction. Here goes why. I admit that your idea to get rid of explicit tcp window control knobs and size it dynamically basing on memory pressure instead does sound tempting, but I don't think it'd always work. The problem is that in contrast to, say, dcache, we can't shrink tcp buffers AFAIU, we can only stop growing them. Now suppose a system hasn't experienced memory pressure for a while. If we don't have explicit tcp window limit, tcp buffers on such a system might have eaten almost all available memory (because of network load/problems). If a user workload that needs a significant amount of memory is started suddenly then, the network code will receive a notification and surely stop growing buffers, but all those buffers accumulated won't disappear instantly. As a result, the workload might be unable to find enough free memory and have no choice but invoke OOM killer. This looks unexpected from the user POV. That said, I think we do need per memcg tcp window control similar to what we have system-wide. In other words, Glauber's work makes sense to me. You might want to point me at my RFC patch where I proposed to revert it (https://lkml.org/lkml/2014/9/12/401). Well, I've changed my mind since then. Now I think I was mistaken, luckily I was stopped. However, I may be mistaken again :-) Thanks, Vladimir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Vladimir Davydov Subject: Re: [PATCH 3/8] net: consolidate memcg socket buffer tracking and accounting Date: Thu, 22 Oct 2015 21:46:12 +0300 Message-ID: <20151022184612.GN18351@esperanza> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-4-git-send-email-hannes@cmpxchg.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Return-path: Content-Disposition: inline In-Reply-To: <1445487696-21545-4-git-send-email-hannes@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Thu, Oct 22, 2015 at 12:21:31AM -0400, Johannes Weiner wrote: > The tcp memory controller has extensive provisions for future memory > accounting interfaces that won't materialize after all. Cut the code > base down to what's actually used, now and in the likely future. > > - There won't be any different protocol counters in the future, so a > direct sock->sk_memcg linkage is enough. This eliminates a lot of > callback maze and boilerplate code, and restores most of the socket > allocation code to pre-tcp_memcontrol state. > > - There won't be a tcp control soft limit, so integrating the memcg In fact, the code is ready for the "soft" limit (I mean min, pressure, max tuple), it just lacks a knob. > code into the global skmem limiting scheme complicates things > unnecessarily. Replace all that with simple and clear charge and > uncharge calls--hidden behind a jump label--to account skb memory. > > - The previous jump label code was an elaborate state machine that > tracked the number of cgroups with an active socket limit in order > to enable the skmem tracking and accounting code only when actively > necessary. But this is overengineered: it was meant to protect the > people who never use this feature in the first place. Simply enable > the branches once when the first limit is set until the next reboot. > ... > @@ -1136,9 +1090,6 @@ static inline bool sk_under_memory_pressure(const struct sock *sk) > if (!sk->sk_prot->memory_pressure) > return false; > > - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) > - return !!sk->sk_cgrp->memory_pressure; > - AFAIU, now we won't shrink the window on hitting the limit, i.e. this patch subtly changes the behavior of the existing knobs, potentially breaking them. Thanks, Vladimir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Vladimir Davydov Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Date: Thu, 22 Oct 2015 21:47:57 +0300 Message-ID: <20151022184757.GO18351@esperanza> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-6-git-send-email-hannes@cmpxchg.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Return-path: Content-Disposition: inline In-Reply-To: <1445487696-21545-6-git-send-email-hannes@cmpxchg.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Thu, Oct 22, 2015 at 12:21:33AM -0400, Johannes Weiner wrote: ... > @@ -5500,13 +5524,38 @@ void sock_release_memcg(struct sock *sk) > */ > bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) > { > + unsigned int batch = max(CHARGE_BATCH, nr_pages); > struct page_counter *counter; > + bool force = false; > > - if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter)) > + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) { > + if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter)) > + return true; > + page_counter_charge(&memcg->skmem, nr_pages); > + return false; > + } > + > + if (consume_stock(memcg, nr_pages)) > return true; > +retry: > + if (page_counter_try_charge(&memcg->memory, batch, &counter)) > + goto done; Currently, we use memcg->memory only for charging memory pages. Besides, every page charged to this counter (including kmem) has ->mem_cgroup field set appropriately. This looks consistent and nice. As an extra benefit, we can track all pages charged to a memory cgroup via /proc/kapgecgroup. Now, you charge "window size" to it, which AFAIU isn't necessarily equal to the amount of memory actually consumed by the cgroup for socket buffers. I think this looks ugly and inconsistent with the existing behavior. I agree that we need to charge socker buffers to ->memory, but IMO we should do that per each skb page, using memcg_kmem_charge_kmem somewhere in alloc_skb_with_frags invoking the reclaimer just as we do for kmalloc, while tcp window size control should stay aside. Thanks, Vladimir From mboxrd@z Thu Jan 1 00:00:00 1970 From: Vladimir Davydov Subject: Re: [PATCH 7/8] mm: vmscan: report vmpressure at the level of reclaim activity Date: Thu, 22 Oct 2015 21:48:53 +0300 Message-ID: <20151022184852.GP18351@esperanza> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-8-git-send-email-hannes@cmpxchg.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Return-path: Content-Disposition: inline In-Reply-To: <1445487696-21545-8-git-send-email-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Thu, Oct 22, 2015 at 12:21:35AM -0400, Johannes Weiner wrote: ... > @@ -2437,6 +2439,10 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc, > } > } > > + vmpressure(sc->gfp_mask, memcg, > + sc->nr_scanned - scanned, > + sc->nr_reclaimed - reclaimed); > + > /* > * Direct reclaim and kswapd have to scan all memory > * cgroups to fulfill the overall scan target for the > @@ -2454,10 +2460,6 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc, > } > } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim))); > > - vmpressure(sc->gfp_mask, sc->target_mem_cgroup, > - sc->nr_scanned - nr_scanned, > - sc->nr_reclaimed - nr_reclaimed); > - > if (sc->nr_reclaimed - nr_reclaimed) > reclaimable = true; > I may be mistaken, but AFAIU this patch subtly changes the behavior of vmpressure visible from the userspace: w/o this patch a userspace process will only receive a notification for a memory cgroup only if *this* memory cgroup calls reclaimer; with this patch userspace notification will be issued even if reclaimer is invoked by any cgroup up the hierarchy. Thanks, Vladimir From mboxrd@z Thu Jan 1 00:00:00 1970 From: Vladimir Davydov Subject: Re: [PATCH 8/8] mm: memcontrol: hook up vmpressure to socket pressure Date: Thu, 22 Oct 2015 21:57:47 +0300 Message-ID: <20151022185747.GQ18351@esperanza> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-9-git-send-email-hannes@cmpxchg.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Return-path: Content-Disposition: inline In-Reply-To: <1445487696-21545-9-git-send-email-hannes@cmpxchg.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Thu, Oct 22, 2015 at 12:21:36AM -0400, Johannes Weiner wrote: ... > @@ -185,8 +183,29 @@ static void vmpressure_work_fn(struct work_struct *work) > vmpr->reclaimed = 0; > spin_unlock(&vmpr->sr_lock); > > + level = vmpressure_calc_level(scanned, reclaimed); > + > + if (level > VMPRESSURE_LOW) { So we start socket_pressure at MEDIUM. Why not at LOW or CRITICAL? > + struct mem_cgroup *memcg; > + /* > + * Let the socket buffer allocator know that we are > + * having trouble reclaiming LRU pages. > + * > + * For hysteresis, keep the pressure state asserted > + * for a second in which subsequent pressure events > + * can occur. > + * > + * XXX: is vmpressure a global feature or part of > + * memcg? There shouldn't be anything memcg-specific > + * about exporting reclaim success ratios from the VM. > + */ > + memcg = container_of(vmpr, struct mem_cgroup, vmpressure); > + if (memcg != root_mem_cgroup) > + memcg->socket_pressure = jiffies + HZ; Why 1 second? Thanks, Vladimir > + } > + > do { > - if (vmpressure_event(vmpr, scanned, reclaimed)) > + if (vmpressure_event(vmpr, level)) > break; > /* > * If not handled, propagate the event upward into the From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH 3/8] net: consolidate memcg socket buffer tracking and accounting Date: Thu, 22 Oct 2015 15:09:43 -0400 Message-ID: <20151022190943.GA20871@cmpxchg.org> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-4-git-send-email-hannes@cmpxchg.org> <20151022184612.GN18351@esperanza> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20151022184612.GN18351@esperanza> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Vladimir Davydov Cc: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Thu, Oct 22, 2015 at 09:46:12PM +0300, Vladimir Davydov wrote: > On Thu, Oct 22, 2015 at 12:21:31AM -0400, Johannes Weiner wrote: > > The tcp memory controller has extensive provisions for future memory > > accounting interfaces that won't materialize after all. Cut the code > > base down to what's actually used, now and in the likely future. > > > > - There won't be any different protocol counters in the future, so a > > direct sock->sk_memcg linkage is enough. This eliminates a lot of > > callback maze and boilerplate code, and restores most of the socket > > allocation code to pre-tcp_memcontrol state. > > > > - There won't be a tcp control soft limit, so integrating the memcg > > In fact, the code is ready for the "soft" limit (I mean min, pressure, > max tuple), it just lacks a knob. Yeah, but that's not going to materialize if the entire interface for dedicated tcp throttling is considered obsolete. > > @@ -1136,9 +1090,6 @@ static inline bool sk_under_memory_pressure(const struct sock *sk) > > if (!sk->sk_prot->memory_pressure) > > return false; > > > > - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) > > - return !!sk->sk_cgrp->memory_pressure; > > - > > AFAIU, now we won't shrink the window on hitting the limit, i.e. this > patch subtly changes the behavior of the existing knobs, potentially > breaking them. Hm, but there is no grace period in which something meaningful could happen with the window shrinking, is there? Any buffer allocation is still going to fail hard. I don't see how this would change anything in practice. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH 1/8] mm: page_counter: let page_counter_try_charge() return bool Date: Fri, 23 Oct 2015 13:31:03 +0200 Message-ID: <20151023113103.GJ2410@dhcp22.suse.cz> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-2-git-send-email-hannes@cmpxchg.org> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <1445487696-21545-2-git-send-email-hannes@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: "David S. Miller" , Andrew Morton , Vladimir Davydov , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Thu 22-10-15 00:21:29, Johannes Weiner wrote: > page_counter_try_charge() currently returns 0 on success and -ENOMEM > on failure, which is surprising behavior given the function name. > > Make it follow the expected pattern of try_stuff() functions that > return a boolean true to indicate success, or false for failure. > > Signed-off-by: Johannes Weiner Acked-by: Michal Hocko > --- > include/linux/page_counter.h | 6 +++--- > mm/hugetlb_cgroup.c | 3 ++- > mm/memcontrol.c | 11 +++++------ > mm/page_counter.c | 14 +++++++------- > 4 files changed, 17 insertions(+), 17 deletions(-) > > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h > index 17fa4f8..7e62920 100644 > --- a/include/linux/page_counter.h > +++ b/include/linux/page_counter.h > @@ -36,9 +36,9 @@ static inline unsigned long page_counter_read(struct page_counter *counter) > > void page_counter_cancel(struct page_counter *counter, unsigned long nr_pages); > void page_counter_charge(struct page_counter *counter, unsigned long nr_pages); > -int page_counter_try_charge(struct page_counter *counter, > - unsigned long nr_pages, > - struct page_counter **fail); > +bool page_counter_try_charge(struct page_counter *counter, > + unsigned long nr_pages, > + struct page_counter **fail); > void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages); > int page_counter_limit(struct page_counter *counter, unsigned long limit); > int page_counter_memparse(const char *buf, const char *max, > diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c > index 6a44263..d8fb10d 100644 > --- a/mm/hugetlb_cgroup.c > +++ b/mm/hugetlb_cgroup.c > @@ -186,7 +186,8 @@ again: > } > rcu_read_unlock(); > > - ret = page_counter_try_charge(&h_cg->hugepage[idx], nr_pages, &counter); > + if (!page_counter_try_charge(&h_cg->hugepage[idx], nr_pages, &counter)) > + ret = -ENOMEM; > css_put(&h_cg->css); > done: > *ptr = h_cg; > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index c71fe40..a8ccdbc 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2018,8 +2018,8 @@ retry: > return 0; > > if (!do_swap_account || > - !page_counter_try_charge(&memcg->memsw, batch, &counter)) { > - if (!page_counter_try_charge(&memcg->memory, batch, &counter)) > + page_counter_try_charge(&memcg->memsw, batch, &counter)) { > + if (page_counter_try_charge(&memcg->memory, batch, &counter)) > goto done_restock; > if (do_swap_account) > page_counter_uncharge(&memcg->memsw, batch); > @@ -2383,14 +2383,13 @@ int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order, > { > unsigned int nr_pages = 1 << order; > struct page_counter *counter; > - int ret = 0; > + int ret; > > if (!memcg_kmem_is_active(memcg)) > return 0; > > - ret = page_counter_try_charge(&memcg->kmem, nr_pages, &counter); > - if (ret) > - return ret; > + if (!page_counter_try_charge(&memcg->kmem, nr_pages, &counter)) > + return -ENOMEM; > > ret = try_charge(memcg, gfp, nr_pages); > if (ret) { > diff --git a/mm/page_counter.c b/mm/page_counter.c > index 11b4bed..7c6a63d 100644 > --- a/mm/page_counter.c > +++ b/mm/page_counter.c > @@ -56,12 +56,12 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages) > * @nr_pages: number of pages to charge > * @fail: points first counter to hit its limit, if any > * > - * Returns 0 on success, or -ENOMEM and @fail if the counter or one of > - * its ancestors has hit its configured limit. > + * Returns %true on success, or %false and @fail if the counter or one > + * of its ancestors has hit its configured limit. > */ > -int page_counter_try_charge(struct page_counter *counter, > - unsigned long nr_pages, > - struct page_counter **fail) > +bool page_counter_try_charge(struct page_counter *counter, > + unsigned long nr_pages, > + struct page_counter **fail) > { > struct page_counter *c; > > @@ -99,13 +99,13 @@ int page_counter_try_charge(struct page_counter *counter, > if (new > c->watermark) > c->watermark = new; > } > - return 0; > + return true; > > failed: > for (c = counter; c != *fail; c = c->parent) > page_counter_cancel(c, nr_pages); > > - return -ENOMEM; > + return false; > } > > /** > -- > 2.6.1 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH 2/8] mm: memcontrol: export root_mem_cgroup Date: Fri, 23 Oct 2015 13:32:38 +0200 Message-ID: <20151023113237.GK2410@dhcp22.suse.cz> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-3-git-send-email-hannes@cmpxchg.org> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <1445487696-21545-3-git-send-email-hannes@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: "David S. Miller" , Andrew Morton , Vladimir Davydov , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Thu 22-10-15 00:21:30, Johannes Weiner wrote: > A later patch will need this symbol in files other than memcontrol.c, > so export it now and replace mem_cgroup_root_css at the same time. > > Signed-off-by: Johannes Weiner Acked-by: Michal Hocko > --- > include/linux/memcontrol.h | 3 ++- > mm/backing-dev.c | 2 +- > mm/memcontrol.c | 5 ++--- > 3 files changed, 5 insertions(+), 5 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 805da1f..19ff87b 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -275,7 +275,8 @@ struct mem_cgroup { > struct mem_cgroup_per_node *nodeinfo[0]; > /* WARNING: nodeinfo must be the last member here */ > }; > -extern struct cgroup_subsys_state *mem_cgroup_root_css; > + > +extern struct mem_cgroup *root_mem_cgroup; > > /** > * mem_cgroup_events - count memory events against a cgroup > diff --git a/mm/backing-dev.c b/mm/backing-dev.c > index 095b23b..73ab967 100644 > --- a/mm/backing-dev.c > +++ b/mm/backing-dev.c > @@ -702,7 +702,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi) > > ret = wb_init(&bdi->wb, bdi, 1, GFP_KERNEL); > if (!ret) { > - bdi->wb.memcg_css = mem_cgroup_root_css; > + bdi->wb.memcg_css = &root_mem_cgroup->css; > bdi->wb.blkcg_css = blkcg_root_css; > } > return ret; > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index a8ccdbc..e54f434 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -76,9 +76,9 @@ > struct cgroup_subsys memory_cgrp_subsys __read_mostly; > EXPORT_SYMBOL(memory_cgrp_subsys); > > +struct mem_cgroup *root_mem_cgroup __read_mostly; > + > #define MEM_CGROUP_RECLAIM_RETRIES 5 > -static struct mem_cgroup *root_mem_cgroup __read_mostly; > -struct cgroup_subsys_state *mem_cgroup_root_css __read_mostly; > > /* Whether the swap controller is active */ > #ifdef CONFIG_MEMCG_SWAP > @@ -4213,7 +4213,6 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) > /* root ? */ > if (parent_css == NULL) { > root_mem_cgroup = memcg; > - mem_cgroup_root_css = &memcg->css; > page_counter_init(&memcg->memory, NULL); > memcg->high = PAGE_COUNTER_MAX; > memcg->soft_limit = PAGE_COUNTER_MAX; > -- > 2.6.1 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH 3/8] net: consolidate memcg socket buffer tracking and accounting Date: Fri, 23 Oct 2015 14:38:30 +0200 Message-ID: <20151023123830.GL2410@dhcp22.suse.cz> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-4-git-send-email-hannes@cmpxchg.org> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <1445487696-21545-4-git-send-email-hannes@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: "David S. Miller" , Andrew Morton , Vladimir Davydov , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Thu 22-10-15 00:21:31, Johannes Weiner wrote: > The tcp memory controller has extensive provisions for future memory > accounting interfaces that won't materialize after all. Cut the code > base down to what's actually used, now and in the likely future. > > - There won't be any different protocol counters in the future, so a > direct sock->sk_memcg linkage is enough. This eliminates a lot of > callback maze and boilerplate code, and restores most of the socket > allocation code to pre-tcp_memcontrol state. > > - There won't be a tcp control soft limit, so integrating the memcg > code into the global skmem limiting scheme complicates things > unnecessarily. Replace all that with simple and clear charge and > uncharge calls--hidden behind a jump label--to account skb memory. > > - The previous jump label code was an elaborate state machine that > tracked the number of cgroups with an active socket limit in order > to enable the skmem tracking and accounting code only when actively > necessary. But this is overengineered: it was meant to protect the > people who never use this feature in the first place. Simply enable > the branches once when the first limit is set until the next reboot. > > Signed-off-by: Johannes Weiner The changelog is certainly attractive. I have looked through the patch but my knowledge of the networking subsystem and its memory management is close to zero so I cannot really do a competent review. Anyway I support any simplification of the tcp kmem accounting. If networking people are OK with the changes, including reduction of the functionality as described by Vladimir then no objections from me for this to be merged. Thanks! > --- > include/linux/memcontrol.h | 64 ++++++++----------- > include/net/sock.h | 135 +++------------------------------------ > include/net/tcp.h | 3 - > include/net/tcp_memcontrol.h | 7 --- > mm/memcontrol.c | 101 +++++++++++++++-------------- > net/core/sock.c | 78 ++++++----------------- > net/ipv4/sysctl_net_ipv4.c | 1 - > net/ipv4/tcp.c | 3 +- > net/ipv4/tcp_ipv4.c | 9 +-- > net/ipv4/tcp_memcontrol.c | 147 +++++++------------------------------------ > net/ipv4/tcp_output.c | 6 +- > net/ipv6/tcp_ipv6.c | 3 - > 12 files changed, 136 insertions(+), 421 deletions(-) > delete mode 100644 include/net/tcp_memcontrol.h > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 19ff87b..5b72f83 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -85,34 +85,6 @@ enum mem_cgroup_events_target { > MEM_CGROUP_NTARGETS, > }; > > -/* > - * Bits in struct cg_proto.flags > - */ > -enum cg_proto_flags { > - /* Currently active and new sockets should be assigned to cgroups */ > - MEMCG_SOCK_ACTIVE, > - /* It was ever activated; we must disarm static keys on destruction */ > - MEMCG_SOCK_ACTIVATED, > -}; > - > -struct cg_proto { > - struct page_counter memory_allocated; /* Current allocated memory. */ > - struct percpu_counter sockets_allocated; /* Current number of sockets. */ > - int memory_pressure; > - long sysctl_mem[3]; > - unsigned long flags; > - /* > - * memcg field is used to find which memcg we belong directly > - * Each memcg struct can hold more than one cg_proto, so container_of > - * won't really cut. > - * > - * The elegant solution would be having an inverse function to > - * proto_cgroup in struct proto, but that means polluting the structure > - * for everybody, instead of just for memcg users. > - */ > - struct mem_cgroup *memcg; > -}; > - > #ifdef CONFIG_MEMCG > struct mem_cgroup_stat_cpu { > long count[MEM_CGROUP_STAT_NSTATS]; > @@ -185,8 +157,15 @@ struct mem_cgroup { > > /* Accounted resources */ > struct page_counter memory; > + > + /* > + * Legacy non-resource counters. In unified hierarchy, all > + * memory is accounted and limited through memcg->memory. > + * Consumer breakdown happens in the statistics. > + */ > struct page_counter memsw; > struct page_counter kmem; > + struct page_counter skmem; > > /* Normal memory consumption range */ > unsigned long low; > @@ -246,9 +225,6 @@ struct mem_cgroup { > */ > struct mem_cgroup_stat_cpu __percpu *stat; > > -#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET) > - struct cg_proto tcp_mem; > -#endif > #if defined(CONFIG_MEMCG_KMEM) > /* Index in the kmem_cache->memcg_params.memcg_caches array */ > int kmemcg_id; > @@ -676,12 +652,6 @@ void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx) > } > #endif /* CONFIG_MEMCG */ > > -enum { > - UNDER_LIMIT, > - SOFT_LIMIT, > - OVER_LIMIT, > -}; > - > #ifdef CONFIG_CGROUP_WRITEBACK > > struct list_head *mem_cgroup_cgwb_list(struct mem_cgroup *memcg); > @@ -707,15 +677,35 @@ static inline void mem_cgroup_wb_stats(struct bdi_writeback *wb, > > struct sock; > #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM) > +extern struct static_key_false mem_cgroup_sockets; > +static inline bool mem_cgroup_do_sockets(void) > +{ > + return static_branch_unlikely(&mem_cgroup_sockets); > +} > void sock_update_memcg(struct sock *sk); > void sock_release_memcg(struct sock *sk); > +bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages); > +void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages); > #else > +static inline bool mem_cgroup_do_sockets(void) > +{ > + return false; > +} > static inline void sock_update_memcg(struct sock *sk) > { > } > static inline void sock_release_memcg(struct sock *sk) > { > } > +static inline bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, > + unsigned int nr_pages) > +{ > + return true; > +} > +static inline void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, > + unsigned int nr_pages) > +{ > +} > #endif /* CONFIG_INET && CONFIG_MEMCG_KMEM */ > > #ifdef CONFIG_MEMCG_KMEM > diff --git a/include/net/sock.h b/include/net/sock.h > index 59a7196..67795fc 100644 > --- a/include/net/sock.h > +++ b/include/net/sock.h > @@ -69,22 +69,6 @@ > #include > #include > > -struct cgroup; > -struct cgroup_subsys; > -#ifdef CONFIG_NET > -int mem_cgroup_sockets_init(struct mem_cgroup *memcg, struct cgroup_subsys *ss); > -void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg); > -#else > -static inline > -int mem_cgroup_sockets_init(struct mem_cgroup *memcg, struct cgroup_subsys *ss) > -{ > - return 0; > -} > -static inline > -void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg) > -{ > -} > -#endif > /* > * This structure really needs to be cleaned up. > * Most of it is for TCP, and not used by any of > @@ -243,7 +227,6 @@ struct sock_common { > /* public: */ > }; > > -struct cg_proto; > /** > * struct sock - network layer representation of sockets > * @__sk_common: shared layout with inet_timewait_sock > @@ -310,7 +293,7 @@ struct cg_proto; > * @sk_security: used by security modules > * @sk_mark: generic packet mark > * @sk_classid: this socket's cgroup classid > - * @sk_cgrp: this socket's cgroup-specific proto data > + * @sk_memcg: this socket's memcg association > * @sk_write_pending: a write to stream socket waits to start > * @sk_state_change: callback to indicate change in the state of the sock > * @sk_data_ready: callback to indicate there is data to be processed > @@ -447,7 +430,7 @@ struct sock { > #ifdef CONFIG_CGROUP_NET_CLASSID > u32 sk_classid; > #endif > - struct cg_proto *sk_cgrp; > + struct mem_cgroup *sk_memcg; > void (*sk_state_change)(struct sock *sk); > void (*sk_data_ready)(struct sock *sk); > void (*sk_write_space)(struct sock *sk); > @@ -1051,18 +1034,6 @@ struct proto { > #ifdef SOCK_REFCNT_DEBUG > atomic_t socks; > #endif > -#ifdef CONFIG_MEMCG_KMEM > - /* > - * cgroup specific init/deinit functions. Called once for all > - * protocols that implement it, from cgroups populate function. > - * This function has to setup any files the protocol want to > - * appear in the kmem cgroup filesystem. > - */ > - int (*init_cgroup)(struct mem_cgroup *memcg, > - struct cgroup_subsys *ss); > - void (*destroy_cgroup)(struct mem_cgroup *memcg); > - struct cg_proto *(*proto_cgroup)(struct mem_cgroup *memcg); > -#endif > }; > > int proto_register(struct proto *prot, int alloc_slab); > @@ -1093,23 +1064,6 @@ static inline void sk_refcnt_debug_release(const struct sock *sk) > #define sk_refcnt_debug_release(sk) do { } while (0) > #endif /* SOCK_REFCNT_DEBUG */ > > -#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_NET) > -extern struct static_key memcg_socket_limit_enabled; > -static inline struct cg_proto *parent_cg_proto(struct proto *proto, > - struct cg_proto *cg_proto) > -{ > - return proto->proto_cgroup(parent_mem_cgroup(cg_proto->memcg)); > -} > -#define mem_cgroup_sockets_enabled static_key_false(&memcg_socket_limit_enabled) > -#else > -#define mem_cgroup_sockets_enabled 0 > -static inline struct cg_proto *parent_cg_proto(struct proto *proto, > - struct cg_proto *cg_proto) > -{ > - return NULL; > -} > -#endif > - > static inline bool sk_stream_memory_free(const struct sock *sk) > { > if (sk->sk_wmem_queued >= sk->sk_sndbuf) > @@ -1136,9 +1090,6 @@ static inline bool sk_under_memory_pressure(const struct sock *sk) > if (!sk->sk_prot->memory_pressure) > return false; > > - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) > - return !!sk->sk_cgrp->memory_pressure; > - > return !!*sk->sk_prot->memory_pressure; > } > > @@ -1146,61 +1097,19 @@ static inline void sk_leave_memory_pressure(struct sock *sk) > { > int *memory_pressure = sk->sk_prot->memory_pressure; > > - if (!memory_pressure) > - return; > - > - if (*memory_pressure) > + if (memory_pressure && *memory_pressure) > *memory_pressure = 0; > - > - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) { > - struct cg_proto *cg_proto = sk->sk_cgrp; > - struct proto *prot = sk->sk_prot; > - > - for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto)) > - cg_proto->memory_pressure = 0; > - } > - > } > > static inline void sk_enter_memory_pressure(struct sock *sk) > { > - if (!sk->sk_prot->enter_memory_pressure) > - return; > - > - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) { > - struct cg_proto *cg_proto = sk->sk_cgrp; > - struct proto *prot = sk->sk_prot; > - > - for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto)) > - cg_proto->memory_pressure = 1; > - } > - > - sk->sk_prot->enter_memory_pressure(sk); > + if (sk->sk_prot->enter_memory_pressure) > + sk->sk_prot->enter_memory_pressure(sk); > } > > static inline long sk_prot_mem_limits(const struct sock *sk, int index) > { > - long *prot = sk->sk_prot->sysctl_mem; > - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) > - prot = sk->sk_cgrp->sysctl_mem; > - return prot[index]; > -} > - > -static inline void memcg_memory_allocated_add(struct cg_proto *prot, > - unsigned long amt, > - int *parent_status) > -{ > - page_counter_charge(&prot->memory_allocated, amt); > - > - if (page_counter_read(&prot->memory_allocated) > > - prot->memory_allocated.limit) > - *parent_status = OVER_LIMIT; > -} > - > -static inline void memcg_memory_allocated_sub(struct cg_proto *prot, > - unsigned long amt) > -{ > - page_counter_uncharge(&prot->memory_allocated, amt); > + return sk->sk_prot->sysctl_mem[index]; > } > > static inline long > @@ -1208,24 +1117,14 @@ sk_memory_allocated(const struct sock *sk) > { > struct proto *prot = sk->sk_prot; > > - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) > - return page_counter_read(&sk->sk_cgrp->memory_allocated); > - > return atomic_long_read(prot->memory_allocated); > } > > static inline long > -sk_memory_allocated_add(struct sock *sk, int amt, int *parent_status) > +sk_memory_allocated_add(struct sock *sk, int amt) > { > struct proto *prot = sk->sk_prot; > > - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) { > - memcg_memory_allocated_add(sk->sk_cgrp, amt, parent_status); > - /* update the root cgroup regardless */ > - atomic_long_add_return(amt, prot->memory_allocated); > - return page_counter_read(&sk->sk_cgrp->memory_allocated); > - } > - > return atomic_long_add_return(amt, prot->memory_allocated); > } > > @@ -1234,9 +1133,6 @@ sk_memory_allocated_sub(struct sock *sk, int amt) > { > struct proto *prot = sk->sk_prot; > > - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) > - memcg_memory_allocated_sub(sk->sk_cgrp, amt); > - > atomic_long_sub(amt, prot->memory_allocated); > } > > @@ -1244,13 +1140,6 @@ static inline void sk_sockets_allocated_dec(struct sock *sk) > { > struct proto *prot = sk->sk_prot; > > - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) { > - struct cg_proto *cg_proto = sk->sk_cgrp; > - > - for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto)) > - percpu_counter_dec(&cg_proto->sockets_allocated); > - } > - > percpu_counter_dec(prot->sockets_allocated); > } > > @@ -1258,13 +1147,6 @@ static inline void sk_sockets_allocated_inc(struct sock *sk) > { > struct proto *prot = sk->sk_prot; > > - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) { > - struct cg_proto *cg_proto = sk->sk_cgrp; > - > - for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto)) > - percpu_counter_inc(&cg_proto->sockets_allocated); > - } > - > percpu_counter_inc(prot->sockets_allocated); > } > > @@ -1273,9 +1155,6 @@ sk_sockets_allocated_read_positive(struct sock *sk) > { > struct proto *prot = sk->sk_prot; > > - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) > - return percpu_counter_read_positive(&sk->sk_cgrp->sockets_allocated); > - > return percpu_counter_read_positive(prot->sockets_allocated); > } > > diff --git a/include/net/tcp.h b/include/net/tcp.h > index eed94fc..77b6c7e 100644 > --- a/include/net/tcp.h > +++ b/include/net/tcp.h > @@ -291,9 +291,6 @@ extern int tcp_memory_pressure; > /* optimized version of sk_under_memory_pressure() for TCP sockets */ > static inline bool tcp_under_memory_pressure(const struct sock *sk) > { > - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) > - return !!sk->sk_cgrp->memory_pressure; > - > return tcp_memory_pressure; > } > /* > diff --git a/include/net/tcp_memcontrol.h b/include/net/tcp_memcontrol.h > deleted file mode 100644 > index 05b94d9..0000000 > --- a/include/net/tcp_memcontrol.h > +++ /dev/null > @@ -1,7 +0,0 @@ > -#ifndef _TCP_MEMCG_H > -#define _TCP_MEMCG_H > - > -struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg); > -int tcp_init_cgroup(struct mem_cgroup *memcg, struct cgroup_subsys *ss); > -void tcp_destroy_cgroup(struct mem_cgroup *memcg); > -#endif /* _TCP_MEMCG_H */ > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index e54f434..c41e6d7 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -66,7 +66,6 @@ > #include "internal.h" > #include > #include > -#include > #include "slab.h" > > #include > @@ -291,58 +290,68 @@ static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id) > /* Writing them here to avoid exposing memcg's inner layout */ > #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM) > > +DEFINE_STATIC_KEY_FALSE(mem_cgroup_sockets); > + > void sock_update_memcg(struct sock *sk) > { > - if (mem_cgroup_sockets_enabled) { > - struct mem_cgroup *memcg; > - struct cg_proto *cg_proto; > - > - BUG_ON(!sk->sk_prot->proto_cgroup); > - > - /* Socket cloning can throw us here with sk_cgrp already > - * filled. It won't however, necessarily happen from > - * process context. So the test for root memcg given > - * the current task's memcg won't help us in this case. > - * > - * Respecting the original socket's memcg is a better > - * decision in this case. > - */ > - if (sk->sk_cgrp) { > - BUG_ON(mem_cgroup_is_root(sk->sk_cgrp->memcg)); > - css_get(&sk->sk_cgrp->memcg->css); > - return; > - } > - > - rcu_read_lock(); > - memcg = mem_cgroup_from_task(current); > - cg_proto = sk->sk_prot->proto_cgroup(memcg); > - if (cg_proto && test_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags) && > - css_tryget_online(&memcg->css)) { > - sk->sk_cgrp = cg_proto; > - } > - rcu_read_unlock(); > + struct mem_cgroup *memcg; > + /* > + * Socket cloning can throw us here with sk_cgrp already > + * filled. It won't however, necessarily happen from > + * process context. So the test for root memcg given > + * the current task's memcg won't help us in this case. > + * > + * Respecting the original socket's memcg is a better > + * decision in this case. > + */ > + if (sk->sk_memcg) { > + BUG_ON(mem_cgroup_is_root(sk->sk_memcg)); > + css_get(&sk->sk_memcg->css); > + return; > } > + > + rcu_read_lock(); > + memcg = mem_cgroup_from_task(current); > + if (css_tryget_online(&memcg->css)) > + sk->sk_memcg = memcg; > + rcu_read_unlock(); > } > EXPORT_SYMBOL(sock_update_memcg); > > void sock_release_memcg(struct sock *sk) > { > - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) { > - struct mem_cgroup *memcg; > - WARN_ON(!sk->sk_cgrp->memcg); > - memcg = sk->sk_cgrp->memcg; > - css_put(&sk->sk_cgrp->memcg->css); > - } > + if (sk->sk_memcg) > + css_put(&sk->sk_memcg->css); > } > > -struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg) > +/** > + * mem_cgroup_charge_skmem - charge socket memory > + * @memcg: memcg to charge > + * @nr_pages: number of pages to charge > + * > + * Charges @nr_pages to @memcg. Returns %true if the charge fit within > + * the memcg's configured limit, %false if the charge had to be forced. > + */ > +bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) > { > - if (!memcg || mem_cgroup_is_root(memcg)) > - return NULL; > + struct page_counter *counter; > + > + if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter)) > + return true; > > - return &memcg->tcp_mem; > + page_counter_charge(&memcg->skmem, nr_pages); > + return false; > +} > + > +/** > + * mem_cgroup_uncharge_skmem - uncharge socket memory > + * @memcg: memcg to uncharge > + * @nr_pages: number of pages to uncharge > + */ > +void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) > +{ > + page_counter_uncharge(&memcg->skmem, nr_pages); > } > -EXPORT_SYMBOL(tcp_proto_cgroup); > > #endif > > @@ -3592,13 +3601,7 @@ static int mem_cgroup_oom_control_write(struct cgroup_subsys_state *css, > #ifdef CONFIG_MEMCG_KMEM > static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss) > { > - int ret; > - > - ret = memcg_propagate_kmem(memcg); > - if (ret) > - return ret; > - > - return mem_cgroup_sockets_init(memcg, ss); > + return memcg_propagate_kmem(memcg); > } > > static void memcg_deactivate_kmem(struct mem_cgroup *memcg) > @@ -3654,7 +3657,6 @@ static void memcg_destroy_kmem(struct mem_cgroup *memcg) > static_key_slow_dec(&memcg_kmem_enabled_key); > WARN_ON(page_counter_read(&memcg->kmem)); > } > - mem_cgroup_sockets_destroy(memcg); > } > #else > static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss) > @@ -4218,6 +4220,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) > memcg->soft_limit = PAGE_COUNTER_MAX; > page_counter_init(&memcg->memsw, NULL); > page_counter_init(&memcg->kmem, NULL); > + page_counter_init(&memcg->skmem, NULL); > } > > memcg->last_scanned_node = MAX_NUMNODES; > @@ -4266,6 +4269,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css) > memcg->soft_limit = PAGE_COUNTER_MAX; > page_counter_init(&memcg->memsw, &parent->memsw); > page_counter_init(&memcg->kmem, &parent->kmem); > + page_counter_init(&memcg->skmem, &parent->skmem); > > /* > * No need to take a reference to the parent because cgroup > @@ -4277,6 +4281,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css) > memcg->soft_limit = PAGE_COUNTER_MAX; > page_counter_init(&memcg->memsw, NULL); > page_counter_init(&memcg->kmem, NULL); > + page_counter_init(&memcg->skmem, NULL); > /* > * Deeper hierachy with use_hierarchy == false doesn't make > * much sense so let cgroup subsystem know about this > diff --git a/net/core/sock.c b/net/core/sock.c > index 0fafd27..0debff5 100644 > --- a/net/core/sock.c > +++ b/net/core/sock.c > @@ -194,44 +194,6 @@ bool sk_net_capable(const struct sock *sk, int cap) > } > EXPORT_SYMBOL(sk_net_capable); > > - > -#ifdef CONFIG_MEMCG_KMEM > -int mem_cgroup_sockets_init(struct mem_cgroup *memcg, struct cgroup_subsys *ss) > -{ > - struct proto *proto; > - int ret = 0; > - > - mutex_lock(&proto_list_mutex); > - list_for_each_entry(proto, &proto_list, node) { > - if (proto->init_cgroup) { > - ret = proto->init_cgroup(memcg, ss); > - if (ret) > - goto out; > - } > - } > - > - mutex_unlock(&proto_list_mutex); > - return ret; > -out: > - list_for_each_entry_continue_reverse(proto, &proto_list, node) > - if (proto->destroy_cgroup) > - proto->destroy_cgroup(memcg); > - mutex_unlock(&proto_list_mutex); > - return ret; > -} > - > -void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg) > -{ > - struct proto *proto; > - > - mutex_lock(&proto_list_mutex); > - list_for_each_entry_reverse(proto, &proto_list, node) > - if (proto->destroy_cgroup) > - proto->destroy_cgroup(memcg); > - mutex_unlock(&proto_list_mutex); > -} > -#endif > - > /* > * Each address family might have different locking rules, so we have > * one slock key per address family: > @@ -239,11 +201,6 @@ void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg) > static struct lock_class_key af_family_keys[AF_MAX]; > static struct lock_class_key af_family_slock_keys[AF_MAX]; > > -#if defined(CONFIG_MEMCG_KMEM) > -struct static_key memcg_socket_limit_enabled; > -EXPORT_SYMBOL(memcg_socket_limit_enabled); > -#endif > - > /* > * Make lock validator output more readable. (we pre-construct these > * strings build-time, so that runtime initialization of socket > @@ -1476,12 +1433,6 @@ void sk_free(struct sock *sk) > } > EXPORT_SYMBOL(sk_free); > > -static void sk_update_clone(const struct sock *sk, struct sock *newsk) > -{ > - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) > - sock_update_memcg(newsk); > -} > - > /** > * sk_clone_lock - clone a socket, and lock its clone > * @sk: the socket to clone > @@ -1577,7 +1528,8 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority) > sk_set_socket(newsk, NULL); > newsk->sk_wq = NULL; > > - sk_update_clone(sk, newsk); > + if (mem_cgroup_do_sockets()) > + sock_update_memcg(newsk); > > if (newsk->sk_prot->sockets_allocated) > sk_sockets_allocated_inc(newsk); > @@ -2036,27 +1988,27 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind) > struct proto *prot = sk->sk_prot; > int amt = sk_mem_pages(size); > long allocated; > - int parent_status = UNDER_LIMIT; > > sk->sk_forward_alloc += amt * SK_MEM_QUANTUM; > > - allocated = sk_memory_allocated_add(sk, amt, &parent_status); > + allocated = sk_memory_allocated_add(sk, amt); > + > + if (mem_cgroup_do_sockets() && sk->sk_memcg && > + !mem_cgroup_charge_skmem(sk->sk_memcg, amt)) > + goto suppress_allocation; > > /* Under limit. */ > - if (parent_status == UNDER_LIMIT && > - allocated <= sk_prot_mem_limits(sk, 0)) { > + if (allocated <= sk_prot_mem_limits(sk, 0)) { > sk_leave_memory_pressure(sk); > return 1; > } > > - /* Under pressure. (we or our parents) */ > - if ((parent_status > SOFT_LIMIT) || > - allocated > sk_prot_mem_limits(sk, 1)) > + /* Under pressure. */ > + if (allocated > sk_prot_mem_limits(sk, 1)) > sk_enter_memory_pressure(sk); > > - /* Over hard limit (we or our parents) */ > - if ((parent_status == OVER_LIMIT) || > - (allocated > sk_prot_mem_limits(sk, 2))) > + /* Over hard limit. */ > + if (allocated > sk_prot_mem_limits(sk, 2)) > goto suppress_allocation; > > /* guarantee minimum buffer size under pressure */ > @@ -2105,6 +2057,9 @@ suppress_allocation: > > sk_memory_allocated_sub(sk, amt); > > + if (mem_cgroup_do_sockets() && sk->sk_memcg) > + mem_cgroup_uncharge_skmem(sk->sk_memcg, amt); > + > return 0; > } > EXPORT_SYMBOL(__sk_mem_schedule); > @@ -2120,6 +2075,9 @@ void __sk_mem_reclaim(struct sock *sk, int amount) > sk_memory_allocated_sub(sk, amount); > sk->sk_forward_alloc -= amount << SK_MEM_QUANTUM_SHIFT; > > + if (mem_cgroup_do_sockets() && sk->sk_memcg) > + mem_cgroup_uncharge_skmem(sk->sk_memcg, amount); > + > if (sk_under_memory_pressure(sk) && > (sk_memory_allocated(sk) < sk_prot_mem_limits(sk, 0))) > sk_leave_memory_pressure(sk); > diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c > index 894da3a..1f00819 100644 > --- a/net/ipv4/sysctl_net_ipv4.c > +++ b/net/ipv4/sysctl_net_ipv4.c > @@ -24,7 +24,6 @@ > #include > #include > #include > -#include > > static int zero; > static int one = 1; > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index ac1bdbb..ec931c0 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -421,7 +421,8 @@ void tcp_init_sock(struct sock *sk) > sk->sk_rcvbuf = sysctl_tcp_rmem[1]; > > local_bh_disable(); > - sock_update_memcg(sk); > + if (mem_cgroup_do_sockets()) > + sock_update_memcg(sk); > sk_sockets_allocated_inc(sk); > local_bh_enable(); > } > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c > index 30dd45c..bb5f4f2 100644 > --- a/net/ipv4/tcp_ipv4.c > +++ b/net/ipv4/tcp_ipv4.c > @@ -73,7 +73,6 @@ > #include > #include > #include > -#include > #include > > #include > @@ -1808,7 +1807,8 @@ void tcp_v4_destroy_sock(struct sock *sk) > tcp_saved_syn_free(tp); > > sk_sockets_allocated_dec(sk); > - sock_release_memcg(sk); > + if (mem_cgroup_do_sockets()) > + sock_release_memcg(sk); > } > EXPORT_SYMBOL(tcp_v4_destroy_sock); > > @@ -2330,11 +2330,6 @@ struct proto tcp_prot = { > .compat_setsockopt = compat_tcp_setsockopt, > .compat_getsockopt = compat_tcp_getsockopt, > #endif > -#ifdef CONFIG_MEMCG_KMEM > - .init_cgroup = tcp_init_cgroup, > - .destroy_cgroup = tcp_destroy_cgroup, > - .proto_cgroup = tcp_proto_cgroup, > -#endif > }; > EXPORT_SYMBOL(tcp_prot); > > diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c > index 2379c1b..09a37eb 100644 > --- a/net/ipv4/tcp_memcontrol.c > +++ b/net/ipv4/tcp_memcontrol.c > @@ -1,107 +1,10 @@ > -#include > -#include > -#include > -#include > -#include > +#include > #include > +#include > #include > - > -int tcp_init_cgroup(struct mem_cgroup *memcg, struct cgroup_subsys *ss) > -{ > - /* > - * The root cgroup does not use page_counters, but rather, > - * rely on the data already collected by the network > - * subsystem > - */ > - struct mem_cgroup *parent = parent_mem_cgroup(memcg); > - struct page_counter *counter_parent = NULL; > - struct cg_proto *cg_proto, *parent_cg; > - > - cg_proto = tcp_prot.proto_cgroup(memcg); > - if (!cg_proto) > - return 0; > - > - cg_proto->sysctl_mem[0] = sysctl_tcp_mem[0]; > - cg_proto->sysctl_mem[1] = sysctl_tcp_mem[1]; > - cg_proto->sysctl_mem[2] = sysctl_tcp_mem[2]; > - cg_proto->memory_pressure = 0; > - cg_proto->memcg = memcg; > - > - parent_cg = tcp_prot.proto_cgroup(parent); > - if (parent_cg) > - counter_parent = &parent_cg->memory_allocated; > - > - page_counter_init(&cg_proto->memory_allocated, counter_parent); > - percpu_counter_init(&cg_proto->sockets_allocated, 0, GFP_KERNEL); > - > - return 0; > -} > -EXPORT_SYMBOL(tcp_init_cgroup); > - > -void tcp_destroy_cgroup(struct mem_cgroup *memcg) > -{ > - struct cg_proto *cg_proto; > - > - cg_proto = tcp_prot.proto_cgroup(memcg); > - if (!cg_proto) > - return; > - > - percpu_counter_destroy(&cg_proto->sockets_allocated); > - > - if (test_bit(MEMCG_SOCK_ACTIVATED, &cg_proto->flags)) > - static_key_slow_dec(&memcg_socket_limit_enabled); > - > -} > -EXPORT_SYMBOL(tcp_destroy_cgroup); > - > -static int tcp_update_limit(struct mem_cgroup *memcg, unsigned long nr_pages) > -{ > - struct cg_proto *cg_proto; > - int i; > - int ret; > - > - cg_proto = tcp_prot.proto_cgroup(memcg); > - if (!cg_proto) > - return -EINVAL; > - > - ret = page_counter_limit(&cg_proto->memory_allocated, nr_pages); > - if (ret) > - return ret; > - > - for (i = 0; i < 3; i++) > - cg_proto->sysctl_mem[i] = min_t(long, nr_pages, > - sysctl_tcp_mem[i]); > - > - if (nr_pages == PAGE_COUNTER_MAX) > - clear_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags); > - else { > - /* > - * The active bit needs to be written after the static_key > - * update. This is what guarantees that the socket activation > - * function is the last one to run. See sock_update_memcg() for > - * details, and note that we don't mark any socket as belonging > - * to this memcg until that flag is up. > - * > - * We need to do this, because static_keys will span multiple > - * sites, but we can't control their order. If we mark a socket > - * as accounted, but the accounting functions are not patched in > - * yet, we'll lose accounting. > - * > - * We never race with the readers in sock_update_memcg(), > - * because when this value change, the code to process it is not > - * patched in yet. > - * > - * The activated bit is used to guarantee that no two writers > - * will do the update in the same memcg. Without that, we can't > - * properly shutdown the static key. > - */ > - if (!test_and_set_bit(MEMCG_SOCK_ACTIVATED, &cg_proto->flags)) > - static_key_slow_inc(&memcg_socket_limit_enabled); > - set_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags); > - } > - > - return 0; > -} > +#include > +#include > +#include > > enum { > RES_USAGE, > @@ -124,11 +27,17 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of, > switch (of_cft(of)->private) { > case RES_LIMIT: > /* see memcontrol.c */ > + if (memcg == root_mem_cgroup) { > + ret = -EINVAL; > + break; > + } > ret = page_counter_memparse(buf, "-1", &nr_pages); > if (ret) > break; > mutex_lock(&tcp_limit_mutex); > - ret = tcp_update_limit(memcg, nr_pages); > + ret = page_counter_limit(&memcg->skmem, nr_pages); > + if (!ret) > + static_branch_enable(&mem_cgroup_sockets); > mutex_unlock(&tcp_limit_mutex); > break; > default: > @@ -141,32 +50,28 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of, > static u64 tcp_cgroup_read(struct cgroup_subsys_state *css, struct cftype *cft) > { > struct mem_cgroup *memcg = mem_cgroup_from_css(css); > - struct cg_proto *cg_proto = tcp_prot.proto_cgroup(memcg); > u64 val; > > switch (cft->private) { > case RES_LIMIT: > - if (!cg_proto) > - return PAGE_COUNTER_MAX; > - val = cg_proto->memory_allocated.limit; > + val = memcg->skmem.limit; > val *= PAGE_SIZE; > break; > case RES_USAGE: > - if (!cg_proto) > + if (memcg == root_mem_cgroup) > val = atomic_long_read(&tcp_memory_allocated); > else > - val = page_counter_read(&cg_proto->memory_allocated); > + val = page_counter_read(&memcg->skmem); > val *= PAGE_SIZE; > break; > case RES_FAILCNT: > - if (!cg_proto) > - return 0; > - val = cg_proto->memory_allocated.failcnt; > + val = memcg->skmem.failcnt; > break; > case RES_MAX_USAGE: > - if (!cg_proto) > - return 0; > - val = cg_proto->memory_allocated.watermark; > + if (memcg == root_mem_cgroup) > + val = 0; > + else > + val = memcg->skmem.watermark; > val *= PAGE_SIZE; > break; > default: > @@ -178,20 +83,14 @@ static u64 tcp_cgroup_read(struct cgroup_subsys_state *css, struct cftype *cft) > static ssize_t tcp_cgroup_reset(struct kernfs_open_file *of, > char *buf, size_t nbytes, loff_t off) > { > - struct mem_cgroup *memcg; > - struct cg_proto *cg_proto; > - > - memcg = mem_cgroup_from_css(of_css(of)); > - cg_proto = tcp_prot.proto_cgroup(memcg); > - if (!cg_proto) > - return nbytes; > + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); > > switch (of_cft(of)->private) { > case RES_MAX_USAGE: > - page_counter_reset_watermark(&cg_proto->memory_allocated); > + page_counter_reset_watermark(&memcg->skmem); > break; > case RES_FAILCNT: > - cg_proto->memory_allocated.failcnt = 0; > + memcg->skmem.failcnt = 0; > break; > } > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index 19adedb..b496fc9 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -2819,13 +2819,15 @@ begin_fwd: > */ > void sk_forced_mem_schedule(struct sock *sk, int size) > { > - int amt, status; > + int amt; > > if (size <= sk->sk_forward_alloc) > return; > amt = sk_mem_pages(size); > sk->sk_forward_alloc += amt * SK_MEM_QUANTUM; > - sk_memory_allocated_add(sk, amt, &status); > + sk_memory_allocated_add(sk, amt); > + if (mem_cgroup_do_sockets() && sk->sk_memcg) > + mem_cgroup_charge_skmem(sk->sk_memcg, amt); > } > > /* Send a FIN. The caller locks the socket for us. > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c > index f495d18..cf19e65 100644 > --- a/net/ipv6/tcp_ipv6.c > +++ b/net/ipv6/tcp_ipv6.c > @@ -1862,9 +1862,6 @@ struct proto tcpv6_prot = { > .compat_setsockopt = compat_tcp_setsockopt, > .compat_getsockopt = compat_tcp_getsockopt, > #endif > -#ifdef CONFIG_MEMCG_KMEM > - .proto_cgroup = tcp_proto_cgroup, > -#endif > .clear_sk = tcp_v6_clear_sk, > }; > > -- > 2.6.1 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH 4/8] mm: memcontrol: prepare for unified hierarchy socket accounting Date: Fri, 23 Oct 2015 14:39:30 +0200 Message-ID: <20151023123930.GM2410@dhcp22.suse.cz> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-5-git-send-email-hannes@cmpxchg.org> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <1445487696-21545-5-git-send-email-hannes@cmpxchg.org> Sender: netdev-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: "David S. Miller" , Andrew Morton , Vladimir Davydov , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Thu 22-10-15 00:21:32, Johannes Weiner wrote: > The unified hierarchy memory controller will account socket > memory. Move the infrastructure functions accordingly. > > Signed-off-by: Johannes Weiner Acked-by: Michal Hocko > --- > mm/memcontrol.c | 136 ++++++++++++++++++++++++++++---------------------------- > 1 file changed, 68 insertions(+), 68 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index c41e6d7..3789050 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -287,74 +287,6 @@ static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id) > return mem_cgroup_from_css(css); > } > > -/* Writing them here to avoid exposing memcg's inner layout */ > -#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM) > - > -DEFINE_STATIC_KEY_FALSE(mem_cgroup_sockets); > - > -void sock_update_memcg(struct sock *sk) > -{ > - struct mem_cgroup *memcg; > - /* > - * Socket cloning can throw us here with sk_cgrp already > - * filled. It won't however, necessarily happen from > - * process context. So the test for root memcg given > - * the current task's memcg won't help us in this case. > - * > - * Respecting the original socket's memcg is a better > - * decision in this case. > - */ > - if (sk->sk_memcg) { > - BUG_ON(mem_cgroup_is_root(sk->sk_memcg)); > - css_get(&sk->sk_memcg->css); > - return; > - } > - > - rcu_read_lock(); > - memcg = mem_cgroup_from_task(current); > - if (css_tryget_online(&memcg->css)) > - sk->sk_memcg = memcg; > - rcu_read_unlock(); > -} > -EXPORT_SYMBOL(sock_update_memcg); > - > -void sock_release_memcg(struct sock *sk) > -{ > - if (sk->sk_memcg) > - css_put(&sk->sk_memcg->css); > -} > - > -/** > - * mem_cgroup_charge_skmem - charge socket memory > - * @memcg: memcg to charge > - * @nr_pages: number of pages to charge > - * > - * Charges @nr_pages to @memcg. Returns %true if the charge fit within > - * the memcg's configured limit, %false if the charge had to be forced. > - */ > -bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) > -{ > - struct page_counter *counter; > - > - if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter)) > - return true; > - > - page_counter_charge(&memcg->skmem, nr_pages); > - return false; > -} > - > -/** > - * mem_cgroup_uncharge_skmem - uncharge socket memory > - * @memcg: memcg to uncharge > - * @nr_pages: number of pages to uncharge > - */ > -void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) > -{ > - page_counter_uncharge(&memcg->skmem, nr_pages); > -} > - > -#endif > - > #ifdef CONFIG_MEMCG_KMEM > /* > * This will be the memcg's index in each cache's ->memcg_params.memcg_caches. > @@ -5521,6 +5453,74 @@ void mem_cgroup_replace_page(struct page *oldpage, struct page *newpage) > commit_charge(newpage, memcg, true); > } > > +/* Writing them here to avoid exposing memcg's inner layout */ > +#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM) > + > +DEFINE_STATIC_KEY_FALSE(mem_cgroup_sockets); > + > +void sock_update_memcg(struct sock *sk) > +{ > + struct mem_cgroup *memcg; > + /* > + * Socket cloning can throw us here with sk_cgrp already > + * filled. It won't however, necessarily happen from > + * process context. So the test for root memcg given > + * the current task's memcg won't help us in this case. > + * > + * Respecting the original socket's memcg is a better > + * decision in this case. > + */ > + if (sk->sk_memcg) { > + BUG_ON(mem_cgroup_is_root(sk->sk_memcg)); > + css_get(&sk->sk_memcg->css); > + return; > + } > + > + rcu_read_lock(); > + memcg = mem_cgroup_from_task(current); > + if (css_tryget_online(&memcg->css)) > + sk->sk_memcg = memcg; > + rcu_read_unlock(); > +} > +EXPORT_SYMBOL(sock_update_memcg); > + > +void sock_release_memcg(struct sock *sk) > +{ > + if (sk->sk_memcg) > + css_put(&sk->sk_memcg->css); > +} > + > +/** > + * mem_cgroup_charge_skmem - charge socket memory > + * @memcg: memcg to charge > + * @nr_pages: number of pages to charge > + * > + * Charges @nr_pages to @memcg. Returns %true if the charge fit within > + * the memcg's configured limit, %false if the charge had to be forced. > + */ > +bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) > +{ > + struct page_counter *counter; > + > + if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter)) > + return true; > + > + page_counter_charge(&memcg->skmem, nr_pages); > + return false; > +} > + > +/** > + * mem_cgroup_uncharge_skmem - uncharge socket memory > + * @memcg: memcg to uncharge > + * @nr_pages: number of pages to uncharge > + */ > +void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) > +{ > + page_counter_uncharge(&memcg->skmem, nr_pages); > +} > + > +#endif > + > /* > * subsys_initcall() for memory controller. > * > -- > 2.6.1 -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Date: Fri, 23 Oct 2015 15:19:56 +0200 Message-ID: <20151023131956.GA15375@dhcp22.suse.cz> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-6-git-send-email-hannes@cmpxchg.org> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <1445487696-21545-6-git-send-email-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: "David S. Miller" , Andrew Morton , Vladimir Davydov , Tejun Heo , netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Thu 22-10-15 00:21:33, Johannes Weiner wrote: > Socket memory can be a significant share of overall memory consumed by > common workloads. In order to provide reasonable resource isolation > out-of-the-box in the unified hierarchy, this type of memory needs to > be accounted and tracked per default in the memory controller. What about users who do not want to pay an additional overhead for the accounting? How can they disable it? > Signed-off-by: Johannes Weiner [...] > @@ -5453,10 +5470,9 @@ void mem_cgroup_replace_page(struct page *oldpage, struct page *newpage) > commit_charge(newpage, memcg, true); > } > > -/* Writing them here to avoid exposing memcg's inner layout */ > -#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM) > +#ifdef CONFIG_INET > > -DEFINE_STATIC_KEY_FALSE(mem_cgroup_sockets); > +DEFINE_STATIC_KEY_TRUE(mem_cgroup_sockets); AFAIU this means that the jump label is enabled by default. Is this intended when you enable it explicitly where needed? > > void sock_update_memcg(struct sock *sk) > { -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH 6/8] mm: vmscan: simplify memcg vs. global shrinker invocation Date: Fri, 23 Oct 2015 15:26:22 +0200 Message-ID: <20151023132622.GB15375@dhcp22.suse.cz> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-7-git-send-email-hannes@cmpxchg.org> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <1445487696-21545-7-git-send-email-hannes@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: "David S. Miller" , Andrew Morton , Vladimir Davydov , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Thu 22-10-15 00:21:34, Johannes Weiner wrote: > Letting shrink_slab() handle the root_mem_cgroup, and implicitely the > !CONFIG_MEMCG case, allows shrink_zone() to invoke the shrinkers > unconditionally from within the memcg iteration loop. > > Signed-off-by: Johannes Weiner Acked-by: Michal Hocko > --- > include/linux/memcontrol.h | 2 ++ > mm/vmscan.c | 31 ++++++++++++++++--------------- > 2 files changed, 18 insertions(+), 15 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 6f1e0f8..d66ae18 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -482,6 +482,8 @@ void mem_cgroup_split_huge_fixup(struct page *head); > #else /* CONFIG_MEMCG */ > struct mem_cgroup; > > +#define root_mem_cgroup NULL > + > static inline void mem_cgroup_events(struct mem_cgroup *memcg, > enum mem_cgroup_events_index idx, > unsigned int nr) > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 9b52ecf..ecc2125 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -411,6 +411,10 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid, > struct shrinker *shrinker; > unsigned long freed = 0; > > + /* Global shrinker mode */ > + if (memcg == root_mem_cgroup) > + memcg = NULL; > + > if (memcg && !memcg_kmem_is_active(memcg)) > return 0; > > @@ -2417,11 +2421,22 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc, > shrink_lruvec(lruvec, swappiness, sc, &lru_pages); > zone_lru_pages += lru_pages; > > - if (memcg && is_classzone) > + /* > + * Shrink the slab caches in the same proportion that > + * the eligible LRU pages were scanned. > + */ > + if (is_classzone) { > shrink_slab(sc->gfp_mask, zone_to_nid(zone), > memcg, sc->nr_scanned - scanned, > lru_pages); > > + if (reclaim_state) { > + sc->nr_reclaimed += > + reclaim_state->reclaimed_slab; > + reclaim_state->reclaimed_slab = 0; > + } > + } > + > /* > * Direct reclaim and kswapd have to scan all memory > * cgroups to fulfill the overall scan target for the > @@ -2439,20 +2454,6 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc, > } > } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim))); > > - /* > - * Shrink the slab caches in the same proportion that > - * the eligible LRU pages were scanned. > - */ > - if (global_reclaim(sc) && is_classzone) > - shrink_slab(sc->gfp_mask, zone_to_nid(zone), NULL, > - sc->nr_scanned - nr_scanned, > - zone_lru_pages); > - > - if (reclaim_state) { > - sc->nr_reclaimed += reclaim_state->reclaimed_slab; > - reclaim_state->reclaimed_slab = 0; > - } > - > vmpressure(sc->gfp_mask, sc->target_mem_cgroup, > sc->nr_scanned - nr_scanned, > sc->nr_reclaimed - nr_reclaimed); > -- > 2.6.1 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Vladimir Davydov Subject: Re: [PATCH 3/8] net: consolidate memcg socket buffer tracking and accounting Date: Fri, 23 Oct 2015 16:42:56 +0300 Message-ID: <20151023134256.GS18351@esperanza> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-4-git-send-email-hannes@cmpxchg.org> <20151022184612.GN18351@esperanza> <20151022190943.GA20871@cmpxchg.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Return-path: Content-Disposition: inline In-Reply-To: <20151022190943.GA20871@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Thu, Oct 22, 2015 at 03:09:43PM -0400, Johannes Weiner wrote: > On Thu, Oct 22, 2015 at 09:46:12PM +0300, Vladimir Davydov wrote: > > On Thu, Oct 22, 2015 at 12:21:31AM -0400, Johannes Weiner wrote: > > > The tcp memory controller has extensive provisions for future memory > > > accounting interfaces that won't materialize after all. Cut the code > > > base down to what's actually used, now and in the likely future. > > > > > > - There won't be any different protocol counters in the future, so a > > > direct sock->sk_memcg linkage is enough. This eliminates a lot of > > > callback maze and boilerplate code, and restores most of the socket > > > allocation code to pre-tcp_memcontrol state. > > > > > > - There won't be a tcp control soft limit, so integrating the memcg > > > > In fact, the code is ready for the "soft" limit (I mean min, pressure, > > max tuple), it just lacks a knob. > > Yeah, but that's not going to materialize if the entire interface for > dedicated tcp throttling is considered obsolete. May be, it shouldn't be. My current understanding is that per memcg tcp window control is necessary, because: - We need to be able to protect a containerized workload from its growing network buffers. Using vmpressure notifications for that does not look reassuring to me. - We need a way to limit network buffers of a particular container, otherwise it can fill the system-wide window throttling other containers, which is unfair. > > > > @@ -1136,9 +1090,6 @@ static inline bool sk_under_memory_pressure(const struct sock *sk) > > > if (!sk->sk_prot->memory_pressure) > > > return false; > > > > > > - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) > > > - return !!sk->sk_cgrp->memory_pressure; > > > - > > > > AFAIU, now we won't shrink the window on hitting the limit, i.e. this > > patch subtly changes the behavior of the existing knobs, potentially > > breaking them. > > Hm, but there is no grace period in which something meaningful could > happen with the window shrinking, is there? Any buffer allocation is > still going to fail hard. AFAIU when we hit the limit, we not only throttle the socket which allocates, but also try to release space reserved by other sockets. After your patch we won't. This looks unfair to me. Thanks, Vladimir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Miller Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Date: Fri, 23 Oct 2015 06:59:57 -0700 (PDT) Message-ID: <20151023.065957.1690815054807881760.davem@davemloft.net> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-6-git-send-email-hannes@cmpxchg.org> <20151023131956.GA15375@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20151023131956.GA15375-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: Text/Plain; charset="us-ascii" To: mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org Cc: hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org, vdavydov-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org, tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org From: Michal Hocko Date: Fri, 23 Oct 2015 15:19:56 +0200 > On Thu 22-10-15 00:21:33, Johannes Weiner wrote: >> Socket memory can be a significant share of overall memory consumed by >> common workloads. In order to provide reasonable resource isolation >> out-of-the-box in the unified hierarchy, this type of memory needs to >> be accounted and tracked per default in the memory controller. > > What about users who do not want to pay an additional overhead for the > accounting? How can they disable it? Yeah, this really cannot pass. This extra overhead will be seen by %99.9999 of users, since entities (especially distributions) just flip on all of these config options by default. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH 7/8] mm: vmscan: report vmpressure at the level of reclaim activity Date: Fri, 23 Oct 2015 15:49:57 +0200 Message-ID: <20151023134957.GC15375@dhcp22.suse.cz> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-8-git-send-email-hannes@cmpxchg.org> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <1445487696-21545-8-git-send-email-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: "David S. Miller" , Andrew Morton , Vladimir Davydov , Tejun Heo , netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Thu 22-10-15 00:21:35, Johannes Weiner wrote: > The vmpressure metric is based on reclaim efficiency, which in turn is > an attribute of the LRU. However, vmpressure events are currently > reported at the source of pressure rather than at the reclaim level. > > Switch the reporting to the reclaim level to allow finer-grained > analysis of which memcg is having trouble reclaiming its pages. I can see how this can be useful. > As far as memory.pressure_level interface semantics go, events are > escalated up the hierarchy until a listener is found, so this won't > affect existing users that listen at higher levels. This is true but the parent will not see cumulative events anymore. One memcg might be fighting and barely reclaim anything so it would report high pressure while other would be doing just fine. The parent will just see conflicting events in a short time period and cannot match them the source memcg. This sounds really confusing. Even more confusing than the current semantic which allows the same behavior under certain configurations. I dunno, have to think about it some more. Maybe we need to rethink the way how the pressure is signaled. If we want the breakdown of the particular memcgs then we should be able to identify them for this to be useful. [...] -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Date: Mon, 26 Oct 2015 12:56:19 -0400 Message-ID: <20151026165619.GB2214@cmpxchg.org> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-6-git-send-email-hannes@cmpxchg.org> <20151023131956.GA15375@dhcp22.suse.cz> <20151023.065957.1690815054807881760.davem@davemloft.net> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20151023.065957.1690815054807881760.davem@davemloft.net> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: David Miller Cc: mhocko@kernel.org, akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Fri, Oct 23, 2015 at 06:59:57AM -0700, David Miller wrote: > From: Michal Hocko > Date: Fri, 23 Oct 2015 15:19:56 +0200 > > > On Thu 22-10-15 00:21:33, Johannes Weiner wrote: > >> Socket memory can be a significant share of overall memory consumed by > >> common workloads. In order to provide reasonable resource isolation > >> out-of-the-box in the unified hierarchy, this type of memory needs to > >> be accounted and tracked per default in the memory controller. > > > > What about users who do not want to pay an additional overhead for the > > accounting? How can they disable it? > > Yeah, this really cannot pass. > > This extra overhead will be seen by %99.9999 of users, since entities > (especially distributions) just flip on all of these config options by > default. Okay, there are several layers to this issue. If you boot a machine with a CONFIG_MEMCG distribution kernel and don't create any cgroups, I agree there shouldn't be any overhead. I already sent a patch to generally remove memory accounting on the system or root level. I can easily update this patch here to not have any socket buffer accounting overhead for systems that don't actively use cgroups. Would you be okay with a branch on sk->sk_memcg in the network accounting path? I'd leave that NULL on the system level then. Then there is of course the case when you create cgroups for process organization but don't care about memory accounting. Systemd comes to mind. Or even if you create cgroups to track other resources like CPU but don't care about memory. The unified hierarchy no longer enables controllers on new cgroups per default, so unless you create a cgroup and specifically tell it to account and track memory, you won't have the socket memory accounting overhead, either. Then there is the third case, where you create a control group to specifically manage and limit the memory consumption of a workload. In that scenario, a major memory consumer like socket buffers, which can easily grow until OOM, should definitely be included in the tracking in order to properly contain both untrusted (possibly malicious) and trusted (possibly buggy) workloads. This is not a hole we can reasonbly leave unpatched for general purpose resource management. Now you could argue that there might exist specialized workloads that need to account anonymous pages and page cache, but not socket memory buffers. Or any other combination of pick-and-choose consumers. But honestly, nowadays all our paths are lockless, and the counting is an atomic-add-return with a per-cpu batch cache. I don't think there is a compelling case for an elaborate interface to make individual memory consumers configurable inside the memory controller. So in summary, would you be okay with this patch if networking only called into the memory controller when you explicitely create a cgroup AND tell it to track the memory footprint of the workload in it? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy Date: Mon, 26 Oct 2015 13:22:16 -0400 Message-ID: <20151026172216.GC2214@cmpxchg.org> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <20151022184509.GM18351@esperanza> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20151022184509.GM18351@esperanza> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Vladimir Davydov Cc: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Thu, Oct 22, 2015 at 09:45:10PM +0300, Vladimir Davydov wrote: > Hi Johannes, > > On Thu, Oct 22, 2015 at 12:21:28AM -0400, Johannes Weiner wrote: > ... > > Patch #5 adds accounting and tracking of socket memory to the unified > > hierarchy memory controller, as described above. It uses the existing > > per-cpu charge caches and triggers high limit reclaim asynchroneously. > > > > Patch #8 uses the vmpressure extension to equalize pressure between > > the pages tracked natively by the VM and socket buffer pages. As the > > pool is shared, it makes sense that while natively tracked pages are > > under duress the network transmit windows are also not increased. > > First of all, I've no experience in networking, so I'm likely to be > mistaken. Nevertheless I beg to disagree that this patch set is a step > in the right direction. Here goes why. > > I admit that your idea to get rid of explicit tcp window control knobs > and size it dynamically basing on memory pressure instead does sound > tempting, but I don't think it'd always work. The problem is that in > contrast to, say, dcache, we can't shrink tcp buffers AFAIU, we can only > stop growing them. Now suppose a system hasn't experienced memory > pressure for a while. If we don't have explicit tcp window limit, tcp > buffers on such a system might have eaten almost all available memory > (because of network load/problems). If a user workload that needs a > significant amount of memory is started suddenly then, the network code > will receive a notification and surely stop growing buffers, but all > those buffers accumulated won't disappear instantly. As a result, the > workload might be unable to find enough free memory and have no choice > but invoke OOM killer. This looks unexpected from the user POV. I'm not getting rid of those knobs, I'm just reusing the old socket accounting infrastructure in an attempt to make the memory accounting feature useful to more people in cgroups v2 (unified hierarchy). We can always come back to think about per-cgroup tcp window limits in the unified hierarchy, my patches don't get in the way of this. I'm not removing the knobs in cgroups v1 and I'm not preventing them in v2. But regardless of tcp window control, we need to account socket memory in the main memory accounting pool where pressure is shared (to the best of our abilities) between all accounted memory consumers. >From an interface standpoint alone, I don't think it's reasonable to ask users per default to limit different consumers on a case by case basis. I certainly have no problem with finetuning for scenarios you describe above, but with memory.current, memory.high, memory.max we are providing a generic interface to account and contain memory consumption of workloads. This has to include all major memory consumers to make semantical sense. But also, there are people right now for whom the socket buffers cause system OOM, but the existing memcg's hard tcp window limitq that exists absolutely wrecks network performance for them. It's not usable the way it is. It'd be much better to have the socket buffers exert pressure on the shared pool, and then propagate the overall pressure back to individual consumers with reclaim, shrinkers, vmpressure etc. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Vladimir Davydov Subject: Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy Date: Tue, 27 Oct 2015 11:43:21 +0300 Message-ID: <20151027084320.GF13221@esperanza> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <20151022184509.GM18351@esperanza> <20151026172216.GC2214@cmpxchg.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Return-path: Content-Disposition: inline In-Reply-To: <20151026172216.GC2214@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Mon, Oct 26, 2015 at 01:22:16PM -0400, Johannes Weiner wrote: > On Thu, Oct 22, 2015 at 09:45:10PM +0300, Vladimir Davydov wrote: > > Hi Johannes, > > > > On Thu, Oct 22, 2015 at 12:21:28AM -0400, Johannes Weiner wrote: > > ... > > > Patch #5 adds accounting and tracking of socket memory to the unified > > > hierarchy memory controller, as described above. It uses the existing > > > per-cpu charge caches and triggers high limit reclaim asynchroneously. > > > > > > Patch #8 uses the vmpressure extension to equalize pressure between > > > the pages tracked natively by the VM and socket buffer pages. As the > > > pool is shared, it makes sense that while natively tracked pages are > > > under duress the network transmit windows are also not increased. > > > > First of all, I've no experience in networking, so I'm likely to be > > mistaken. Nevertheless I beg to disagree that this patch set is a step > > in the right direction. Here goes why. > > > > I admit that your idea to get rid of explicit tcp window control knobs > > and size it dynamically basing on memory pressure instead does sound > > tempting, but I don't think it'd always work. The problem is that in > > contrast to, say, dcache, we can't shrink tcp buffers AFAIU, we can only > > stop growing them. Now suppose a system hasn't experienced memory > > pressure for a while. If we don't have explicit tcp window limit, tcp > > buffers on such a system might have eaten almost all available memory > > (because of network load/problems). If a user workload that needs a > > significant amount of memory is started suddenly then, the network code > > will receive a notification and surely stop growing buffers, but all > > those buffers accumulated won't disappear instantly. As a result, the > > workload might be unable to find enough free memory and have no choice > > but invoke OOM killer. This looks unexpected from the user POV. > > I'm not getting rid of those knobs, I'm just reusing the old socket > accounting infrastructure in an attempt to make the memory accounting > feature useful to more people in cgroups v2 (unified hierarchy). > My understanding is that in the meantime you effectively break the existing per memcg tcp window control logic. > We can always come back to think about per-cgroup tcp window limits in > the unified hierarchy, my patches don't get in the way of this. I'm > not removing the knobs in cgroups v1 and I'm not preventing them in v2. > > But regardless of tcp window control, we need to account socket memory > in the main memory accounting pool where pressure is shared (to the > best of our abilities) between all accounted memory consumers. > No objections to this point. However, I really don't like the idea to charge tcp window size to memory.current instead of charging individual pages consumed by the workload for storing socket buffers, because it is inconsistent with what we have now. Can't we charge individual skb pages as we do in case of other kmem allocations? > From an interface standpoint alone, I don't think it's reasonable to > ask users per default to limit different consumers on a case by case > basis. I certainly have no problem with finetuning for scenarios you > describe above, but with memory.current, memory.high, memory.max we > are providing a generic interface to account and contain memory > consumption of workloads. This has to include all major memory > consumers to make semantical sense. We can propose a reasonable default as we do in the global case. > > But also, there are people right now for whom the socket buffers cause > system OOM, but the existing memcg's hard tcp window limitq that > exists absolutely wrecks network performance for them. It's not usable > the way it is. It'd be much better to have the socket buffers exert > pressure on the shared pool, and then propagate the overall pressure > back to individual consumers with reclaim, shrinkers, vmpressure etc. > This might or might not work. I'm not an expert to judge. But if you do this only for memcg leaving the global case as it is, networking people won't budge IMO. So could you please start such a major rework from the global case? Could you please try to deprecate the tcp window limits not only in the legacy memcg hierarchy, but also system-wide in order to attract attention of networking experts? Thanks, Vladimir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Date: Tue, 27 Oct 2015 13:26:47 +0100 Message-ID: <20151027122647.GG9891@dhcp22.suse.cz> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-6-git-send-email-hannes@cmpxchg.org> <20151023131956.GA15375@dhcp22.suse.cz> <20151023.065957.1690815054807881760.davem@davemloft.net> <20151026165619.GB2214@cmpxchg.org> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20151026165619.GB2214@cmpxchg.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: David Miller , akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Mon 26-10-15 12:56:19, Johannes Weiner wrote: [...] > Now you could argue that there might exist specialized workloads that > need to account anonymous pages and page cache, but not socket memory > buffers. Exactly, and there are loads doing this. Memcg groups are also created to limit anon/page cache consumers to not affect the others running on the system (basically in the root memcg context from memcg POV) which don't care about tracking and they definitely do not want to pay for an additional overhead. We should definitely be able to offer a global disable knob for them. The same applies to kmem accounting in general. I do understand with having the accounting enabled by default after we are reasonably sure that both kmem/tcp are stable enough (which I am not convinced about yet to be honest) but there will be always special loads which simply do not care about kmem/tcp accounting and rather pay a global balancing price (even OOM) rather than a permanent price. And they should get a way to opt-out. > Or any other combination of pick-and-choose consumers. But > honestly, nowadays all our paths are lockless, and the counting is an > atomic-add-return with a per-cpu batch cache. You are still hooking into hot paths and there are users who want to squeeze every single cycle from the HW. > I don't think there is a compelling case for an elaborate interface > to make individual memory consumers configurable inside the memory > controller. I do not think we need an elaborate interface. We just want to have a global boot time knob to overwrite the default behavior. This is few lines of code and it should give the sufficient flexibility. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Miller Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Date: Tue, 27 Oct 2015 06:49:16 -0700 (PDT) Message-ID: <20151027.064916.312540587298733586.davem@davemloft.net> References: <20151023.065957.1690815054807881760.davem@davemloft.net> <20151026165619.GB2214@cmpxchg.org> <20151027122647.GG9891@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20151027122647.GG9891-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: Text/Plain; charset="us-ascii" To: mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org Cc: hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org, vdavydov-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org, tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org From: Michal Hocko Date: Tue, 27 Oct 2015 13:26:47 +0100 > On Mon 26-10-15 12:56:19, Johannes Weiner wrote: > [...] >> Or any other combination of pick-and-choose consumers. But >> honestly, nowadays all our paths are lockless, and the counting is an >> atomic-add-return with a per-cpu batch cache. > > You are still hooking into hot paths and there are users who want to > squeeze every single cycle from the HW. Yeah, you're basically probably undoing a half year of work by another developer who was able to remove an atomic from these paths. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Date: Tue, 27 Oct 2015 11:41:38 -0400 Message-ID: <20151027154138.GA4665@cmpxchg.org> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-6-git-send-email-hannes@cmpxchg.org> <20151023131956.GA15375@dhcp22.suse.cz> <20151023.065957.1690815054807881760.davem@davemloft.net> <20151026165619.GB2214@cmpxchg.org> <20151027122647.GG9891@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20151027122647.GG9891-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: David Miller , akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org, vdavydov-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org, tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Tue, Oct 27, 2015 at 01:26:47PM +0100, Michal Hocko wrote: > On Mon 26-10-15 12:56:19, Johannes Weiner wrote: > [...] > > Now you could argue that there might exist specialized workloads that > > need to account anonymous pages and page cache, but not socket memory > > buffers. > > Exactly, and there are loads doing this. Memcg groups are also created to > limit anon/page cache consumers to not affect the others running on > the system (basically in the root memcg context from memcg POV) which > don't care about tracking and they definitely do not want to pay for an > additional overhead. We should definitely be able to offer a global > disable knob for them. The same applies to kmem accounting in general. I don't see how you make such a clear distinction between, say, page cache and the dentry cache, and call one user memory and the other kernel memory. That just doesn't make sense to me. They're both kernel memory allocated on behalf of the user, the only difference being that one is tracked on the page level and the other on the slab level, and we started accounting one before the other. IMO that's an implementation detail and a historical artifact that should not be exposed to the user. And that's the thing I hate about the current opt-out knob. > > I don't think there is a compelling case for an elaborate interface > > to make individual memory consumers configurable inside the memory > > controller. > > I do not think we need an elaborate interface. We just want to have > a global boot time knob to overwrite the default behavior. This is > few lines of code and it should give the sufficient flexibility. Okay, then let's add this for the socket memory to start with. I'll have to think more about how to distinguish the slab-based consumers. Or maybe you have an idea. For now, something like this as a boot commandline? cgroup.memory=nosocket So again in summary, no default overhead until you create a cgroup to specifically track and account memory. And then, when you know what you are doing and have a specialized workload, you can disable socket memory as a specific consumer to remove that particular overhead while still being able to contain page cache, anon, kmem, whatever. Does that sound like reasonable userinterfacing to everyone? From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy Date: Tue, 27 Oct 2015 09:01:08 -0700 Message-ID: <20151027155833.GB4665@cmpxchg.org> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <20151022184509.GM18351@esperanza> <20151026172216.GC2214@cmpxchg.org> <20151027084320.GF13221@esperanza> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20151027084320.GF13221@esperanza> Sender: linux-kernel-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Vladimir Davydov Cc: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Tue, Oct 27, 2015 at 11:43:21AM +0300, Vladimir Davydov wrote: > On Mon, Oct 26, 2015 at 01:22:16PM -0400, Johannes Weiner wrote: > > I'm not getting rid of those knobs, I'm just reusing the old socket > > accounting infrastructure in an attempt to make the memory accounting > > feature useful to more people in cgroups v2 (unified hierarchy). > > My understanding is that in the meantime you effectively break the > existing per memcg tcp window control logic. That's not my intention, this stuff has to keep working. I'm assuming you mean the changes to sk_enter_memory_pressure() when hitting the charge limit; let me address this in the other subthread. > > We can always come back to think about per-cgroup tcp window limits in > > the unified hierarchy, my patches don't get in the way of this. I'm > > not removing the knobs in cgroups v1 and I'm not preventing them in v2. > > > > But regardless of tcp window control, we need to account socket memory > > in the main memory accounting pool where pressure is shared (to the > > best of our abilities) between all accounted memory consumers. > > > > No objections to this point. However, I really don't like the idea to > charge tcp window size to memory.current instead of charging individual > pages consumed by the workload for storing socket buffers, because it is > inconsistent with what we have now. Can't we charge individual skb pages > as we do in case of other kmem allocations? Absolutely, both work for me. I chose that route because it's where the networking code already tracks and accounts memory consumed, so it seemed like a better site to hook into. But I understand your concerns. We want to track this stuff as close to the memory allocators as possible. > > But also, there are people right now for whom the socket buffers cause > > system OOM, but the existing memcg's hard tcp window limitq that > > exists absolutely wrecks network performance for them. It's not usable > > the way it is. It'd be much better to have the socket buffers exert > > pressure on the shared pool, and then propagate the overall pressure > > back to individual consumers with reclaim, shrinkers, vmpressure etc. > > This might or might not work. I'm not an expert to judge. But if you do > this only for memcg leaving the global case as it is, networking people > won't budge IMO. So could you please start such a major rework from the > global case? Could you please try to deprecate the tcp window limits not > only in the legacy memcg hierarchy, but also system-wide in order to > attract attention of networking experts? I'm definitely interested in addressing this globally as well. The idea behind this was to use the memcg part as a testbed. cgroup2 is going to be new and people are prepared for hiccups when migrating their applications to it; and they can roll back to cgroup1 and tcp window limits at any time should they run into problems in production. So this seemed like a good way to prove a new mechanism before rolling it out to every single Linux setup, rather than switch everybody over after the limited scope testing I can do as a developer on my own. Keep in mind that my patches are not committing anything in terms of interface, so we retain all the freedom to fix and tune the way this is implemented, including the freedom to re-add tcp window limits in case the pressure balancing is not a comprehensive solution. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Date: Tue, 27 Oct 2015 17:15:54 +0100 Message-ID: <20151027161554.GJ9891@dhcp22.suse.cz> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-6-git-send-email-hannes@cmpxchg.org> <20151023131956.GA15375@dhcp22.suse.cz> <20151023.065957.1690815054807881760.davem@davemloft.net> <20151026165619.GB2214@cmpxchg.org> <20151027122647.GG9891@dhcp22.suse.cz> <20151027154138.GA4665@cmpxchg.org> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20151027154138.GA4665@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: David Miller , akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Tue 27-10-15 11:41:38, Johannes Weiner wrote: > On Tue, Oct 27, 2015 at 01:26:47PM +0100, Michal Hocko wrote: > > On Mon 26-10-15 12:56:19, Johannes Weiner wrote: > > [...] > > > Now you could argue that there might exist specialized workloads that > > > need to account anonymous pages and page cache, but not socket memory > > > buffers. > > > > Exactly, and there are loads doing this. Memcg groups are also created to > > limit anon/page cache consumers to not affect the others running on > > the system (basically in the root memcg context from memcg POV) which > > don't care about tracking and they definitely do not want to pay for an > > additional overhead. We should definitely be able to offer a global > > disable knob for them. The same applies to kmem accounting in general. > > I don't see how you make such a clear distinction between, say, page > cache and the dentry cache, and call one user memory and the other > kernel memory. Because the kernel memory footprint would be so small that it simply doesn't change the picture at all. While the page cache or anonymous memory consumption might be so large it might be disruptive. I am talking about loads where good enough is better than "perfect" and ephemeral global memory pressure when kmem goes over expectations is better than a permanent cpu overhead. Whatever we do it will always be non-zero. Also kmem accounting will make the load more non-deterministic because many of the resources are shared between tasks in separate cgroups unless they are explicitly configured. E.g. [id]cache will be shared and first to touch gets charged so you would end up with more false sharing. Nevertheless, I do not want to shift the discussion from the topic. I just think that one-fits-all simply won't work. > That just doesn't make sense to me. They're both kernel > memory allocated on behalf of the user, the only difference being that > one is tracked on the page level and the other on the slab level, and > we started accounting one before the other. > > IMO that's an implementation detail and a historical artifact that > should not be exposed to the user. And that's the thing I hate about > the current opt-out knob. > > > > I don't think there is a compelling case for an elaborate interface > > > to make individual memory consumers configurable inside the memory > > > controller. > > > > I do not think we need an elaborate interface. We just want to have > > a global boot time knob to overwrite the default behavior. This is > > few lines of code and it should give the sufficient flexibility. > > Okay, then let's add this for the socket memory to start with. I'll > have to think more about how to distinguish the slab-based consumers. > Or maybe you have an idea. Isn't that as simple as enabling the jump label during the initialization depending on the knob value? All the charging paths should be disabled by default already. > For now, something like this as a boot commandline? > > cgroup.memory=nosocket That would work for me. I would even see a place to have CONFIG_MEMCG_TCP_KMEM_ENABLED config option for the default and [no]socket as a kernel parameter to override the configuratioin default. This would allow distributions to define their policy without enforcing it hard and those who compile the kernel to define their own policy. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Date: Tue, 27 Oct 2015 09:42:27 -0700 Message-ID: <20151027164227.GB7749@cmpxchg.org> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-6-git-send-email-hannes@cmpxchg.org> <20151023131956.GA15375@dhcp22.suse.cz> <20151023.065957.1690815054807881760.davem@davemloft.net> <20151026165619.GB2214@cmpxchg.org> <20151027122647.GG9891@dhcp22.suse.cz> <20151027154138.GA4665@cmpxchg.org> <20151027161554.GJ9891@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20151027161554.GJ9891@dhcp22.suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: David Miller , akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Tue, Oct 27, 2015 at 05:15:54PM +0100, Michal Hocko wrote: > On Tue 27-10-15 11:41:38, Johannes Weiner wrote: > > On Tue, Oct 27, 2015 at 01:26:47PM +0100, Michal Hocko wrote: > > > On Mon 26-10-15 12:56:19, Johannes Weiner wrote: > > > [...] > > > > Now you could argue that there might exist specialized workloads that > > > > need to account anonymous pages and page cache, but not socket memory > > > > buffers. > > > > > > Exactly, and there are loads doing this. Memcg groups are also created to > > > limit anon/page cache consumers to not affect the others running on > > > the system (basically in the root memcg context from memcg POV) which > > > don't care about tracking and they definitely do not want to pay for an > > > additional overhead. We should definitely be able to offer a global > > > disable knob for them. The same applies to kmem accounting in general. > > > > I don't see how you make such a clear distinction between, say, page > > cache and the dentry cache, and call one user memory and the other > > kernel memory. > > Because the kernel memory footprint would be so small that it simply > doesn't change the picture at all. While the page cache or anonymous > memory consumption might be so large it might be disruptive. Or it could be exactly the other way around when you have a workload that is heavy on filesystem metadata. I don't see why any scenario would be more important than the other. I'm not saying that distinguishing between consumers is wrong, just that "user memory vs kernel memory" is a false classification. Why do you call page cache user memory but dentry cache kernel memory? It doesn't make any sense. > Also kmem accounting will make the load more non-deterministic because > many of the resources are shared between tasks in separate cgroups > unless they are explicitly configured. E.g. [id]cache will be shared > and first to touch gets charged so you would end up with more false > sharing. Exactly like page cache. This differentiation isn't based on reality. > Nevertheless, I do not want to shift the discussion from the topic. I > just think that one-fits-all simply won't work. Okay, this is something we can converge on. > > That just doesn't make sense to me. They're both kernel > > memory allocated on behalf of the user, the only difference being that > > one is tracked on the page level and the other on the slab level, and > > we started accounting one before the other. > > > > IMO that's an implementation detail and a historical artifact that > > should not be exposed to the user. And that's the thing I hate about > > the current opt-out knob. You carefully skipped over this part. We can ignore it for socket memory but it's something we need to figure out when it comes to slab accounting and tracking. > > > > I don't think there is a compelling case for an elaborate interface > > > > to make individual memory consumers configurable inside the memory > > > > controller. > > > > > > I do not think we need an elaborate interface. We just want to have > > > a global boot time knob to overwrite the default behavior. This is > > > few lines of code and it should give the sufficient flexibility. > > > > Okay, then let's add this for the socket memory to start with. I'll > > have to think more about how to distinguish the slab-based consumers. > > Or maybe you have an idea. > > Isn't that as simple as enabling the jump label during the > initialization depending on the knob value? All the charging paths > should be disabled by default already. You missed my point. It's not about the implementation, it's about how we present these choices to the user. Having page cache accounting built in while presenting dentry+inode cache as a configurable extension is completely random and doesn't make sense. They are both first class memory consumers. They're not separate categories. One isn't more "core" than the other. > > For now, something like this as a boot commandline? > > > > cgroup.memory=nosocket > > That would work for me. Okay, then I'll go that route for the socket stuff. Dave is that cool with you? From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Miller Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Date: Tue, 27 Oct 2015 17:45:32 -0700 (PDT) Message-ID: <20151027.174532.469361008055673315.davem@davemloft.net> References: <20151027154138.GA4665@cmpxchg.org> <20151027161554.GJ9891@dhcp22.suse.cz> <20151027164227.GB7749@cmpxchg.org> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20151027164227.GB7749-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: Text/Plain; charset="us-ascii" To: hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org Cc: mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org, vdavydov-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org, tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org From: Johannes Weiner Date: Tue, 27 Oct 2015 09:42:27 -0700 > On Tue, Oct 27, 2015 at 05:15:54PM +0100, Michal Hocko wrote: >> > For now, something like this as a boot commandline? >> > >> > cgroup.memory=nosocket >> >> That would work for me. > > Okay, then I'll go that route for the socket stuff. > > Dave is that cool with you? Depends upon the default. Until the user configures something explicitly into the memory controller, the networking bits should all evaluate to nothing. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Date: Tue, 27 Oct 2015 20:05:19 -0700 Message-ID: <20151028030519.GA20789@cmpxchg.org> References: <20151027154138.GA4665@cmpxchg.org> <20151027161554.GJ9891@dhcp22.suse.cz> <20151027164227.GB7749@cmpxchg.org> <20151027.174532.469361008055673315.davem@davemloft.net> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20151027.174532.469361008055673315.davem@davemloft.net> Sender: linux-kernel-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: David Miller Cc: mhocko@kernel.org, akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Tue, Oct 27, 2015 at 05:45:32PM -0700, David Miller wrote: > From: Johannes Weiner > Date: Tue, 27 Oct 2015 09:42:27 -0700 > > > On Tue, Oct 27, 2015 at 05:15:54PM +0100, Michal Hocko wrote: > >> > For now, something like this as a boot commandline? > >> > > >> > cgroup.memory=nosocket > >> > >> That would work for me. > > > > Okay, then I'll go that route for the socket stuff. > > > > Dave is that cool with you? > > Depends upon the default. > > Until the user configures something explicitly into the memory > controller, the networking bits should all evaluate to nothing. Yep, I'll stick them behind a default-off jump label again. This bootflag is only to override an active memory controller configuration and force-off that jump label permanently. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Vladimir Davydov Subject: Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy Date: Wed, 28 Oct 2015 11:20:03 +0300 Message-ID: <20151028082003.GK13221@esperanza> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <20151022184509.GM18351@esperanza> <20151026172216.GC2214@cmpxchg.org> <20151027084320.GF13221@esperanza> <20151027155833.GB4665@cmpxchg.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Return-path: Content-Disposition: inline In-Reply-To: <20151027155833.GB4665@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Tue, Oct 27, 2015 at 09:01:08AM -0700, Johannes Weiner wrote: ... > > > But regardless of tcp window control, we need to account socket memory > > > in the main memory accounting pool where pressure is shared (to the > > > best of our abilities) between all accounted memory consumers. > > > > > > > No objections to this point. However, I really don't like the idea to > > charge tcp window size to memory.current instead of charging individual > > pages consumed by the workload for storing socket buffers, because it is > > inconsistent with what we have now. Can't we charge individual skb pages > > as we do in case of other kmem allocations? > > Absolutely, both work for me. I chose that route because it's where > the networking code already tracks and accounts memory consumed, so it > seemed like a better site to hook into. > > But I understand your concerns. We want to track this stuff as close > to the memory allocators as possible. Exactly. > > > > But also, there are people right now for whom the socket buffers cause > > > system OOM, but the existing memcg's hard tcp window limitq that > > > exists absolutely wrecks network performance for them. It's not usable > > > the way it is. It'd be much better to have the socket buffers exert > > > pressure on the shared pool, and then propagate the overall pressure > > > back to individual consumers with reclaim, shrinkers, vmpressure etc. > > > > This might or might not work. I'm not an expert to judge. But if you do > > this only for memcg leaving the global case as it is, networking people > > won't budge IMO. So could you please start such a major rework from the > > global case? Could you please try to deprecate the tcp window limits not > > only in the legacy memcg hierarchy, but also system-wide in order to > > attract attention of networking experts? > > I'm definitely interested in addressing this globally as well. > > The idea behind this was to use the memcg part as a testbed. cgroup2 > is going to be new and people are prepared for hiccups when migrating > their applications to it; and they can roll back to cgroup1 and tcp > window limits at any time should they run into problems in production. Then you'd better not touch existing tcp limits at all, because they just work, and the logic behind them is very close to that of global tcp limits. I don't think one can simplify it somehow. Moreover, frankly I still have my reservations about this vmpressure propagation to skb you're proposing. It might work, but I doubt it will allow us to throw away explicit tcp limit, as I explained previously. So, even with your approach I think we can still need per memcg tcp limit *unless* you get rid of global tcp limit somehow. > > So this seemed like a good way to prove a new mechanism before rolling > it out to every single Linux setup, rather than switch everybody over > after the limited scope testing I can do as a developer on my own. > > Keep in mind that my patches are not committing anything in terms of > interface, so we retain all the freedom to fix and tune the way this > is implemented, including the freedom to re-add tcp window limits in > case the pressure balancing is not a comprehensive solution. > I really dislike this kind of proof. It looks like you're trying to push something you think is right covertly, w/o having a proper discussion with networking people and then say that it just works and hence should be done globally, but what if it won't? Revert it? We already have a lot of dubious stuff in memcg that should be reverted, so let's please try to avoid this kind of mistakes in future. Note, I say "w/o having a proper discussion with networking people", because I don't think they will really care *unless* you change the global logic, simply because most of them aren't very interested in memcg AFAICS. That effectively means you loose a chance to listen to networking experts, who could point you at design flaws and propose an improvement right away. Let's please not miss such an opportunity. You said that you'd seen this problem happen w/o cgroups, so you have a use case that might need fixing at the global level. IMO it shouldn't be difficult to prepare an RFC patch for the global case first and see what people think about it. Thanks, Vladimir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy Date: Wed, 28 Oct 2015 11:58:10 -0700 Message-ID: <20151028185810.GA31488@cmpxchg.org> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <20151022184509.GM18351@esperanza> <20151026172216.GC2214@cmpxchg.org> <20151027084320.GF13221@esperanza> <20151027155833.GB4665@cmpxchg.org> <20151028082003.GK13221@esperanza> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20151028082003.GK13221@esperanza> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Vladimir Davydov Cc: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Wed, Oct 28, 2015 at 11:20:03AM +0300, Vladimir Davydov wrote: > Then you'd better not touch existing tcp limits at all, because they > just work, and the logic behind them is very close to that of global tcp > limits. I don't think one can simplify it somehow. Uhm, no, there is a crapload of boilerplate code and complication that seems entirely unnecessary. The only thing missing from my patch seems to be the part where it enters memory pressure state when the limit is hit. I'm adding this for completeness, but I doubt it even matters. > Moreover, frankly I still have my reservations about this vmpressure > propagation to skb you're proposing. It might work, but I doubt it > will allow us to throw away explicit tcp limit, as I explained > previously. So, even with your approach I think we can still need > per memcg tcp limit *unless* you get rid of global tcp limit > somehow. Having the hard limit as a failsafe (or a minimum for other consumers) is one thing, and certainly something I'm open to for cgroupv2, should we have problems with load startup up after a socket memory landgrab. That being said, if the VM is struggling to reclaim pages, or is even swapping, it makes perfect sense to let the socket memory scheduler know it shouldn't continue to increase its footprint until the VM recovers. Regardless of any hard limitations/minimum guarantees. This is what my patch does and it seems pretty straight-forward to me. I don't really understand why this is so controversial. The *next* step would be to figure out whether we can actually *reclaim* memory in the network subsystem--shrink windows and steal buffers back--and that might even be an avenue to replace tcp window limits. But it's not necessary for *this* patch series to be useful. > > So this seemed like a good way to prove a new mechanism before rolling > > it out to every single Linux setup, rather than switch everybody over > > after the limited scope testing I can do as a developer on my own. > > > > Keep in mind that my patches are not committing anything in terms of > > interface, so we retain all the freedom to fix and tune the way this > > is implemented, including the freedom to re-add tcp window limits in > > case the pressure balancing is not a comprehensive solution. > > I really dislike this kind of proof. It looks like you're trying to > push something you think is right covertly, w/o having a proper > discussion with networking people and then say that it just works > and hence should be done globally, but what if it won't? Revert it? > We already have a lot of dubious stuff in memcg that should be > reverted, so let's please try to avoid this kind of mistakes in > future. Note, I say "w/o having a proper discussion with networking > people", because I don't think they will really care *unless* you > change the global logic, simply because most of them aren't very > interested in memcg AFAICS. Come on, Dave is the first To and netdev is CC'd. They might not care about memcg, but "pushing things covertly" is a bit of a stretch. > That effectively means you loose a chance to listen to networking > experts, who could point you at design flaws and propose an improvement > right away. Let's please not miss such an opportunity. You said that > you'd seen this problem happen w/o cgroups, so you have a use case that > might need fixing at the global level. IMO it shouldn't be difficult to > prepare an RFC patch for the global case first and see what people think > about it. No, the problem we are running into is when network memory is not tracked per cgroup. The lack of containment means that the socket memory consumption of individual cgroups can trigger system OOM. We tried using the per-memcg tcp limits, and that prevents the OOMs for sure, but it's horrendous for network performance. There is no "stop growing" phase, it just keeps going full throttle until it hits the wall hard. Now, we could probably try to replicate the global knobs and add a per-memcg soft limit. But you know better than anyone else how hard it is to estimate the overall workingset size of a workload, and the margins on containerized loads are razor-thin. Performance is much more sensitive to input errors, and often times parameters must be adjusted continuously during the runtime of a workload. It'd be disasterous to rely on yet more static, error-prone user input here. What all this means to me is that fixing it on the cgroup level has higher priority. But it also means that once we figured it out under such a high-pressure environment, it's much easier to apply to the global case and potentially replace the soft limit there. This seems like a better approach to me than starting globally, only to realize that the solution is not workable for cgroups and we need yet something else. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Vladimir Davydov Subject: Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy Date: Thu, 29 Oct 2015 12:27:47 +0300 Message-ID: <20151029092747.GR13221@esperanza> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <20151022184509.GM18351@esperanza> <20151026172216.GC2214@cmpxchg.org> <20151027084320.GF13221@esperanza> <20151027155833.GB4665@cmpxchg.org> <20151028082003.GK13221@esperanza> <20151028185810.GA31488@cmpxchg.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Return-path: Content-Disposition: inline In-Reply-To: <20151028185810.GA31488@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Wed, Oct 28, 2015 at 11:58:10AM -0700, Johannes Weiner wrote: > On Wed, Oct 28, 2015 at 11:20:03AM +0300, Vladimir Davydov wrote: > > Then you'd better not touch existing tcp limits at all, because they > > just work, and the logic behind them is very close to that of global tcp > > limits. I don't think one can simplify it somehow. > > Uhm, no, there is a crapload of boilerplate code and complication that > seems entirely unnecessary. The only thing missing from my patch seems > to be the part where it enters memory pressure state when the limit is > hit. I'm adding this for completeness, but I doubt it even matters. > > > Moreover, frankly I still have my reservations about this vmpressure > > propagation to skb you're proposing. It might work, but I doubt it > > will allow us to throw away explicit tcp limit, as I explained > > previously. So, even with your approach I think we can still need > > per memcg tcp limit *unless* you get rid of global tcp limit > > somehow. > > Having the hard limit as a failsafe (or a minimum for other consumers) > is one thing, and certainly something I'm open to for cgroupv2, should > we have problems with load startup up after a socket memory landgrab. > > That being said, if the VM is struggling to reclaim pages, or is even > swapping, it makes perfect sense to let the socket memory scheduler > know it shouldn't continue to increase its footprint until the VM > recovers. Regardless of any hard limitations/minimum guarantees. > > This is what my patch does and it seems pretty straight-forward to > me. I don't really understand why this is so controversial. I'm not arguing that the idea behind this patch set is necessarily bad. Quite the contrary, it does look interesting to me. I'm just saying that IMO it can't replace hard/soft limits. It probably could if it was possible to shrink buffers, but I don't think it's feasible, even theoretically. That's why I propose not to change the behavior of the existing per memcg tcp limit at all. And frankly I don't get why you are so keen on simplifying it. You say it's a "crapload of boilerplate code". Well, I don't see how it is - it just replicates global knobs and I don't see how it could be done in a better way. The code is hidden behind jump labels, so the overhead is zero if it isn't used. If you really dislike this code, we can isolate it under a separate config option. But all right, I don't rule out the possibility that the code could be simplified. If you do that w/o breaking it, that'll be OK to me, but I don't see why it should be related to this particular patch set. > > The *next* step would be to figure out whether we can actually > *reclaim* memory in the network subsystem--shrink windows and steal > buffers back--and that might even be an avenue to replace tcp window > limits. But it's not necessary for *this* patch series to be useful. Again, I don't think we can *reclaim* network memory, but you're right. > > > > So this seemed like a good way to prove a new mechanism before rolling > > > it out to every single Linux setup, rather than switch everybody over > > > after the limited scope testing I can do as a developer on my own. > > > > > > Keep in mind that my patches are not committing anything in terms of > > > interface, so we retain all the freedom to fix and tune the way this > > > is implemented, including the freedom to re-add tcp window limits in > > > case the pressure balancing is not a comprehensive solution. > > > > I really dislike this kind of proof. It looks like you're trying to > > push something you think is right covertly, w/o having a proper > > discussion with networking people and then say that it just works > > and hence should be done globally, but what if it won't? Revert it? > > We already have a lot of dubious stuff in memcg that should be > > reverted, so let's please try to avoid this kind of mistakes in > > future. Note, I say "w/o having a proper discussion with networking > > people", because I don't think they will really care *unless* you > > change the global logic, simply because most of them aren't very > > interested in memcg AFAICS. > > Come on, Dave is the first To and netdev is CC'd. They might not care > about memcg, but "pushing things covertly" is a bit of a stretch. Sorry if it sounded rude to you. I just look back at my experience patching slab internals to make kmem accountable, and AFAICS Christoph didn't really care about *what* I was doing, he only cared about the global case - if there was no performance degradation when kmemcg was disabled, he was usually fine with it, even if from the memcg pov it was a crap. Anyway, I can't force you to patch the global case first or simultaneously with the memcg case, so let's just hope I'm a bit too overcautious. > > > That effectively means you loose a chance to listen to networking > > experts, who could point you at design flaws and propose an improvement > > right away. Let's please not miss such an opportunity. You said that > > you'd seen this problem happen w/o cgroups, so you have a use case that > > might need fixing at the global level. IMO it shouldn't be difficult to > > prepare an RFC patch for the global case first and see what people think > > about it. > > No, the problem we are running into is when network memory is not > tracked per cgroup. The lack of containment means that the socket > memory consumption of individual cgroups can trigger system OOM. > > We tried using the per-memcg tcp limits, and that prevents the OOMs > for sure, but it's horrendous for network performance. There is no > "stop growing" phase, it just keeps going full throttle until it hits > the wall hard. > > Now, we could probably try to replicate the global knobs and add a > per-memcg soft limit. But you know better than anyone else how hard it > is to estimate the overall workingset size of a workload, and the > margins on containerized loads are razor-thin. Performance is much > more sensitive to input errors, and often times parameters must be > adjusted continuously during the runtime of a workload. It'd be > disasterous to rely on yet more static, error-prone user input here. Yeah, but the dynamic approach proposed in your patch set doesn't guarantee we won't hit OOM in memcg due to overgrown buffers. It just reduces this possibility. Of course, memcg OOM is far not as disastrous as the global one, but still it usually means the workload breakage. The static approach is error-prone for sure, but it has existed for years and worked satisfactory AFAIK. > > What all this means to me is that fixing it on the cgroup level has > higher priority. But it also means that once we figured it out under > such a high-pressure environment, it's much easier to apply to the > global case and potentially replace the soft limit there. > > This seems like a better approach to me than starting globally, only > to realize that the solution is not workable for cgroups and we need > yet something else. > Are we in rush? I think if you try your approach at the global level and fail, it's still good, because it will probably give us all a better understanding of the problem. If you successfully fix the global case, but then realize that it doesn't fit memcg, it's even better, because you actually fixed a problem. If you patch both global and memcg cases, it's perfect. But of course, that's my understanding and I may be mistaken. Let's hope you're right. Thanks, Vladimir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Date: Thu, 29 Oct 2015 16:25:46 +0100 Message-ID: <20151029152546.GG23598@dhcp22.suse.cz> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-6-git-send-email-hannes@cmpxchg.org> <20151023131956.GA15375@dhcp22.suse.cz> <20151023.065957.1690815054807881760.davem@davemloft.net> <20151026165619.GB2214@cmpxchg.org> <20151027122647.GG9891@dhcp22.suse.cz> <20151027154138.GA4665@cmpxchg.org> <20151027161554.GJ9891@dhcp22.suse.cz> <20151027164227.GB7749@cmpxchg.org> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20151027164227.GB7749@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: David Miller , akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Tue 27-10-15 09:42:27, Johannes Weiner wrote: > On Tue, Oct 27, 2015 at 05:15:54PM +0100, Michal Hocko wrote: > > On Tue 27-10-15 11:41:38, Johannes Weiner wrote: [...] > Or it could be exactly the other way around when you have a workload > that is heavy on filesystem metadata. I don't see why any scenario > would be more important than the other. Yes I definitely agree. No scenario is more important. We can only come up with a default that makes more sense for the majority and allow the minority to override. That was what I wanted to say basically. > I'm not saying that distinguishing between consumers is wrong, just > that "user memory vs kernel memory" is a false classification. Why do > you call page cache user memory but dentry cache kernel memory? It > doesn't make any sense. We are not talking about dcache vs. page cache alone here, though. We are talking about _all_ slab allocations vs. only user accessed memory. The slab consumption is directly under kernel control. A great pile of this logic is completly hidden from userspace. While user can estimate the user memory it is hard (if possible) to do that for the kernel memory footprint - not even mentioning this is variable and dependent on the particular kernel version. > > Also kmem accounting will make the load more non-deterministic because > > many of the resources are shared between tasks in separate cgroups > > unless they are explicitly configured. E.g. [id]cache will be shared > > and first to touch gets charged so you would end up with more false > > sharing. > > Exactly like page cache. This differentiation isn't based on reality. Yes false sharing is an existing and long term problem already. I just wanted to point out that the false sharing would be even a bigger problem because some kernel tracked resources are shared more naturally than file sharing. > > > IMO that's an implementation detail and a historical artifact that > > > should not be exposed to the user. And that's the thing I hate about > > > the current opt-out knob. > > You carefully skipped over this part. We can ignore it for socket > memory but it's something we need to figure out when it comes to slab > accounting and tracking. I am sorry, I didn't mean to skip this part, I though it would be clear from the previous text. I think kmem accounting falls into the same category. Have a sane default and a global boottime knob to override it for those that think differently - for whatever reason they might have. [...] > Having page cache accounting built in while presenting dentry+inode > cache as a configurable extension is completely random and doesn't > make sense. They are both first class memory consumers. They're not > separate categories. One isn't more "core" than the other. Again we are talking about all slab allocations not just the dcache. > > > For now, something like this as a boot commandline? > > > > > > cgroup.memory=nosocket > > > > That would work for me. > > Okay, then I'll go that route for the socket stuff. Thanks! -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Date: Thu, 29 Oct 2015 09:10:09 -0700 Message-ID: <20151029161009.GA9160@cmpxchg.org> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-6-git-send-email-hannes@cmpxchg.org> <20151023131956.GA15375@dhcp22.suse.cz> <20151023.065957.1690815054807881760.davem@davemloft.net> <20151026165619.GB2214@cmpxchg.org> <20151027122647.GG9891@dhcp22.suse.cz> <20151027154138.GA4665@cmpxchg.org> <20151027161554.GJ9891@dhcp22.suse.cz> <20151027164227.GB7749@cmpxchg.org> <20151029152546.GG23598@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20151029152546.GG23598-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: David Miller , akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org, vdavydov-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org, tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Thu, Oct 29, 2015 at 04:25:46PM +0100, Michal Hocko wrote: > On Tue 27-10-15 09:42:27, Johannes Weiner wrote: > > On Tue, Oct 27, 2015 at 05:15:54PM +0100, Michal Hocko wrote: > > > On Tue 27-10-15 11:41:38, Johannes Weiner wrote: > > > > IMO that's an implementation detail and a historical artifact that > > > > should not be exposed to the user. And that's the thing I hate about > > > > the current opt-out knob. > > > > You carefully skipped over this part. We can ignore it for socket > > memory but it's something we need to figure out when it comes to slab > > accounting and tracking. > > I am sorry, I didn't mean to skip this part, I though it would be clear > from the previous text. I think kmem accounting falls into the same > category. Have a sane default and a global boottime knob to override it > for those that think differently - for whatever reason they might have. Yes, that makes sense to me. Like cgroup.memory=nosocket, would you think it makes sense to include slab in the default for functional/semantical completeness and provide a cgroup.memory=noslab for powerusers? From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy Date: Thu, 29 Oct 2015 10:52:28 -0700 Message-ID: <20151029175228.GB9160@cmpxchg.org> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <20151022184509.GM18351@esperanza> <20151026172216.GC2214@cmpxchg.org> <20151027084320.GF13221@esperanza> <20151027155833.GB4665@cmpxchg.org> <20151028082003.GK13221@esperanza> <20151028185810.GA31488@cmpxchg.org> <20151029092747.GR13221@esperanza> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20151029092747.GR13221@esperanza> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Vladimir Davydov Cc: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Thu, Oct 29, 2015 at 12:27:47PM +0300, Vladimir Davydov wrote: > On Wed, Oct 28, 2015 at 11:58:10AM -0700, Johannes Weiner wrote: > > Having the hard limit as a failsafe (or a minimum for other consumers) > > is one thing, and certainly something I'm open to for cgroupv2, should > > we have problems with load startup up after a socket memory landgrab. > > > > That being said, if the VM is struggling to reclaim pages, or is even > > swapping, it makes perfect sense to let the socket memory scheduler > > know it shouldn't continue to increase its footprint until the VM > > recovers. Regardless of any hard limitations/minimum guarantees. > > > > This is what my patch does and it seems pretty straight-forward to > > me. I don't really understand why this is so controversial. > > I'm not arguing that the idea behind this patch set is necessarily bad. > Quite the contrary, it does look interesting to me. I'm just saying that > IMO it can't replace hard/soft limits. It probably could if it was > possible to shrink buffers, but I don't think it's feasible, even > theoretically. That's why I propose not to change the behavior of the > existing per memcg tcp limit at all. And frankly I don't get why you are > so keen on simplifying it. You say it's a "crapload of boilerplate > code". Well, I don't see how it is - it just replicates global knobs and > I don't see how it could be done in a better way. The code is hidden > behind jump labels, so the overhead is zero if it isn't used. If you > really dislike this code, we can isolate it under a separate config > option. But all right, I don't rule out the possibility that the code > could be simplified. If you do that w/o breaking it, that'll be OK to > me, but I don't see why it should be related to this particular patch > set. Okay, I see your concern. I'm not trying to change the behavior, just the implementation, because it's too complex for the functionality it actually provides. And the reason it's part of this patch set is because I'm using the same code to hook into the memory accounting, so it makes sense to refactor this stuff in the same go. There is also a niceness factor of not adding more memcg callbacks to the networking subsystem when there is an option to consolidate them. Now, you mentioned that you'd rather see the socket buffers accounted at the allocator level, but I looked at the different allocation paths and network protocols and I'm not convinced that this makes sense. We don't want to be in the hotpath of every single packet when a lot of them are small, short-lived management blips that don't involve user space to let the kernel dispose of them. __sk_mem_schedule() on the other hand is already wired up to exactly those consumers we are interested in for memory isolation: those with bigger chunks of data attached to them and those that have exploding receive queues when userspace fails to read(). UDP and TCP. I mean, there is a reason why the global memory limits apply to only those types of packets in the first place: everything else is noise. I agree that it's appealing to account at the allocator level and set page->mem_cgroup etc. but in this case we'd pay extra to capture a lot of noise, and I don't want to pay that just for aesthetics. In this case it's better to track ownership on the socket level and only count packets that can accumulate a significant amount of memory consumed. > > We tried using the per-memcg tcp limits, and that prevents the OOMs > > for sure, but it's horrendous for network performance. There is no > > "stop growing" phase, it just keeps going full throttle until it hits > > the wall hard. > > > > Now, we could probably try to replicate the global knobs and add a > > per-memcg soft limit. But you know better than anyone else how hard it > > is to estimate the overall workingset size of a workload, and the > > margins on containerized loads are razor-thin. Performance is much > > more sensitive to input errors, and often times parameters must be > > adjusted continuously during the runtime of a workload. It'd be > > disasterous to rely on yet more static, error-prone user input here. > > Yeah, but the dynamic approach proposed in your patch set doesn't > guarantee we won't hit OOM in memcg due to overgrown buffers. It just > reduces this possibility. Of course, memcg OOM is far not as disastrous > as the global one, but still it usually means the workload breakage. Right now, the entire machine breaks. Confining it to a faulty memcg, as well as reducing the likelihood of that OOM in many cases seems like a good move in the right direction, no? And how likely are memcg OOMs because of this anyway? There is of course a scenario imaginable where the packets pile up, followed by some *other* part of the workload, the one that doesn't read() and process packets, trying to expand--which then doesn't work and goes OOM. But that seems like a complete corner case. In the vast majority of cases, the application will be in full operation and just fail to read() fast enough--because the network bandwidth is enormous compared to the container's size, or because it shares the CPU with thousands of other workloads and there is scheduling latency. This would be the perfect point to reign in the transmit window... > The static approach is error-prone for sure, but it has existed for > years and worked satisfactory AFAIK. ...but that point is not a fixed amount of memory consumed. It depends on the workload and the random interactions it's having with thousands of other containers on that same machine. The point of containers is to maximize utilization of your hardware and systematically eliminate slack in the system. But it's exactly that slack on dedicated bare-metal machines that allowed us to take a wild guess at the settings and then tune them based on observing a handful of workloads. This approach is not going to work anymore when we pack the machine to capacity and still expect every single container out of thousands to perform well. We need that automation. The static setting working okay on the global level is also why I'm not interested in starting to experiment with it. There is no reason to change it. It's much more likely that any attempt to change it will be shot down, not because of the approach chosen, but because there is no problem to solve there. I doubt we can get networking people to care about containers by screwing with things that work for them ;-) From mboxrd@z Thu Jan 1 00:00:00 1970 From: Vladimir Davydov Subject: Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy Date: Mon, 2 Nov 2015 17:47:29 +0300 Message-ID: <20151102144729.GA17424@esperanza> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <20151022184509.GM18351@esperanza> <20151026172216.GC2214@cmpxchg.org> <20151027084320.GF13221@esperanza> <20151027155833.GB4665@cmpxchg.org> <20151028082003.GK13221@esperanza> <20151028185810.GA31488@cmpxchg.org> <20151029092747.GR13221@esperanza> <20151029175228.GB9160@cmpxchg.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Return-path: Content-Disposition: inline In-Reply-To: <20151029175228.GB9160@cmpxchg.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Thu, Oct 29, 2015 at 10:52:28AM -0700, Johannes Weiner wrote: ... > Now, you mentioned that you'd rather see the socket buffers accounted > at the allocator level, but I looked at the different allocation paths > and network protocols and I'm not convinced that this makes sense. We > don't want to be in the hotpath of every single packet when a lot of > them are small, short-lived management blips that don't involve user > space to let the kernel dispose of them. > > __sk_mem_schedule() on the other hand is already wired up to exactly > those consumers we are interested in for memory isolation: those with > bigger chunks of data attached to them and those that have exploding > receive queues when userspace fails to read(). UDP and TCP. > > I mean, there is a reason why the global memory limits apply to only > those types of packets in the first place: everything else is noise. > > I agree that it's appealing to account at the allocator level and set > page->mem_cgroup etc. but in this case we'd pay extra to capture a lot > of noise, and I don't want to pay that just for aesthetics. In this > case it's better to track ownership on the socket level and only count > packets that can accumulate a significant amount of memory consumed. Sigh, you seem to be right. Moreover, I can't even think of a neat way to account skb pages to memcg, because rcv skbs are generated in device drivers, where we don't know which socket/memcg it will go to. We could recharge individual pages when skb gets to the network or transport layer, but it would result in unjustified overhead. > > > > We tried using the per-memcg tcp limits, and that prevents the OOMs > > > for sure, but it's horrendous for network performance. There is no > > > "stop growing" phase, it just keeps going full throttle until it hits > > > the wall hard. > > > > > > Now, we could probably try to replicate the global knobs and add a > > > per-memcg soft limit. But you know better than anyone else how hard it > > > is to estimate the overall workingset size of a workload, and the > > > margins on containerized loads are razor-thin. Performance is much > > > more sensitive to input errors, and often times parameters must be > > > adjusted continuously during the runtime of a workload. It'd be > > > disasterous to rely on yet more static, error-prone user input here. > > > > Yeah, but the dynamic approach proposed in your patch set doesn't > > guarantee we won't hit OOM in memcg due to overgrown buffers. It just > > reduces this possibility. Of course, memcg OOM is far not as disastrous > > as the global one, but still it usually means the workload breakage. > > Right now, the entire machine breaks. Confining it to a faulty memcg, > as well as reducing the likelihood of that OOM in many cases seems > like a good move in the right direction, no? It seems. However, memcg OOM is also bad, we should strive to avoid it if we can. > > And how likely are memcg OOMs because of this anyway? There is of Frankly, I've no idea. Your arguments below sound reassuring though. > course a scenario imaginable where the packets pile up, followed by > some *other* part of the workload, the one that doesn't read() and > process packets, trying to expand--which then doesn't work and goes > OOM. But that seems like a complete corner case. In the vast majority > of cases, the application will be in full operation and just fail to > read() fast enough--because the network bandwidth is enormous compared > to the container's size, or because it shares the CPU with thousands > of other workloads and there is scheduling latency. > > This would be the perfect point to reign in the transmit window... > > > The static approach is error-prone for sure, but it has existed for > > years and worked satisfactory AFAIK. > > ...but that point is not a fixed amount of memory consumed. It depends > on the workload and the random interactions it's having with thousands > of other containers on that same machine. > > The point of containers is to maximize utilization of your hardware > and systematically eliminate slack in the system. But it's exactly > that slack on dedicated bare-metal machines that allowed us to take a > wild guess at the settings and then tune them based on observing a > handful of workloads. This approach is not going to work anymore when > we pack the machine to capacity and still expect every single > container out of thousands to perform well. We need that automation. But we do use static approach when setting memory limits, no? memory.{low,high,max} - they are all static. I understand it's appealing to have just one knob - memory size - like in case of virtual machines, but it doesn't seem to work with containers. You added memory.low and memory.high knobs. VMs don't have anything like that. How is one supposed to set them? Depends on the workload, I guess. Also, there is the pids cgroup for limiting the number of pids that can be used by a cgroup, because pid turns out to be a resource in case of containers. May be, tcp window should be considered as a separate resource either, as it is now, and shouldn't go to memcg? I'm just wondering... > > The static setting working okay on the global level is also why I'm > not interested in starting to experiment with it. There is no reason > to change it. It's much more likely that any attempt to change it will > be shot down, not because of the approach chosen, but because there is > no problem to solve there. I doubt we can get networking people to > care about containers by screwing with things that work for them ;-) Fair enough. Thanks, Vladimir From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Date: Wed, 4 Nov 2015 11:42:40 +0100 Message-ID: <20151104104239.GG29607@dhcp22.suse.cz> References: <1445487696-21545-6-git-send-email-hannes@cmpxchg.org> <20151023131956.GA15375@dhcp22.suse.cz> <20151023.065957.1690815054807881760.davem@davemloft.net> <20151026165619.GB2214@cmpxchg.org> <20151027122647.GG9891@dhcp22.suse.cz> <20151027154138.GA4665@cmpxchg.org> <20151027161554.GJ9891@dhcp22.suse.cz> <20151027164227.GB7749@cmpxchg.org> <20151029152546.GG23598@dhcp22.suse.cz> <20151029161009.GA9160@cmpxchg.org> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20151029161009.GA9160@cmpxchg.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: David Miller , akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Thu 29-10-15 09:10:09, Johannes Weiner wrote: > On Thu, Oct 29, 2015 at 04:25:46PM +0100, Michal Hocko wrote: > > On Tue 27-10-15 09:42:27, Johannes Weiner wrote: [...] > > > You carefully skipped over this part. We can ignore it for socket > > > memory but it's something we need to figure out when it comes to slab > > > accounting and tracking. > > > > I am sorry, I didn't mean to skip this part, I though it would be clear > > from the previous text. I think kmem accounting falls into the same > > category. Have a sane default and a global boottime knob to override it > > for those that think differently - for whatever reason they might have. > > Yes, that makes sense to me. > > Like cgroup.memory=nosocket, would you think it makes sense to include > slab in the default for functional/semantical completeness and provide > a cgroup.memory=noslab for powerusers? I am still not sure whether the kmem accounting is stable enough to be enabled by default. If for nothing else the allocation failures, which are not allowed for the global case and easily triggered by the hard limit, might be a big problem. My last attempts to allow GFP_NOFS to fail made me quite skeptical. I still believe this is something which will be solved in the long term but the current state might be still too fragile. So I would rather be conservative and have the kmem accounting disabled by default with a config option and boot parameter to override. If somebody is confident that the desired load is stable then the config can be enabled easily. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Date: Wed, 4 Nov 2015 14:50:37 -0500 Message-ID: <20151104195037.GA6872@cmpxchg.org> References: <20151023131956.GA15375@dhcp22.suse.cz> <20151023.065957.1690815054807881760.davem@davemloft.net> <20151026165619.GB2214@cmpxchg.org> <20151027122647.GG9891@dhcp22.suse.cz> <20151027154138.GA4665@cmpxchg.org> <20151027161554.GJ9891@dhcp22.suse.cz> <20151027164227.GB7749@cmpxchg.org> <20151029152546.GG23598@dhcp22.suse.cz> <20151029161009.GA9160@cmpxchg.org> <20151104104239.GG29607@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20151104104239.GG29607-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: David Miller , akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org, vdavydov-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org, tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Wed, Nov 04, 2015 at 11:42:40AM +0100, Michal Hocko wrote: > On Thu 29-10-15 09:10:09, Johannes Weiner wrote: > > On Thu, Oct 29, 2015 at 04:25:46PM +0100, Michal Hocko wrote: > > > On Tue 27-10-15 09:42:27, Johannes Weiner wrote: > [...] > > > > You carefully skipped over this part. We can ignore it for socket > > > > memory but it's something we need to figure out when it comes to slab > > > > accounting and tracking. > > > > > > I am sorry, I didn't mean to skip this part, I though it would be clear > > > from the previous text. I think kmem accounting falls into the same > > > category. Have a sane default and a global boottime knob to override it > > > for those that think differently - for whatever reason they might have. > > > > Yes, that makes sense to me. > > > > Like cgroup.memory=nosocket, would you think it makes sense to include > > slab in the default for functional/semantical completeness and provide > > a cgroup.memory=noslab for powerusers? > > I am still not sure whether the kmem accounting is stable enough to be > enabled by default. If for nothing else the allocation failures, which > are not allowed for the global case and easily triggered by the hard > limit, might be a big problem. My last attempts to allow GFP_NOFS to > fail made me quite skeptical. I still believe this is something which > will be solved in the long term but the current state might be still too > fragile. So I would rather be conservative and have the kmem accounting > disabled by default with a config option and boot parameter to override. > If somebody is confident that the desired load is stable then the config > can be enabled easily. I agree with your assessment of the current kmem code state, but I think your conclusion is completely backwards here. The interface will be set in stone forever, whereas any stability issues will be transient and will have to be addressed in a finite amount of time anyway. It doesn't make sense to design an interface based on temporary quality of implementation. Only one of those two can ever be changed. Because it goes without saying that once the cgroupv2 interface is released, and people use it in production, there is no way we can then *add* dentry cache, inode cache, and others to memory.current. That would be an unacceptable change in interface behavior. On the other hand, people will be prepared for hiccups in the early stages of cgroupv2 release, and we're providing cgroup.memory=noslab to let them workaround severe problems in production until we fix it without forcing them to fully revert to cgroupv1. So if we agree that there are no fundamental architectural concerns with slab accounting, i.e. nothing that can't be addressed in the implementation, we have to make the call now. And I maintain that not accounting dentry cache and inode cache is a gaping hole in memory isolation, so it should be included by default. (The rest of the slabs is arguable, but IMO the risk of missing something important is higher than the cost of including them.) As far as your allocation failure concerns go, I think the kmem code is currently not behaving as Glauber originally intended, which is to force charge if reclaim and OOM killing weren't able to make enough space. See this recently rewritten section of the kmem charge path: - /* - * try_charge() chose to bypass to root due to OOM kill or - * fatal signal. Since our only options are to either fail - * the allocation or charge it to this cgroup, do it as a - * temporary condition. But we can't fail. From a kmem/slab - * perspective, the cache has already been selected, by - * mem_cgroup_kmem_get_cache(), so it is too late to change - * our minds. - * - * This condition will only trigger if the task entered - * memcg_charge_kmem in a sane state, but was OOM-killed - * during try_charge() above. Tasks that were already dying - * when the allocation triggers should have been already - * directed to the root cgroup in memcontrol.h - */ - page_counter_charge(&memcg->memory, nr_pages); - if (do_swap_account) - page_counter_charge(&memcg->memsw, nr_pages); It could be that this never properly worked as it was tied to the -EINTR bypass trick, but the idea was these charges never fail. And this makes sense. If the allocator semantics are such that we never fail these page allocations for slab, and the callsites rely on that, surely we should not fail them in the memory controller, either. And it makes a lot more sense to account them in excess of the limit than pretend they don't exist. We might not be able to completely fullfill the containment part of the memory controller (although these slab charges will still create significant pressure before that), but at least we don't fail the accounting part on top of it. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Date: Thu, 5 Nov 2015 15:40:02 +0100 Message-ID: <20151105144002.GB15111@dhcp22.suse.cz> References: <20151023.065957.1690815054807881760.davem@davemloft.net> <20151026165619.GB2214@cmpxchg.org> <20151027122647.GG9891@dhcp22.suse.cz> <20151027154138.GA4665@cmpxchg.org> <20151027161554.GJ9891@dhcp22.suse.cz> <20151027164227.GB7749@cmpxchg.org> <20151029152546.GG23598@dhcp22.suse.cz> <20151029161009.GA9160@cmpxchg.org> <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20151104195037.GA6872@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: David Miller , akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Wed 04-11-15 14:50:37, Johannes Weiner wrote: [...] > Because it goes without saying that once the cgroupv2 interface is > released, and people use it in production, there is no way we can then > *add* dentry cache, inode cache, and others to memory.current. That > would be an unacceptable change in interface behavior. They would still have to _enable_ the config option _explicitly_. make oldconfig wouldn't change it silently for them. I do not think it is an unacceptable change of behavior if the config is changed explicitly. > On the other > hand, people will be prepared for hiccups in the early stages of > cgroupv2 release, and we're providing cgroup.memory=noslab to let them > workaround severe problems in production until we fix it without > forcing them to fully revert to cgroupv1. This would be true if they moved on to the new cgroup API intentionally. The reality is more complicated though. AFAIK sysmted is waiting for cgroup2 already and privileged services enable all available resource controllers by default as I've learned just recently. If we know that the interface is not stable enough then we are basically forcing _most_ users to use the kernel boot parameter if we stay with the current kmem semantic. More on that below. > So if we agree that there are no fundamental architectural concerns > with slab accounting, i.e. nothing that can't be addressed in the > implementation, we have to make the call now. We are on the same page here. > And I maintain that not accounting dentry cache and inode cache is a > gaping hole in memory isolation, so it should be included by default. > (The rest of the slabs is arguable, but IMO the risk of missing > something important is higher than the cost of including them.) More on that below. > As far as your allocation failure concerns go, I think the kmem code > is currently not behaving as Glauber originally intended, which is to > force charge if reclaim and OOM killing weren't able to make enough > space. See this recently rewritten section of the kmem charge path: > > - /* > - * try_charge() chose to bypass to root due to OOM kill or > - * fatal signal. Since our only options are to either fail > - * the allocation or charge it to this cgroup, do it as a > - * temporary condition. But we can't fail. From a kmem/slab > - * perspective, the cache has already been selected, by > - * mem_cgroup_kmem_get_cache(), so it is too late to change > - * our minds. > - * > - * This condition will only trigger if the task entered > - * memcg_charge_kmem in a sane state, but was OOM-killed > - * during try_charge() above. Tasks that were already dying > - * when the allocation triggers should have been already > - * directed to the root cgroup in memcontrol.h > - */ > - page_counter_charge(&memcg->memory, nr_pages); > - if (do_swap_account) > - page_counter_charge(&memcg->memsw, nr_pages); > > It could be that this never properly worked as it was tied to the > -EINTR bypass trick, but the idea was these charges never fail. I have always understood this path as a corner case when the task is an oom victim or exiting. So this would be only a temporal condition which cannot cause a complete runaway. > And this makes sense. If the allocator semantics are such that we > never fail these page allocations for slab, and the callsites rely on > that, surely we should not fail them in the memory controller, either. Then we can only bypass them or loop inside the charge code for ever like we do in the page allocator. The later one is really fragile and it would be much more in the restricted environment as we have learned with the memcg OOM killer in the past. > And it makes a lot more sense to account them in excess of the limit > than pretend they don't exist. We might not be able to completely > fullfill the containment part of the memory controller (although these > slab charges will still create significant pressure before that), but > at least we don't fail the accounting part on top of it. Hmm, wouldn't that kill the whole purpose of the kmem accounting? Any load could simply runaway via kernel allocations. What is even worse we might even not trigger memcg OOM killer before we hit the global OOM. So the whole containment goes straight to hell. I can see four options here: 1) enable kmem by default with the current semantic which we know can BUG_ON (at least btrfs is known to hit this) or lead to other issues. 2) enable kmem by default and change the semantic for cgroup2 to allow runaway charges above the hard limit which would defeat the whole purpose of the containment for cgroup2. This can be a temporary workaround until we can afford kmem failures. This has a big risk that we will end up with this permanently because there is a strong pressure that GFP_KERNEL allocations should never fail. Yet this is the most common type of request. Or do we change the consistency with the global case at some point? 3) keep only some (safe) cache types enabled by default with the current failing semantic and require an explicit enabling for the complete kmem accounting. [di]cache code paths should be quite robust to handle allocation failures. 4) disable kmem by default and change the config default later to signal the accounting is safe as far as we are aware and let people enable the functionality on those basis. We would keep the current failing semantic. To me 4) sounds like the safest option because it still keeps the functionality available to those who can benefit from it in v1 already while we are not exposing a potentially buggy behavior to the majority (many of them even unintentionally). Moreover we still allow to change the default later on an explicit basis. 3) sounds like the second best option but I am not really sure whether we can do that very easily without bringing up a lot of unmaintainable mess. 2) sounds like the third best approach but I am afraid it would render the basic use cases unusable for a very long time and kill any interest in cgroup2 for even longer (cargo cults are really hard to get rid of). 1) sounds like a land mine approach which would force many/most users to simply keep using the boot option and force us to re-evaluate the default hard way. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Miller Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Date: Thu, 05 Nov 2015 11:16:09 -0500 (EST) Message-ID: <20151105.111609.1695015438589063316.davem@davemloft.net> References: <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20151105144002.GB15111@dhcp22.suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: Content-Type: Text/Plain; charset="us-ascii" To: mhocko@kernel.org Cc: hannes@cmpxchg.org, akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org From: Michal Hocko Date: Thu, 5 Nov 2015 15:40:02 +0100 > On Wed 04-11-15 14:50:37, Johannes Weiner wrote: > [...] >> Because it goes without saying that once the cgroupv2 interface is >> released, and people use it in production, there is no way we can then >> *add* dentry cache, inode cache, and others to memory.current. That >> would be an unacceptable change in interface behavior. > > They would still have to _enable_ the config option _explicitly_. make > oldconfig wouldn't change it silently for them. I do not think > it is an unacceptable change of behavior if the config is changed > explicitly. Every user is going to get this config option when they update their distibution kernel or whatever. Then they will all wonder why their networking performance went down. This is why I do not want the networking accounting bits on by default even if the kconfig option is enabled. They must be off by default and guarded by a static branch so the cost is exactly zero. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Date: Thu, 5 Nov 2015 17:28:03 +0100 Message-ID: <20151105162803.GD15111@dhcp22.suse.cz> References: <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> <20151105.111609.1695015438589063316.davem@davemloft.net> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20151105.111609.1695015438589063316.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: David Miller Cc: hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org, vdavydov-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org, tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Thu 05-11-15 11:16:09, David S. Miller wrote: > From: Michal Hocko > Date: Thu, 5 Nov 2015 15:40:02 +0100 > > > On Wed 04-11-15 14:50:37, Johannes Weiner wrote: > > [...] > >> Because it goes without saying that once the cgroupv2 interface is > >> released, and people use it in production, there is no way we can then > >> *add* dentry cache, inode cache, and others to memory.current. That > >> would be an unacceptable change in interface behavior. > > > > They would still have to _enable_ the config option _explicitly_. make > > oldconfig wouldn't change it silently for them. I do not think > > it is an unacceptable change of behavior if the config is changed > > explicitly. > > Every user is going to get this config option when they update their > distibution kernel or whatever. > > Then they will all wonder why their networking performance went down. > > This is why I do not want the networking accounting bits on by default > even if the kconfig option is enabled. They must be off by default > and guarded by a static branch so the cost is exactly zero. Yes, that part is clear and Johannes made it clear that the kmem tcp part is disabled by default. Or are you considering also all the slab usage by the networking code as well? -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Miller Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Date: Thu, 05 Nov 2015 11:30:12 -0500 (EST) Message-ID: <20151105.113012.433525933573324396.davem@davemloft.net> References: <20151105144002.GB15111@dhcp22.suse.cz> <20151105.111609.1695015438589063316.davem@davemloft.net> <20151105162803.GD15111@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20151105162803.GD15111@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: Text/Plain; charset="us-ascii" To: mhocko@kernel.org Cc: hannes@cmpxchg.org, akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org From: Michal Hocko Date: Thu, 5 Nov 2015 17:28:03 +0100 > Yes, that part is clear and Johannes made it clear that the kmem tcp > part is disabled by default. Or are you considering also all the slab > usage by the networking code as well? I'm still thinking about the implications of that aspect, and will comment when I have something coherent to say about it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Date: Thu, 5 Nov 2015 15:55:22 -0500 Message-ID: <20151105205522.GA1067@cmpxchg.org> References: <20151026165619.GB2214@cmpxchg.org> <20151027122647.GG9891@dhcp22.suse.cz> <20151027154138.GA4665@cmpxchg.org> <20151027161554.GJ9891@dhcp22.suse.cz> <20151027164227.GB7749@cmpxchg.org> <20151029152546.GG23598@dhcp22.suse.cz> <20151029161009.GA9160@cmpxchg.org> <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20151105144002.GB15111@dhcp22.suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: David Miller , akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote: > On Wed 04-11-15 14:50:37, Johannes Weiner wrote: > [...] > > Because it goes without saying that once the cgroupv2 interface is > > released, and people use it in production, there is no way we can then > > *add* dentry cache, inode cache, and others to memory.current. That > > would be an unacceptable change in interface behavior. > > They would still have to _enable_ the config option _explicitly_. make > oldconfig wouldn't change it silently for them. I do not think > it is an unacceptable change of behavior if the config is changed > explicitly. Yeah, as Dave said these will all get turned on anyway, so there is no point in fragmenting the Kconfig space in the first place. > > On the other > > hand, people will be prepared for hiccups in the early stages of > > cgroupv2 release, and we're providing cgroup.memory=noslab to let them > > workaround severe problems in production until we fix it without > > forcing them to fully revert to cgroupv1. > > This would be true if they moved on to the new cgroup API intentionally. > The reality is more complicated though. AFAIK sysmted is waiting for > cgroup2 already and privileged services enable all available resource > controllers by default as I've learned just recently. Have you filed a report with them? I don't think they should turn them on unless users explicitely configure resource control for the unit. But what I said still holds: critical production machines don't just get rolling updates and "accidentally" switch to all this new code. And those that do take the plunge have the cmdline options. > > And it makes a lot more sense to account them in excess of the limit > > than pretend they don't exist. We might not be able to completely > > fullfill the containment part of the memory controller (although these > > slab charges will still create significant pressure before that), but > > at least we don't fail the accounting part on top of it. > > Hmm, wouldn't that kill the whole purpose of the kmem accounting? Any > load could simply runaway via kernel allocations. What is even worse we > might even not trigger memcg OOM killer before we hit the global OOM. So > the whole containment goes straight to hell. > > I can see four options here: > 1) enable kmem by default with the current semantic which we know can > BUG_ON (at least btrfs is known to hit this) or lead to other issues. Can you point me to that report? That's not "semantics", that's a bug! Whether or not a feature is enabled by default, it can not be allowed to crash the kernel. Presenting this as a choice is a bit of a strawman argument. > 2) enable kmem by default and change the semantic for cgroup2 to allow > runaway charges above the hard limit which would defeat the whole > purpose of the containment for cgroup2. This can be a temporary > workaround until we can afford kmem failures. This has a big risk > that we will end up with this permanently because there is a strong > pressure that GFP_KERNEL allocations should never fail. Yet this is > the most common type of request. Or do we change the consistency with > the global case at some point? As per 1) we *have* to fail containment eventually if not doing so means crashes and lockups. That's not a choice of semantics. But that doesn't mean we have to give up *immediately* and allow unrestrained "runaway charges"--again, more of a strawman than a choice. We can still throttle the allocator and apply significant pressure on the memory pool, culminating in OOM kills eventually. Once we run out of available containment tools, however, we *have* to follow the semantics of the page and slab allocator and succeed the request. We can not just return -ENOMEM if that causes kernel bugs. That's the only thing we can do right now. In fact, it's likely going to be the best we will ever be able to do when it comes to kernel memory accounting. Linus made it clear where he stands on failing kernel allocations, so all we can do is continue to improve our containment tools and then give up on containment when they're exhausted and force the charge past the limit. > 3) keep only some (safe) cache types enabled by default with the current > failing semantic and require an explicit enabling for the complete > kmem accounting. [di]cache code paths should be quite robust to > handle allocation failures. Vladimir, what would be your opinion on this? > 4) disable kmem by default and change the config default later to signal > the accounting is safe as far as we are aware and let people enable > the functionality on those basis. We would keep the current failing > semantic. > > To me 4) sounds like the safest option because it still keeps the > functionality available to those who can benefit from it in v1 already > while we are not exposing a potentially buggy behavior to the majority > (many of them even unintentionally). Moreover we still allow to change > the default later on an explicit basis. I'm not interested in fragmenting the interface forever out of caution because there might be a bug in the implementation right now. As I said we have to fix any instability in the features we provide whether they are turned on by default or not. I don't see how this is relevant to the interface discussion. Also, there is no way we can later fundamentally change the semantics of memory.current, so it would have to remain configurable forever, forcing people forever to select multiple options in order to piece together a single logical kernel feature. This is really not an option, either. If there are show-stopping bugs in the implementation, I'd rather hold off the release of the unified hierarchy than commit to a half-assed interface right out of the gate. The point of v2 is sane interfaces. So let's please focus on fixing any problems that slab accounting may have, rather than designing complex config options and transition procedures whose sole purpose is to defer dealing with our issues. Please? From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Date: Thu, 5 Nov 2015 17:32:51 -0500 Message-ID: <20151105223251.GA4427@cmpxchg.org> References: <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> <20151105.111609.1695015438589063316.davem@davemloft.net> <20151105162803.GD15111@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20151105162803.GD15111-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: David Miller , akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org, vdavydov-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org, tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Thu, Nov 05, 2015 at 05:28:03PM +0100, Michal Hocko wrote: > On Thu 05-11-15 11:16:09, David S. Miller wrote: > > From: Michal Hocko > > Date: Thu, 5 Nov 2015 15:40:02 +0100 > > > > > On Wed 04-11-15 14:50:37, Johannes Weiner wrote: > > > [...] > > >> Because it goes without saying that once the cgroupv2 interface is > > >> released, and people use it in production, there is no way we can then > > >> *add* dentry cache, inode cache, and others to memory.current. That > > >> would be an unacceptable change in interface behavior. > > > > > > They would still have to _enable_ the config option _explicitly_. make > > > oldconfig wouldn't change it silently for them. I do not think > > > it is an unacceptable change of behavior if the config is changed > > > explicitly. > > > > Every user is going to get this config option when they update their > > distibution kernel or whatever. > > > > Then they will all wonder why their networking performance went down. > > > > This is why I do not want the networking accounting bits on by default > > even if the kconfig option is enabled. They must be off by default > > and guarded by a static branch so the cost is exactly zero. > > Yes, that part is clear and Johannes made it clear that the kmem tcp > part is disabled by default. Or are you considering also all the slab > usage by the networking code as well? Michal, there shouldn't be any tracking or accounting going on per default when you boot into a fresh system. I removed all accounting and statistics on the system level in cgroupv2, so distribution kernels can compile-time enable a single, feature-complete CONFIG_MEMCG that provides a full memory controller while at the same time puts no overhead on users that don't benefit from mem control at all and just want to use the machine bare-metal. This is completely doable. My new series does it for skmem, but I also want to retrofit the code to eliminate that current overhead for page cache, anonymous memory, slab memory and so forth. This is the only sane way to make the memory controller powerful and generally useful without having to make unreasonable compromises with memory consumers. We shouldn't even be *having* the discussion about whether we should sacrifice the quality of our interface in order to compromise with a class of users that doesn't care about any of this in the first place. So let's eliminate the cost for non-users, but make the memory controller feature-complete and useful--with reasonable cost, implementation, and interface--for our actual userbase. Paying the necessary cost for a functionality you actually want is not the problem. Paying for something that doesn't benefit you is. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Date: Thu, 5 Nov 2015 17:52:00 -0500 Message-ID: <20151105225200.GA5432@cmpxchg.org> References: <20151027122647.GG9891@dhcp22.suse.cz> <20151027154138.GA4665@cmpxchg.org> <20151027161554.GJ9891@dhcp22.suse.cz> <20151027164227.GB7749@cmpxchg.org> <20151029152546.GG23598@dhcp22.suse.cz> <20151029161009.GA9160@cmpxchg.org> <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> <20151105205522.GA1067@cmpxchg.org> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20151105205522.GA1067-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: David Miller , akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org, vdavydov-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org, tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote: > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote: > > This would be true if they moved on to the new cgroup API intentionally. > > The reality is more complicated though. AFAIK sysmted is waiting for > > cgroup2 already and privileged services enable all available resource > > controllers by default as I've learned just recently. > > Have you filed a report with them? I don't think they should turn them > on unless users explicitely configure resource control for the unit. Okay, verified with systemd people that they're not planning on enabling resource control per default. Inflammatory half-truths, man. This is not constructive. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Vladimir Davydov Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Date: Fri, 6 Nov 2015 12:05:55 +0300 Message-ID: <20151106090555.GK29259@esperanza> References: <20151027122647.GG9891@dhcp22.suse.cz> <20151027154138.GA4665@cmpxchg.org> <20151027161554.GJ9891@dhcp22.suse.cz> <20151027164227.GB7749@cmpxchg.org> <20151029152546.GG23598@dhcp22.suse.cz> <20151029161009.GA9160@cmpxchg.org> <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> <20151105205522.GA1067@cmpxchg.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Return-path: Content-Disposition: inline In-Reply-To: <20151105205522.GA1067-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: Michal Hocko , David Miller , akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org, tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote: > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote: ... > > 3) keep only some (safe) cache types enabled by default with the current > > failing semantic and require an explicit enabling for the complete > > kmem accounting. [di]cache code paths should be quite robust to > > handle allocation failures. > > Vladimir, what would be your opinion on this? I'm all for this option. Actually, I've been thinking about this since I introduced the __GFP_NOACCOUNT flag. Not because of the failing semantics, since we can always let kmem allocations breach the limit. This shouldn't be critical, because I don't think it's possible to issue a series of kmem allocations w/o a single user page allocation, which would reclaim/kill the excess. The point is there are allocations that are shared system-wide and therefore shouldn't go to any memcg. Most obvious examples are: mempool users and radix_tree/idr preloads. Accounting them to memcg is likely to result in noticeable memory overhead as memory cgroups are created/destroyed, because they pin dead memory cgroups with all their kmem caches, which aren't tiny. Another funny example is objects destroyed lazily for performance reasons, e.g. vmap_area. Such objects are usually very small, so delaying destruction of a bunch of them will normally go unnoticed. However, if kmemcg is used the effective memory consumption caused by such objects can be multiplied by many times due to dangling kmem caches. We can, of course, mark all such allocations as __GFP_NOACCOUNT, but the problem is they are tricky to identify, because they are scattered all over the kernel source tree. E.g. Dave Chinner mentioned that XFS internals do a lot of allocations that are shared among all XFS filesystems and therefore should not be accounted (BTW that's why list_lru's used by XFS are not marked as memcg-aware). There must be more out there. Besides, kernel developers don't usually even know about kmemcg (they just write the code for their subsys, so why should they?) so they won't care thinking about using __GFP_NOACCOUNT, and hence new falsely-accounted allocations are likely to appear. That said, by switching from black-list (__GFP_NOACCOUNT) to white-list (__GFP_ACCOUNT) kmem accounting policy we would make the system more predictable and robust IMO. OTOH what would we lose? Security? Well, containers aren't secure IMHO. In fact, I doubt they will ever be (as secure as VMs). Anyway, if a runaway allocation is reported, it should be trivial to fix by adding __GFP_ACCOUNT where appropriate. If there are no objections, I'll prepare a patch switching to the white-list approach. Let's start from obvious things like fs_struct, mm_struct, task_struct, signal_struct, dentry, inode, which can be easily allocated from user space. This should cover 90% of all allocations that should be accounted AFAICS. The rest will be added later if necessarily. Thanks, Vladimir From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Date: Fri, 6 Nov 2015 11:57:24 +0100 Message-ID: <20151106105724.GG4390@dhcp22.suse.cz> References: <20151027154138.GA4665@cmpxchg.org> <20151027161554.GJ9891@dhcp22.suse.cz> <20151027164227.GB7749@cmpxchg.org> <20151029152546.GG23598@dhcp22.suse.cz> <20151029161009.GA9160@cmpxchg.org> <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> <20151105205522.GA1067@cmpxchg.org> <20151105225200.GA5432@cmpxchg.org> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20151105225200.GA5432@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: David Miller , akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Thu 05-11-15 17:52:00, Johannes Weiner wrote: > On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote: > > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote: > > > This would be true if they moved on to the new cgroup API intentionally. > > > The reality is more complicated though. AFAIK sysmted is waiting for > > > cgroup2 already and privileged services enable all available resource > > > controllers by default as I've learned just recently. > > > > Have you filed a report with them? I don't think they should turn them > > on unless users explicitely configure resource control for the unit. > > Okay, verified with systemd people that they're not planning on > enabling resource control per default. > > Inflammatory half-truths, man. This is not constructive. What about Delegate=yes feature then? We have just been burnt by this quite heavily. AFAIU nspawn@.service and nspawn@.service have this enabled by default http://lists.freedesktop.org/archives/systemd-commits/2014-November/007400.html -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Date: Fri, 6 Nov 2015 13:51:40 +0100 Message-ID: <20151106125140.GI4390@dhcp22.suse.cz> References: <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> <20151105.111609.1695015438589063316.davem@davemloft.net> <20151105162803.GD15111@dhcp22.suse.cz> <20151105223251.GA4427@cmpxchg.org> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20151105223251.GA4427@cmpxchg.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: David Miller , akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Thu 05-11-15 17:32:51, Johannes Weiner wrote: > On Thu, Nov 05, 2015 at 05:28:03PM +0100, Michal Hocko wrote: [...] > > Yes, that part is clear and Johannes made it clear that the kmem tcp > > part is disabled by default. Or are you considering also all the slab > > usage by the networking code as well? > > Michal, there shouldn't be any tracking or accounting going on per > default when you boot into a fresh system. > > I removed all accounting and statistics on the system level in > cgroupv2, so distribution kernels can compile-time enable a single, > feature-complete CONFIG_MEMCG that provides a full memory controller > while at the same time puts no overhead on users that don't benefit > from mem control at all and just want to use the machine bare-metal. Yes that part is clear and I am not disputing it _at all_. It is just that changes are high that memory controller _will_ be enabled in a typical distribution systems. E.g. systemd _is_ enabling all resource controllers by default for some services with Delegate=yes option. > This is completely doable. My new series does it for skmem, but I also > want to retrofit the code to eliminate that current overhead for page > cache, anonymous memory, slab memory and so forth. > > This is the only sane way to make the memory controller powerful and > generally useful without having to make unreasonable compromises with > memory consumers. We shouldn't even be *having* the discussion about > whether we should sacrifice the quality of our interface in order to > compromise with a class of users that doesn't care about any of this > in the first place. > > So let's eliminate the cost for non-users, but make the memory > controller feature-complete and useful--with reasonable cost, > implementation, and interface--for our actual userbase. > > Paying the necessary cost for a functionality you actually want is not > the problem. Paying for something that doesn't benefit you is. I completely agree that a reasonable cost for those who _want_ the functionality. It hasn't been shown that people actually lack kmem accounting in the wild from the past in general. E.g. kmem controller is even not enabled in opensuse nor SLES kernels and I do not remember there was huge push to enable it. I do understand that you want to have an out-of-the-box isolation behavior which I agree is a nice-to-have feature. Especially with a larger penetration of containerized workloads. But my point still holds. This is not something everybody wants to have. So have a configuration and a boot time option to override is the most reasonable way to go. You can clearly see that this is already demand from tcp kmem extension because they really _care_ about every single cpu cycle even though some part of the userspace happens to have memcg enabled. The question about the configuration default is a different question and we can discuss that because this is not an easy one to decide right now IMHO. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Date: Fri, 6 Nov 2015 14:21:02 +0100 Message-ID: <20151106132102.GJ4390@dhcp22.suse.cz> References: <20151027122647.GG9891@dhcp22.suse.cz> <20151027154138.GA4665@cmpxchg.org> <20151027161554.GJ9891@dhcp22.suse.cz> <20151027164227.GB7749@cmpxchg.org> <20151029152546.GG23598@dhcp22.suse.cz> <20151029161009.GA9160@cmpxchg.org> <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> <20151105205522.GA1067@cmpxchg.org> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20151105205522.GA1067@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: David Miller , akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Thu 05-11-15 15:55:22, Johannes Weiner wrote: > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote: > > On Wed 04-11-15 14:50:37, Johannes Weiner wrote: [...] > > This would be true if they moved on to the new cgroup API intentionally. > > The reality is more complicated though. AFAIK sysmted is waiting for > > cgroup2 already and privileged services enable all available resource > > controllers by default as I've learned just recently. > > Have you filed a report with them? I don't think they should turn them > on unless users explicitely configure resource control for the unit. We have just been bitten by this (aka Delegate=yes for some basic services) and our systemd people are supposed to bring this up upstream. I've mentioned that in other email where you accuse me from spreading a FUD. > But what I said still holds: critical production machines don't just > get rolling updates and "accidentally" switch to all this new > code. And those that do take the plunge have the cmdline options. That is exactly my point why I do not think re-evaluating the default config option is a problem at all. The default wouldn't matter for existing users. Those who care can have all the functionality they need right away - be it kmem enabled or disabled. > > > And it makes a lot more sense to account them in excess of the limit > > > than pretend they don't exist. We might not be able to completely > > > fullfill the containment part of the memory controller (although these > > > slab charges will still create significant pressure before that), but > > > at least we don't fail the accounting part on top of it. > > > > Hmm, wouldn't that kill the whole purpose of the kmem accounting? Any > > load could simply runaway via kernel allocations. What is even worse we > > might even not trigger memcg OOM killer before we hit the global OOM. So > > the whole containment goes straight to hell. > > > > I can see four options here: > > 1) enable kmem by default with the current semantic which we know can > > BUG_ON (at least btrfs is known to hit this) or lead to other issues. > > Can you point me to that report? git grep "BUG_ON.*ENOMEM" -- fs/btrfs just to give you a picture. Not all of them are kmalloc and others are not annotated by ENOMEM comment. This came out as a result of my last attempt to allow GFP_NOFS fail (http://lkml.kernel.org/r/1438768284-30927-1-git-send-email-mhocko%40kernel.org) > That's not "semantics", that's a bug! Whether or not a feature is > enabled by default, it can not be allowed to crash the kernel. Yes those are bugs and have to be fixed. Not an easy task but nothing which couldn't be solved. It just takes some time. They are not very likely right now because they are reduced to corner cases right now. But they are more visible with the current kmem accounting semantic. So either we change the semantic or wait until this gets fixed if the accoutning should be on by default. > Presenting this as a choice is a bit of a strawman argument. > > > 2) enable kmem by default and change the semantic for cgroup2 to allow > > runaway charges above the hard limit which would defeat the whole > > purpose of the containment for cgroup2. This can be a temporary > > workaround until we can afford kmem failures. This has a big risk > > that we will end up with this permanently because there is a strong > > pressure that GFP_KERNEL allocations should never fail. Yet this is > > the most common type of request. Or do we change the consistency with > > the global case at some point? > > As per 1) we *have* to fail containment eventually if not doing so > means crashes and lockups. That's not a choice of semantics. > > But that doesn't mean we have to give up *immediately* and allow > unrestrained "runaway charges"--again, more of a strawman than a > choice. We can still throttle the allocator and apply significant > pressure on the memory pool, culminating in OOM kills eventually. > > Once we run out of available containment tools, however, we *have* to > follow the semantics of the page and slab allocator and succeed the > request. We can not just return -ENOMEM if that causes kernel bugs. > > That's the only thing we can do right now. > > In fact, it's likely going to be the best we will ever be able to do > when it comes to kernel memory accounting. Linus made it clear where > he stands on failing kernel allocations, so all we can do is continue > to improve our containment tools and then give up on containment when > they're exhausted and force the charge past the limit. OK, then we need all the additional measures to keep the hard limit excess bound. > > 3) keep only some (safe) cache types enabled by default with the current > > failing semantic and require an explicit enabling for the complete > > kmem accounting. [di]cache code paths should be quite robust to > > handle allocation failures. > > Vladimir, what would be your opinion on this? > > > 4) disable kmem by default and change the config default later to signal > > the accounting is safe as far as we are aware and let people enable > > the functionality on those basis. We would keep the current failing > > semantic. > > > > To me 4) sounds like the safest option because it still keeps the > > functionality available to those who can benefit from it in v1 already > > while we are not exposing a potentially buggy behavior to the majority > > (many of them even unintentionally). Moreover we still allow to change > > the default later on an explicit basis. > > I'm not interested in fragmenting the interface forever out of caution > because there might be a bug in the implementation right now. As I > said we have to fix any instability in the features we provide whether > they are turned on by default or not. I don't see how this is relevant > to the interface discussion. > > Also, there is no way we can later fundamentally change the semantics > of memory.current, so it would have to remain configurable forever, > forcing people forever to select multiple options in order to piece > together a single logical kernel feature. > > This is really not an option, either. Why not? I can clearly see people who would really want to have this disabled and doing that by config is much more easier than providing a command line parameter. A config option doesn't give any additional maintenance burden than the boot time parameter. > If there are show-stopping bugs in the implementation, I'd rather hold > off the release of the unified hierarchy than commit to a half-assed > interface right out of the gate. If you are willing to postpone releasing cgroup2 until this gets resolved - one way or another - then I have no objections. My impression was that Tejun wanted to release it sooner rather than later. As this mere discussion shows we are even not sure what should be the kmem failure behavior. > The point of v2 is sane interfaces. And the sane interface to me is to use a single set of knobs regardless of memory type. We are currently only discussing what should be accounted by default. My understanding of what David said is that tcp kmem should be enabled only when explicitly opted in. Did I get this wrong? > So let's please focus on fixing any problems that slab accounting may > have, rather than designing complex config options and transition > procedures whose sole purpose is to defer dealing with our issues. > > Please? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Date: Fri, 6 Nov 2015 14:29:40 +0100 Message-ID: <20151106132940.GK4390@dhcp22.suse.cz> References: <20151027154138.GA4665@cmpxchg.org> <20151027161554.GJ9891@dhcp22.suse.cz> <20151027164227.GB7749@cmpxchg.org> <20151029152546.GG23598@dhcp22.suse.cz> <20151029161009.GA9160@cmpxchg.org> <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> <20151105205522.GA1067@cmpxchg.org> <20151106090555.GK29259@esperanza> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20151106090555.GK29259@esperanza> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Vladimir Davydov Cc: Johannes Weiner , David Miller , akpm@linux-foundation.org, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Fri 06-11-15 12:05:55, Vladimir Davydov wrote: [...] > If there are no objections, I'll prepare a patch switching to the > white-list approach. Let's start from obvious things like fs_struct, > mm_struct, task_struct, signal_struct, dentry, inode, which can be > easily allocated from user space. pipe buffers, kernel stacks and who knows what more. > This should cover 90% of all > allocations that should be accounted AFAICS. The rest will be added > later if necessarily. The more I think about that the more I am convinced that is the only sane way forward. The only concerns I would have is how do we deal with the old interface in cgroup1? We do not want to break existing deployments which might depend on the current behavior. I doubt they are but... -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Date: Fri, 6 Nov 2015 11:19:53 -0500 Message-ID: <20151106161953.GA7813@cmpxchg.org> References: <20151027161554.GJ9891@dhcp22.suse.cz> <20151027164227.GB7749@cmpxchg.org> <20151029152546.GG23598@dhcp22.suse.cz> <20151029161009.GA9160@cmpxchg.org> <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> <20151105205522.GA1067@cmpxchg.org> <20151105225200.GA5432@cmpxchg.org> <20151106105724.GG4390@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20151106105724.GG4390@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: David Miller , akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Fri, Nov 06, 2015 at 11:57:24AM +0100, Michal Hocko wrote: > On Thu 05-11-15 17:52:00, Johannes Weiner wrote: > > On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote: > > > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote: > > > > This would be true if they moved on to the new cgroup API intentionally. > > > > The reality is more complicated though. AFAIK sysmted is waiting for > > > > cgroup2 already and privileged services enable all available resource > > > > controllers by default as I've learned just recently. > > > > > > Have you filed a report with them? I don't think they should turn them > > > on unless users explicitely configure resource control for the unit. > > > > Okay, verified with systemd people that they're not planning on > > enabling resource control per default. > > > > Inflammatory half-truths, man. This is not constructive. > > What about Delegate=yes feature then? We have just been burnt by this > quite heavily. AFAIU nspawn@.service and nspawn@.service have this > enabled by default > http://lists.freedesktop.org/archives/systemd-commits/2014-November/007400.html That's when you launch a *container* and want it to be able to use nested resource control. We're talking about actual container users here. It's not turning on resource control for all "privileged services", which is what we were worried about here. Can you at least admit that when you yourself link to the refuting evidence? And if you've been "burnt quite heavily" by this, where is your bug report to stop other users from getting "burnt quite heavily" as well? All I read here is vague inflammatory language to spread FUD. You might think sending these emails is helpful, but it really isn't. Not only is it not contributing code, insights, or solutions, you're now actively sabotaging someone else's effort to build something. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Date: Fri, 6 Nov 2015 11:35:55 -0500 Message-ID: <20151106163555.GB7813@cmpxchg.org> References: <20151027154138.GA4665@cmpxchg.org> <20151027161554.GJ9891@dhcp22.suse.cz> <20151027164227.GB7749@cmpxchg.org> <20151029152546.GG23598@dhcp22.suse.cz> <20151029161009.GA9160@cmpxchg.org> <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> <20151105205522.GA1067@cmpxchg.org> <20151106090555.GK29259@esperanza> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20151106090555.GK29259@esperanza> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Vladimir Davydov Cc: Michal Hocko , David Miller , akpm@linux-foundation.org, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Fri, Nov 06, 2015 at 12:05:55PM +0300, Vladimir Davydov wrote: > On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote: > > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote: > ... > > > 3) keep only some (safe) cache types enabled by default with the current > > > failing semantic and require an explicit enabling for the complete > > > kmem accounting. [di]cache code paths should be quite robust to > > > handle allocation failures. > > > > Vladimir, what would be your opinion on this? > > I'm all for this option. Actually, I've been thinking about this since I > introduced the __GFP_NOACCOUNT flag. Not because of the failing > semantics, since we can always let kmem allocations breach the limit. > This shouldn't be critical, because I don't think it's possible to issue > a series of kmem allocations w/o a single user page allocation, which > would reclaim/kill the excess. > > The point is there are allocations that are shared system-wide and > therefore shouldn't go to any memcg. Most obvious examples are: mempool > users and radix_tree/idr preloads. Accounting them to memcg is likely to > result in noticeable memory overhead as memory cgroups are > created/destroyed, because they pin dead memory cgroups with all their > kmem caches, which aren't tiny. > > Another funny example is objects destroyed lazily for performance > reasons, e.g. vmap_area. Such objects are usually very small, so > delaying destruction of a bunch of them will normally go unnoticed. > However, if kmemcg is used the effective memory consumption caused by > such objects can be multiplied by many times due to dangling kmem > caches. > > We can, of course, mark all such allocations as __GFP_NOACCOUNT, but the > problem is they are tricky to identify, because they are scattered all > over the kernel source tree. E.g. Dave Chinner mentioned that XFS > internals do a lot of allocations that are shared among all XFS > filesystems and therefore should not be accounted (BTW that's why > list_lru's used by XFS are not marked as memcg-aware). There must be > more out there. Besides, kernel developers don't usually even know about > kmemcg (they just write the code for their subsys, so why should they?) > so they won't care thinking about using __GFP_NOACCOUNT, and hence new > falsely-accounted allocations are likely to appear. > > That said, by switching from black-list (__GFP_NOACCOUNT) to white-list > (__GFP_ACCOUNT) kmem accounting policy we would make the system more > predictable and robust IMO. OTOH what would we lose? Security? Well, > containers aren't secure IMHO. In fact, I doubt they will ever be (as > secure as VMs). Anyway, if a runaway allocation is reported, it should > be trivial to fix by adding __GFP_ACCOUNT where appropriate. I wholeheartedly agree with all of this. > If there are no objections, I'll prepare a patch switching to the > white-list approach. Let's start from obvious things like fs_struct, > mm_struct, task_struct, signal_struct, dentry, inode, which can be > easily allocated from user space. This should cover 90% of all > allocations that should be accounted AFAICS. The rest will be added > later if necessarily. Awesome, I'm looking forward to that patch! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Date: Fri, 6 Nov 2015 17:46:57 +0100 Message-ID: <20151106164657.GL4390@dhcp22.suse.cz> References: <20151027164227.GB7749@cmpxchg.org> <20151029152546.GG23598@dhcp22.suse.cz> <20151029161009.GA9160@cmpxchg.org> <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> <20151105205522.GA1067@cmpxchg.org> <20151105225200.GA5432@cmpxchg.org> <20151106105724.GG4390@dhcp22.suse.cz> <20151106161953.GA7813@cmpxchg.org> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20151106161953.GA7813-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: David Miller , akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org, vdavydov-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org, tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Fri 06-11-15 11:19:53, Johannes Weiner wrote: > On Fri, Nov 06, 2015 at 11:57:24AM +0100, Michal Hocko wrote: > > On Thu 05-11-15 17:52:00, Johannes Weiner wrote: > > > On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote: > > > > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote: > > > > > This would be true if they moved on to the new cgroup API intentionally. > > > > > The reality is more complicated though. AFAIK sysmted is waiting for > > > > > cgroup2 already and privileged services enable all available resource > > > > > controllers by default as I've learned just recently. > > > > > > > > Have you filed a report with them? I don't think they should turn them > > > > on unless users explicitely configure resource control for the unit. > > > > > > Okay, verified with systemd people that they're not planning on > > > enabling resource control per default. > > > > > > Inflammatory half-truths, man. This is not constructive. > > > > What about Delegate=yes feature then? We have just been burnt by this > > quite heavily. AFAIU nspawn@.service and nspawn@.service have this > > enabled by default > > http://lists.freedesktop.org/archives/systemd-commits/2014-November/007400.html > > That's when you launch a *container* and want it to be able to use > nested resource control. Ups. copy&paste error here. The second one was user@.service. So it is not only about containers AFAIU but all user defined sessions. > We're talking about actual container users here. It's not turning on > resource control for all "privileged services", which is what we were > worried about here. Can you at least admit that when you yourself link > to the refuting evidence? My bad, that was misundestanding of the changelog. > And if you've been "burnt quite heavily" by this, where is your bug > report to stop other users from getting "burnt quite heavily" as well? The bug report is still internal because it is tracking an unrelased product. We have ended up reverting Delegate feature. Our systemd developers are supposed to bring this up with the upstream. The basic problem was that the Delegate feature has been backported to our systemd package without further consideration and that has invalidated a lot of performance testing because some resource controllers have measurable effects on those benchmarks. > All I read here is vague inflammatory language to spread FUD. I was merely pointing out that memory controller might be enabled without _user_ actually even noticing because the controller wasn't enabled explicitly. I haven't blamed anybody for that. > You might think sending these emails is helpful, but it really > isn't. Not only is it not contributing code, insights, or solutions, > you're now actively sabotaging someone else's effort to build something. Come on! Are you even serious? -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Date: Fri, 6 Nov 2015 12:45:17 -0500 Message-ID: <20151106174517.GA9315@cmpxchg.org> References: <20151029152546.GG23598@dhcp22.suse.cz> <20151029161009.GA9160@cmpxchg.org> <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> <20151105205522.GA1067@cmpxchg.org> <20151105225200.GA5432@cmpxchg.org> <20151106105724.GG4390@dhcp22.suse.cz> <20151106161953.GA7813@cmpxchg.org> <20151106164657.GL4390@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20151106164657.GL4390-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: David Miller , akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org, vdavydov-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org, tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Fri, Nov 06, 2015 at 05:46:57PM +0100, Michal Hocko wrote: > The basic problem was that the Delegate feature has been backported to > our systemd package without further consideration and that has > invalidated a lot of performance testing because some resource > controllers have measurable effects on those benchmarks. You're talking about a userspace bug. No amount of fragmenting and layering and opt-in in the kernel's runtime configuration space is going to help you if you screw up and enable it all by accident. > > All I read here is vague inflammatory language to spread FUD. > > I was merely pointing out that memory controller might be enabled without > _user_ actually even noticing because the controller wasn't enabled > explicitly. I haven't blamed anybody for that. Why does that have anything to do with how we design our interface? We can't do more than present a sane interface in good faith and lobby userspace projects if we think they misuse it. From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Miller Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Date: Fri, 06 Nov 2015 22:45:41 -0500 (EST) Message-ID: <20151106.224541.1640743718816725953.davem@davemloft.net> References: <20151106105724.GG4390@dhcp22.suse.cz> <20151106161953.GA7813@cmpxchg.org> <20151106164657.GL4390@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20151106164657.GL4390@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: Text/Plain; charset="us-ascii" To: mhocko@kernel.org Cc: hannes@cmpxchg.org, akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org From: Michal Hocko Date: Fri, 6 Nov 2015 17:46:57 +0100 > On Fri 06-11-15 11:19:53, Johannes Weiner wrote: >> You might think sending these emails is helpful, but it really >> isn't. Not only is it not contributing code, insights, or solutions, >> you're now actively sabotaging someone else's effort to build something. > > Come on! Are you even serious? He is, and I agree %100 with him FWIW. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Date: Thu, 12 Nov 2015 18:36:20 +0000 Message-ID: <20151112183620.GC14880@techsingularity.net> References: <20151027164227.GB7749@cmpxchg.org> <20151029152546.GG23598@dhcp22.suse.cz> <20151029161009.GA9160@cmpxchg.org> <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> <20151105205522.GA1067@cmpxchg.org> <20151105225200.GA5432@cmpxchg.org> <20151106105724.GG4390@dhcp22.suse.cz> <20151106161953.GA7813@cmpxchg.org> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20151106161953.GA7813-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: Michal Hocko , David Miller , akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org, vdavydov-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org, tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Fri, Nov 06, 2015 at 11:19:53AM -0500, Johannes Weiner wrote: > On Fri, Nov 06, 2015 at 11:57:24AM +0100, Michal Hocko wrote: > > On Thu 05-11-15 17:52:00, Johannes Weiner wrote: > > > On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote: > > > > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote: > > > > > This would be true if they moved on to the new cgroup API intentionally. > > > > > The reality is more complicated though. AFAIK sysmted is waiting for > > > > > cgroup2 already and privileged services enable all available resource > > > > > controllers by default as I've learned just recently. > > > > > > > > Have you filed a report with them? I don't think they should turn them > > > > on unless users explicitely configure resource control for the unit. > > > > > > Okay, verified with systemd people that they're not planning on > > > enabling resource control per default. > > > > > > Inflammatory half-truths, man. This is not constructive. > > > > What about Delegate=yes feature then? We have just been burnt by this > > quite heavily. AFAIU nspawn@.service and nspawn@.service have this > > enabled by default > > http://lists.freedesktop.org/archives/systemd-commits/2014-November/007400.html > > That's when you launch a *container* and want it to be able to use > nested resource control. > > We're talking about actual container users here. It's not turning on > resource control for all "privileged services", which is what we were > worried about here. Can you at least admit that when you yourself link > to the refuting evidence? > > And if you've been "burnt quite heavily" by this, where is your bug > report to stop other users from getting "burnt quite heavily" as well? > I didn't read this thread in detail but the lack of public information on problems with cgroup controllers is partially my fault so I'd like to correct that. https://bugzilla.suse.com/show_bug.cgi?id=954765 This bug documents some of the impact that was incurred due to ssh sessions being resource controlled by default. It talks primarily about pipetest being impacted by cpu,cpuacct. It is also found in the recent past that dbench4 was previously impacted because the blkio controller was enabled. That bug is not public but basically dbench4 regressed 80% as the journal thread was in a different cgroup than dbench4. dbench4 would stall for 8ms in case more IO was issued before the journal thread could issue any IO. The opensuse bug 954765 bug is not affected by blkio because it's disabled by a distribution-specific patch. Mike Galbraith adds some additional information on why activating the cpu controller can have an impact on semantics even if the overhead was zero. It may be the case that it's an oversight by the systemd developers and the intent was only to affect containers. My experience was that everything was affected. It also may be the case that this is an opensuse-specific problem due to how the maintainers packaged systemd. I don't actually know and hopefully the bug will be able to determine if upstream is really affected or not. There is also a link to this bug on the upstream project so there is some chance they are aware https://github.com/systemd/systemd/issues/1715 Bottom line, there is legimate confusion over whether cgroup controllers are going to be enabled by default or not in the future. If they are enabled by default, there is a non-zero cost to that and a change in semantics that people may or may not be surprised by. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Date: Thu, 12 Nov 2015 14:12:20 -0500 Message-ID: <20151112191220.GA25750@cmpxchg.org> References: <20151029152546.GG23598@dhcp22.suse.cz> <20151029161009.GA9160@cmpxchg.org> <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> <20151105205522.GA1067@cmpxchg.org> <20151105225200.GA5432@cmpxchg.org> <20151106105724.GG4390@dhcp22.suse.cz> <20151106161953.GA7813@cmpxchg.org> <20151112183620.GC14880@techsingularity.net> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20151112183620.GC14880@techsingularity.net> Sender: linux-kernel-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Mel Gorman Cc: Michal Hocko , David Miller , akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Thu, Nov 12, 2015 at 06:36:20PM +0000, Mel Gorman wrote: > Bottom line, there is legimate confusion over whether cgroup controllers > are going to be enabled by default or not in the future. If they are > enabled by default, there is a non-zero cost to that and a change in > semantics that people may or may not be surprised by. Thanks for elaborating, Mel. My understanding is that this is a plain bug. I don't think anybody wants to put costs without benefits on their users. But I'll keep an eye on these reports, and I'll work with the systemd people should issues with the kernel interface materialize that would force them to enable resource control prematurely. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f175.google.com (mail-wi0-f175.google.com [209.85.212.175]) by kanga.kvack.org (Postfix) with ESMTP id A71CB82F65 for ; Thu, 22 Oct 2015 00:22:16 -0400 (EDT) Received: by wikq8 with SMTP id q8so12984550wik.1 for ; Wed, 21 Oct 2015 21:22:16 -0700 (PDT) Received: from gum.cmpxchg.org (gum.cmpxchg.org. [85.214.110.215]) by mx.google.com with ESMTPS id pu5si15988867wjc.50.2015.10.21.21.22.15 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 21 Oct 2015 21:22:15 -0700 (PDT) From: Johannes Weiner Subject: [PATCH 7/8] mm: vmscan: report vmpressure at the level of reclaim activity Date: Thu, 22 Oct 2015 00:21:35 -0400 Message-Id: <1445487696-21545-8-git-send-email-hannes@cmpxchg.org> In-Reply-To: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: "David S. Miller" , Andrew Morton Cc: Michal Hocko , Vladimir Davydov , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org The vmpressure metric is based on reclaim efficiency, which in turn is an attribute of the LRU. However, vmpressure events are currently reported at the source of pressure rather than at the reclaim level. Switch the reporting to the reclaim level to allow finer-grained analysis of which memcg is having trouble reclaiming its pages. As far as memory.pressure_level interface semantics go, events are escalated up the hierarchy until a listener is found, so this won't affect existing users that listen at higher levels. This also prepares vmpressure for hooking it up to the networking stack's memory pressure code. Signed-off-by: Johannes Weiner --- mm/vmscan.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index ecc2125..50630c8 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2404,6 +2404,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc, memcg = mem_cgroup_iter(root, NULL, &reclaim); do { unsigned long lru_pages; + unsigned long reclaimed; unsigned long scanned; struct lruvec *lruvec; int swappiness; @@ -2416,6 +2417,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc, lruvec = mem_cgroup_zone_lruvec(zone, memcg); swappiness = mem_cgroup_swappiness(memcg); + reclaimed = sc->nr_reclaimed; scanned = sc->nr_scanned; shrink_lruvec(lruvec, swappiness, sc, &lru_pages); @@ -2437,6 +2439,10 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc, } } + vmpressure(sc->gfp_mask, memcg, + sc->nr_scanned - scanned, + sc->nr_reclaimed - reclaimed); + /* * Direct reclaim and kswapd have to scan all memory * cgroups to fulfill the overall scan target for the @@ -2454,10 +2460,6 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc, } } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim))); - vmpressure(sc->gfp_mask, sc->target_mem_cgroup, - sc->nr_scanned - nr_scanned, - sc->nr_reclaimed - nr_reclaimed); - if (sc->nr_reclaimed - nr_reclaimed) reclaimable = true; -- 2.6.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf0-f47.google.com (mail-lf0-f47.google.com [209.85.215.47]) by kanga.kvack.org (Postfix) with ESMTP id E488D6B0038 for ; Thu, 22 Oct 2015 14:48:12 -0400 (EDT) Received: by lfaz124 with SMTP id z124so58879707lfa.1 for ; Thu, 22 Oct 2015 11:48:12 -0700 (PDT) Received: from relay.parallels.com (relay.parallels.com. [195.214.232.42]) by mx.google.com with ESMTPS id xe5si10449102lbb.65.2015.10.22.11.48.11 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 22 Oct 2015 11:48:11 -0700 (PDT) Date: Thu, 22 Oct 2015 21:47:57 +0300 From: Vladimir Davydov Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151022184757.GO18351@esperanza> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-6-git-send-email-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <1445487696-21545-6-git-send-email-hannes@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Thu, Oct 22, 2015 at 12:21:33AM -0400, Johannes Weiner wrote: ... > @@ -5500,13 +5524,38 @@ void sock_release_memcg(struct sock *sk) > */ > bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) > { > + unsigned int batch = max(CHARGE_BATCH, nr_pages); > struct page_counter *counter; > + bool force = false; > > - if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter)) > + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) { > + if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter)) > + return true; > + page_counter_charge(&memcg->skmem, nr_pages); > + return false; > + } > + > + if (consume_stock(memcg, nr_pages)) > return true; > +retry: > + if (page_counter_try_charge(&memcg->memory, batch, &counter)) > + goto done; Currently, we use memcg->memory only for charging memory pages. Besides, every page charged to this counter (including kmem) has ->mem_cgroup field set appropriately. This looks consistent and nice. As an extra benefit, we can track all pages charged to a memory cgroup via /proc/kapgecgroup. Now, you charge "window size" to it, which AFAIU isn't necessarily equal to the amount of memory actually consumed by the cgroup for socket buffers. I think this looks ugly and inconsistent with the existing behavior. I agree that we need to charge socker buffers to ->memory, but IMO we should do that per each skb page, using memcg_kmem_charge_kmem somewhere in alloc_skb_with_frags invoking the reclaimer just as we do for kmalloc, while tcp window size control should stay aside. Thanks, Vladimir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf0-f47.google.com (mail-lf0-f47.google.com [209.85.215.47]) by kanga.kvack.org (Postfix) with ESMTP id 589066B0038 for ; Thu, 22 Oct 2015 14:49:08 -0400 (EDT) Received: by lffv3 with SMTP id v3so58827950lff.0 for ; Thu, 22 Oct 2015 11:49:07 -0700 (PDT) Received: from relay.parallels.com (relay.parallels.com. [195.214.232.42]) by mx.google.com with ESMTPS id jv8si10447147lbc.86.2015.10.22.11.49.07 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 22 Oct 2015 11:49:07 -0700 (PDT) Date: Thu, 22 Oct 2015 21:48:53 +0300 From: Vladimir Davydov Subject: Re: [PATCH 7/8] mm: vmscan: report vmpressure at the level of reclaim activity Message-ID: <20151022184852.GP18351@esperanza> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-8-git-send-email-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <1445487696-21545-8-git-send-email-hannes@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Thu, Oct 22, 2015 at 12:21:35AM -0400, Johannes Weiner wrote: ... > @@ -2437,6 +2439,10 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc, > } > } > > + vmpressure(sc->gfp_mask, memcg, > + sc->nr_scanned - scanned, > + sc->nr_reclaimed - reclaimed); > + > /* > * Direct reclaim and kswapd have to scan all memory > * cgroups to fulfill the overall scan target for the > @@ -2454,10 +2460,6 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc, > } > } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim))); > > - vmpressure(sc->gfp_mask, sc->target_mem_cgroup, > - sc->nr_scanned - nr_scanned, > - sc->nr_reclaimed - nr_reclaimed); > - > if (sc->nr_reclaimed - nr_reclaimed) > reclaimable = true; > I may be mistaken, but AFAIU this patch subtly changes the behavior of vmpressure visible from the userspace: w/o this patch a userspace process will only receive a notification for a memory cgroup only if *this* memory cgroup calls reclaimer; with this patch userspace notification will be issued even if reclaimer is invoked by any cgroup up the hierarchy. Thanks, Vladimir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf0-f51.google.com (mail-lf0-f51.google.com [209.85.215.51]) by kanga.kvack.org (Postfix) with ESMTP id 4CB9A82F64 for ; Thu, 22 Oct 2015 14:58:04 -0400 (EDT) Received: by lfbn126 with SMTP id n126so24800921lfb.2 for ; Thu, 22 Oct 2015 11:58:03 -0700 (PDT) Received: from relay.parallels.com (relay.parallels.com. [195.214.232.42]) by mx.google.com with ESMTPS id k185si10474099lfe.96.2015.10.22.11.58.02 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 22 Oct 2015 11:58:02 -0700 (PDT) Date: Thu, 22 Oct 2015 21:57:47 +0300 From: Vladimir Davydov Subject: Re: [PATCH 8/8] mm: memcontrol: hook up vmpressure to socket pressure Message-ID: <20151022185747.GQ18351@esperanza> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-9-git-send-email-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <1445487696-21545-9-git-send-email-hannes@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Thu, Oct 22, 2015 at 12:21:36AM -0400, Johannes Weiner wrote: ... > @@ -185,8 +183,29 @@ static void vmpressure_work_fn(struct work_struct *work) > vmpr->reclaimed = 0; > spin_unlock(&vmpr->sr_lock); > > + level = vmpressure_calc_level(scanned, reclaimed); > + > + if (level > VMPRESSURE_LOW) { So we start socket_pressure at MEDIUM. Why not at LOW or CRITICAL? > + struct mem_cgroup *memcg; > + /* > + * Let the socket buffer allocator know that we are > + * having trouble reclaiming LRU pages. > + * > + * For hysteresis, keep the pressure state asserted > + * for a second in which subsequent pressure events > + * can occur. > + * > + * XXX: is vmpressure a global feature or part of > + * memcg? There shouldn't be anything memcg-specific > + * about exporting reclaim success ratios from the VM. > + */ > + memcg = container_of(vmpr, struct mem_cgroup, vmpressure); > + if (memcg != root_mem_cgroup) > + memcg->socket_pressure = jiffies + HZ; Why 1 second? Thanks, Vladimir > + } > + > do { > - if (vmpressure_event(vmpr, scanned, reclaimed)) > + if (vmpressure_event(vmpr, level)) > break; > /* > * If not handled, propagate the event upward into the -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f177.google.com (mail-wi0-f177.google.com [209.85.212.177]) by kanga.kvack.org (Postfix) with ESMTP id F3B446B0038 for ; Fri, 23 Oct 2015 08:39:38 -0400 (EDT) Received: by wikq8 with SMTP id q8so75047175wik.1 for ; Fri, 23 Oct 2015 05:39:38 -0700 (PDT) Received: from mail-wi0-f179.google.com (mail-wi0-f179.google.com. [209.85.212.179]) by mx.google.com with ESMTPS id bz5si4792441wib.23.2015.10.23.05.39.32 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 23 Oct 2015 05:39:32 -0700 (PDT) Received: by wicfv8 with SMTP id fv8so30132872wic.0 for ; Fri, 23 Oct 2015 05:39:32 -0700 (PDT) Date: Fri, 23 Oct 2015 14:39:30 +0200 From: Michal Hocko Subject: Re: [PATCH 4/8] mm: memcontrol: prepare for unified hierarchy socket accounting Message-ID: <20151023123930.GM2410@dhcp22.suse.cz> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-5-git-send-email-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1445487696-21545-5-git-send-email-hannes@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: "David S. Miller" , Andrew Morton , Vladimir Davydov , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Thu 22-10-15 00:21:32, Johannes Weiner wrote: > The unified hierarchy memory controller will account socket > memory. Move the infrastructure functions accordingly. > > Signed-off-by: Johannes Weiner Acked-by: Michal Hocko > --- > mm/memcontrol.c | 136 ++++++++++++++++++++++++++++---------------------------- > 1 file changed, 68 insertions(+), 68 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index c41e6d7..3789050 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -287,74 +287,6 @@ static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id) > return mem_cgroup_from_css(css); > } > > -/* Writing them here to avoid exposing memcg's inner layout */ > -#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM) > - > -DEFINE_STATIC_KEY_FALSE(mem_cgroup_sockets); > - > -void sock_update_memcg(struct sock *sk) > -{ > - struct mem_cgroup *memcg; > - /* > - * Socket cloning can throw us here with sk_cgrp already > - * filled. It won't however, necessarily happen from > - * process context. So the test for root memcg given > - * the current task's memcg won't help us in this case. > - * > - * Respecting the original socket's memcg is a better > - * decision in this case. > - */ > - if (sk->sk_memcg) { > - BUG_ON(mem_cgroup_is_root(sk->sk_memcg)); > - css_get(&sk->sk_memcg->css); > - return; > - } > - > - rcu_read_lock(); > - memcg = mem_cgroup_from_task(current); > - if (css_tryget_online(&memcg->css)) > - sk->sk_memcg = memcg; > - rcu_read_unlock(); > -} > -EXPORT_SYMBOL(sock_update_memcg); > - > -void sock_release_memcg(struct sock *sk) > -{ > - if (sk->sk_memcg) > - css_put(&sk->sk_memcg->css); > -} > - > -/** > - * mem_cgroup_charge_skmem - charge socket memory > - * @memcg: memcg to charge > - * @nr_pages: number of pages to charge > - * > - * Charges @nr_pages to @memcg. Returns %true if the charge fit within > - * the memcg's configured limit, %false if the charge had to be forced. > - */ > -bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) > -{ > - struct page_counter *counter; > - > - if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter)) > - return true; > - > - page_counter_charge(&memcg->skmem, nr_pages); > - return false; > -} > - > -/** > - * mem_cgroup_uncharge_skmem - uncharge socket memory > - * @memcg: memcg to uncharge > - * @nr_pages: number of pages to uncharge > - */ > -void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) > -{ > - page_counter_uncharge(&memcg->skmem, nr_pages); > -} > - > -#endif > - > #ifdef CONFIG_MEMCG_KMEM > /* > * This will be the memcg's index in each cache's ->memcg_params.memcg_caches. > @@ -5521,6 +5453,74 @@ void mem_cgroup_replace_page(struct page *oldpage, struct page *newpage) > commit_charge(newpage, memcg, true); > } > > +/* Writing them here to avoid exposing memcg's inner layout */ > +#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM) > + > +DEFINE_STATIC_KEY_FALSE(mem_cgroup_sockets); > + > +void sock_update_memcg(struct sock *sk) > +{ > + struct mem_cgroup *memcg; > + /* > + * Socket cloning can throw us here with sk_cgrp already > + * filled. It won't however, necessarily happen from > + * process context. So the test for root memcg given > + * the current task's memcg won't help us in this case. > + * > + * Respecting the original socket's memcg is a better > + * decision in this case. > + */ > + if (sk->sk_memcg) { > + BUG_ON(mem_cgroup_is_root(sk->sk_memcg)); > + css_get(&sk->sk_memcg->css); > + return; > + } > + > + rcu_read_lock(); > + memcg = mem_cgroup_from_task(current); > + if (css_tryget_online(&memcg->css)) > + sk->sk_memcg = memcg; > + rcu_read_unlock(); > +} > +EXPORT_SYMBOL(sock_update_memcg); > + > +void sock_release_memcg(struct sock *sk) > +{ > + if (sk->sk_memcg) > + css_put(&sk->sk_memcg->css); > +} > + > +/** > + * mem_cgroup_charge_skmem - charge socket memory > + * @memcg: memcg to charge > + * @nr_pages: number of pages to charge > + * > + * Charges @nr_pages to @memcg. Returns %true if the charge fit within > + * the memcg's configured limit, %false if the charge had to be forced. > + */ > +bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) > +{ > + struct page_counter *counter; > + > + if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter)) > + return true; > + > + page_counter_charge(&memcg->skmem, nr_pages); > + return false; > +} > + > +/** > + * mem_cgroup_uncharge_skmem - uncharge socket memory > + * @memcg: memcg to uncharge > + * @nr_pages: number of pages to uncharge > + */ > +void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) > +{ > + page_counter_uncharge(&memcg->skmem, nr_pages); > +} > + > +#endif > + > /* > * subsys_initcall() for memory controller. > * > -- > 2.6.1 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f180.google.com (mail-wi0-f180.google.com [209.85.212.180]) by kanga.kvack.org (Postfix) with ESMTP id DE04A6B0038 for ; Fri, 23 Oct 2015 09:19:59 -0400 (EDT) Received: by wijp11 with SMTP id p11so77216614wij.0 for ; Fri, 23 Oct 2015 06:19:59 -0700 (PDT) Received: from mail-wi0-f171.google.com (mail-wi0-f171.google.com. [209.85.212.171]) by mx.google.com with ESMTPS id ee8si4983589wic.1.2015.10.23.06.19.57 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 23 Oct 2015 06:19:57 -0700 (PDT) Received: by wicfx6 with SMTP id fx6so31050223wic.1 for ; Fri, 23 Oct 2015 06:19:57 -0700 (PDT) Date: Fri, 23 Oct 2015 15:19:56 +0200 From: Michal Hocko Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151023131956.GA15375@dhcp22.suse.cz> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-6-git-send-email-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1445487696-21545-6-git-send-email-hannes@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: "David S. Miller" , Andrew Morton , Vladimir Davydov , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Thu 22-10-15 00:21:33, Johannes Weiner wrote: > Socket memory can be a significant share of overall memory consumed by > common workloads. In order to provide reasonable resource isolation > out-of-the-box in the unified hierarchy, this type of memory needs to > be accounted and tracked per default in the memory controller. What about users who do not want to pay an additional overhead for the accounting? How can they disable it? > Signed-off-by: Johannes Weiner [...] > @@ -5453,10 +5470,9 @@ void mem_cgroup_replace_page(struct page *oldpage, struct page *newpage) > commit_charge(newpage, memcg, true); > } > > -/* Writing them here to avoid exposing memcg's inner layout */ > -#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM) > +#ifdef CONFIG_INET > > -DEFINE_STATIC_KEY_FALSE(mem_cgroup_sockets); > +DEFINE_STATIC_KEY_TRUE(mem_cgroup_sockets); AFAIU this means that the jump label is enabled by default. Is this intended when you enable it explicitly where needed? > > void sock_update_memcg(struct sock *sk) > { -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f174.google.com (mail-io0-f174.google.com [209.85.223.174]) by kanga.kvack.org (Postfix) with ESMTP id 879A26B0253 for ; Fri, 23 Oct 2015 09:43:33 -0400 (EDT) Received: by ioll68 with SMTP id l68so125107626iol.3 for ; Fri, 23 Oct 2015 06:43:33 -0700 (PDT) Received: from shards.monkeyblade.net (shards.monkeyblade.net. [2001:4f8:3:36:211:85ff:fe63:a549]) by mx.google.com with ESMTP id p6si3419354igj.14.2015.10.23.06.43.33 for ; Fri, 23 Oct 2015 06:43:33 -0700 (PDT) Date: Fri, 23 Oct 2015 06:59:57 -0700 (PDT) Message-Id: <20151023.065957.1690815054807881760.davem@davemloft.net> Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy From: David Miller In-Reply-To: <20151023131956.GA15375@dhcp22.suse.cz> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-6-git-send-email-hannes@cmpxchg.org> <20151023131956.GA15375@dhcp22.suse.cz> Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: mhocko@kernel.org Cc: hannes@cmpxchg.org, akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org From: Michal Hocko Date: Fri, 23 Oct 2015 15:19:56 +0200 > On Thu 22-10-15 00:21:33, Johannes Weiner wrote: >> Socket memory can be a significant share of overall memory consumed by >> common workloads. In order to provide reasonable resource isolation >> out-of-the-box in the unified hierarchy, this type of memory needs to >> be accounted and tracked per default in the memory controller. > > What about users who do not want to pay an additional overhead for the > accounting? How can they disable it? Yeah, this really cannot pass. This extra overhead will be seen by %99.9999 of users, since entities (especially distributions) just flip on all of these config options by default. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f173.google.com (mail-wi0-f173.google.com [209.85.212.173]) by kanga.kvack.org (Postfix) with ESMTP id 5D2476B0256 for ; Fri, 23 Oct 2015 09:50:00 -0400 (EDT) Received: by wicfx6 with SMTP id fx6so32253838wic.1 for ; Fri, 23 Oct 2015 06:49:59 -0700 (PDT) Received: from mail-wi0-f182.google.com (mail-wi0-f182.google.com. [209.85.212.182]) by mx.google.com with ESMTPS id v1si25059853wja.21.2015.10.23.06.49.59 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 23 Oct 2015 06:49:59 -0700 (PDT) Received: by wicfx6 with SMTP id fx6so32253311wic.1 for ; Fri, 23 Oct 2015 06:49:59 -0700 (PDT) Date: Fri, 23 Oct 2015 15:49:57 +0200 From: Michal Hocko Subject: Re: [PATCH 7/8] mm: vmscan: report vmpressure at the level of reclaim activity Message-ID: <20151023134957.GC15375@dhcp22.suse.cz> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-8-git-send-email-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1445487696-21545-8-git-send-email-hannes@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: "David S. Miller" , Andrew Morton , Vladimir Davydov , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Thu 22-10-15 00:21:35, Johannes Weiner wrote: > The vmpressure metric is based on reclaim efficiency, which in turn is > an attribute of the LRU. However, vmpressure events are currently > reported at the source of pressure rather than at the reclaim level. > > Switch the reporting to the reclaim level to allow finer-grained > analysis of which memcg is having trouble reclaiming its pages. I can see how this can be useful. > As far as memory.pressure_level interface semantics go, events are > escalated up the hierarchy until a listener is found, so this won't > affect existing users that listen at higher levels. This is true but the parent will not see cumulative events anymore. One memcg might be fighting and barely reclaim anything so it would report high pressure while other would be doing just fine. The parent will just see conflicting events in a short time period and cannot match them the source memcg. This sounds really confusing. Even more confusing than the current semantic which allows the same behavior under certain configurations. I dunno, have to think about it some more. Maybe we need to rethink the way how the pressure is signaled. If we want the breakdown of the particular memcgs then we should be able to identify them for this to be useful. [...] -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f179.google.com (mail-wi0-f179.google.com [209.85.212.179]) by kanga.kvack.org (Postfix) with ESMTP id B69F282F64 for ; Mon, 26 Oct 2015 13:22:28 -0400 (EDT) Received: by wikq8 with SMTP id q8so174957321wik.1 for ; Mon, 26 Oct 2015 10:22:28 -0700 (PDT) Received: from gum.cmpxchg.org (gum.cmpxchg.org. [85.214.110.215]) by mx.google.com with ESMTPS id o8si44338608wjx.66.2015.10.26.10.22.27 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 26 Oct 2015 10:22:27 -0700 (PDT) Date: Mon, 26 Oct 2015 13:22:16 -0400 From: Johannes Weiner Subject: Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy Message-ID: <20151026172216.GC2214@cmpxchg.org> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <20151022184509.GM18351@esperanza> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151022184509.GM18351@esperanza> Sender: owner-linux-mm@kvack.org List-ID: To: Vladimir Davydov Cc: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Thu, Oct 22, 2015 at 09:45:10PM +0300, Vladimir Davydov wrote: > Hi Johannes, > > On Thu, Oct 22, 2015 at 12:21:28AM -0400, Johannes Weiner wrote: > ... > > Patch #5 adds accounting and tracking of socket memory to the unified > > hierarchy memory controller, as described above. It uses the existing > > per-cpu charge caches and triggers high limit reclaim asynchroneously. > > > > Patch #8 uses the vmpressure extension to equalize pressure between > > the pages tracked natively by the VM and socket buffer pages. As the > > pool is shared, it makes sense that while natively tracked pages are > > under duress the network transmit windows are also not increased. > > First of all, I've no experience in networking, so I'm likely to be > mistaken. Nevertheless I beg to disagree that this patch set is a step > in the right direction. Here goes why. > > I admit that your idea to get rid of explicit tcp window control knobs > and size it dynamically basing on memory pressure instead does sound > tempting, but I don't think it'd always work. The problem is that in > contrast to, say, dcache, we can't shrink tcp buffers AFAIU, we can only > stop growing them. Now suppose a system hasn't experienced memory > pressure for a while. If we don't have explicit tcp window limit, tcp > buffers on such a system might have eaten almost all available memory > (because of network load/problems). If a user workload that needs a > significant amount of memory is started suddenly then, the network code > will receive a notification and surely stop growing buffers, but all > those buffers accumulated won't disappear instantly. As a result, the > workload might be unable to find enough free memory and have no choice > but invoke OOM killer. This looks unexpected from the user POV. I'm not getting rid of those knobs, I'm just reusing the old socket accounting infrastructure in an attempt to make the memory accounting feature useful to more people in cgroups v2 (unified hierarchy). We can always come back to think about per-cgroup tcp window limits in the unified hierarchy, my patches don't get in the way of this. I'm not removing the knobs in cgroups v1 and I'm not preventing them in v2. But regardless of tcp window control, we need to account socket memory in the main memory accounting pool where pressure is shared (to the best of our abilities) between all accounted memory consumers. >>From an interface standpoint alone, I don't think it's reasonable to ask users per default to limit different consumers on a case by case basis. I certainly have no problem with finetuning for scenarios you describe above, but with memory.current, memory.high, memory.max we are providing a generic interface to account and contain memory consumption of workloads. This has to include all major memory consumers to make semantical sense. But also, there are people right now for whom the socket buffers cause system OOM, but the existing memcg's hard tcp window limitq that exists absolutely wrecks network performance for them. It's not usable the way it is. It'd be much better to have the socket buffers exert pressure on the shared pool, and then propagate the overall pressure back to individual consumers with reclaim, shrinkers, vmpressure etc. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f181.google.com (mail-wi0-f181.google.com [209.85.212.181]) by kanga.kvack.org (Postfix) with ESMTP id F35516B0038 for ; Tue, 27 Oct 2015 08:26:49 -0400 (EDT) Received: by wikq8 with SMTP id q8so209118065wik.1 for ; Tue, 27 Oct 2015 05:26:49 -0700 (PDT) Received: from mail-wi0-f174.google.com (mail-wi0-f174.google.com. [209.85.212.174]) by mx.google.com with ESMTPS id a11si29834757wik.16.2015.10.27.05.26.48 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 27 Oct 2015 05:26:48 -0700 (PDT) Received: by wicll6 with SMTP id ll6so157122622wic.0 for ; Tue, 27 Oct 2015 05:26:48 -0700 (PDT) Date: Tue, 27 Oct 2015 13:26:47 +0100 From: Michal Hocko Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151027122647.GG9891@dhcp22.suse.cz> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-6-git-send-email-hannes@cmpxchg.org> <20151023131956.GA15375@dhcp22.suse.cz> <20151023.065957.1690815054807881760.davem@davemloft.net> <20151026165619.GB2214@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151026165619.GB2214@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: David Miller , akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Mon 26-10-15 12:56:19, Johannes Weiner wrote: [...] > Now you could argue that there might exist specialized workloads that > need to account anonymous pages and page cache, but not socket memory > buffers. Exactly, and there are loads doing this. Memcg groups are also created to limit anon/page cache consumers to not affect the others running on the system (basically in the root memcg context from memcg POV) which don't care about tracking and they definitely do not want to pay for an additional overhead. We should definitely be able to offer a global disable knob for them. The same applies to kmem accounting in general. I do understand with having the accounting enabled by default after we are reasonably sure that both kmem/tcp are stable enough (which I am not convinced about yet to be honest) but there will be always special loads which simply do not care about kmem/tcp accounting and rather pay a global balancing price (even OOM) rather than a permanent price. And they should get a way to opt-out. > Or any other combination of pick-and-choose consumers. But > honestly, nowadays all our paths are lockless, and the counting is an > atomic-add-return with a per-cpu batch cache. You are still hooking into hot paths and there are users who want to squeeze every single cycle from the HW. > I don't think there is a compelling case for an elaborate interface > to make individual memory consumers configurable inside the memory > controller. I do not think we need an elaborate interface. We just want to have a global boot time knob to overwrite the default behavior. This is few lines of code and it should give the sufficient flexibility. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f46.google.com (mail-oi0-f46.google.com [209.85.218.46]) by kanga.kvack.org (Postfix) with ESMTP id 183A382F64 for ; Tue, 27 Oct 2015 09:32:40 -0400 (EDT) Received: by oifu63 with SMTP id u63so77312030oif.2 for ; Tue, 27 Oct 2015 06:32:39 -0700 (PDT) Received: from shards.monkeyblade.net (shards.monkeyblade.net. [2001:4f8:3:36:211:85ff:fe63:a549]) by mx.google.com with ESMTP id o3si24360976obv.60.2015.10.27.06.32.39 for ; Tue, 27 Oct 2015 06:32:39 -0700 (PDT) Date: Tue, 27 Oct 2015 06:49:16 -0700 (PDT) Message-Id: <20151027.064916.312540587298733586.davem@davemloft.net> Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy From: David Miller In-Reply-To: <20151027122647.GG9891@dhcp22.suse.cz> References: <20151023.065957.1690815054807881760.davem@davemloft.net> <20151026165619.GB2214@cmpxchg.org> <20151027122647.GG9891@dhcp22.suse.cz> Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: mhocko@kernel.org Cc: hannes@cmpxchg.org, akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org From: Michal Hocko Date: Tue, 27 Oct 2015 13:26:47 +0100 > On Mon 26-10-15 12:56:19, Johannes Weiner wrote: > [...] >> Or any other combination of pick-and-choose consumers. But >> honestly, nowadays all our paths are lockless, and the counting is an >> atomic-add-return with a per-cpu batch cache. > > You are still hooking into hot paths and there are users who want to > squeeze every single cycle from the HW. Yeah, you're basically probably undoing a half year of work by another developer who was able to remove an atomic from these paths. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f176.google.com (mail-wi0-f176.google.com [209.85.212.176]) by kanga.kvack.org (Postfix) with ESMTP id 5A99C82F64 for ; Tue, 27 Oct 2015 11:41:52 -0400 (EDT) Received: by wicll6 with SMTP id ll6so166373765wic.1 for ; Tue, 27 Oct 2015 08:41:51 -0700 (PDT) Received: from gum.cmpxchg.org (gum.cmpxchg.org. [85.214.110.215]) by mx.google.com with ESMTPS id m137si3030573wmb.68.2015.10.27.08.41.50 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 27 Oct 2015 08:41:51 -0700 (PDT) Date: Tue, 27 Oct 2015 11:41:38 -0400 From: Johannes Weiner Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151027154138.GA4665@cmpxchg.org> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-6-git-send-email-hannes@cmpxchg.org> <20151023131956.GA15375@dhcp22.suse.cz> <20151023.065957.1690815054807881760.davem@davemloft.net> <20151026165619.GB2214@cmpxchg.org> <20151027122647.GG9891@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151027122647.GG9891@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: David Miller , akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Tue, Oct 27, 2015 at 01:26:47PM +0100, Michal Hocko wrote: > On Mon 26-10-15 12:56:19, Johannes Weiner wrote: > [...] > > Now you could argue that there might exist specialized workloads that > > need to account anonymous pages and page cache, but not socket memory > > buffers. > > Exactly, and there are loads doing this. Memcg groups are also created to > limit anon/page cache consumers to not affect the others running on > the system (basically in the root memcg context from memcg POV) which > don't care about tracking and they definitely do not want to pay for an > additional overhead. We should definitely be able to offer a global > disable knob for them. The same applies to kmem accounting in general. I don't see how you make such a clear distinction between, say, page cache and the dentry cache, and call one user memory and the other kernel memory. That just doesn't make sense to me. They're both kernel memory allocated on behalf of the user, the only difference being that one is tracked on the page level and the other on the slab level, and we started accounting one before the other. IMO that's an implementation detail and a historical artifact that should not be exposed to the user. And that's the thing I hate about the current opt-out knob. > > I don't think there is a compelling case for an elaborate interface > > to make individual memory consumers configurable inside the memory > > controller. > > I do not think we need an elaborate interface. We just want to have > a global boot time knob to overwrite the default behavior. This is > few lines of code and it should give the sufficient flexibility. Okay, then let's add this for the socket memory to start with. I'll have to think more about how to distinguish the slab-based consumers. Or maybe you have an idea. For now, something like this as a boot commandline? cgroup.memory=nosocket So again in summary, no default overhead until you create a cgroup to specifically track and account memory. And then, when you know what you are doing and have a specialized workload, you can disable socket memory as a specific consumer to remove that particular overhead while still being able to contain page cache, anon, kmem, whatever. Does that sound like reasonable userinterfacing to everyone? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f171.google.com (mail-wi0-f171.google.com [209.85.212.171]) by kanga.kvack.org (Postfix) with ESMTP id 9A59482F64 for ; Tue, 27 Oct 2015 12:01:23 -0400 (EDT) Received: by wicll6 with SMTP id ll6so166438704wic.0 for ; Tue, 27 Oct 2015 09:01:23 -0700 (PDT) Received: from gum.cmpxchg.org (gum.cmpxchg.org. [85.214.110.215]) by mx.google.com with ESMTPS id qs10si50883036wjc.129.2015.10.27.09.01.22 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 27 Oct 2015 09:01:22 -0700 (PDT) Date: Tue, 27 Oct 2015 09:01:08 -0700 From: Johannes Weiner Subject: Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy Message-ID: <20151027155833.GB4665@cmpxchg.org> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <20151022184509.GM18351@esperanza> <20151026172216.GC2214@cmpxchg.org> <20151027084320.GF13221@esperanza> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151027084320.GF13221@esperanza> Sender: owner-linux-mm@kvack.org List-ID: To: Vladimir Davydov Cc: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Tue, Oct 27, 2015 at 11:43:21AM +0300, Vladimir Davydov wrote: > On Mon, Oct 26, 2015 at 01:22:16PM -0400, Johannes Weiner wrote: > > I'm not getting rid of those knobs, I'm just reusing the old socket > > accounting infrastructure in an attempt to make the memory accounting > > feature useful to more people in cgroups v2 (unified hierarchy). > > My understanding is that in the meantime you effectively break the > existing per memcg tcp window control logic. That's not my intention, this stuff has to keep working. I'm assuming you mean the changes to sk_enter_memory_pressure() when hitting the charge limit; let me address this in the other subthread. > > We can always come back to think about per-cgroup tcp window limits in > > the unified hierarchy, my patches don't get in the way of this. I'm > > not removing the knobs in cgroups v1 and I'm not preventing them in v2. > > > > But regardless of tcp window control, we need to account socket memory > > in the main memory accounting pool where pressure is shared (to the > > best of our abilities) between all accounted memory consumers. > > > > No objections to this point. However, I really don't like the idea to > charge tcp window size to memory.current instead of charging individual > pages consumed by the workload for storing socket buffers, because it is > inconsistent with what we have now. Can't we charge individual skb pages > as we do in case of other kmem allocations? Absolutely, both work for me. I chose that route because it's where the networking code already tracks and accounts memory consumed, so it seemed like a better site to hook into. But I understand your concerns. We want to track this stuff as close to the memory allocators as possible. > > But also, there are people right now for whom the socket buffers cause > > system OOM, but the existing memcg's hard tcp window limitq that > > exists absolutely wrecks network performance for them. It's not usable > > the way it is. It'd be much better to have the socket buffers exert > > pressure on the shared pool, and then propagate the overall pressure > > back to individual consumers with reclaim, shrinkers, vmpressure etc. > > This might or might not work. I'm not an expert to judge. But if you do > this only for memcg leaving the global case as it is, networking people > won't budge IMO. So could you please start such a major rework from the > global case? Could you please try to deprecate the tcp window limits not > only in the legacy memcg hierarchy, but also system-wide in order to > attract attention of networking experts? I'm definitely interested in addressing this globally as well. The idea behind this was to use the memcg part as a testbed. cgroup2 is going to be new and people are prepared for hiccups when migrating their applications to it; and they can roll back to cgroup1 and tcp window limits at any time should they run into problems in production. So this seemed like a good way to prove a new mechanism before rolling it out to every single Linux setup, rather than switch everybody over after the limited scope testing I can do as a developer on my own. Keep in mind that my patches are not committing anything in terms of interface, so we retain all the freedom to fix and tune the way this is implemented, including the freedom to re-add tcp window limits in case the pressure balancing is not a comprehensive solution. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f172.google.com (mail-wi0-f172.google.com [209.85.212.172]) by kanga.kvack.org (Postfix) with ESMTP id 1400382F64 for ; Tue, 27 Oct 2015 12:42:40 -0400 (EDT) Received: by wicll6 with SMTP id ll6so168174966wic.0 for ; Tue, 27 Oct 2015 09:42:39 -0700 (PDT) Received: from gum.cmpxchg.org (gum.cmpxchg.org. [85.214.110.215]) by mx.google.com with ESMTPS id my2si322020wic.29.2015.10.27.09.42.38 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 27 Oct 2015 09:42:38 -0700 (PDT) Date: Tue, 27 Oct 2015 09:42:27 -0700 From: Johannes Weiner Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151027164227.GB7749@cmpxchg.org> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-6-git-send-email-hannes@cmpxchg.org> <20151023131956.GA15375@dhcp22.suse.cz> <20151023.065957.1690815054807881760.davem@davemloft.net> <20151026165619.GB2214@cmpxchg.org> <20151027122647.GG9891@dhcp22.suse.cz> <20151027154138.GA4665@cmpxchg.org> <20151027161554.GJ9891@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151027161554.GJ9891@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: David Miller , akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Tue, Oct 27, 2015 at 05:15:54PM +0100, Michal Hocko wrote: > On Tue 27-10-15 11:41:38, Johannes Weiner wrote: > > On Tue, Oct 27, 2015 at 01:26:47PM +0100, Michal Hocko wrote: > > > On Mon 26-10-15 12:56:19, Johannes Weiner wrote: > > > [...] > > > > Now you could argue that there might exist specialized workloads that > > > > need to account anonymous pages and page cache, but not socket memory > > > > buffers. > > > > > > Exactly, and there are loads doing this. Memcg groups are also created to > > > limit anon/page cache consumers to not affect the others running on > > > the system (basically in the root memcg context from memcg POV) which > > > don't care about tracking and they definitely do not want to pay for an > > > additional overhead. We should definitely be able to offer a global > > > disable knob for them. The same applies to kmem accounting in general. > > > > I don't see how you make such a clear distinction between, say, page > > cache and the dentry cache, and call one user memory and the other > > kernel memory. > > Because the kernel memory footprint would be so small that it simply > doesn't change the picture at all. While the page cache or anonymous > memory consumption might be so large it might be disruptive. Or it could be exactly the other way around when you have a workload that is heavy on filesystem metadata. I don't see why any scenario would be more important than the other. I'm not saying that distinguishing between consumers is wrong, just that "user memory vs kernel memory" is a false classification. Why do you call page cache user memory but dentry cache kernel memory? It doesn't make any sense. > Also kmem accounting will make the load more non-deterministic because > many of the resources are shared between tasks in separate cgroups > unless they are explicitly configured. E.g. [id]cache will be shared > and first to touch gets charged so you would end up with more false > sharing. Exactly like page cache. This differentiation isn't based on reality. > Nevertheless, I do not want to shift the discussion from the topic. I > just think that one-fits-all simply won't work. Okay, this is something we can converge on. > > That just doesn't make sense to me. They're both kernel > > memory allocated on behalf of the user, the only difference being that > > one is tracked on the page level and the other on the slab level, and > > we started accounting one before the other. > > > > IMO that's an implementation detail and a historical artifact that > > should not be exposed to the user. And that's the thing I hate about > > the current opt-out knob. You carefully skipped over this part. We can ignore it for socket memory but it's something we need to figure out when it comes to slab accounting and tracking. > > > > I don't think there is a compelling case for an elaborate interface > > > > to make individual memory consumers configurable inside the memory > > > > controller. > > > > > > I do not think we need an elaborate interface. We just want to have > > > a global boot time knob to overwrite the default behavior. This is > > > few lines of code and it should give the sufficient flexibility. > > > > Okay, then let's add this for the socket memory to start with. I'll > > have to think more about how to distinguish the slab-based consumers. > > Or maybe you have an idea. > > Isn't that as simple as enabling the jump label during the > initialization depending on the knob value? All the charging paths > should be disabled by default already. You missed my point. It's not about the implementation, it's about how we present these choices to the user. Having page cache accounting built in while presenting dentry+inode cache as a configurable extension is completely random and doesn't make sense. They are both first class memory consumers. They're not separate categories. One isn't more "core" than the other. > > For now, something like this as a boot commandline? > > > > cgroup.memory=nosocket > > That would work for me. Okay, then I'll go that route for the socket stuff. Dave is that cool with you? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f180.google.com (mail-ig0-f180.google.com [209.85.213.180]) by kanga.kvack.org (Postfix) with ESMTP id F044182F64 for ; Tue, 27 Oct 2015 20:28:55 -0400 (EDT) Received: by igbkq10 with SMTP id kq10so107039904igb.0 for ; Tue, 27 Oct 2015 17:28:55 -0700 (PDT) Received: from shards.monkeyblade.net (shards.monkeyblade.net. [2001:4f8:3:36:211:85ff:fe63:a549]) by mx.google.com with ESMTP id b65si12048215ioe.177.2015.10.27.17.28.54 for ; Tue, 27 Oct 2015 17:28:54 -0700 (PDT) Date: Tue, 27 Oct 2015 17:45:32 -0700 (PDT) Message-Id: <20151027.174532.469361008055673315.davem@davemloft.net> Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy From: David Miller In-Reply-To: <20151027164227.GB7749@cmpxchg.org> References: <20151027154138.GA4665@cmpxchg.org> <20151027161554.GJ9891@dhcp22.suse.cz> <20151027164227.GB7749@cmpxchg.org> Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: hannes@cmpxchg.org Cc: mhocko@kernel.org, akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org From: Johannes Weiner Date: Tue, 27 Oct 2015 09:42:27 -0700 > On Tue, Oct 27, 2015 at 05:15:54PM +0100, Michal Hocko wrote: >> > For now, something like this as a boot commandline? >> > >> > cgroup.memory=nosocket >> >> That would work for me. > > Okay, then I'll go that route for the socket stuff. > > Dave is that cool with you? Depends upon the default. Until the user configures something explicitly into the memory controller, the networking bits should all evaluate to nothing. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f169.google.com (mail-wi0-f169.google.com [209.85.212.169]) by kanga.kvack.org (Postfix) with ESMTP id CA4C482F64 for ; Tue, 27 Oct 2015 23:05:33 -0400 (EDT) Received: by wicfx6 with SMTP id fx6so183425656wic.1 for ; Tue, 27 Oct 2015 20:05:33 -0700 (PDT) Received: from gum.cmpxchg.org (gum.cmpxchg.org. [85.214.110.215]) by mx.google.com with ESMTPS id p3si1239766wia.63.2015.10.27.20.05.32 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 27 Oct 2015 20:05:32 -0700 (PDT) Date: Tue, 27 Oct 2015 20:05:19 -0700 From: Johannes Weiner Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151028030519.GA20789@cmpxchg.org> References: <20151027154138.GA4665@cmpxchg.org> <20151027161554.GJ9891@dhcp22.suse.cz> <20151027164227.GB7749@cmpxchg.org> <20151027.174532.469361008055673315.davem@davemloft.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151027.174532.469361008055673315.davem@davemloft.net> Sender: owner-linux-mm@kvack.org List-ID: To: David Miller Cc: mhocko@kernel.org, akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Tue, Oct 27, 2015 at 05:45:32PM -0700, David Miller wrote: > From: Johannes Weiner > Date: Tue, 27 Oct 2015 09:42:27 -0700 > > > On Tue, Oct 27, 2015 at 05:15:54PM +0100, Michal Hocko wrote: > >> > For now, something like this as a boot commandline? > >> > > >> > cgroup.memory=nosocket > >> > >> That would work for me. > > > > Okay, then I'll go that route for the socket stuff. > > > > Dave is that cool with you? > > Depends upon the default. > > Until the user configures something explicitly into the memory > controller, the networking bits should all evaluate to nothing. Yep, I'll stick them behind a default-off jump label again. This bootflag is only to override an active memory controller configuration and force-off that jump label permanently. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f179.google.com (mail-wi0-f179.google.com [209.85.212.179]) by kanga.kvack.org (Postfix) with ESMTP id A1B9682F64 for ; Thu, 29 Oct 2015 12:10:28 -0400 (EDT) Received: by wicfv8 with SMTP id fv8so47736091wic.0 for ; Thu, 29 Oct 2015 09:10:28 -0700 (PDT) Received: from gum.cmpxchg.org (gum.cmpxchg.org. [85.214.110.215]) by mx.google.com with ESMTPS id j64si12253819wmd.123.2015.10.29.09.10.26 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 29 Oct 2015 09:10:27 -0700 (PDT) Date: Thu, 29 Oct 2015 09:10:09 -0700 From: Johannes Weiner Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151029161009.GA9160@cmpxchg.org> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-6-git-send-email-hannes@cmpxchg.org> <20151023131956.GA15375@dhcp22.suse.cz> <20151023.065957.1690815054807881760.davem@davemloft.net> <20151026165619.GB2214@cmpxchg.org> <20151027122647.GG9891@dhcp22.suse.cz> <20151027154138.GA4665@cmpxchg.org> <20151027161554.GJ9891@dhcp22.suse.cz> <20151027164227.GB7749@cmpxchg.org> <20151029152546.GG23598@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151029152546.GG23598@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: David Miller , akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Thu, Oct 29, 2015 at 04:25:46PM +0100, Michal Hocko wrote: > On Tue 27-10-15 09:42:27, Johannes Weiner wrote: > > On Tue, Oct 27, 2015 at 05:15:54PM +0100, Michal Hocko wrote: > > > On Tue 27-10-15 11:41:38, Johannes Weiner wrote: > > > > IMO that's an implementation detail and a historical artifact that > > > > should not be exposed to the user. And that's the thing I hate about > > > > the current opt-out knob. > > > > You carefully skipped over this part. We can ignore it for socket > > memory but it's something we need to figure out when it comes to slab > > accounting and tracking. > > I am sorry, I didn't mean to skip this part, I though it would be clear > from the previous text. I think kmem accounting falls into the same > category. Have a sane default and a global boottime knob to override it > for those that think differently - for whatever reason they might have. Yes, that makes sense to me. Like cgroup.memory=nosocket, would you think it makes sense to include slab in the default for functional/semantical completeness and provide a cgroup.memory=noslab for powerusers? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f177.google.com (mail-wi0-f177.google.com [209.85.212.177]) by kanga.kvack.org (Postfix) with ESMTP id 50EF482F64 for ; Thu, 29 Oct 2015 13:52:45 -0400 (EDT) Received: by wijp11 with SMTP id p11so294592342wij.0 for ; Thu, 29 Oct 2015 10:52:44 -0700 (PDT) Received: from gum.cmpxchg.org (gum.cmpxchg.org. [85.214.110.215]) by mx.google.com with ESMTPS id m139si12808747wmb.72.2015.10.29.10.52.43 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 29 Oct 2015 10:52:43 -0700 (PDT) Date: Thu, 29 Oct 2015 10:52:28 -0700 From: Johannes Weiner Subject: Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy Message-ID: <20151029175228.GB9160@cmpxchg.org> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <20151022184509.GM18351@esperanza> <20151026172216.GC2214@cmpxchg.org> <20151027084320.GF13221@esperanza> <20151027155833.GB4665@cmpxchg.org> <20151028082003.GK13221@esperanza> <20151028185810.GA31488@cmpxchg.org> <20151029092747.GR13221@esperanza> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151029092747.GR13221@esperanza> Sender: owner-linux-mm@kvack.org List-ID: To: Vladimir Davydov Cc: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Thu, Oct 29, 2015 at 12:27:47PM +0300, Vladimir Davydov wrote: > On Wed, Oct 28, 2015 at 11:58:10AM -0700, Johannes Weiner wrote: > > Having the hard limit as a failsafe (or a minimum for other consumers) > > is one thing, and certainly something I'm open to for cgroupv2, should > > we have problems with load startup up after a socket memory landgrab. > > > > That being said, if the VM is struggling to reclaim pages, or is even > > swapping, it makes perfect sense to let the socket memory scheduler > > know it shouldn't continue to increase its footprint until the VM > > recovers. Regardless of any hard limitations/minimum guarantees. > > > > This is what my patch does and it seems pretty straight-forward to > > me. I don't really understand why this is so controversial. > > I'm not arguing that the idea behind this patch set is necessarily bad. > Quite the contrary, it does look interesting to me. I'm just saying that > IMO it can't replace hard/soft limits. It probably could if it was > possible to shrink buffers, but I don't think it's feasible, even > theoretically. That's why I propose not to change the behavior of the > existing per memcg tcp limit at all. And frankly I don't get why you are > so keen on simplifying it. You say it's a "crapload of boilerplate > code". Well, I don't see how it is - it just replicates global knobs and > I don't see how it could be done in a better way. The code is hidden > behind jump labels, so the overhead is zero if it isn't used. If you > really dislike this code, we can isolate it under a separate config > option. But all right, I don't rule out the possibility that the code > could be simplified. If you do that w/o breaking it, that'll be OK to > me, but I don't see why it should be related to this particular patch > set. Okay, I see your concern. I'm not trying to change the behavior, just the implementation, because it's too complex for the functionality it actually provides. And the reason it's part of this patch set is because I'm using the same code to hook into the memory accounting, so it makes sense to refactor this stuff in the same go. There is also a niceness factor of not adding more memcg callbacks to the networking subsystem when there is an option to consolidate them. Now, you mentioned that you'd rather see the socket buffers accounted at the allocator level, but I looked at the different allocation paths and network protocols and I'm not convinced that this makes sense. We don't want to be in the hotpath of every single packet when a lot of them are small, short-lived management blips that don't involve user space to let the kernel dispose of them. __sk_mem_schedule() on the other hand is already wired up to exactly those consumers we are interested in for memory isolation: those with bigger chunks of data attached to them and those that have exploding receive queues when userspace fails to read(). UDP and TCP. I mean, there is a reason why the global memory limits apply to only those types of packets in the first place: everything else is noise. I agree that it's appealing to account at the allocator level and set page->mem_cgroup etc. but in this case we'd pay extra to capture a lot of noise, and I don't want to pay that just for aesthetics. In this case it's better to track ownership on the socket level and only count packets that can accumulate a significant amount of memory consumed. > > We tried using the per-memcg tcp limits, and that prevents the OOMs > > for sure, but it's horrendous for network performance. There is no > > "stop growing" phase, it just keeps going full throttle until it hits > > the wall hard. > > > > Now, we could probably try to replicate the global knobs and add a > > per-memcg soft limit. But you know better than anyone else how hard it > > is to estimate the overall workingset size of a workload, and the > > margins on containerized loads are razor-thin. Performance is much > > more sensitive to input errors, and often times parameters must be > > adjusted continuously during the runtime of a workload. It'd be > > disasterous to rely on yet more static, error-prone user input here. > > Yeah, but the dynamic approach proposed in your patch set doesn't > guarantee we won't hit OOM in memcg due to overgrown buffers. It just > reduces this possibility. Of course, memcg OOM is far not as disastrous > as the global one, but still it usually means the workload breakage. Right now, the entire machine breaks. Confining it to a faulty memcg, as well as reducing the likelihood of that OOM in many cases seems like a good move in the right direction, no? And how likely are memcg OOMs because of this anyway? There is of course a scenario imaginable where the packets pile up, followed by some *other* part of the workload, the one that doesn't read() and process packets, trying to expand--which then doesn't work and goes OOM. But that seems like a complete corner case. In the vast majority of cases, the application will be in full operation and just fail to read() fast enough--because the network bandwidth is enormous compared to the container's size, or because it shares the CPU with thousands of other workloads and there is scheduling latency. This would be the perfect point to reign in the transmit window... > The static approach is error-prone for sure, but it has existed for > years and worked satisfactory AFAIK. ...but that point is not a fixed amount of memory consumed. It depends on the workload and the random interactions it's having with thousands of other containers on that same machine. The point of containers is to maximize utilization of your hardware and systematically eliminate slack in the system. But it's exactly that slack on dedicated bare-metal machines that allowed us to take a wild guess at the settings and then tune them based on observing a handful of workloads. This approach is not going to work anymore when we pack the machine to capacity and still expect every single container out of thousands to perform well. We need that automation. The static setting working okay on the global level is also why I'm not interested in starting to experiment with it. There is no reason to change it. It's much more likely that any attempt to change it will be shot down, not because of the approach chosen, but because there is no problem to solve there. I doubt we can get networking people to care about containers by screwing with things that work for them ;-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lb0-f177.google.com (mail-lb0-f177.google.com [209.85.217.177]) by kanga.kvack.org (Postfix) with ESMTP id B95776B0038 for ; Mon, 2 Nov 2015 09:47:50 -0500 (EST) Received: by lbbwb3 with SMTP id wb3so89468106lbb.1 for ; Mon, 02 Nov 2015 06:47:50 -0800 (PST) Received: from relay.parallels.com (relay.parallels.com. [195.214.232.42]) by mx.google.com with ESMTPS id r123si14822607lfr.149.2015.11.02.06.47.48 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 02 Nov 2015 06:47:48 -0800 (PST) Date: Mon, 2 Nov 2015 17:47:29 +0300 From: Vladimir Davydov Subject: Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy Message-ID: <20151102144729.GA17424@esperanza> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <20151022184509.GM18351@esperanza> <20151026172216.GC2214@cmpxchg.org> <20151027084320.GF13221@esperanza> <20151027155833.GB4665@cmpxchg.org> <20151028082003.GK13221@esperanza> <20151028185810.GA31488@cmpxchg.org> <20151029092747.GR13221@esperanza> <20151029175228.GB9160@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <20151029175228.GB9160@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Thu, Oct 29, 2015 at 10:52:28AM -0700, Johannes Weiner wrote: ... > Now, you mentioned that you'd rather see the socket buffers accounted > at the allocator level, but I looked at the different allocation paths > and network protocols and I'm not convinced that this makes sense. We > don't want to be in the hotpath of every single packet when a lot of > them are small, short-lived management blips that don't involve user > space to let the kernel dispose of them. > > __sk_mem_schedule() on the other hand is already wired up to exactly > those consumers we are interested in for memory isolation: those with > bigger chunks of data attached to them and those that have exploding > receive queues when userspace fails to read(). UDP and TCP. > > I mean, there is a reason why the global memory limits apply to only > those types of packets in the first place: everything else is noise. > > I agree that it's appealing to account at the allocator level and set > page->mem_cgroup etc. but in this case we'd pay extra to capture a lot > of noise, and I don't want to pay that just for aesthetics. In this > case it's better to track ownership on the socket level and only count > packets that can accumulate a significant amount of memory consumed. Sigh, you seem to be right. Moreover, I can't even think of a neat way to account skb pages to memcg, because rcv skbs are generated in device drivers, where we don't know which socket/memcg it will go to. We could recharge individual pages when skb gets to the network or transport layer, but it would result in unjustified overhead. > > > > We tried using the per-memcg tcp limits, and that prevents the OOMs > > > for sure, but it's horrendous for network performance. There is no > > > "stop growing" phase, it just keeps going full throttle until it hits > > > the wall hard. > > > > > > Now, we could probably try to replicate the global knobs and add a > > > per-memcg soft limit. But you know better than anyone else how hard it > > > is to estimate the overall workingset size of a workload, and the > > > margins on containerized loads are razor-thin. Performance is much > > > more sensitive to input errors, and often times parameters must be > > > adjusted continuously during the runtime of a workload. It'd be > > > disasterous to rely on yet more static, error-prone user input here. > > > > Yeah, but the dynamic approach proposed in your patch set doesn't > > guarantee we won't hit OOM in memcg due to overgrown buffers. It just > > reduces this possibility. Of course, memcg OOM is far not as disastrous > > as the global one, but still it usually means the workload breakage. > > Right now, the entire machine breaks. Confining it to a faulty memcg, > as well as reducing the likelihood of that OOM in many cases seems > like a good move in the right direction, no? It seems. However, memcg OOM is also bad, we should strive to avoid it if we can. > > And how likely are memcg OOMs because of this anyway? There is of Frankly, I've no idea. Your arguments below sound reassuring though. > course a scenario imaginable where the packets pile up, followed by > some *other* part of the workload, the one that doesn't read() and > process packets, trying to expand--which then doesn't work and goes > OOM. But that seems like a complete corner case. In the vast majority > of cases, the application will be in full operation and just fail to > read() fast enough--because the network bandwidth is enormous compared > to the container's size, or because it shares the CPU with thousands > of other workloads and there is scheduling latency. > > This would be the perfect point to reign in the transmit window... > > > The static approach is error-prone for sure, but it has existed for > > years and worked satisfactory AFAIK. > > ...but that point is not a fixed amount of memory consumed. It depends > on the workload and the random interactions it's having with thousands > of other containers on that same machine. > > The point of containers is to maximize utilization of your hardware > and systematically eliminate slack in the system. But it's exactly > that slack on dedicated bare-metal machines that allowed us to take a > wild guess at the settings and then tune them based on observing a > handful of workloads. This approach is not going to work anymore when > we pack the machine to capacity and still expect every single > container out of thousands to perform well. We need that automation. But we do use static approach when setting memory limits, no? memory.{low,high,max} - they are all static. I understand it's appealing to have just one knob - memory size - like in case of virtual machines, but it doesn't seem to work with containers. You added memory.low and memory.high knobs. VMs don't have anything like that. How is one supposed to set them? Depends on the workload, I guess. Also, there is the pids cgroup for limiting the number of pids that can be used by a cgroup, because pid turns out to be a resource in case of containers. May be, tcp window should be considered as a separate resource either, as it is now, and shouldn't go to memcg? I'm just wondering... > > The static setting working okay on the global level is also why I'm > not interested in starting to experiment with it. There is no reason > to change it. It's much more likely that any attempt to change it will > be shot down, not because of the approach chosen, but because there is > no problem to solve there. I doubt we can get networking people to > care about containers by screwing with things that work for them ;-) Fair enough. Thanks, Vladimir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f42.google.com (mail-wm0-f42.google.com [74.125.82.42]) by kanga.kvack.org (Postfix) with ESMTP id 6B1646B0253 for ; Wed, 4 Nov 2015 05:42:43 -0500 (EST) Received: by wmeg8 with SMTP id g8so37780528wme.1 for ; Wed, 04 Nov 2015 02:42:42 -0800 (PST) Received: from mail-wm0-f52.google.com (mail-wm0-f52.google.com. [74.125.82.52]) by mx.google.com with ESMTPS id 5si2869845wmw.58.2015.11.04.02.42.41 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 04 Nov 2015 02:42:41 -0800 (PST) Received: by wmeg8 with SMTP id g8so37780016wme.1 for ; Wed, 04 Nov 2015 02:42:41 -0800 (PST) Date: Wed, 4 Nov 2015 11:42:40 +0100 From: Michal Hocko Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151104104239.GG29607@dhcp22.suse.cz> References: <1445487696-21545-6-git-send-email-hannes@cmpxchg.org> <20151023131956.GA15375@dhcp22.suse.cz> <20151023.065957.1690815054807881760.davem@davemloft.net> <20151026165619.GB2214@cmpxchg.org> <20151027122647.GG9891@dhcp22.suse.cz> <20151027154138.GA4665@cmpxchg.org> <20151027161554.GJ9891@dhcp22.suse.cz> <20151027164227.GB7749@cmpxchg.org> <20151029152546.GG23598@dhcp22.suse.cz> <20151029161009.GA9160@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151029161009.GA9160@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: David Miller , akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Thu 29-10-15 09:10:09, Johannes Weiner wrote: > On Thu, Oct 29, 2015 at 04:25:46PM +0100, Michal Hocko wrote: > > On Tue 27-10-15 09:42:27, Johannes Weiner wrote: [...] > > > You carefully skipped over this part. We can ignore it for socket > > > memory but it's something we need to figure out when it comes to slab > > > accounting and tracking. > > > > I am sorry, I didn't mean to skip this part, I though it would be clear > > from the previous text. I think kmem accounting falls into the same > > category. Have a sane default and a global boottime knob to override it > > for those that think differently - for whatever reason they might have. > > Yes, that makes sense to me. > > Like cgroup.memory=nosocket, would you think it makes sense to include > slab in the default for functional/semantical completeness and provide > a cgroup.memory=noslab for powerusers? I am still not sure whether the kmem accounting is stable enough to be enabled by default. If for nothing else the allocation failures, which are not allowed for the global case and easily triggered by the hard limit, might be a big problem. My last attempts to allow GFP_NOFS to fail made me quite skeptical. I still believe this is something which will be solved in the long term but the current state might be still too fragile. So I would rather be conservative and have the kmem accounting disabled by default with a config option and boot parameter to override. If somebody is confident that the desired load is stable then the config can be enabled easily. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f173.google.com (mail-wi0-f173.google.com [209.85.212.173]) by kanga.kvack.org (Postfix) with ESMTP id 6865682F6A for ; Wed, 4 Nov 2015 14:50:53 -0500 (EST) Received: by wicll6 with SMTP id ll6so38964972wic.1 for ; Wed, 04 Nov 2015 11:50:52 -0800 (PST) Received: from gum.cmpxchg.org (gum.cmpxchg.org. [85.214.110.215]) by mx.google.com with ESMTPS id l13si5405429wmg.29.2015.11.04.11.50.51 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 04 Nov 2015 11:50:52 -0800 (PST) Date: Wed, 4 Nov 2015 14:50:37 -0500 From: Johannes Weiner Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151104195037.GA6872@cmpxchg.org> References: <20151023131956.GA15375@dhcp22.suse.cz> <20151023.065957.1690815054807881760.davem@davemloft.net> <20151026165619.GB2214@cmpxchg.org> <20151027122647.GG9891@dhcp22.suse.cz> <20151027154138.GA4665@cmpxchg.org> <20151027161554.GJ9891@dhcp22.suse.cz> <20151027164227.GB7749@cmpxchg.org> <20151029152546.GG23598@dhcp22.suse.cz> <20151029161009.GA9160@cmpxchg.org> <20151104104239.GG29607@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151104104239.GG29607@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: David Miller , akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Wed, Nov 04, 2015 at 11:42:40AM +0100, Michal Hocko wrote: > On Thu 29-10-15 09:10:09, Johannes Weiner wrote: > > On Thu, Oct 29, 2015 at 04:25:46PM +0100, Michal Hocko wrote: > > > On Tue 27-10-15 09:42:27, Johannes Weiner wrote: > [...] > > > > You carefully skipped over this part. We can ignore it for socket > > > > memory but it's something we need to figure out when it comes to slab > > > > accounting and tracking. > > > > > > I am sorry, I didn't mean to skip this part, I though it would be clear > > > from the previous text. I think kmem accounting falls into the same > > > category. Have a sane default and a global boottime knob to override it > > > for those that think differently - for whatever reason they might have. > > > > Yes, that makes sense to me. > > > > Like cgroup.memory=nosocket, would you think it makes sense to include > > slab in the default for functional/semantical completeness and provide > > a cgroup.memory=noslab for powerusers? > > I am still not sure whether the kmem accounting is stable enough to be > enabled by default. If for nothing else the allocation failures, which > are not allowed for the global case and easily triggered by the hard > limit, might be a big problem. My last attempts to allow GFP_NOFS to > fail made me quite skeptical. I still believe this is something which > will be solved in the long term but the current state might be still too > fragile. So I would rather be conservative and have the kmem accounting > disabled by default with a config option and boot parameter to override. > If somebody is confident that the desired load is stable then the config > can be enabled easily. I agree with your assessment of the current kmem code state, but I think your conclusion is completely backwards here. The interface will be set in stone forever, whereas any stability issues will be transient and will have to be addressed in a finite amount of time anyway. It doesn't make sense to design an interface based on temporary quality of implementation. Only one of those two can ever be changed. Because it goes without saying that once the cgroupv2 interface is released, and people use it in production, there is no way we can then *add* dentry cache, inode cache, and others to memory.current. That would be an unacceptable change in interface behavior. On the other hand, people will be prepared for hiccups in the early stages of cgroupv2 release, and we're providing cgroup.memory=noslab to let them workaround severe problems in production until we fix it without forcing them to fully revert to cgroupv1. So if we agree that there are no fundamental architectural concerns with slab accounting, i.e. nothing that can't be addressed in the implementation, we have to make the call now. And I maintain that not accounting dentry cache and inode cache is a gaping hole in memory isolation, so it should be included by default. (The rest of the slabs is arguable, but IMO the risk of missing something important is higher than the cost of including them.) As far as your allocation failure concerns go, I think the kmem code is currently not behaving as Glauber originally intended, which is to force charge if reclaim and OOM killing weren't able to make enough space. See this recently rewritten section of the kmem charge path: - /* - * try_charge() chose to bypass to root due to OOM kill or - * fatal signal. Since our only options are to either fail - * the allocation or charge it to this cgroup, do it as a - * temporary condition. But we can't fail. From a kmem/slab - * perspective, the cache has already been selected, by - * mem_cgroup_kmem_get_cache(), so it is too late to change - * our minds. - * - * This condition will only trigger if the task entered - * memcg_charge_kmem in a sane state, but was OOM-killed - * during try_charge() above. Tasks that were already dying - * when the allocation triggers should have been already - * directed to the root cgroup in memcontrol.h - */ - page_counter_charge(&memcg->memory, nr_pages); - if (do_swap_account) - page_counter_charge(&memcg->memsw, nr_pages); It could be that this never properly worked as it was tied to the -EINTR bypass trick, but the idea was these charges never fail. And this makes sense. If the allocator semantics are such that we never fail these page allocations for slab, and the callsites rely on that, surely we should not fail them in the memory controller, either. And it makes a lot more sense to account them in excess of the limit than pretend they don't exist. We might not be able to completely fullfill the containment part of the memory controller (although these slab charges will still create significant pressure before that), but at least we don't fail the accounting part on top of it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f53.google.com (mail-pa0-f53.google.com [209.85.220.53]) by kanga.kvack.org (Postfix) with ESMTP id 15C2082F64 for ; Thu, 5 Nov 2015 11:16:14 -0500 (EST) Received: by pabfh17 with SMTP id fh17so91511163pab.0 for ; Thu, 05 Nov 2015 08:16:13 -0800 (PST) Received: from shards.monkeyblade.net (shards.monkeyblade.net. [2001:4f8:3:36:211:85ff:fe63:a549]) by mx.google.com with ESMTP id jw6si11064267pbc.214.2015.11.05.08.16.12 for ; Thu, 05 Nov 2015 08:16:13 -0800 (PST) Date: Thu, 05 Nov 2015 11:16:09 -0500 (EST) Message-Id: <20151105.111609.1695015438589063316.davem@davemloft.net> Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy From: David Miller In-Reply-To: <20151105144002.GB15111@dhcp22.suse.cz> References: <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: mhocko@kernel.org Cc: hannes@cmpxchg.org, akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org From: Michal Hocko Date: Thu, 5 Nov 2015 15:40:02 +0100 > On Wed 04-11-15 14:50:37, Johannes Weiner wrote: > [...] >> Because it goes without saying that once the cgroupv2 interface is >> released, and people use it in production, there is no way we can then >> *add* dentry cache, inode cache, and others to memory.current. That >> would be an unacceptable change in interface behavior. > > They would still have to _enable_ the config option _explicitly_. make > oldconfig wouldn't change it silently for them. I do not think > it is an unacceptable change of behavior if the config is changed > explicitly. Every user is going to get this config option when they update their distibution kernel or whatever. Then they will all wonder why their networking performance went down. This is why I do not want the networking accounting bits on by default even if the kconfig option is enabled. They must be off by default and guarded by a static branch so the cost is exactly zero. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f41.google.com (mail-wm0-f41.google.com [74.125.82.41]) by kanga.kvack.org (Postfix) with ESMTP id 2219E82F64 for ; Thu, 5 Nov 2015 11:28:06 -0500 (EST) Received: by wmww144 with SMTP id w144so11299412wmw.1 for ; Thu, 05 Nov 2015 08:28:05 -0800 (PST) Received: from mail-wi0-f173.google.com (mail-wi0-f173.google.com. [209.85.212.173]) by mx.google.com with ESMTPS id e16si8909079wjz.164.2015.11.05.08.28.05 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 05 Nov 2015 08:28:05 -0800 (PST) Received: by wikq8 with SMTP id q8so14045104wik.1 for ; Thu, 05 Nov 2015 08:28:04 -0800 (PST) Date: Thu, 5 Nov 2015 17:28:03 +0100 From: Michal Hocko Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151105162803.GD15111@dhcp22.suse.cz> References: <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> <20151105.111609.1695015438589063316.davem@davemloft.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151105.111609.1695015438589063316.davem@davemloft.net> Sender: owner-linux-mm@kvack.org List-ID: To: David Miller Cc: hannes@cmpxchg.org, akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Thu 05-11-15 11:16:09, David S. Miller wrote: > From: Michal Hocko > Date: Thu, 5 Nov 2015 15:40:02 +0100 > > > On Wed 04-11-15 14:50:37, Johannes Weiner wrote: > > [...] > >> Because it goes without saying that once the cgroupv2 interface is > >> released, and people use it in production, there is no way we can then > >> *add* dentry cache, inode cache, and others to memory.current. That > >> would be an unacceptable change in interface behavior. > > > > They would still have to _enable_ the config option _explicitly_. make > > oldconfig wouldn't change it silently for them. I do not think > > it is an unacceptable change of behavior if the config is changed > > explicitly. > > Every user is going to get this config option when they update their > distibution kernel or whatever. > > Then they will all wonder why their networking performance went down. > > This is why I do not want the networking accounting bits on by default > even if the kconfig option is enabled. They must be off by default > and guarded by a static branch so the cost is exactly zero. Yes, that part is clear and Johannes made it clear that the kmem tcp part is disabled by default. Or are you considering also all the slab usage by the networking code as well? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f175.google.com (mail-wi0-f175.google.com [209.85.212.175]) by kanga.kvack.org (Postfix) with ESMTP id E9F2A82F64 for ; Thu, 5 Nov 2015 15:55:36 -0500 (EST) Received: by wimw2 with SMTP id w2so18159725wim.1 for ; Thu, 05 Nov 2015 12:55:36 -0800 (PST) Received: from gum.cmpxchg.org (gum.cmpxchg.org. [85.214.110.215]) by mx.google.com with ESMTPS id 17si12326554wmg.112.2015.11.05.12.55.35 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 05 Nov 2015 12:55:35 -0800 (PST) Date: Thu, 5 Nov 2015 15:55:22 -0500 From: Johannes Weiner Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151105205522.GA1067@cmpxchg.org> References: <20151026165619.GB2214@cmpxchg.org> <20151027122647.GG9891@dhcp22.suse.cz> <20151027154138.GA4665@cmpxchg.org> <20151027161554.GJ9891@dhcp22.suse.cz> <20151027164227.GB7749@cmpxchg.org> <20151029152546.GG23598@dhcp22.suse.cz> <20151029161009.GA9160@cmpxchg.org> <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151105144002.GB15111@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: David Miller , akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote: > On Wed 04-11-15 14:50:37, Johannes Weiner wrote: > [...] > > Because it goes without saying that once the cgroupv2 interface is > > released, and people use it in production, there is no way we can then > > *add* dentry cache, inode cache, and others to memory.current. That > > would be an unacceptable change in interface behavior. > > They would still have to _enable_ the config option _explicitly_. make > oldconfig wouldn't change it silently for them. I do not think > it is an unacceptable change of behavior if the config is changed > explicitly. Yeah, as Dave said these will all get turned on anyway, so there is no point in fragmenting the Kconfig space in the first place. > > On the other > > hand, people will be prepared for hiccups in the early stages of > > cgroupv2 release, and we're providing cgroup.memory=noslab to let them > > workaround severe problems in production until we fix it without > > forcing them to fully revert to cgroupv1. > > This would be true if they moved on to the new cgroup API intentionally. > The reality is more complicated though. AFAIK sysmted is waiting for > cgroup2 already and privileged services enable all available resource > controllers by default as I've learned just recently. Have you filed a report with them? I don't think they should turn them on unless users explicitely configure resource control for the unit. But what I said still holds: critical production machines don't just get rolling updates and "accidentally" switch to all this new code. And those that do take the plunge have the cmdline options. > > And it makes a lot more sense to account them in excess of the limit > > than pretend they don't exist. We might not be able to completely > > fullfill the containment part of the memory controller (although these > > slab charges will still create significant pressure before that), but > > at least we don't fail the accounting part on top of it. > > Hmm, wouldn't that kill the whole purpose of the kmem accounting? Any > load could simply runaway via kernel allocations. What is even worse we > might even not trigger memcg OOM killer before we hit the global OOM. So > the whole containment goes straight to hell. > > I can see four options here: > 1) enable kmem by default with the current semantic which we know can > BUG_ON (at least btrfs is known to hit this) or lead to other issues. Can you point me to that report? That's not "semantics", that's a bug! Whether or not a feature is enabled by default, it can not be allowed to crash the kernel. Presenting this as a choice is a bit of a strawman argument. > 2) enable kmem by default and change the semantic for cgroup2 to allow > runaway charges above the hard limit which would defeat the whole > purpose of the containment for cgroup2. This can be a temporary > workaround until we can afford kmem failures. This has a big risk > that we will end up with this permanently because there is a strong > pressure that GFP_KERNEL allocations should never fail. Yet this is > the most common type of request. Or do we change the consistency with > the global case at some point? As per 1) we *have* to fail containment eventually if not doing so means crashes and lockups. That's not a choice of semantics. But that doesn't mean we have to give up *immediately* and allow unrestrained "runaway charges"--again, more of a strawman than a choice. We can still throttle the allocator and apply significant pressure on the memory pool, culminating in OOM kills eventually. Once we run out of available containment tools, however, we *have* to follow the semantics of the page and slab allocator and succeed the request. We can not just return -ENOMEM if that causes kernel bugs. That's the only thing we can do right now. In fact, it's likely going to be the best we will ever be able to do when it comes to kernel memory accounting. Linus made it clear where he stands on failing kernel allocations, so all we can do is continue to improve our containment tools and then give up on containment when they're exhausted and force the charge past the limit. > 3) keep only some (safe) cache types enabled by default with the current > failing semantic and require an explicit enabling for the complete > kmem accounting. [di]cache code paths should be quite robust to > handle allocation failures. Vladimir, what would be your opinion on this? > 4) disable kmem by default and change the config default later to signal > the accounting is safe as far as we are aware and let people enable > the functionality on those basis. We would keep the current failing > semantic. > > To me 4) sounds like the safest option because it still keeps the > functionality available to those who can benefit from it in v1 already > while we are not exposing a potentially buggy behavior to the majority > (many of them even unintentionally). Moreover we still allow to change > the default later on an explicit basis. I'm not interested in fragmenting the interface forever out of caution because there might be a bug in the implementation right now. As I said we have to fix any instability in the features we provide whether they are turned on by default or not. I don't see how this is relevant to the interface discussion. Also, there is no way we can later fundamentally change the semantics of memory.current, so it would have to remain configurable forever, forcing people forever to select multiple options in order to piece together a single logical kernel feature. This is really not an option, either. If there are show-stopping bugs in the implementation, I'd rather hold off the release of the unified hierarchy than commit to a half-assed interface right out of the gate. The point of v2 is sane interfaces. So let's please focus on fixing any problems that slab accounting may have, rather than designing complex config options and transition procedures whose sole purpose is to defer dealing with our issues. Please? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f44.google.com (mail-wm0-f44.google.com [74.125.82.44]) by kanga.kvack.org (Postfix) with ESMTP id 3C9C382F64 for ; Thu, 5 Nov 2015 17:33:05 -0500 (EST) Received: by wmll128 with SMTP id l128so26362529wml.0 for ; Thu, 05 Nov 2015 14:33:04 -0800 (PST) Received: from gum.cmpxchg.org (gum.cmpxchg.org. [85.214.110.215]) by mx.google.com with ESMTPS id 71si394401wmm.27.2015.11.05.14.33.03 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 05 Nov 2015 14:33:03 -0800 (PST) Date: Thu, 5 Nov 2015 17:32:51 -0500 From: Johannes Weiner Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151105223251.GA4427@cmpxchg.org> References: <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> <20151105.111609.1695015438589063316.davem@davemloft.net> <20151105162803.GD15111@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151105162803.GD15111@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: David Miller , akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Thu, Nov 05, 2015 at 05:28:03PM +0100, Michal Hocko wrote: > On Thu 05-11-15 11:16:09, David S. Miller wrote: > > From: Michal Hocko > > Date: Thu, 5 Nov 2015 15:40:02 +0100 > > > > > On Wed 04-11-15 14:50:37, Johannes Weiner wrote: > > > [...] > > >> Because it goes without saying that once the cgroupv2 interface is > > >> released, and people use it in production, there is no way we can then > > >> *add* dentry cache, inode cache, and others to memory.current. That > > >> would be an unacceptable change in interface behavior. > > > > > > They would still have to _enable_ the config option _explicitly_. make > > > oldconfig wouldn't change it silently for them. I do not think > > > it is an unacceptable change of behavior if the config is changed > > > explicitly. > > > > Every user is going to get this config option when they update their > > distibution kernel or whatever. > > > > Then they will all wonder why their networking performance went down. > > > > This is why I do not want the networking accounting bits on by default > > even if the kconfig option is enabled. They must be off by default > > and guarded by a static branch so the cost is exactly zero. > > Yes, that part is clear and Johannes made it clear that the kmem tcp > part is disabled by default. Or are you considering also all the slab > usage by the networking code as well? Michal, there shouldn't be any tracking or accounting going on per default when you boot into a fresh system. I removed all accounting and statistics on the system level in cgroupv2, so distribution kernels can compile-time enable a single, feature-complete CONFIG_MEMCG that provides a full memory controller while at the same time puts no overhead on users that don't benefit from mem control at all and just want to use the machine bare-metal. This is completely doable. My new series does it for skmem, but I also want to retrofit the code to eliminate that current overhead for page cache, anonymous memory, slab memory and so forth. This is the only sane way to make the memory controller powerful and generally useful without having to make unreasonable compromises with memory consumers. We shouldn't even be *having* the discussion about whether we should sacrifice the quality of our interface in order to compromise with a class of users that doesn't care about any of this in the first place. So let's eliminate the cost for non-users, but make the memory controller feature-complete and useful--with reasonable cost, implementation, and interface--for our actual userbase. Paying the necessary cost for a functionality you actually want is not the problem. Paying for something that doesn't benefit you is. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f52.google.com (mail-wm0-f52.google.com [74.125.82.52]) by kanga.kvack.org (Postfix) with ESMTP id 0E13782F64 for ; Thu, 5 Nov 2015 17:52:11 -0500 (EST) Received: by wmww144 with SMTP id w144so18213309wmw.1 for ; Thu, 05 Nov 2015 14:52:10 -0800 (PST) Received: from gum.cmpxchg.org (gum.cmpxchg.org. [85.214.110.215]) by mx.google.com with ESMTPS id hi4si234548wjc.65.2015.11.05.14.52.09 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 05 Nov 2015 14:52:09 -0800 (PST) Date: Thu, 5 Nov 2015 17:52:00 -0500 From: Johannes Weiner Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151105225200.GA5432@cmpxchg.org> References: <20151027122647.GG9891@dhcp22.suse.cz> <20151027154138.GA4665@cmpxchg.org> <20151027161554.GJ9891@dhcp22.suse.cz> <20151027164227.GB7749@cmpxchg.org> <20151029152546.GG23598@dhcp22.suse.cz> <20151029161009.GA9160@cmpxchg.org> <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> <20151105205522.GA1067@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151105205522.GA1067@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: David Miller , akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote: > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote: > > This would be true if they moved on to the new cgroup API intentionally. > > The reality is more complicated though. AFAIK sysmted is waiting for > > cgroup2 already and privileged services enable all available resource > > controllers by default as I've learned just recently. > > Have you filed a report with them? I don't think they should turn them > on unless users explicitely configure resource control for the unit. Okay, verified with systemd people that they're not planning on enabling resource control per default. Inflammatory half-truths, man. This is not constructive. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lb0-f176.google.com (mail-lb0-f176.google.com [209.85.217.176]) by kanga.kvack.org (Postfix) with ESMTP id 03D8E82F64 for ; Fri, 6 Nov 2015 04:06:17 -0500 (EST) Received: by lbbkw15 with SMTP id kw15so47346889lbb.0 for ; Fri, 06 Nov 2015 01:06:16 -0800 (PST) Received: from relay.parallels.com (relay.parallels.com. [195.214.232.42]) by mx.google.com with ESMTPS id zz10si7421035lbb.56.2015.11.06.01.06.14 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 06 Nov 2015 01:06:15 -0800 (PST) Date: Fri, 6 Nov 2015 12:05:55 +0300 From: Vladimir Davydov Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151106090555.GK29259@esperanza> References: <20151027122647.GG9891@dhcp22.suse.cz> <20151027154138.GA4665@cmpxchg.org> <20151027161554.GJ9891@dhcp22.suse.cz> <20151027164227.GB7749@cmpxchg.org> <20151029152546.GG23598@dhcp22.suse.cz> <20151029161009.GA9160@cmpxchg.org> <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> <20151105205522.GA1067@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <20151105205522.GA1067@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Michal Hocko , David Miller , akpm@linux-foundation.org, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote: > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote: ... > > 3) keep only some (safe) cache types enabled by default with the current > > failing semantic and require an explicit enabling for the complete > > kmem accounting. [di]cache code paths should be quite robust to > > handle allocation failures. > > Vladimir, what would be your opinion on this? I'm all for this option. Actually, I've been thinking about this since I introduced the __GFP_NOACCOUNT flag. Not because of the failing semantics, since we can always let kmem allocations breach the limit. This shouldn't be critical, because I don't think it's possible to issue a series of kmem allocations w/o a single user page allocation, which would reclaim/kill the excess. The point is there are allocations that are shared system-wide and therefore shouldn't go to any memcg. Most obvious examples are: mempool users and radix_tree/idr preloads. Accounting them to memcg is likely to result in noticeable memory overhead as memory cgroups are created/destroyed, because they pin dead memory cgroups with all their kmem caches, which aren't tiny. Another funny example is objects destroyed lazily for performance reasons, e.g. vmap_area. Such objects are usually very small, so delaying destruction of a bunch of them will normally go unnoticed. However, if kmemcg is used the effective memory consumption caused by such objects can be multiplied by many times due to dangling kmem caches. We can, of course, mark all such allocations as __GFP_NOACCOUNT, but the problem is they are tricky to identify, because they are scattered all over the kernel source tree. E.g. Dave Chinner mentioned that XFS internals do a lot of allocations that are shared among all XFS filesystems and therefore should not be accounted (BTW that's why list_lru's used by XFS are not marked as memcg-aware). There must be more out there. Besides, kernel developers don't usually even know about kmemcg (they just write the code for their subsys, so why should they?) so they won't care thinking about using __GFP_NOACCOUNT, and hence new falsely-accounted allocations are likely to appear. That said, by switching from black-list (__GFP_NOACCOUNT) to white-list (__GFP_ACCOUNT) kmem accounting policy we would make the system more predictable and robust IMO. OTOH what would we lose? Security? Well, containers aren't secure IMHO. In fact, I doubt they will ever be (as secure as VMs). Anyway, if a runaway allocation is reported, it should be trivial to fix by adding __GFP_ACCOUNT where appropriate. If there are no objections, I'll prepare a patch switching to the white-list approach. Let's start from obvious things like fs_struct, mm_struct, task_struct, signal_struct, dentry, inode, which can be easily allocated from user space. This should cover 90% of all allocations that should be accounted AFAICS. The rest will be added later if necessarily. Thanks, Vladimir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f182.google.com (mail-wi0-f182.google.com [209.85.212.182]) by kanga.kvack.org (Postfix) with ESMTP id 6548282F64 for ; Fri, 6 Nov 2015 07:51:43 -0500 (EST) Received: by wicll6 with SMTP id ll6so29503474wic.0 for ; Fri, 06 Nov 2015 04:51:42 -0800 (PST) Received: from mail-wi0-f177.google.com (mail-wi0-f177.google.com. [209.85.212.177]) by mx.google.com with ESMTPS id m79si731576wmg.42.2015.11.06.04.51.41 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 06 Nov 2015 04:51:41 -0800 (PST) Received: by wicll6 with SMTP id ll6so29503213wic.0 for ; Fri, 06 Nov 2015 04:51:41 -0800 (PST) Date: Fri, 6 Nov 2015 13:51:40 +0100 From: Michal Hocko Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151106125140.GI4390@dhcp22.suse.cz> References: <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> <20151105.111609.1695015438589063316.davem@davemloft.net> <20151105162803.GD15111@dhcp22.suse.cz> <20151105223251.GA4427@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151105223251.GA4427@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: David Miller , akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Thu 05-11-15 17:32:51, Johannes Weiner wrote: > On Thu, Nov 05, 2015 at 05:28:03PM +0100, Michal Hocko wrote: [...] > > Yes, that part is clear and Johannes made it clear that the kmem tcp > > part is disabled by default. Or are you considering also all the slab > > usage by the networking code as well? > > Michal, there shouldn't be any tracking or accounting going on per > default when you boot into a fresh system. > > I removed all accounting and statistics on the system level in > cgroupv2, so distribution kernels can compile-time enable a single, > feature-complete CONFIG_MEMCG that provides a full memory controller > while at the same time puts no overhead on users that don't benefit > from mem control at all and just want to use the machine bare-metal. Yes that part is clear and I am not disputing it _at all_. It is just that changes are high that memory controller _will_ be enabled in a typical distribution systems. E.g. systemd _is_ enabling all resource controllers by default for some services with Delegate=yes option. > This is completely doable. My new series does it for skmem, but I also > want to retrofit the code to eliminate that current overhead for page > cache, anonymous memory, slab memory and so forth. > > This is the only sane way to make the memory controller powerful and > generally useful without having to make unreasonable compromises with > memory consumers. We shouldn't even be *having* the discussion about > whether we should sacrifice the quality of our interface in order to > compromise with a class of users that doesn't care about any of this > in the first place. > > So let's eliminate the cost for non-users, but make the memory > controller feature-complete and useful--with reasonable cost, > implementation, and interface--for our actual userbase. > > Paying the necessary cost for a functionality you actually want is not > the problem. Paying for something that doesn't benefit you is. I completely agree that a reasonable cost for those who _want_ the functionality. It hasn't been shown that people actually lack kmem accounting in the wild from the past in general. E.g. kmem controller is even not enabled in opensuse nor SLES kernels and I do not remember there was huge push to enable it. I do understand that you want to have an out-of-the-box isolation behavior which I agree is a nice-to-have feature. Especially with a larger penetration of containerized workloads. But my point still holds. This is not something everybody wants to have. So have a configuration and a boot time option to override is the most reasonable way to go. You can clearly see that this is already demand from tcp kmem extension because they really _care_ about every single cpu cycle even though some part of the userspace happens to have memcg enabled. The question about the configuration default is a different question and we can discuss that because this is not an easy one to decide right now IMHO. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f181.google.com (mail-wi0-f181.google.com [209.85.212.181]) by kanga.kvack.org (Postfix) with ESMTP id 9B9A682F64 for ; Fri, 6 Nov 2015 11:47:00 -0500 (EST) Received: by wicfv8 with SMTP id fv8so32409237wic.0 for ; Fri, 06 Nov 2015 08:47:00 -0800 (PST) Received: from mail-wm0-f44.google.com (mail-wm0-f44.google.com. [74.125.82.44]) by mx.google.com with ESMTPS id h136si1916854wmd.98.2015.11.06.08.46.59 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 06 Nov 2015 08:46:59 -0800 (PST) Received: by wmww144 with SMTP id w144so34405894wmw.1 for ; Fri, 06 Nov 2015 08:46:59 -0800 (PST) Date: Fri, 6 Nov 2015 17:46:57 +0100 From: Michal Hocko Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151106164657.GL4390@dhcp22.suse.cz> References: <20151027164227.GB7749@cmpxchg.org> <20151029152546.GG23598@dhcp22.suse.cz> <20151029161009.GA9160@cmpxchg.org> <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> <20151105205522.GA1067@cmpxchg.org> <20151105225200.GA5432@cmpxchg.org> <20151106105724.GG4390@dhcp22.suse.cz> <20151106161953.GA7813@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151106161953.GA7813@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: David Miller , akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Fri 06-11-15 11:19:53, Johannes Weiner wrote: > On Fri, Nov 06, 2015 at 11:57:24AM +0100, Michal Hocko wrote: > > On Thu 05-11-15 17:52:00, Johannes Weiner wrote: > > > On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote: > > > > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote: > > > > > This would be true if they moved on to the new cgroup API intentionally. > > > > > The reality is more complicated though. AFAIK sysmted is waiting for > > > > > cgroup2 already and privileged services enable all available resource > > > > > controllers by default as I've learned just recently. > > > > > > > > Have you filed a report with them? I don't think they should turn them > > > > on unless users explicitely configure resource control for the unit. > > > > > > Okay, verified with systemd people that they're not planning on > > > enabling resource control per default. > > > > > > Inflammatory half-truths, man. This is not constructive. > > > > What about Delegate=yes feature then? We have just been burnt by this > > quite heavily. AFAIU nspawn@.service and nspawn@.service have this > > enabled by default > > http://lists.freedesktop.org/archives/systemd-commits/2014-November/007400.html > > That's when you launch a *container* and want it to be able to use > nested resource control. Ups. copy&paste error here. The second one was user@.service. So it is not only about containers AFAIU but all user defined sessions. > We're talking about actual container users here. It's not turning on > resource control for all "privileged services", which is what we were > worried about here. Can you at least admit that when you yourself link > to the refuting evidence? My bad, that was misundestanding of the changelog. > And if you've been "burnt quite heavily" by this, where is your bug > report to stop other users from getting "burnt quite heavily" as well? The bug report is still internal because it is tracking an unrelased product. We have ended up reverting Delegate feature. Our systemd developers are supposed to bring this up with the upstream. The basic problem was that the Delegate feature has been backported to our systemd package without further consideration and that has invalidated a lot of performance testing because some resource controllers have measurable effects on those benchmarks. > All I read here is vague inflammatory language to spread FUD. I was merely pointing out that memory controller might be enabled without _user_ actually even noticing because the controller wasn't enabled explicitly. I haven't blamed anybody for that. > You might think sending these emails is helpful, but it really > isn't. Not only is it not contributing code, insights, or solutions, > you're now actively sabotaging someone else's effort to build something. Come on! Are you even serious? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f41.google.com (mail-wm0-f41.google.com [74.125.82.41]) by kanga.kvack.org (Postfix) with ESMTP id B187982F64 for ; Fri, 6 Nov 2015 12:45:47 -0500 (EST) Received: by wmec201 with SMTP id c201so24444107wme.0 for ; Fri, 06 Nov 2015 09:45:47 -0800 (PST) Received: from gum.cmpxchg.org (gum.cmpxchg.org. [85.214.110.215]) by mx.google.com with ESMTPS id p10si1577759wjo.3.2015.11.06.09.45.46 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 06 Nov 2015 09:45:46 -0800 (PST) Date: Fri, 6 Nov 2015 12:45:17 -0500 From: Johannes Weiner Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151106174517.GA9315@cmpxchg.org> References: <20151029152546.GG23598@dhcp22.suse.cz> <20151029161009.GA9160@cmpxchg.org> <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> <20151105205522.GA1067@cmpxchg.org> <20151105225200.GA5432@cmpxchg.org> <20151106105724.GG4390@dhcp22.suse.cz> <20151106161953.GA7813@cmpxchg.org> <20151106164657.GL4390@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151106164657.GL4390@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: David Miller , akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Fri, Nov 06, 2015 at 05:46:57PM +0100, Michal Hocko wrote: > The basic problem was that the Delegate feature has been backported to > our systemd package without further consideration and that has > invalidated a lot of performance testing because some resource > controllers have measurable effects on those benchmarks. You're talking about a userspace bug. No amount of fragmenting and layering and opt-in in the kernel's runtime configuration space is going to help you if you screw up and enable it all by accident. > > All I read here is vague inflammatory language to spread FUD. > > I was merely pointing out that memory controller might be enabled without > _user_ actually even noticing because the controller wasn't enabled > explicitly. I haven't blamed anybody for that. Why does that have anything to do with how we design our interface? We can't do more than present a sane interface in good faith and lobby userspace projects if we think they misuse it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f42.google.com (mail-wm0-f42.google.com [74.125.82.42]) by kanga.kvack.org (Postfix) with ESMTP id E7B1B6B0038 for ; Thu, 12 Nov 2015 13:36:24 -0500 (EST) Received: by wmec201 with SMTP id c201so105169273wme.1 for ; Thu, 12 Nov 2015 10:36:24 -0800 (PST) Received: from outbound-smtp03.blacknight.com (outbound-smtp03.blacknight.com. [81.17.249.16]) by mx.google.com with ESMTPS id kj9si20125158wjb.72.2015.11.12.10.36.23 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Thu, 12 Nov 2015 10:36:23 -0800 (PST) Received: from mail.blacknight.com (pemlinmail05.blacknight.ie [81.17.254.26]) by outbound-smtp03.blacknight.com (Postfix) with ESMTPS id C4D6B989E3 for ; Thu, 12 Nov 2015 18:36:22 +0000 (UTC) Date: Thu, 12 Nov 2015 18:36:20 +0000 From: Mel Gorman Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151112183620.GC14880@techsingularity.net> References: <20151027164227.GB7749@cmpxchg.org> <20151029152546.GG23598@dhcp22.suse.cz> <20151029161009.GA9160@cmpxchg.org> <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> <20151105205522.GA1067@cmpxchg.org> <20151105225200.GA5432@cmpxchg.org> <20151106105724.GG4390@dhcp22.suse.cz> <20151106161953.GA7813@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20151106161953.GA7813@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Michal Hocko , David Miller , akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Fri, Nov 06, 2015 at 11:19:53AM -0500, Johannes Weiner wrote: > On Fri, Nov 06, 2015 at 11:57:24AM +0100, Michal Hocko wrote: > > On Thu 05-11-15 17:52:00, Johannes Weiner wrote: > > > On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote: > > > > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote: > > > > > This would be true if they moved on to the new cgroup API intentionally. > > > > > The reality is more complicated though. AFAIK sysmted is waiting for > > > > > cgroup2 already and privileged services enable all available resource > > > > > controllers by default as I've learned just recently. > > > > > > > > Have you filed a report with them? I don't think they should turn them > > > > on unless users explicitely configure resource control for the unit. > > > > > > Okay, verified with systemd people that they're not planning on > > > enabling resource control per default. > > > > > > Inflammatory half-truths, man. This is not constructive. > > > > What about Delegate=yes feature then? We have just been burnt by this > > quite heavily. AFAIU nspawn@.service and nspawn@.service have this > > enabled by default > > http://lists.freedesktop.org/archives/systemd-commits/2014-November/007400.html > > That's when you launch a *container* and want it to be able to use > nested resource control. > > We're talking about actual container users here. It's not turning on > resource control for all "privileged services", which is what we were > worried about here. Can you at least admit that when you yourself link > to the refuting evidence? > > And if you've been "burnt quite heavily" by this, where is your bug > report to stop other users from getting "burnt quite heavily" as well? > I didn't read this thread in detail but the lack of public information on problems with cgroup controllers is partially my fault so I'd like to correct that. https://bugzilla.suse.com/show_bug.cgi?id=954765 This bug documents some of the impact that was incurred due to ssh sessions being resource controlled by default. It talks primarily about pipetest being impacted by cpu,cpuacct. It is also found in the recent past that dbench4 was previously impacted because the blkio controller was enabled. That bug is not public but basically dbench4 regressed 80% as the journal thread was in a different cgroup than dbench4. dbench4 would stall for 8ms in case more IO was issued before the journal thread could issue any IO. The opensuse bug 954765 bug is not affected by blkio because it's disabled by a distribution-specific patch. Mike Galbraith adds some additional information on why activating the cpu controller can have an impact on semantics even if the overhead was zero. It may be the case that it's an oversight by the systemd developers and the intent was only to affect containers. My experience was that everything was affected. It also may be the case that this is an opensuse-specific problem due to how the maintainers packaged systemd. I don't actually know and hopefully the bug will be able to determine if upstream is really affected or not. There is also a link to this bug on the upstream project so there is some chance they are aware https://github.com/systemd/systemd/issues/1715 Bottom line, there is legimate confusion over whether cgroup controllers are going to be enabled by default or not in the future. If they are enabled by default, there is a non-zero cost to that and a change in semantics that people may or may not be surprised by. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f41.google.com (mail-wm0-f41.google.com [74.125.82.41]) by kanga.kvack.org (Postfix) with ESMTP id 9B8CD6B0038 for ; Thu, 12 Nov 2015 14:12:37 -0500 (EST) Received: by wmww144 with SMTP id w144so531992wmw.0 for ; Thu, 12 Nov 2015 11:12:37 -0800 (PST) Received: from gum.cmpxchg.org (gum.cmpxchg.org. [85.214.110.215]) by mx.google.com with ESMTPS id ci5si20261785wjc.170.2015.11.12.11.12.35 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 12 Nov 2015 11:12:36 -0800 (PST) Date: Thu, 12 Nov 2015 14:12:20 -0500 From: Johannes Weiner Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151112191220.GA25750@cmpxchg.org> References: <20151029152546.GG23598@dhcp22.suse.cz> <20151029161009.GA9160@cmpxchg.org> <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> <20151105205522.GA1067@cmpxchg.org> <20151105225200.GA5432@cmpxchg.org> <20151106105724.GG4390@dhcp22.suse.cz> <20151106161953.GA7813@cmpxchg.org> <20151112183620.GC14880@techsingularity.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151112183620.GC14880@techsingularity.net> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Michal Hocko , David Miller , akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org On Thu, Nov 12, 2015 at 06:36:20PM +0000, Mel Gorman wrote: > Bottom line, there is legimate confusion over whether cgroup controllers > are going to be enabled by default or not in the future. If they are > enabled by default, there is a non-zero cost to that and a change in > semantics that people may or may not be surprised by. Thanks for elaborating, Mel. My understanding is that this is a plain bug. I don't think anybody wants to put costs without benefits on their users. But I'll keep an eye on these reports, and I'll work with the systemd people should issues with the kernel interface materialize that would force them to enable resource control prematurely. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751490AbbJVEV7 (ORCPT ); Thu, 22 Oct 2015 00:21:59 -0400 Received: from gum.cmpxchg.org ([85.214.110.215]:39242 "EHLO gum.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751117AbbJVEV6 (ORCPT ); Thu, 22 Oct 2015 00:21:58 -0400 From: Johannes Weiner To: "David S. Miller" , Andrew Morton Cc: Michal Hocko , Vladimir Davydov , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy Date: Thu, 22 Oct 2015 00:21:28 -0400 Message-Id: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> X-Mailer: git-send-email 2.6.1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, this series adds socket buffer memory tracking and accounting to the unified hierarchy memory cgroup controller. [ Networking people, at this time please check the diffstat below to avoid going into convulsions. ] Socket buffer memory can make up a significant share of a workload's memory footprint, and so it needs to be accounted and tracked out of the box, along with other types of memory that can be directly linked to userspace activity, in order to provide useful resource isolation. Historically, socket buffers were accounted in a separate counter, without any pressure equalization between anonymous memory, page cache, and the socket buffers. When the socket buffer pool was exhausted, buffer allocations would fail hard and cause network performance to tank, regardless of whether there was still memory available to the group or not. Likewise, struggling anonymous or cache workingsets could not dip into an idle socket memory pool. Because of this, the feature was not usable for many real life applications. To not repeat this mistake, the new memory controller will account all types of memory pages it is tracking on behalf of a cgroup in a single pool. And upon pressure, the VM reclaims and shrinks whatever memory in that pool is within its reach. These patches add accounting for memory consumed by sockets associated with a cgroup to the existing pool of anonymous pages and page cache. Patch #3 reworks the existing memcg socket infrastructure. It has many provisions for future plans that won't materialize, and much of this simply evaporates. The networking people should be happy about this. Patch #5 adds accounting and tracking of socket memory to the unified hierarchy memory controller, as described above. It uses the existing per-cpu charge caches and triggers high limit reclaim asynchroneously. Patch #8 uses the vmpressure extension to equalize pressure between the pages tracked natively by the VM and socket buffer pages. As the pool is shared, it makes sense that while natively tracked pages are under duress the network transmit windows are also not increased. As per above, this is an essential part of the new memory controller's core functionality. With the unified hierarchy nearing release, please consider this for 4.4. include/linux/memcontrol.h | 90 +++++++++------- include/linux/page_counter.h | 6 +- include/net/sock.h | 139 ++---------------------- include/net/tcp.h | 5 +- include/net/tcp_memcontrol.h | 7 -- mm/backing-dev.c | 2 +- mm/hugetlb_cgroup.c | 3 +- mm/memcontrol.c | 235 ++++++++++++++++++++++++++--------------- mm/page_counter.c | 14 +-- mm/vmpressure.c | 29 ++++- mm/vmscan.c | 41 +++---- net/core/sock.c | 78 ++++---------- net/ipv4/sysctl_net_ipv4.c | 1 - net/ipv4/tcp.c | 3 +- net/ipv4/tcp_ipv4.c | 9 +- net/ipv4/tcp_memcontrol.c | 147 ++++---------------------- net/ipv4/tcp_output.c | 6 +- net/ipv6/tcp_ipv6.c | 3 - 18 files changed, 319 insertions(+), 499 deletions(-) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756736AbbJVEZF (ORCPT ); Thu, 22 Oct 2015 00:25:05 -0400 Received: from gum.cmpxchg.org ([85.214.110.215]:39248 "EHLO gum.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751123AbbJVEV6 (ORCPT ); Thu, 22 Oct 2015 00:21:58 -0400 From: Johannes Weiner To: "David S. Miller" , Andrew Morton Cc: Michal Hocko , Vladimir Davydov , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH 1/8] mm: page_counter: let page_counter_try_charge() return bool Date: Thu, 22 Oct 2015 00:21:29 -0400 Message-Id: <1445487696-21545-2-git-send-email-hannes@cmpxchg.org> X-Mailer: git-send-email 2.6.1 In-Reply-To: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org page_counter_try_charge() currently returns 0 on success and -ENOMEM on failure, which is surprising behavior given the function name. Make it follow the expected pattern of try_stuff() functions that return a boolean true to indicate success, or false for failure. Signed-off-by: Johannes Weiner --- include/linux/page_counter.h | 6 +++--- mm/hugetlb_cgroup.c | 3 ++- mm/memcontrol.c | 11 +++++------ mm/page_counter.c | 14 +++++++------- 4 files changed, 17 insertions(+), 17 deletions(-) diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h index 17fa4f8..7e62920 100644 --- a/include/linux/page_counter.h +++ b/include/linux/page_counter.h @@ -36,9 +36,9 @@ static inline unsigned long page_counter_read(struct page_counter *counter) void page_counter_cancel(struct page_counter *counter, unsigned long nr_pages); void page_counter_charge(struct page_counter *counter, unsigned long nr_pages); -int page_counter_try_charge(struct page_counter *counter, - unsigned long nr_pages, - struct page_counter **fail); +bool page_counter_try_charge(struct page_counter *counter, + unsigned long nr_pages, + struct page_counter **fail); void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages); int page_counter_limit(struct page_counter *counter, unsigned long limit); int page_counter_memparse(const char *buf, const char *max, diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c index 6a44263..d8fb10d 100644 --- a/mm/hugetlb_cgroup.c +++ b/mm/hugetlb_cgroup.c @@ -186,7 +186,8 @@ again: } rcu_read_unlock(); - ret = page_counter_try_charge(&h_cg->hugepage[idx], nr_pages, &counter); + if (!page_counter_try_charge(&h_cg->hugepage[idx], nr_pages, &counter)) + ret = -ENOMEM; css_put(&h_cg->css); done: *ptr = h_cg; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c71fe40..a8ccdbc 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2018,8 +2018,8 @@ retry: return 0; if (!do_swap_account || - !page_counter_try_charge(&memcg->memsw, batch, &counter)) { - if (!page_counter_try_charge(&memcg->memory, batch, &counter)) + page_counter_try_charge(&memcg->memsw, batch, &counter)) { + if (page_counter_try_charge(&memcg->memory, batch, &counter)) goto done_restock; if (do_swap_account) page_counter_uncharge(&memcg->memsw, batch); @@ -2383,14 +2383,13 @@ int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order, { unsigned int nr_pages = 1 << order; struct page_counter *counter; - int ret = 0; + int ret; if (!memcg_kmem_is_active(memcg)) return 0; - ret = page_counter_try_charge(&memcg->kmem, nr_pages, &counter); - if (ret) - return ret; + if (!page_counter_try_charge(&memcg->kmem, nr_pages, &counter)) + return -ENOMEM; ret = try_charge(memcg, gfp, nr_pages); if (ret) { diff --git a/mm/page_counter.c b/mm/page_counter.c index 11b4bed..7c6a63d 100644 --- a/mm/page_counter.c +++ b/mm/page_counter.c @@ -56,12 +56,12 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages) * @nr_pages: number of pages to charge * @fail: points first counter to hit its limit, if any * - * Returns 0 on success, or -ENOMEM and @fail if the counter or one of - * its ancestors has hit its configured limit. + * Returns %true on success, or %false and @fail if the counter or one + * of its ancestors has hit its configured limit. */ -int page_counter_try_charge(struct page_counter *counter, - unsigned long nr_pages, - struct page_counter **fail) +bool page_counter_try_charge(struct page_counter *counter, + unsigned long nr_pages, + struct page_counter **fail) { struct page_counter *c; @@ -99,13 +99,13 @@ int page_counter_try_charge(struct page_counter *counter, if (new > c->watermark) c->watermark = new; } - return 0; + return true; failed: for (c = counter; c != *fail; c = c->parent) page_counter_cancel(c, nr_pages); - return -ENOMEM; + return false; } /** -- 2.6.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932395AbbJVEWH (ORCPT ); Thu, 22 Oct 2015 00:22:07 -0400 Received: from gum.cmpxchg.org ([85.214.110.215]:39262 "EHLO gum.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751506AbbJVEWA (ORCPT ); Thu, 22 Oct 2015 00:22:00 -0400 From: Johannes Weiner To: "David S. Miller" , Andrew Morton Cc: Michal Hocko , Vladimir Davydov , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH 2/8] mm: memcontrol: export root_mem_cgroup Date: Thu, 22 Oct 2015 00:21:30 -0400 Message-Id: <1445487696-21545-3-git-send-email-hannes@cmpxchg.org> X-Mailer: git-send-email 2.6.1 In-Reply-To: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org A later patch will need this symbol in files other than memcontrol.c, so export it now and replace mem_cgroup_root_css at the same time. Signed-off-by: Johannes Weiner --- include/linux/memcontrol.h | 3 ++- mm/backing-dev.c | 2 +- mm/memcontrol.c | 5 ++--- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 805da1f..19ff87b 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -275,7 +275,8 @@ struct mem_cgroup { struct mem_cgroup_per_node *nodeinfo[0]; /* WARNING: nodeinfo must be the last member here */ }; -extern struct cgroup_subsys_state *mem_cgroup_root_css; + +extern struct mem_cgroup *root_mem_cgroup; /** * mem_cgroup_events - count memory events against a cgroup diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 095b23b..73ab967 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -702,7 +702,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi) ret = wb_init(&bdi->wb, bdi, 1, GFP_KERNEL); if (!ret) { - bdi->wb.memcg_css = mem_cgroup_root_css; + bdi->wb.memcg_css = &root_mem_cgroup->css; bdi->wb.blkcg_css = blkcg_root_css; } return ret; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index a8ccdbc..e54f434 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -76,9 +76,9 @@ struct cgroup_subsys memory_cgrp_subsys __read_mostly; EXPORT_SYMBOL(memory_cgrp_subsys); +struct mem_cgroup *root_mem_cgroup __read_mostly; + #define MEM_CGROUP_RECLAIM_RETRIES 5 -static struct mem_cgroup *root_mem_cgroup __read_mostly; -struct cgroup_subsys_state *mem_cgroup_root_css __read_mostly; /* Whether the swap controller is active */ #ifdef CONFIG_MEMCG_SWAP @@ -4213,7 +4213,6 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) /* root ? */ if (parent_css == NULL) { root_mem_cgroup = memcg; - mem_cgroup_root_css = &memcg->css; page_counter_init(&memcg->memory, NULL); memcg->high = PAGE_COUNTER_MAX; memcg->soft_limit = PAGE_COUNTER_MAX; -- 2.6.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756731AbbJVEYe (ORCPT ); Thu, 22 Oct 2015 00:24:34 -0400 Received: from gum.cmpxchg.org ([85.214.110.215]:39278 "EHLO gum.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751523AbbJVEWE (ORCPT ); Thu, 22 Oct 2015 00:22:04 -0400 From: Johannes Weiner To: "David S. Miller" , Andrew Morton Cc: Michal Hocko , Vladimir Davydov , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH 3/8] net: consolidate memcg socket buffer tracking and accounting Date: Thu, 22 Oct 2015 00:21:31 -0400 Message-Id: <1445487696-21545-4-git-send-email-hannes@cmpxchg.org> X-Mailer: git-send-email 2.6.1 In-Reply-To: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The tcp memory controller has extensive provisions for future memory accounting interfaces that won't materialize after all. Cut the code base down to what's actually used, now and in the likely future. - There won't be any different protocol counters in the future, so a direct sock->sk_memcg linkage is enough. This eliminates a lot of callback maze and boilerplate code, and restores most of the socket allocation code to pre-tcp_memcontrol state. - There won't be a tcp control soft limit, so integrating the memcg code into the global skmem limiting scheme complicates things unnecessarily. Replace all that with simple and clear charge and uncharge calls--hidden behind a jump label--to account skb memory. - The previous jump label code was an elaborate state machine that tracked the number of cgroups with an active socket limit in order to enable the skmem tracking and accounting code only when actively necessary. But this is overengineered: it was meant to protect the people who never use this feature in the first place. Simply enable the branches once when the first limit is set until the next reboot. Signed-off-by: Johannes Weiner --- include/linux/memcontrol.h | 64 ++++++++----------- include/net/sock.h | 135 +++------------------------------------ include/net/tcp.h | 3 - include/net/tcp_memcontrol.h | 7 --- mm/memcontrol.c | 101 +++++++++++++++-------------- net/core/sock.c | 78 ++++++----------------- net/ipv4/sysctl_net_ipv4.c | 1 - net/ipv4/tcp.c | 3 +- net/ipv4/tcp_ipv4.c | 9 +-- net/ipv4/tcp_memcontrol.c | 147 +++++++------------------------------------ net/ipv4/tcp_output.c | 6 +- net/ipv6/tcp_ipv6.c | 3 - 12 files changed, 136 insertions(+), 421 deletions(-) delete mode 100644 include/net/tcp_memcontrol.h diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 19ff87b..5b72f83 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -85,34 +85,6 @@ enum mem_cgroup_events_target { MEM_CGROUP_NTARGETS, }; -/* - * Bits in struct cg_proto.flags - */ -enum cg_proto_flags { - /* Currently active and new sockets should be assigned to cgroups */ - MEMCG_SOCK_ACTIVE, - /* It was ever activated; we must disarm static keys on destruction */ - MEMCG_SOCK_ACTIVATED, -}; - -struct cg_proto { - struct page_counter memory_allocated; /* Current allocated memory. */ - struct percpu_counter sockets_allocated; /* Current number of sockets. */ - int memory_pressure; - long sysctl_mem[3]; - unsigned long flags; - /* - * memcg field is used to find which memcg we belong directly - * Each memcg struct can hold more than one cg_proto, so container_of - * won't really cut. - * - * The elegant solution would be having an inverse function to - * proto_cgroup in struct proto, but that means polluting the structure - * for everybody, instead of just for memcg users. - */ - struct mem_cgroup *memcg; -}; - #ifdef CONFIG_MEMCG struct mem_cgroup_stat_cpu { long count[MEM_CGROUP_STAT_NSTATS]; @@ -185,8 +157,15 @@ struct mem_cgroup { /* Accounted resources */ struct page_counter memory; + + /* + * Legacy non-resource counters. In unified hierarchy, all + * memory is accounted and limited through memcg->memory. + * Consumer breakdown happens in the statistics. + */ struct page_counter memsw; struct page_counter kmem; + struct page_counter skmem; /* Normal memory consumption range */ unsigned long low; @@ -246,9 +225,6 @@ struct mem_cgroup { */ struct mem_cgroup_stat_cpu __percpu *stat; -#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET) - struct cg_proto tcp_mem; -#endif #if defined(CONFIG_MEMCG_KMEM) /* Index in the kmem_cache->memcg_params.memcg_caches array */ int kmemcg_id; @@ -676,12 +652,6 @@ void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx) } #endif /* CONFIG_MEMCG */ -enum { - UNDER_LIMIT, - SOFT_LIMIT, - OVER_LIMIT, -}; - #ifdef CONFIG_CGROUP_WRITEBACK struct list_head *mem_cgroup_cgwb_list(struct mem_cgroup *memcg); @@ -707,15 +677,35 @@ static inline void mem_cgroup_wb_stats(struct bdi_writeback *wb, struct sock; #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM) +extern struct static_key_false mem_cgroup_sockets; +static inline bool mem_cgroup_do_sockets(void) +{ + return static_branch_unlikely(&mem_cgroup_sockets); +} void sock_update_memcg(struct sock *sk); void sock_release_memcg(struct sock *sk); +bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages); +void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages); #else +static inline bool mem_cgroup_do_sockets(void) +{ + return false; +} static inline void sock_update_memcg(struct sock *sk) { } static inline void sock_release_memcg(struct sock *sk) { } +static inline bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, + unsigned int nr_pages) +{ + return true; +} +static inline void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, + unsigned int nr_pages) +{ +} #endif /* CONFIG_INET && CONFIG_MEMCG_KMEM */ #ifdef CONFIG_MEMCG_KMEM diff --git a/include/net/sock.h b/include/net/sock.h index 59a7196..67795fc 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -69,22 +69,6 @@ #include #include -struct cgroup; -struct cgroup_subsys; -#ifdef CONFIG_NET -int mem_cgroup_sockets_init(struct mem_cgroup *memcg, struct cgroup_subsys *ss); -void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg); -#else -static inline -int mem_cgroup_sockets_init(struct mem_cgroup *memcg, struct cgroup_subsys *ss) -{ - return 0; -} -static inline -void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg) -{ -} -#endif /* * This structure really needs to be cleaned up. * Most of it is for TCP, and not used by any of @@ -243,7 +227,6 @@ struct sock_common { /* public: */ }; -struct cg_proto; /** * struct sock - network layer representation of sockets * @__sk_common: shared layout with inet_timewait_sock @@ -310,7 +293,7 @@ struct cg_proto; * @sk_security: used by security modules * @sk_mark: generic packet mark * @sk_classid: this socket's cgroup classid - * @sk_cgrp: this socket's cgroup-specific proto data + * @sk_memcg: this socket's memcg association * @sk_write_pending: a write to stream socket waits to start * @sk_state_change: callback to indicate change in the state of the sock * @sk_data_ready: callback to indicate there is data to be processed @@ -447,7 +430,7 @@ struct sock { #ifdef CONFIG_CGROUP_NET_CLASSID u32 sk_classid; #endif - struct cg_proto *sk_cgrp; + struct mem_cgroup *sk_memcg; void (*sk_state_change)(struct sock *sk); void (*sk_data_ready)(struct sock *sk); void (*sk_write_space)(struct sock *sk); @@ -1051,18 +1034,6 @@ struct proto { #ifdef SOCK_REFCNT_DEBUG atomic_t socks; #endif -#ifdef CONFIG_MEMCG_KMEM - /* - * cgroup specific init/deinit functions. Called once for all - * protocols that implement it, from cgroups populate function. - * This function has to setup any files the protocol want to - * appear in the kmem cgroup filesystem. - */ - int (*init_cgroup)(struct mem_cgroup *memcg, - struct cgroup_subsys *ss); - void (*destroy_cgroup)(struct mem_cgroup *memcg); - struct cg_proto *(*proto_cgroup)(struct mem_cgroup *memcg); -#endif }; int proto_register(struct proto *prot, int alloc_slab); @@ -1093,23 +1064,6 @@ static inline void sk_refcnt_debug_release(const struct sock *sk) #define sk_refcnt_debug_release(sk) do { } while (0) #endif /* SOCK_REFCNT_DEBUG */ -#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_NET) -extern struct static_key memcg_socket_limit_enabled; -static inline struct cg_proto *parent_cg_proto(struct proto *proto, - struct cg_proto *cg_proto) -{ - return proto->proto_cgroup(parent_mem_cgroup(cg_proto->memcg)); -} -#define mem_cgroup_sockets_enabled static_key_false(&memcg_socket_limit_enabled) -#else -#define mem_cgroup_sockets_enabled 0 -static inline struct cg_proto *parent_cg_proto(struct proto *proto, - struct cg_proto *cg_proto) -{ - return NULL; -} -#endif - static inline bool sk_stream_memory_free(const struct sock *sk) { if (sk->sk_wmem_queued >= sk->sk_sndbuf) @@ -1136,9 +1090,6 @@ static inline bool sk_under_memory_pressure(const struct sock *sk) if (!sk->sk_prot->memory_pressure) return false; - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) - return !!sk->sk_cgrp->memory_pressure; - return !!*sk->sk_prot->memory_pressure; } @@ -1146,61 +1097,19 @@ static inline void sk_leave_memory_pressure(struct sock *sk) { int *memory_pressure = sk->sk_prot->memory_pressure; - if (!memory_pressure) - return; - - if (*memory_pressure) + if (memory_pressure && *memory_pressure) *memory_pressure = 0; - - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) { - struct cg_proto *cg_proto = sk->sk_cgrp; - struct proto *prot = sk->sk_prot; - - for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto)) - cg_proto->memory_pressure = 0; - } - } static inline void sk_enter_memory_pressure(struct sock *sk) { - if (!sk->sk_prot->enter_memory_pressure) - return; - - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) { - struct cg_proto *cg_proto = sk->sk_cgrp; - struct proto *prot = sk->sk_prot; - - for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto)) - cg_proto->memory_pressure = 1; - } - - sk->sk_prot->enter_memory_pressure(sk); + if (sk->sk_prot->enter_memory_pressure) + sk->sk_prot->enter_memory_pressure(sk); } static inline long sk_prot_mem_limits(const struct sock *sk, int index) { - long *prot = sk->sk_prot->sysctl_mem; - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) - prot = sk->sk_cgrp->sysctl_mem; - return prot[index]; -} - -static inline void memcg_memory_allocated_add(struct cg_proto *prot, - unsigned long amt, - int *parent_status) -{ - page_counter_charge(&prot->memory_allocated, amt); - - if (page_counter_read(&prot->memory_allocated) > - prot->memory_allocated.limit) - *parent_status = OVER_LIMIT; -} - -static inline void memcg_memory_allocated_sub(struct cg_proto *prot, - unsigned long amt) -{ - page_counter_uncharge(&prot->memory_allocated, amt); + return sk->sk_prot->sysctl_mem[index]; } static inline long @@ -1208,24 +1117,14 @@ sk_memory_allocated(const struct sock *sk) { struct proto *prot = sk->sk_prot; - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) - return page_counter_read(&sk->sk_cgrp->memory_allocated); - return atomic_long_read(prot->memory_allocated); } static inline long -sk_memory_allocated_add(struct sock *sk, int amt, int *parent_status) +sk_memory_allocated_add(struct sock *sk, int amt) { struct proto *prot = sk->sk_prot; - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) { - memcg_memory_allocated_add(sk->sk_cgrp, amt, parent_status); - /* update the root cgroup regardless */ - atomic_long_add_return(amt, prot->memory_allocated); - return page_counter_read(&sk->sk_cgrp->memory_allocated); - } - return atomic_long_add_return(amt, prot->memory_allocated); } @@ -1234,9 +1133,6 @@ sk_memory_allocated_sub(struct sock *sk, int amt) { struct proto *prot = sk->sk_prot; - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) - memcg_memory_allocated_sub(sk->sk_cgrp, amt); - atomic_long_sub(amt, prot->memory_allocated); } @@ -1244,13 +1140,6 @@ static inline void sk_sockets_allocated_dec(struct sock *sk) { struct proto *prot = sk->sk_prot; - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) { - struct cg_proto *cg_proto = sk->sk_cgrp; - - for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto)) - percpu_counter_dec(&cg_proto->sockets_allocated); - } - percpu_counter_dec(prot->sockets_allocated); } @@ -1258,13 +1147,6 @@ static inline void sk_sockets_allocated_inc(struct sock *sk) { struct proto *prot = sk->sk_prot; - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) { - struct cg_proto *cg_proto = sk->sk_cgrp; - - for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto)) - percpu_counter_inc(&cg_proto->sockets_allocated); - } - percpu_counter_inc(prot->sockets_allocated); } @@ -1273,9 +1155,6 @@ sk_sockets_allocated_read_positive(struct sock *sk) { struct proto *prot = sk->sk_prot; - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) - return percpu_counter_read_positive(&sk->sk_cgrp->sockets_allocated); - return percpu_counter_read_positive(prot->sockets_allocated); } diff --git a/include/net/tcp.h b/include/net/tcp.h index eed94fc..77b6c7e 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -291,9 +291,6 @@ extern int tcp_memory_pressure; /* optimized version of sk_under_memory_pressure() for TCP sockets */ static inline bool tcp_under_memory_pressure(const struct sock *sk) { - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) - return !!sk->sk_cgrp->memory_pressure; - return tcp_memory_pressure; } /* diff --git a/include/net/tcp_memcontrol.h b/include/net/tcp_memcontrol.h deleted file mode 100644 index 05b94d9..0000000 --- a/include/net/tcp_memcontrol.h +++ /dev/null @@ -1,7 +0,0 @@ -#ifndef _TCP_MEMCG_H -#define _TCP_MEMCG_H - -struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg); -int tcp_init_cgroup(struct mem_cgroup *memcg, struct cgroup_subsys *ss); -void tcp_destroy_cgroup(struct mem_cgroup *memcg); -#endif /* _TCP_MEMCG_H */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e54f434..c41e6d7 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -66,7 +66,6 @@ #include "internal.h" #include #include -#include #include "slab.h" #include @@ -291,58 +290,68 @@ static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id) /* Writing them here to avoid exposing memcg's inner layout */ #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM) +DEFINE_STATIC_KEY_FALSE(mem_cgroup_sockets); + void sock_update_memcg(struct sock *sk) { - if (mem_cgroup_sockets_enabled) { - struct mem_cgroup *memcg; - struct cg_proto *cg_proto; - - BUG_ON(!sk->sk_prot->proto_cgroup); - - /* Socket cloning can throw us here with sk_cgrp already - * filled. It won't however, necessarily happen from - * process context. So the test for root memcg given - * the current task's memcg won't help us in this case. - * - * Respecting the original socket's memcg is a better - * decision in this case. - */ - if (sk->sk_cgrp) { - BUG_ON(mem_cgroup_is_root(sk->sk_cgrp->memcg)); - css_get(&sk->sk_cgrp->memcg->css); - return; - } - - rcu_read_lock(); - memcg = mem_cgroup_from_task(current); - cg_proto = sk->sk_prot->proto_cgroup(memcg); - if (cg_proto && test_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags) && - css_tryget_online(&memcg->css)) { - sk->sk_cgrp = cg_proto; - } - rcu_read_unlock(); + struct mem_cgroup *memcg; + /* + * Socket cloning can throw us here with sk_cgrp already + * filled. It won't however, necessarily happen from + * process context. So the test for root memcg given + * the current task's memcg won't help us in this case. + * + * Respecting the original socket's memcg is a better + * decision in this case. + */ + if (sk->sk_memcg) { + BUG_ON(mem_cgroup_is_root(sk->sk_memcg)); + css_get(&sk->sk_memcg->css); + return; } + + rcu_read_lock(); + memcg = mem_cgroup_from_task(current); + if (css_tryget_online(&memcg->css)) + sk->sk_memcg = memcg; + rcu_read_unlock(); } EXPORT_SYMBOL(sock_update_memcg); void sock_release_memcg(struct sock *sk) { - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) { - struct mem_cgroup *memcg; - WARN_ON(!sk->sk_cgrp->memcg); - memcg = sk->sk_cgrp->memcg; - css_put(&sk->sk_cgrp->memcg->css); - } + if (sk->sk_memcg) + css_put(&sk->sk_memcg->css); } -struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg) +/** + * mem_cgroup_charge_skmem - charge socket memory + * @memcg: memcg to charge + * @nr_pages: number of pages to charge + * + * Charges @nr_pages to @memcg. Returns %true if the charge fit within + * the memcg's configured limit, %false if the charge had to be forced. + */ +bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) { - if (!memcg || mem_cgroup_is_root(memcg)) - return NULL; + struct page_counter *counter; + + if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter)) + return true; - return &memcg->tcp_mem; + page_counter_charge(&memcg->skmem, nr_pages); + return false; +} + +/** + * mem_cgroup_uncharge_skmem - uncharge socket memory + * @memcg: memcg to uncharge + * @nr_pages: number of pages to uncharge + */ +void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) +{ + page_counter_uncharge(&memcg->skmem, nr_pages); } -EXPORT_SYMBOL(tcp_proto_cgroup); #endif @@ -3592,13 +3601,7 @@ static int mem_cgroup_oom_control_write(struct cgroup_subsys_state *css, #ifdef CONFIG_MEMCG_KMEM static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss) { - int ret; - - ret = memcg_propagate_kmem(memcg); - if (ret) - return ret; - - return mem_cgroup_sockets_init(memcg, ss); + return memcg_propagate_kmem(memcg); } static void memcg_deactivate_kmem(struct mem_cgroup *memcg) @@ -3654,7 +3657,6 @@ static void memcg_destroy_kmem(struct mem_cgroup *memcg) static_key_slow_dec(&memcg_kmem_enabled_key); WARN_ON(page_counter_read(&memcg->kmem)); } - mem_cgroup_sockets_destroy(memcg); } #else static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss) @@ -4218,6 +4220,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) memcg->soft_limit = PAGE_COUNTER_MAX; page_counter_init(&memcg->memsw, NULL); page_counter_init(&memcg->kmem, NULL); + page_counter_init(&memcg->skmem, NULL); } memcg->last_scanned_node = MAX_NUMNODES; @@ -4266,6 +4269,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css) memcg->soft_limit = PAGE_COUNTER_MAX; page_counter_init(&memcg->memsw, &parent->memsw); page_counter_init(&memcg->kmem, &parent->kmem); + page_counter_init(&memcg->skmem, &parent->skmem); /* * No need to take a reference to the parent because cgroup @@ -4277,6 +4281,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css) memcg->soft_limit = PAGE_COUNTER_MAX; page_counter_init(&memcg->memsw, NULL); page_counter_init(&memcg->kmem, NULL); + page_counter_init(&memcg->skmem, NULL); /* * Deeper hierachy with use_hierarchy == false doesn't make * much sense so let cgroup subsystem know about this diff --git a/net/core/sock.c b/net/core/sock.c index 0fafd27..0debff5 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -194,44 +194,6 @@ bool sk_net_capable(const struct sock *sk, int cap) } EXPORT_SYMBOL(sk_net_capable); - -#ifdef CONFIG_MEMCG_KMEM -int mem_cgroup_sockets_init(struct mem_cgroup *memcg, struct cgroup_subsys *ss) -{ - struct proto *proto; - int ret = 0; - - mutex_lock(&proto_list_mutex); - list_for_each_entry(proto, &proto_list, node) { - if (proto->init_cgroup) { - ret = proto->init_cgroup(memcg, ss); - if (ret) - goto out; - } - } - - mutex_unlock(&proto_list_mutex); - return ret; -out: - list_for_each_entry_continue_reverse(proto, &proto_list, node) - if (proto->destroy_cgroup) - proto->destroy_cgroup(memcg); - mutex_unlock(&proto_list_mutex); - return ret; -} - -void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg) -{ - struct proto *proto; - - mutex_lock(&proto_list_mutex); - list_for_each_entry_reverse(proto, &proto_list, node) - if (proto->destroy_cgroup) - proto->destroy_cgroup(memcg); - mutex_unlock(&proto_list_mutex); -} -#endif - /* * Each address family might have different locking rules, so we have * one slock key per address family: @@ -239,11 +201,6 @@ void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg) static struct lock_class_key af_family_keys[AF_MAX]; static struct lock_class_key af_family_slock_keys[AF_MAX]; -#if defined(CONFIG_MEMCG_KMEM) -struct static_key memcg_socket_limit_enabled; -EXPORT_SYMBOL(memcg_socket_limit_enabled); -#endif - /* * Make lock validator output more readable. (we pre-construct these * strings build-time, so that runtime initialization of socket @@ -1476,12 +1433,6 @@ void sk_free(struct sock *sk) } EXPORT_SYMBOL(sk_free); -static void sk_update_clone(const struct sock *sk, struct sock *newsk) -{ - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) - sock_update_memcg(newsk); -} - /** * sk_clone_lock - clone a socket, and lock its clone * @sk: the socket to clone @@ -1577,7 +1528,8 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority) sk_set_socket(newsk, NULL); newsk->sk_wq = NULL; - sk_update_clone(sk, newsk); + if (mem_cgroup_do_sockets()) + sock_update_memcg(newsk); if (newsk->sk_prot->sockets_allocated) sk_sockets_allocated_inc(newsk); @@ -2036,27 +1988,27 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind) struct proto *prot = sk->sk_prot; int amt = sk_mem_pages(size); long allocated; - int parent_status = UNDER_LIMIT; sk->sk_forward_alloc += amt * SK_MEM_QUANTUM; - allocated = sk_memory_allocated_add(sk, amt, &parent_status); + allocated = sk_memory_allocated_add(sk, amt); + + if (mem_cgroup_do_sockets() && sk->sk_memcg && + !mem_cgroup_charge_skmem(sk->sk_memcg, amt)) + goto suppress_allocation; /* Under limit. */ - if (parent_status == UNDER_LIMIT && - allocated <= sk_prot_mem_limits(sk, 0)) { + if (allocated <= sk_prot_mem_limits(sk, 0)) { sk_leave_memory_pressure(sk); return 1; } - /* Under pressure. (we or our parents) */ - if ((parent_status > SOFT_LIMIT) || - allocated > sk_prot_mem_limits(sk, 1)) + /* Under pressure. */ + if (allocated > sk_prot_mem_limits(sk, 1)) sk_enter_memory_pressure(sk); - /* Over hard limit (we or our parents) */ - if ((parent_status == OVER_LIMIT) || - (allocated > sk_prot_mem_limits(sk, 2))) + /* Over hard limit. */ + if (allocated > sk_prot_mem_limits(sk, 2)) goto suppress_allocation; /* guarantee minimum buffer size under pressure */ @@ -2105,6 +2057,9 @@ suppress_allocation: sk_memory_allocated_sub(sk, amt); + if (mem_cgroup_do_sockets() && sk->sk_memcg) + mem_cgroup_uncharge_skmem(sk->sk_memcg, amt); + return 0; } EXPORT_SYMBOL(__sk_mem_schedule); @@ -2120,6 +2075,9 @@ void __sk_mem_reclaim(struct sock *sk, int amount) sk_memory_allocated_sub(sk, amount); sk->sk_forward_alloc -= amount << SK_MEM_QUANTUM_SHIFT; + if (mem_cgroup_do_sockets() && sk->sk_memcg) + mem_cgroup_uncharge_skmem(sk->sk_memcg, amount); + if (sk_under_memory_pressure(sk) && (sk_memory_allocated(sk) < sk_prot_mem_limits(sk, 0))) sk_leave_memory_pressure(sk); diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 894da3a..1f00819 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -24,7 +24,6 @@ #include #include #include -#include static int zero; static int one = 1; diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index ac1bdbb..ec931c0 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -421,7 +421,8 @@ void tcp_init_sock(struct sock *sk) sk->sk_rcvbuf = sysctl_tcp_rmem[1]; local_bh_disable(); - sock_update_memcg(sk); + if (mem_cgroup_do_sockets()) + sock_update_memcg(sk); sk_sockets_allocated_inc(sk); local_bh_enable(); } diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 30dd45c..bb5f4f2 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -73,7 +73,6 @@ #include #include #include -#include #include #include @@ -1808,7 +1807,8 @@ void tcp_v4_destroy_sock(struct sock *sk) tcp_saved_syn_free(tp); sk_sockets_allocated_dec(sk); - sock_release_memcg(sk); + if (mem_cgroup_do_sockets()) + sock_release_memcg(sk); } EXPORT_SYMBOL(tcp_v4_destroy_sock); @@ -2330,11 +2330,6 @@ struct proto tcp_prot = { .compat_setsockopt = compat_tcp_setsockopt, .compat_getsockopt = compat_tcp_getsockopt, #endif -#ifdef CONFIG_MEMCG_KMEM - .init_cgroup = tcp_init_cgroup, - .destroy_cgroup = tcp_destroy_cgroup, - .proto_cgroup = tcp_proto_cgroup, -#endif }; EXPORT_SYMBOL(tcp_prot); diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c index 2379c1b..09a37eb 100644 --- a/net/ipv4/tcp_memcontrol.c +++ b/net/ipv4/tcp_memcontrol.c @@ -1,107 +1,10 @@ -#include -#include -#include -#include -#include +#include #include +#include #include - -int tcp_init_cgroup(struct mem_cgroup *memcg, struct cgroup_subsys *ss) -{ - /* - * The root cgroup does not use page_counters, but rather, - * rely on the data already collected by the network - * subsystem - */ - struct mem_cgroup *parent = parent_mem_cgroup(memcg); - struct page_counter *counter_parent = NULL; - struct cg_proto *cg_proto, *parent_cg; - - cg_proto = tcp_prot.proto_cgroup(memcg); - if (!cg_proto) - return 0; - - cg_proto->sysctl_mem[0] = sysctl_tcp_mem[0]; - cg_proto->sysctl_mem[1] = sysctl_tcp_mem[1]; - cg_proto->sysctl_mem[2] = sysctl_tcp_mem[2]; - cg_proto->memory_pressure = 0; - cg_proto->memcg = memcg; - - parent_cg = tcp_prot.proto_cgroup(parent); - if (parent_cg) - counter_parent = &parent_cg->memory_allocated; - - page_counter_init(&cg_proto->memory_allocated, counter_parent); - percpu_counter_init(&cg_proto->sockets_allocated, 0, GFP_KERNEL); - - return 0; -} -EXPORT_SYMBOL(tcp_init_cgroup); - -void tcp_destroy_cgroup(struct mem_cgroup *memcg) -{ - struct cg_proto *cg_proto; - - cg_proto = tcp_prot.proto_cgroup(memcg); - if (!cg_proto) - return; - - percpu_counter_destroy(&cg_proto->sockets_allocated); - - if (test_bit(MEMCG_SOCK_ACTIVATED, &cg_proto->flags)) - static_key_slow_dec(&memcg_socket_limit_enabled); - -} -EXPORT_SYMBOL(tcp_destroy_cgroup); - -static int tcp_update_limit(struct mem_cgroup *memcg, unsigned long nr_pages) -{ - struct cg_proto *cg_proto; - int i; - int ret; - - cg_proto = tcp_prot.proto_cgroup(memcg); - if (!cg_proto) - return -EINVAL; - - ret = page_counter_limit(&cg_proto->memory_allocated, nr_pages); - if (ret) - return ret; - - for (i = 0; i < 3; i++) - cg_proto->sysctl_mem[i] = min_t(long, nr_pages, - sysctl_tcp_mem[i]); - - if (nr_pages == PAGE_COUNTER_MAX) - clear_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags); - else { - /* - * The active bit needs to be written after the static_key - * update. This is what guarantees that the socket activation - * function is the last one to run. See sock_update_memcg() for - * details, and note that we don't mark any socket as belonging - * to this memcg until that flag is up. - * - * We need to do this, because static_keys will span multiple - * sites, but we can't control their order. If we mark a socket - * as accounted, but the accounting functions are not patched in - * yet, we'll lose accounting. - * - * We never race with the readers in sock_update_memcg(), - * because when this value change, the code to process it is not - * patched in yet. - * - * The activated bit is used to guarantee that no two writers - * will do the update in the same memcg. Without that, we can't - * properly shutdown the static key. - */ - if (!test_and_set_bit(MEMCG_SOCK_ACTIVATED, &cg_proto->flags)) - static_key_slow_inc(&memcg_socket_limit_enabled); - set_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags); - } - - return 0; -} +#include +#include +#include enum { RES_USAGE, @@ -124,11 +27,17 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of, switch (of_cft(of)->private) { case RES_LIMIT: /* see memcontrol.c */ + if (memcg == root_mem_cgroup) { + ret = -EINVAL; + break; + } ret = page_counter_memparse(buf, "-1", &nr_pages); if (ret) break; mutex_lock(&tcp_limit_mutex); - ret = tcp_update_limit(memcg, nr_pages); + ret = page_counter_limit(&memcg->skmem, nr_pages); + if (!ret) + static_branch_enable(&mem_cgroup_sockets); mutex_unlock(&tcp_limit_mutex); break; default: @@ -141,32 +50,28 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of, static u64 tcp_cgroup_read(struct cgroup_subsys_state *css, struct cftype *cft) { struct mem_cgroup *memcg = mem_cgroup_from_css(css); - struct cg_proto *cg_proto = tcp_prot.proto_cgroup(memcg); u64 val; switch (cft->private) { case RES_LIMIT: - if (!cg_proto) - return PAGE_COUNTER_MAX; - val = cg_proto->memory_allocated.limit; + val = memcg->skmem.limit; val *= PAGE_SIZE; break; case RES_USAGE: - if (!cg_proto) + if (memcg == root_mem_cgroup) val = atomic_long_read(&tcp_memory_allocated); else - val = page_counter_read(&cg_proto->memory_allocated); + val = page_counter_read(&memcg->skmem); val *= PAGE_SIZE; break; case RES_FAILCNT: - if (!cg_proto) - return 0; - val = cg_proto->memory_allocated.failcnt; + val = memcg->skmem.failcnt; break; case RES_MAX_USAGE: - if (!cg_proto) - return 0; - val = cg_proto->memory_allocated.watermark; + if (memcg == root_mem_cgroup) + val = 0; + else + val = memcg->skmem.watermark; val *= PAGE_SIZE; break; default: @@ -178,20 +83,14 @@ static u64 tcp_cgroup_read(struct cgroup_subsys_state *css, struct cftype *cft) static ssize_t tcp_cgroup_reset(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { - struct mem_cgroup *memcg; - struct cg_proto *cg_proto; - - memcg = mem_cgroup_from_css(of_css(of)); - cg_proto = tcp_prot.proto_cgroup(memcg); - if (!cg_proto) - return nbytes; + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); switch (of_cft(of)->private) { case RES_MAX_USAGE: - page_counter_reset_watermark(&cg_proto->memory_allocated); + page_counter_reset_watermark(&memcg->skmem); break; case RES_FAILCNT: - cg_proto->memory_allocated.failcnt = 0; + memcg->skmem.failcnt = 0; break; } diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 19adedb..b496fc9 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -2819,13 +2819,15 @@ begin_fwd: */ void sk_forced_mem_schedule(struct sock *sk, int size) { - int amt, status; + int amt; if (size <= sk->sk_forward_alloc) return; amt = sk_mem_pages(size); sk->sk_forward_alloc += amt * SK_MEM_QUANTUM; - sk_memory_allocated_add(sk, amt, &status); + sk_memory_allocated_add(sk, amt); + if (mem_cgroup_do_sockets() && sk->sk_memcg) + mem_cgroup_charge_skmem(sk->sk_memcg, amt); } /* Send a FIN. The caller locks the socket for us. diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index f495d18..cf19e65 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -1862,9 +1862,6 @@ struct proto tcpv6_prot = { .compat_setsockopt = compat_tcp_setsockopt, .compat_getsockopt = compat_tcp_getsockopt, #endif -#ifdef CONFIG_MEMCG_KMEM - .proto_cgroup = tcp_proto_cgroup, -#endif .clear_sk = tcp_v6_clear_sk, }; -- 2.6.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751835AbbJVEWJ (ORCPT ); Thu, 22 Oct 2015 00:22:09 -0400 Received: from gum.cmpxchg.org ([85.214.110.215]:39284 "EHLO gum.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751117AbbJVEWG (ORCPT ); Thu, 22 Oct 2015 00:22:06 -0400 From: Johannes Weiner To: "David S. Miller" , Andrew Morton Cc: Michal Hocko , Vladimir Davydov , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH 4/8] mm: memcontrol: prepare for unified hierarchy socket accounting Date: Thu, 22 Oct 2015 00:21:32 -0400 Message-Id: <1445487696-21545-5-git-send-email-hannes@cmpxchg.org> X-Mailer: git-send-email 2.6.1 In-Reply-To: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The unified hierarchy memory controller will account socket memory. Move the infrastructure functions accordingly. Signed-off-by: Johannes Weiner --- mm/memcontrol.c | 136 ++++++++++++++++++++++++++++---------------------------- 1 file changed, 68 insertions(+), 68 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c41e6d7..3789050 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -287,74 +287,6 @@ static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id) return mem_cgroup_from_css(css); } -/* Writing them here to avoid exposing memcg's inner layout */ -#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM) - -DEFINE_STATIC_KEY_FALSE(mem_cgroup_sockets); - -void sock_update_memcg(struct sock *sk) -{ - struct mem_cgroup *memcg; - /* - * Socket cloning can throw us here with sk_cgrp already - * filled. It won't however, necessarily happen from - * process context. So the test for root memcg given - * the current task's memcg won't help us in this case. - * - * Respecting the original socket's memcg is a better - * decision in this case. - */ - if (sk->sk_memcg) { - BUG_ON(mem_cgroup_is_root(sk->sk_memcg)); - css_get(&sk->sk_memcg->css); - return; - } - - rcu_read_lock(); - memcg = mem_cgroup_from_task(current); - if (css_tryget_online(&memcg->css)) - sk->sk_memcg = memcg; - rcu_read_unlock(); -} -EXPORT_SYMBOL(sock_update_memcg); - -void sock_release_memcg(struct sock *sk) -{ - if (sk->sk_memcg) - css_put(&sk->sk_memcg->css); -} - -/** - * mem_cgroup_charge_skmem - charge socket memory - * @memcg: memcg to charge - * @nr_pages: number of pages to charge - * - * Charges @nr_pages to @memcg. Returns %true if the charge fit within - * the memcg's configured limit, %false if the charge had to be forced. - */ -bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) -{ - struct page_counter *counter; - - if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter)) - return true; - - page_counter_charge(&memcg->skmem, nr_pages); - return false; -} - -/** - * mem_cgroup_uncharge_skmem - uncharge socket memory - * @memcg: memcg to uncharge - * @nr_pages: number of pages to uncharge - */ -void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) -{ - page_counter_uncharge(&memcg->skmem, nr_pages); -} - -#endif - #ifdef CONFIG_MEMCG_KMEM /* * This will be the memcg's index in each cache's ->memcg_params.memcg_caches. @@ -5521,6 +5453,74 @@ void mem_cgroup_replace_page(struct page *oldpage, struct page *newpage) commit_charge(newpage, memcg, true); } +/* Writing them here to avoid exposing memcg's inner layout */ +#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM) + +DEFINE_STATIC_KEY_FALSE(mem_cgroup_sockets); + +void sock_update_memcg(struct sock *sk) +{ + struct mem_cgroup *memcg; + /* + * Socket cloning can throw us here with sk_cgrp already + * filled. It won't however, necessarily happen from + * process context. So the test for root memcg given + * the current task's memcg won't help us in this case. + * + * Respecting the original socket's memcg is a better + * decision in this case. + */ + if (sk->sk_memcg) { + BUG_ON(mem_cgroup_is_root(sk->sk_memcg)); + css_get(&sk->sk_memcg->css); + return; + } + + rcu_read_lock(); + memcg = mem_cgroup_from_task(current); + if (css_tryget_online(&memcg->css)) + sk->sk_memcg = memcg; + rcu_read_unlock(); +} +EXPORT_SYMBOL(sock_update_memcg); + +void sock_release_memcg(struct sock *sk) +{ + if (sk->sk_memcg) + css_put(&sk->sk_memcg->css); +} + +/** + * mem_cgroup_charge_skmem - charge socket memory + * @memcg: memcg to charge + * @nr_pages: number of pages to charge + * + * Charges @nr_pages to @memcg. Returns %true if the charge fit within + * the memcg's configured limit, %false if the charge had to be forced. + */ +bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) +{ + struct page_counter *counter; + + if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter)) + return true; + + page_counter_charge(&memcg->skmem, nr_pages); + return false; +} + +/** + * mem_cgroup_uncharge_skmem - uncharge socket memory + * @memcg: memcg to uncharge + * @nr_pages: number of pages to uncharge + */ +void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) +{ + page_counter_uncharge(&memcg->skmem, nr_pages); +} + +#endif + /* * subsys_initcall() for memory controller. * -- 2.6.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754125AbbJVEWO (ORCPT ); Thu, 22 Oct 2015 00:22:14 -0400 Received: from gum.cmpxchg.org ([85.214.110.215]:39304 "EHLO gum.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751791AbbJVEWK (ORCPT ); Thu, 22 Oct 2015 00:22:10 -0400 From: Johannes Weiner To: "David S. Miller" , Andrew Morton Cc: Michal Hocko , Vladimir Davydov , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Date: Thu, 22 Oct 2015 00:21:33 -0400 Message-Id: <1445487696-21545-6-git-send-email-hannes@cmpxchg.org> X-Mailer: git-send-email 2.6.1 In-Reply-To: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Socket memory can be a significant share of overall memory consumed by common workloads. In order to provide reasonable resource isolation out-of-the-box in the unified hierarchy, this type of memory needs to be accounted and tracked per default in the memory controller. Signed-off-by: Johannes Weiner --- include/linux/memcontrol.h | 16 ++++++-- mm/memcontrol.c | 95 ++++++++++++++++++++++++++++++++++++---------- 2 files changed, 87 insertions(+), 24 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 5b72f83..6f1e0f8 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -244,6 +244,10 @@ struct mem_cgroup { struct wb_domain cgwb_domain; #endif +#ifdef CONFIG_INET + struct work_struct socket_work; +#endif + /* List of events which userspace want to receive */ struct list_head event_list; spinlock_t event_list_lock; @@ -676,11 +680,15 @@ static inline void mem_cgroup_wb_stats(struct bdi_writeback *wb, #endif /* CONFIG_CGROUP_WRITEBACK */ struct sock; -#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM) -extern struct static_key_false mem_cgroup_sockets; +#ifdef CONFIG_INET +extern struct static_key_true mem_cgroup_sockets; static inline bool mem_cgroup_do_sockets(void) { - return static_branch_unlikely(&mem_cgroup_sockets); + if (mem_cgroup_disabled()) + return false; + if (!static_branch_likely(&mem_cgroup_sockets)) + return false; + return true; } void sock_update_memcg(struct sock *sk); void sock_release_memcg(struct sock *sk); @@ -706,7 +714,7 @@ static inline void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) { } -#endif /* CONFIG_INET && CONFIG_MEMCG_KMEM */ +#endif /* CONFIG_INET */ #ifdef CONFIG_MEMCG_KMEM extern struct static_key memcg_kmem_enabled_key; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 3789050..cb1d6aa 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1916,6 +1916,18 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb, return NOTIFY_OK; } +static void reclaim_high(struct mem_cgroup *memcg, + unsigned int nr_pages, + gfp_t gfp_mask) +{ + do { + if (page_counter_read(&memcg->memory) <= memcg->high) + continue; + mem_cgroup_events(memcg, MEMCG_HIGH, 1); + try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true); + } while ((memcg = parent_mem_cgroup(memcg))); +} + /* * Scheduled by try_charge() to be executed from the userland return path * and reclaims memory over the high limit. @@ -1923,20 +1935,13 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb, void mem_cgroup_handle_over_high(void) { unsigned int nr_pages = current->memcg_nr_pages_over_high; - struct mem_cgroup *memcg, *pos; + struct mem_cgroup *memcg; if (likely(!nr_pages)) return; - pos = memcg = get_mem_cgroup_from_mm(current->mm); - - do { - if (page_counter_read(&pos->memory) <= pos->high) - continue; - mem_cgroup_events(pos, MEMCG_HIGH, 1); - try_to_free_mem_cgroup_pages(pos, nr_pages, GFP_KERNEL, true); - } while ((pos = parent_mem_cgroup(pos))); - + memcg = get_mem_cgroup_from_mm(current->mm); + reclaim_high(memcg, nr_pages, GFP_KERNEL); css_put(&memcg->css); current->memcg_nr_pages_over_high = 0; } @@ -4129,6 +4134,8 @@ struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg) } EXPORT_SYMBOL(parent_mem_cgroup); +static void socket_work_func(struct work_struct *work); + static struct cgroup_subsys_state * __ref mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) { @@ -4169,6 +4176,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) #ifdef CONFIG_CGROUP_WRITEBACK INIT_LIST_HEAD(&memcg->cgwb_list); #endif +#ifdef CONFIG_INET + INIT_WORK(&memcg->socket_work, socket_work_func); +#endif return &memcg->css; free_out: @@ -4266,6 +4276,8 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css) { struct mem_cgroup *memcg = mem_cgroup_from_css(css); + cancel_work_sync(&memcg->socket_work); + memcg_destroy_kmem(memcg); __mem_cgroup_free(memcg); } @@ -4948,10 +4960,15 @@ static void mem_cgroup_bind(struct cgroup_subsys_state *root_css) * guarantees that @root doesn't have any children, so turning it * on for the root memcg is enough. */ - if (cgroup_subsys_on_dfl(memory_cgrp_subsys)) + if (cgroup_subsys_on_dfl(memory_cgrp_subsys)) { root_mem_cgroup->use_hierarchy = true; - else +#ifdef CONFIG_INET + /* unified hierarchy always counts skmem */ + static_branch_enable(&mem_cgroup_sockets); +#endif + } else { root_mem_cgroup->use_hierarchy = false; + } } static u64 memory_current_read(struct cgroup_subsys_state *css, @@ -5453,10 +5470,9 @@ void mem_cgroup_replace_page(struct page *oldpage, struct page *newpage) commit_charge(newpage, memcg, true); } -/* Writing them here to avoid exposing memcg's inner layout */ -#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM) +#ifdef CONFIG_INET -DEFINE_STATIC_KEY_FALSE(mem_cgroup_sockets); +DEFINE_STATIC_KEY_TRUE(mem_cgroup_sockets); void sock_update_memcg(struct sock *sk) { @@ -5490,6 +5506,14 @@ void sock_release_memcg(struct sock *sk) css_put(&sk->sk_memcg->css); } +static void socket_work_func(struct work_struct *work) +{ + struct mem_cgroup *memcg; + + memcg = container_of(work, struct mem_cgroup, socket_work); + reclaim_high(memcg, CHARGE_BATCH, GFP_KERNEL); +} + /** * mem_cgroup_charge_skmem - charge socket memory * @memcg: memcg to charge @@ -5500,13 +5524,38 @@ void sock_release_memcg(struct sock *sk) */ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) { + unsigned int batch = max(CHARGE_BATCH, nr_pages); struct page_counter *counter; + bool force = false; - if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter)) + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) { + if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter)) + return true; + page_counter_charge(&memcg->skmem, nr_pages); + return false; + } + + if (consume_stock(memcg, nr_pages)) return true; +retry: + if (page_counter_try_charge(&memcg->memory, batch, &counter)) + goto done; - page_counter_charge(&memcg->skmem, nr_pages); - return false; + if (batch > nr_pages) { + batch = nr_pages; + goto retry; + } + + force = true; + page_counter_charge(&memcg->memory, batch); +done: + css_get_many(&memcg->css, batch); + if (batch > nr_pages) + refill_stock(memcg, batch - nr_pages); + + schedule_work(&memcg->socket_work); + + return !force; } /** @@ -5516,10 +5565,16 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) */ void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) { - page_counter_uncharge(&memcg->skmem, nr_pages); + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) { + page_counter_uncharge(&memcg->skmem, nr_pages); + return; + } + + page_counter_uncharge(&memcg->memory, nr_pages); + css_put_many(&memcg->css, nr_pages); } -#endif +#endif /* CONFIG_INET */ /* * subsys_initcall() for memory controller. -- 2.6.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756040AbbJVEXL (ORCPT ); Thu, 22 Oct 2015 00:23:11 -0400 Received: from gum.cmpxchg.org ([85.214.110.215]:39312 "EHLO gum.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751886AbbJVEWN (ORCPT ); Thu, 22 Oct 2015 00:22:13 -0400 From: Johannes Weiner To: "David S. Miller" , Andrew Morton Cc: Michal Hocko , Vladimir Davydov , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH 6/8] mm: vmscan: simplify memcg vs. global shrinker invocation Date: Thu, 22 Oct 2015 00:21:34 -0400 Message-Id: <1445487696-21545-7-git-send-email-hannes@cmpxchg.org> X-Mailer: git-send-email 2.6.1 In-Reply-To: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Letting shrink_slab() handle the root_mem_cgroup, and implicitely the !CONFIG_MEMCG case, allows shrink_zone() to invoke the shrinkers unconditionally from within the memcg iteration loop. Signed-off-by: Johannes Weiner --- include/linux/memcontrol.h | 2 ++ mm/vmscan.c | 31 ++++++++++++++++--------------- 2 files changed, 18 insertions(+), 15 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 6f1e0f8..d66ae18 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -482,6 +482,8 @@ void mem_cgroup_split_huge_fixup(struct page *head); #else /* CONFIG_MEMCG */ struct mem_cgroup; +#define root_mem_cgroup NULL + static inline void mem_cgroup_events(struct mem_cgroup *memcg, enum mem_cgroup_events_index idx, unsigned int nr) diff --git a/mm/vmscan.c b/mm/vmscan.c index 9b52ecf..ecc2125 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -411,6 +411,10 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct shrinker *shrinker; unsigned long freed = 0; + /* Global shrinker mode */ + if (memcg == root_mem_cgroup) + memcg = NULL; + if (memcg && !memcg_kmem_is_active(memcg)) return 0; @@ -2417,11 +2421,22 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc, shrink_lruvec(lruvec, swappiness, sc, &lru_pages); zone_lru_pages += lru_pages; - if (memcg && is_classzone) + /* + * Shrink the slab caches in the same proportion that + * the eligible LRU pages were scanned. + */ + if (is_classzone) { shrink_slab(sc->gfp_mask, zone_to_nid(zone), memcg, sc->nr_scanned - scanned, lru_pages); + if (reclaim_state) { + sc->nr_reclaimed += + reclaim_state->reclaimed_slab; + reclaim_state->reclaimed_slab = 0; + } + } + /* * Direct reclaim and kswapd have to scan all memory * cgroups to fulfill the overall scan target for the @@ -2439,20 +2454,6 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc, } } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim))); - /* - * Shrink the slab caches in the same proportion that - * the eligible LRU pages were scanned. - */ - if (global_reclaim(sc) && is_classzone) - shrink_slab(sc->gfp_mask, zone_to_nid(zone), NULL, - sc->nr_scanned - nr_scanned, - zone_lru_pages); - - if (reclaim_state) { - sc->nr_reclaimed += reclaim_state->reclaimed_slab; - reclaim_state->reclaimed_slab = 0; - } - vmpressure(sc->gfp_mask, sc->target_mem_cgroup, sc->nr_scanned - nr_scanned, sc->nr_reclaimed - nr_reclaimed); -- 2.6.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S964780AbbJVEWg (ORCPT ); Thu, 22 Oct 2015 00:22:36 -0400 Received: from gum.cmpxchg.org ([85.214.110.215]:39340 "EHLO gum.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754795AbbJVEWT (ORCPT ); Thu, 22 Oct 2015 00:22:19 -0400 From: Johannes Weiner To: "David S. Miller" , Andrew Morton Cc: Michal Hocko , Vladimir Davydov , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH 8/8] mm: memcontrol: hook up vmpressure to socket pressure Date: Thu, 22 Oct 2015 00:21:36 -0400 Message-Id: <1445487696-21545-9-git-send-email-hannes@cmpxchg.org> X-Mailer: git-send-email 2.6.1 In-Reply-To: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Let the networking stack know when a memcg is under reclaim pressure, so it can shrink its transmit windows accordingly. Whenever the reclaim efficiency of a memcg's LRU lists drops low enough for a MEDIUM or HIGH vmpressure event to occur, assert a pressure state in the socket and tcp memory code that tells it to reduce memory usage in sockets associated with said memory cgroup. vmpressure events are edge triggered, so for hysteresis assert socket pressure for a second to allow for subsequent vmpressure events to occur before letting the socket code return to normal. Signed-off-by: Johannes Weiner --- include/linux/memcontrol.h | 9 +++++++++ include/net/sock.h | 4 ++++ include/net/tcp.h | 4 ++++ mm/memcontrol.c | 1 + mm/vmpressure.c | 29 ++++++++++++++++++++++++----- 5 files changed, 42 insertions(+), 5 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index d66ae18..b9990f7 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -246,6 +246,7 @@ struct mem_cgroup { #ifdef CONFIG_INET struct work_struct socket_work; + unsigned long socket_pressure; #endif /* List of events which userspace want to receive */ @@ -696,6 +697,10 @@ void sock_update_memcg(struct sock *sk); void sock_release_memcg(struct sock *sk); bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages); void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages); +static inline bool mem_cgroup_socket_pressure(struct mem_cgroup *memcg) +{ + return time_before(jiffies, memcg->socket_pressure); +} #else static inline bool mem_cgroup_do_sockets(void) { @@ -716,6 +721,10 @@ static inline void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) { } +static inline bool mem_cgroup_socket_pressure(struct mem_cgroup *memcg) +{ + return false; +} #endif /* CONFIG_INET */ #ifdef CONFIG_MEMCG_KMEM diff --git a/include/net/sock.h b/include/net/sock.h index 67795fc..22bfb9c 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1087,6 +1087,10 @@ static inline bool sk_has_memory_pressure(const struct sock *sk) static inline bool sk_under_memory_pressure(const struct sock *sk) { + if (mem_cgroup_do_sockets() && sk->sk_memcg && + mem_cgroup_socket_pressure(sk->sk_memcg)) + return true; + if (!sk->sk_prot->memory_pressure) return false; diff --git a/include/net/tcp.h b/include/net/tcp.h index 77b6c7e..c7d342c 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -291,6 +291,10 @@ extern int tcp_memory_pressure; /* optimized version of sk_under_memory_pressure() for TCP sockets */ static inline bool tcp_under_memory_pressure(const struct sock *sk) { + if (mem_cgroup_do_sockets() && sk->sk_memcg && + mem_cgroup_socket_pressure(sk->sk_memcg)) + return true; + return tcp_memory_pressure; } /* diff --git a/mm/memcontrol.c b/mm/memcontrol.c index cb1d6aa..2e09def 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4178,6 +4178,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) #endif #ifdef CONFIG_INET INIT_WORK(&memcg->socket_work, socket_work_func); + memcg->socket_pressure = jiffies; #endif return &memcg->css; diff --git a/mm/vmpressure.c b/mm/vmpressure.c index 4c25e62..f64c0e1 100644 --- a/mm/vmpressure.c +++ b/mm/vmpressure.c @@ -137,14 +137,11 @@ struct vmpressure_event { }; static bool vmpressure_event(struct vmpressure *vmpr, - unsigned long scanned, unsigned long reclaimed) + enum vmpressure_levels level) { struct vmpressure_event *ev; - enum vmpressure_levels level; bool signalled = false; - level = vmpressure_calc_level(scanned, reclaimed); - mutex_lock(&vmpr->events_lock); list_for_each_entry(ev, &vmpr->events, node) { @@ -162,6 +159,7 @@ static bool vmpressure_event(struct vmpressure *vmpr, static void vmpressure_work_fn(struct work_struct *work) { struct vmpressure *vmpr = work_to_vmpressure(work); + enum vmpressure_levels level; unsigned long scanned; unsigned long reclaimed; @@ -185,8 +183,29 @@ static void vmpressure_work_fn(struct work_struct *work) vmpr->reclaimed = 0; spin_unlock(&vmpr->sr_lock); + level = vmpressure_calc_level(scanned, reclaimed); + + if (level > VMPRESSURE_LOW) { + struct mem_cgroup *memcg; + /* + * Let the socket buffer allocator know that we are + * having trouble reclaiming LRU pages. + * + * For hysteresis, keep the pressure state asserted + * for a second in which subsequent pressure events + * can occur. + * + * XXX: is vmpressure a global feature or part of + * memcg? There shouldn't be anything memcg-specific + * about exporting reclaim success ratios from the VM. + */ + memcg = container_of(vmpr, struct mem_cgroup, vmpressure); + if (memcg != root_mem_cgroup) + memcg->socket_pressure = jiffies + HZ; + } + do { - if (vmpressure_event(vmpr, scanned, reclaimed)) + if (vmpressure_event(vmpr, level)) break; /* * If not handled, propagate the event upward into the -- 2.6.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758159AbbJVSpn (ORCPT ); Thu, 22 Oct 2015 14:45:43 -0400 Received: from relay.parallels.com ([195.214.232.42]:47464 "EHLO relay.parallels.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758048AbbJVSp2 (ORCPT ); Thu, 22 Oct 2015 14:45:28 -0400 Date: Thu, 22 Oct 2015 21:45:10 +0300 From: Vladimir Davydov To: Johannes Weiner CC: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , , , , Subject: Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy Message-ID: <20151022184509.GM18351@esperanza> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> X-ClientProxiedBy: US-EXCH2.sw.swsoft.com (10.255.249.46) To MSK-EXCH1.sw.swsoft.com (10.67.48.55) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Johannes, On Thu, Oct 22, 2015 at 12:21:28AM -0400, Johannes Weiner wrote: ... > Patch #5 adds accounting and tracking of socket memory to the unified > hierarchy memory controller, as described above. It uses the existing > per-cpu charge caches and triggers high limit reclaim asynchroneously. > > Patch #8 uses the vmpressure extension to equalize pressure between > the pages tracked natively by the VM and socket buffer pages. As the > pool is shared, it makes sense that while natively tracked pages are > under duress the network transmit windows are also not increased. First of all, I've no experience in networking, so I'm likely to be mistaken. Nevertheless I beg to disagree that this patch set is a step in the right direction. Here goes why. I admit that your idea to get rid of explicit tcp window control knobs and size it dynamically basing on memory pressure instead does sound tempting, but I don't think it'd always work. The problem is that in contrast to, say, dcache, we can't shrink tcp buffers AFAIU, we can only stop growing them. Now suppose a system hasn't experienced memory pressure for a while. If we don't have explicit tcp window limit, tcp buffers on such a system might have eaten almost all available memory (because of network load/problems). If a user workload that needs a significant amount of memory is started suddenly then, the network code will receive a notification and surely stop growing buffers, but all those buffers accumulated won't disappear instantly. As a result, the workload might be unable to find enough free memory and have no choice but invoke OOM killer. This looks unexpected from the user POV. That said, I think we do need per memcg tcp window control similar to what we have system-wide. In other words, Glauber's work makes sense to me. You might want to point me at my RFC patch where I proposed to revert it (https://lkml.org/lkml/2014/9/12/401). Well, I've changed my mind since then. Now I think I was mistaken, luckily I was stopped. However, I may be mistaken again :-) Thanks, Vladimir From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758172AbbJVSqa (ORCPT ); Thu, 22 Oct 2015 14:46:30 -0400 Received: from relay.parallels.com ([195.214.232.42]:47497 "EHLO relay.parallels.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756473AbbJVSq0 (ORCPT ); Thu, 22 Oct 2015 14:46:26 -0400 Date: Thu, 22 Oct 2015 21:46:12 +0300 From: Vladimir Davydov To: Johannes Weiner CC: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , , , , Subject: Re: [PATCH 3/8] net: consolidate memcg socket buffer tracking and accounting Message-ID: <20151022184612.GN18351@esperanza> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-4-git-send-email-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <1445487696-21545-4-git-send-email-hannes@cmpxchg.org> X-ClientProxiedBy: US-EXCH.sw.swsoft.com (10.255.249.47) To MSK-EXCH1.sw.swsoft.com (10.67.48.55) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Oct 22, 2015 at 12:21:31AM -0400, Johannes Weiner wrote: > The tcp memory controller has extensive provisions for future memory > accounting interfaces that won't materialize after all. Cut the code > base down to what's actually used, now and in the likely future. > > - There won't be any different protocol counters in the future, so a > direct sock->sk_memcg linkage is enough. This eliminates a lot of > callback maze and boilerplate code, and restores most of the socket > allocation code to pre-tcp_memcontrol state. > > - There won't be a tcp control soft limit, so integrating the memcg In fact, the code is ready for the "soft" limit (I mean min, pressure, max tuple), it just lacks a knob. > code into the global skmem limiting scheme complicates things > unnecessarily. Replace all that with simple and clear charge and > uncharge calls--hidden behind a jump label--to account skb memory. > > - The previous jump label code was an elaborate state machine that > tracked the number of cgroups with an active socket limit in order > to enable the skmem tracking and accounting code only when actively > necessary. But this is overengineered: it was meant to protect the > people who never use this feature in the first place. Simply enable > the branches once when the first limit is set until the next reboot. > ... > @@ -1136,9 +1090,6 @@ static inline bool sk_under_memory_pressure(const struct sock *sk) > if (!sk->sk_prot->memory_pressure) > return false; > > - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) > - return !!sk->sk_cgrp->memory_pressure; > - AFAIU, now we won't shrink the window on hitting the limit, i.e. this patch subtly changes the behavior of the existing knobs, potentially breaking them. Thanks, Vladimir From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965573AbbJVSsR (ORCPT ); Thu, 22 Oct 2015 14:48:17 -0400 Received: from relay.parallels.com ([195.214.232.42]:47539 "EHLO relay.parallels.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965394AbbJVSsM (ORCPT ); Thu, 22 Oct 2015 14:48:12 -0400 Date: Thu, 22 Oct 2015 21:47:57 +0300 From: Vladimir Davydov To: Johannes Weiner CC: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , , , , Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151022184757.GO18351@esperanza> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-6-git-send-email-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <1445487696-21545-6-git-send-email-hannes@cmpxchg.org> X-ClientProxiedBy: US-EXCH2.sw.swsoft.com (10.255.249.46) To MSK-EXCH1.sw.swsoft.com (10.67.48.55) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Oct 22, 2015 at 12:21:33AM -0400, Johannes Weiner wrote: ... > @@ -5500,13 +5524,38 @@ void sock_release_memcg(struct sock *sk) > */ > bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) > { > + unsigned int batch = max(CHARGE_BATCH, nr_pages); > struct page_counter *counter; > + bool force = false; > > - if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter)) > + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) { > + if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter)) > + return true; > + page_counter_charge(&memcg->skmem, nr_pages); > + return false; > + } > + > + if (consume_stock(memcg, nr_pages)) > return true; > +retry: > + if (page_counter_try_charge(&memcg->memory, batch, &counter)) > + goto done; Currently, we use memcg->memory only for charging memory pages. Besides, every page charged to this counter (including kmem) has ->mem_cgroup field set appropriately. This looks consistent and nice. As an extra benefit, we can track all pages charged to a memory cgroup via /proc/kapgecgroup. Now, you charge "window size" to it, which AFAIU isn't necessarily equal to the amount of memory actually consumed by the cgroup for socket buffers. I think this looks ugly and inconsistent with the existing behavior. I agree that we need to charge socker buffers to ->memory, but IMO we should do that per each skb page, using memcg_kmem_charge_kmem somewhere in alloc_skb_with_frags invoking the reclaimer just as we do for kmalloc, while tcp window size control should stay aside. Thanks, Vladimir From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965647AbbJVStL (ORCPT ); Thu, 22 Oct 2015 14:49:11 -0400 Received: from relay.parallels.com ([195.214.232.42]:47586 "EHLO relay.parallels.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965394AbbJVStI (ORCPT ); Thu, 22 Oct 2015 14:49:08 -0400 Date: Thu, 22 Oct 2015 21:48:53 +0300 From: Vladimir Davydov To: Johannes Weiner CC: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , , , , Subject: Re: [PATCH 7/8] mm: vmscan: report vmpressure at the level of reclaim activity Message-ID: <20151022184852.GP18351@esperanza> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-8-git-send-email-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <1445487696-21545-8-git-send-email-hannes@cmpxchg.org> X-ClientProxiedBy: US-EXCH.sw.swsoft.com (10.255.249.47) To MSK-EXCH1.sw.swsoft.com (10.67.48.55) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Oct 22, 2015 at 12:21:35AM -0400, Johannes Weiner wrote: ... > @@ -2437,6 +2439,10 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc, > } > } > > + vmpressure(sc->gfp_mask, memcg, > + sc->nr_scanned - scanned, > + sc->nr_reclaimed - reclaimed); > + > /* > * Direct reclaim and kswapd have to scan all memory > * cgroups to fulfill the overall scan target for the > @@ -2454,10 +2460,6 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc, > } > } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim))); > > - vmpressure(sc->gfp_mask, sc->target_mem_cgroup, > - sc->nr_scanned - nr_scanned, > - sc->nr_reclaimed - nr_reclaimed); > - > if (sc->nr_reclaimed - nr_reclaimed) > reclaimable = true; > I may be mistaken, but AFAIU this patch subtly changes the behavior of vmpressure visible from the userspace: w/o this patch a userspace process will only receive a notification for a memory cgroup only if *this* memory cgroup calls reclaimer; with this patch userspace notification will be issued even if reclaimer is invoked by any cgroup up the hierarchy. Thanks, Vladimir From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965602AbbJVS6K (ORCPT ); Thu, 22 Oct 2015 14:58:10 -0400 Received: from relay.parallels.com ([195.214.232.42]:47976 "EHLO relay.parallels.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965180AbbJVS6E (ORCPT ); Thu, 22 Oct 2015 14:58:04 -0400 Date: Thu, 22 Oct 2015 21:57:47 +0300 From: Vladimir Davydov To: Johannes Weiner CC: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , , , , Subject: Re: [PATCH 8/8] mm: memcontrol: hook up vmpressure to socket pressure Message-ID: <20151022185747.GQ18351@esperanza> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-9-git-send-email-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <1445487696-21545-9-git-send-email-hannes@cmpxchg.org> X-ClientProxiedBy: US-EXCH2.sw.swsoft.com (10.255.249.46) To MSK-EXCH1.sw.swsoft.com (10.67.48.55) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Oct 22, 2015 at 12:21:36AM -0400, Johannes Weiner wrote: ... > @@ -185,8 +183,29 @@ static void vmpressure_work_fn(struct work_struct *work) > vmpr->reclaimed = 0; > spin_unlock(&vmpr->sr_lock); > > + level = vmpressure_calc_level(scanned, reclaimed); > + > + if (level > VMPRESSURE_LOW) { So we start socket_pressure at MEDIUM. Why not at LOW or CRITICAL? > + struct mem_cgroup *memcg; > + /* > + * Let the socket buffer allocator know that we are > + * having trouble reclaiming LRU pages. > + * > + * For hysteresis, keep the pressure state asserted > + * for a second in which subsequent pressure events > + * can occur. > + * > + * XXX: is vmpressure a global feature or part of > + * memcg? There shouldn't be anything memcg-specific > + * about exporting reclaim success ratios from the VM. > + */ > + memcg = container_of(vmpr, struct mem_cgroup, vmpressure); > + if (memcg != root_mem_cgroup) > + memcg->socket_pressure = jiffies + HZ; Why 1 second? Thanks, Vladimir > + } > + > do { > - if (vmpressure_event(vmpr, scanned, reclaimed)) > + if (vmpressure_event(vmpr, level)) > break; > /* > * If not handled, propagate the event upward into the From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965401AbbJVTKB (ORCPT ); Thu, 22 Oct 2015 15:10:01 -0400 Received: from gum.cmpxchg.org ([85.214.110.215]:39502 "EHLO gum.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753688AbbJVTJ7 (ORCPT ); Thu, 22 Oct 2015 15:09:59 -0400 Date: Thu, 22 Oct 2015 15:09:43 -0400 From: Johannes Weiner To: Vladimir Davydov Cc: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 3/8] net: consolidate memcg socket buffer tracking and accounting Message-ID: <20151022190943.GA20871@cmpxchg.org> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-4-git-send-email-hannes@cmpxchg.org> <20151022184612.GN18351@esperanza> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151022184612.GN18351@esperanza> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Oct 22, 2015 at 09:46:12PM +0300, Vladimir Davydov wrote: > On Thu, Oct 22, 2015 at 12:21:31AM -0400, Johannes Weiner wrote: > > The tcp memory controller has extensive provisions for future memory > > accounting interfaces that won't materialize after all. Cut the code > > base down to what's actually used, now and in the likely future. > > > > - There won't be any different protocol counters in the future, so a > > direct sock->sk_memcg linkage is enough. This eliminates a lot of > > callback maze and boilerplate code, and restores most of the socket > > allocation code to pre-tcp_memcontrol state. > > > > - There won't be a tcp control soft limit, so integrating the memcg > > In fact, the code is ready for the "soft" limit (I mean min, pressure, > max tuple), it just lacks a knob. Yeah, but that's not going to materialize if the entire interface for dedicated tcp throttling is considered obsolete. > > @@ -1136,9 +1090,6 @@ static inline bool sk_under_memory_pressure(const struct sock *sk) > > if (!sk->sk_prot->memory_pressure) > > return false; > > > > - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) > > - return !!sk->sk_cgrp->memory_pressure; > > - > > AFAIU, now we won't shrink the window on hitting the limit, i.e. this > patch subtly changes the behavior of the existing knobs, potentially > breaking them. Hm, but there is no grace period in which something meaningful could happen with the window shrinking, is there? Any buffer allocation is still going to fail hard. I don't see how this would change anything in practice. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753256AbbJWLbL (ORCPT ); Fri, 23 Oct 2015 07:31:11 -0400 Received: from mail-wi0-f180.google.com ([209.85.212.180]:34832 "EHLO mail-wi0-f180.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753198AbbJWLbG (ORCPT ); Fri, 23 Oct 2015 07:31:06 -0400 Date: Fri, 23 Oct 2015 13:31:03 +0200 From: Michal Hocko To: Johannes Weiner Cc: "David S. Miller" , Andrew Morton , Vladimir Davydov , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 1/8] mm: page_counter: let page_counter_try_charge() return bool Message-ID: <20151023113103.GJ2410@dhcp22.suse.cz> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-2-git-send-email-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1445487696-21545-2-git-send-email-hannes@cmpxchg.org> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 22-10-15 00:21:29, Johannes Weiner wrote: > page_counter_try_charge() currently returns 0 on success and -ENOMEM > on failure, which is surprising behavior given the function name. > > Make it follow the expected pattern of try_stuff() functions that > return a boolean true to indicate success, or false for failure. > > Signed-off-by: Johannes Weiner Acked-by: Michal Hocko > --- > include/linux/page_counter.h | 6 +++--- > mm/hugetlb_cgroup.c | 3 ++- > mm/memcontrol.c | 11 +++++------ > mm/page_counter.c | 14 +++++++------- > 4 files changed, 17 insertions(+), 17 deletions(-) > > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h > index 17fa4f8..7e62920 100644 > --- a/include/linux/page_counter.h > +++ b/include/linux/page_counter.h > @@ -36,9 +36,9 @@ static inline unsigned long page_counter_read(struct page_counter *counter) > > void page_counter_cancel(struct page_counter *counter, unsigned long nr_pages); > void page_counter_charge(struct page_counter *counter, unsigned long nr_pages); > -int page_counter_try_charge(struct page_counter *counter, > - unsigned long nr_pages, > - struct page_counter **fail); > +bool page_counter_try_charge(struct page_counter *counter, > + unsigned long nr_pages, > + struct page_counter **fail); > void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages); > int page_counter_limit(struct page_counter *counter, unsigned long limit); > int page_counter_memparse(const char *buf, const char *max, > diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c > index 6a44263..d8fb10d 100644 > --- a/mm/hugetlb_cgroup.c > +++ b/mm/hugetlb_cgroup.c > @@ -186,7 +186,8 @@ again: > } > rcu_read_unlock(); > > - ret = page_counter_try_charge(&h_cg->hugepage[idx], nr_pages, &counter); > + if (!page_counter_try_charge(&h_cg->hugepage[idx], nr_pages, &counter)) > + ret = -ENOMEM; > css_put(&h_cg->css); > done: > *ptr = h_cg; > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index c71fe40..a8ccdbc 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2018,8 +2018,8 @@ retry: > return 0; > > if (!do_swap_account || > - !page_counter_try_charge(&memcg->memsw, batch, &counter)) { > - if (!page_counter_try_charge(&memcg->memory, batch, &counter)) > + page_counter_try_charge(&memcg->memsw, batch, &counter)) { > + if (page_counter_try_charge(&memcg->memory, batch, &counter)) > goto done_restock; > if (do_swap_account) > page_counter_uncharge(&memcg->memsw, batch); > @@ -2383,14 +2383,13 @@ int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order, > { > unsigned int nr_pages = 1 << order; > struct page_counter *counter; > - int ret = 0; > + int ret; > > if (!memcg_kmem_is_active(memcg)) > return 0; > > - ret = page_counter_try_charge(&memcg->kmem, nr_pages, &counter); > - if (ret) > - return ret; > + if (!page_counter_try_charge(&memcg->kmem, nr_pages, &counter)) > + return -ENOMEM; > > ret = try_charge(memcg, gfp, nr_pages); > if (ret) { > diff --git a/mm/page_counter.c b/mm/page_counter.c > index 11b4bed..7c6a63d 100644 > --- a/mm/page_counter.c > +++ b/mm/page_counter.c > @@ -56,12 +56,12 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages) > * @nr_pages: number of pages to charge > * @fail: points first counter to hit its limit, if any > * > - * Returns 0 on success, or -ENOMEM and @fail if the counter or one of > - * its ancestors has hit its configured limit. > + * Returns %true on success, or %false and @fail if the counter or one > + * of its ancestors has hit its configured limit. > */ > -int page_counter_try_charge(struct page_counter *counter, > - unsigned long nr_pages, > - struct page_counter **fail) > +bool page_counter_try_charge(struct page_counter *counter, > + unsigned long nr_pages, > + struct page_counter **fail) > { > struct page_counter *c; > > @@ -99,13 +99,13 @@ int page_counter_try_charge(struct page_counter *counter, > if (new > c->watermark) > c->watermark = new; > } > - return 0; > + return true; > > failed: > for (c = counter; c != *fail; c = c->parent) > page_counter_cancel(c, nr_pages); > > - return -ENOMEM; > + return false; > } > > /** > -- > 2.6.1 -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753289AbbJWLcn (ORCPT ); Fri, 23 Oct 2015 07:32:43 -0400 Received: from mail-wi0-f175.google.com ([209.85.212.175]:33403 "EHLO mail-wi0-f175.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751969AbbJWLck (ORCPT ); Fri, 23 Oct 2015 07:32:40 -0400 Date: Fri, 23 Oct 2015 13:32:38 +0200 From: Michal Hocko To: Johannes Weiner Cc: "David S. Miller" , Andrew Morton , Vladimir Davydov , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 2/8] mm: memcontrol: export root_mem_cgroup Message-ID: <20151023113237.GK2410@dhcp22.suse.cz> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-3-git-send-email-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1445487696-21545-3-git-send-email-hannes@cmpxchg.org> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 22-10-15 00:21:30, Johannes Weiner wrote: > A later patch will need this symbol in files other than memcontrol.c, > so export it now and replace mem_cgroup_root_css at the same time. > > Signed-off-by: Johannes Weiner Acked-by: Michal Hocko > --- > include/linux/memcontrol.h | 3 ++- > mm/backing-dev.c | 2 +- > mm/memcontrol.c | 5 ++--- > 3 files changed, 5 insertions(+), 5 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 805da1f..19ff87b 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -275,7 +275,8 @@ struct mem_cgroup { > struct mem_cgroup_per_node *nodeinfo[0]; > /* WARNING: nodeinfo must be the last member here */ > }; > -extern struct cgroup_subsys_state *mem_cgroup_root_css; > + > +extern struct mem_cgroup *root_mem_cgroup; > > /** > * mem_cgroup_events - count memory events against a cgroup > diff --git a/mm/backing-dev.c b/mm/backing-dev.c > index 095b23b..73ab967 100644 > --- a/mm/backing-dev.c > +++ b/mm/backing-dev.c > @@ -702,7 +702,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi) > > ret = wb_init(&bdi->wb, bdi, 1, GFP_KERNEL); > if (!ret) { > - bdi->wb.memcg_css = mem_cgroup_root_css; > + bdi->wb.memcg_css = &root_mem_cgroup->css; > bdi->wb.blkcg_css = blkcg_root_css; > } > return ret; > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index a8ccdbc..e54f434 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -76,9 +76,9 @@ > struct cgroup_subsys memory_cgrp_subsys __read_mostly; > EXPORT_SYMBOL(memory_cgrp_subsys); > > +struct mem_cgroup *root_mem_cgroup __read_mostly; > + > #define MEM_CGROUP_RECLAIM_RETRIES 5 > -static struct mem_cgroup *root_mem_cgroup __read_mostly; > -struct cgroup_subsys_state *mem_cgroup_root_css __read_mostly; > > /* Whether the swap controller is active */ > #ifdef CONFIG_MEMCG_SWAP > @@ -4213,7 +4213,6 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) > /* root ? */ > if (parent_css == NULL) { > root_mem_cgroup = memcg; > - mem_cgroup_root_css = &memcg->css; > page_counter_init(&memcg->memory, NULL); > memcg->high = PAGE_COUNTER_MAX; > memcg->soft_limit = PAGE_COUNTER_MAX; > -- > 2.6.1 -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753049AbbJWMik (ORCPT ); Fri, 23 Oct 2015 08:38:40 -0400 Received: from mail-wi0-f172.google.com ([209.85.212.172]:34301 "EHLO mail-wi0-f172.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751478AbbJWMie (ORCPT ); Fri, 23 Oct 2015 08:38:34 -0400 Date: Fri, 23 Oct 2015 14:38:30 +0200 From: Michal Hocko To: Johannes Weiner Cc: "David S. Miller" , Andrew Morton , Vladimir Davydov , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 3/8] net: consolidate memcg socket buffer tracking and accounting Message-ID: <20151023123830.GL2410@dhcp22.suse.cz> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-4-git-send-email-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1445487696-21545-4-git-send-email-hannes@cmpxchg.org> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 22-10-15 00:21:31, Johannes Weiner wrote: > The tcp memory controller has extensive provisions for future memory > accounting interfaces that won't materialize after all. Cut the code > base down to what's actually used, now and in the likely future. > > - There won't be any different protocol counters in the future, so a > direct sock->sk_memcg linkage is enough. This eliminates a lot of > callback maze and boilerplate code, and restores most of the socket > allocation code to pre-tcp_memcontrol state. > > - There won't be a tcp control soft limit, so integrating the memcg > code into the global skmem limiting scheme complicates things > unnecessarily. Replace all that with simple and clear charge and > uncharge calls--hidden behind a jump label--to account skb memory. > > - The previous jump label code was an elaborate state machine that > tracked the number of cgroups with an active socket limit in order > to enable the skmem tracking and accounting code only when actively > necessary. But this is overengineered: it was meant to protect the > people who never use this feature in the first place. Simply enable > the branches once when the first limit is set until the next reboot. > > Signed-off-by: Johannes Weiner The changelog is certainly attractive. I have looked through the patch but my knowledge of the networking subsystem and its memory management is close to zero so I cannot really do a competent review. Anyway I support any simplification of the tcp kmem accounting. If networking people are OK with the changes, including reduction of the functionality as described by Vladimir then no objections from me for this to be merged. Thanks! > --- > include/linux/memcontrol.h | 64 ++++++++----------- > include/net/sock.h | 135 +++------------------------------------ > include/net/tcp.h | 3 - > include/net/tcp_memcontrol.h | 7 --- > mm/memcontrol.c | 101 +++++++++++++++-------------- > net/core/sock.c | 78 ++++++----------------- > net/ipv4/sysctl_net_ipv4.c | 1 - > net/ipv4/tcp.c | 3 +- > net/ipv4/tcp_ipv4.c | 9 +-- > net/ipv4/tcp_memcontrol.c | 147 +++++++------------------------------------ > net/ipv4/tcp_output.c | 6 +- > net/ipv6/tcp_ipv6.c | 3 - > 12 files changed, 136 insertions(+), 421 deletions(-) > delete mode 100644 include/net/tcp_memcontrol.h > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 19ff87b..5b72f83 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -85,34 +85,6 @@ enum mem_cgroup_events_target { > MEM_CGROUP_NTARGETS, > }; > > -/* > - * Bits in struct cg_proto.flags > - */ > -enum cg_proto_flags { > - /* Currently active and new sockets should be assigned to cgroups */ > - MEMCG_SOCK_ACTIVE, > - /* It was ever activated; we must disarm static keys on destruction */ > - MEMCG_SOCK_ACTIVATED, > -}; > - > -struct cg_proto { > - struct page_counter memory_allocated; /* Current allocated memory. */ > - struct percpu_counter sockets_allocated; /* Current number of sockets. */ > - int memory_pressure; > - long sysctl_mem[3]; > - unsigned long flags; > - /* > - * memcg field is used to find which memcg we belong directly > - * Each memcg struct can hold more than one cg_proto, so container_of > - * won't really cut. > - * > - * The elegant solution would be having an inverse function to > - * proto_cgroup in struct proto, but that means polluting the structure > - * for everybody, instead of just for memcg users. > - */ > - struct mem_cgroup *memcg; > -}; > - > #ifdef CONFIG_MEMCG > struct mem_cgroup_stat_cpu { > long count[MEM_CGROUP_STAT_NSTATS]; > @@ -185,8 +157,15 @@ struct mem_cgroup { > > /* Accounted resources */ > struct page_counter memory; > + > + /* > + * Legacy non-resource counters. In unified hierarchy, all > + * memory is accounted and limited through memcg->memory. > + * Consumer breakdown happens in the statistics. > + */ > struct page_counter memsw; > struct page_counter kmem; > + struct page_counter skmem; > > /* Normal memory consumption range */ > unsigned long low; > @@ -246,9 +225,6 @@ struct mem_cgroup { > */ > struct mem_cgroup_stat_cpu __percpu *stat; > > -#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET) > - struct cg_proto tcp_mem; > -#endif > #if defined(CONFIG_MEMCG_KMEM) > /* Index in the kmem_cache->memcg_params.memcg_caches array */ > int kmemcg_id; > @@ -676,12 +652,6 @@ void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx) > } > #endif /* CONFIG_MEMCG */ > > -enum { > - UNDER_LIMIT, > - SOFT_LIMIT, > - OVER_LIMIT, > -}; > - > #ifdef CONFIG_CGROUP_WRITEBACK > > struct list_head *mem_cgroup_cgwb_list(struct mem_cgroup *memcg); > @@ -707,15 +677,35 @@ static inline void mem_cgroup_wb_stats(struct bdi_writeback *wb, > > struct sock; > #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM) > +extern struct static_key_false mem_cgroup_sockets; > +static inline bool mem_cgroup_do_sockets(void) > +{ > + return static_branch_unlikely(&mem_cgroup_sockets); > +} > void sock_update_memcg(struct sock *sk); > void sock_release_memcg(struct sock *sk); > +bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages); > +void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages); > #else > +static inline bool mem_cgroup_do_sockets(void) > +{ > + return false; > +} > static inline void sock_update_memcg(struct sock *sk) > { > } > static inline void sock_release_memcg(struct sock *sk) > { > } > +static inline bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, > + unsigned int nr_pages) > +{ > + return true; > +} > +static inline void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, > + unsigned int nr_pages) > +{ > +} > #endif /* CONFIG_INET && CONFIG_MEMCG_KMEM */ > > #ifdef CONFIG_MEMCG_KMEM > diff --git a/include/net/sock.h b/include/net/sock.h > index 59a7196..67795fc 100644 > --- a/include/net/sock.h > +++ b/include/net/sock.h > @@ -69,22 +69,6 @@ > #include > #include > > -struct cgroup; > -struct cgroup_subsys; > -#ifdef CONFIG_NET > -int mem_cgroup_sockets_init(struct mem_cgroup *memcg, struct cgroup_subsys *ss); > -void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg); > -#else > -static inline > -int mem_cgroup_sockets_init(struct mem_cgroup *memcg, struct cgroup_subsys *ss) > -{ > - return 0; > -} > -static inline > -void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg) > -{ > -} > -#endif > /* > * This structure really needs to be cleaned up. > * Most of it is for TCP, and not used by any of > @@ -243,7 +227,6 @@ struct sock_common { > /* public: */ > }; > > -struct cg_proto; > /** > * struct sock - network layer representation of sockets > * @__sk_common: shared layout with inet_timewait_sock > @@ -310,7 +293,7 @@ struct cg_proto; > * @sk_security: used by security modules > * @sk_mark: generic packet mark > * @sk_classid: this socket's cgroup classid > - * @sk_cgrp: this socket's cgroup-specific proto data > + * @sk_memcg: this socket's memcg association > * @sk_write_pending: a write to stream socket waits to start > * @sk_state_change: callback to indicate change in the state of the sock > * @sk_data_ready: callback to indicate there is data to be processed > @@ -447,7 +430,7 @@ struct sock { > #ifdef CONFIG_CGROUP_NET_CLASSID > u32 sk_classid; > #endif > - struct cg_proto *sk_cgrp; > + struct mem_cgroup *sk_memcg; > void (*sk_state_change)(struct sock *sk); > void (*sk_data_ready)(struct sock *sk); > void (*sk_write_space)(struct sock *sk); > @@ -1051,18 +1034,6 @@ struct proto { > #ifdef SOCK_REFCNT_DEBUG > atomic_t socks; > #endif > -#ifdef CONFIG_MEMCG_KMEM > - /* > - * cgroup specific init/deinit functions. Called once for all > - * protocols that implement it, from cgroups populate function. > - * This function has to setup any files the protocol want to > - * appear in the kmem cgroup filesystem. > - */ > - int (*init_cgroup)(struct mem_cgroup *memcg, > - struct cgroup_subsys *ss); > - void (*destroy_cgroup)(struct mem_cgroup *memcg); > - struct cg_proto *(*proto_cgroup)(struct mem_cgroup *memcg); > -#endif > }; > > int proto_register(struct proto *prot, int alloc_slab); > @@ -1093,23 +1064,6 @@ static inline void sk_refcnt_debug_release(const struct sock *sk) > #define sk_refcnt_debug_release(sk) do { } while (0) > #endif /* SOCK_REFCNT_DEBUG */ > > -#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_NET) > -extern struct static_key memcg_socket_limit_enabled; > -static inline struct cg_proto *parent_cg_proto(struct proto *proto, > - struct cg_proto *cg_proto) > -{ > - return proto->proto_cgroup(parent_mem_cgroup(cg_proto->memcg)); > -} > -#define mem_cgroup_sockets_enabled static_key_false(&memcg_socket_limit_enabled) > -#else > -#define mem_cgroup_sockets_enabled 0 > -static inline struct cg_proto *parent_cg_proto(struct proto *proto, > - struct cg_proto *cg_proto) > -{ > - return NULL; > -} > -#endif > - > static inline bool sk_stream_memory_free(const struct sock *sk) > { > if (sk->sk_wmem_queued >= sk->sk_sndbuf) > @@ -1136,9 +1090,6 @@ static inline bool sk_under_memory_pressure(const struct sock *sk) > if (!sk->sk_prot->memory_pressure) > return false; > > - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) > - return !!sk->sk_cgrp->memory_pressure; > - > return !!*sk->sk_prot->memory_pressure; > } > > @@ -1146,61 +1097,19 @@ static inline void sk_leave_memory_pressure(struct sock *sk) > { > int *memory_pressure = sk->sk_prot->memory_pressure; > > - if (!memory_pressure) > - return; > - > - if (*memory_pressure) > + if (memory_pressure && *memory_pressure) > *memory_pressure = 0; > - > - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) { > - struct cg_proto *cg_proto = sk->sk_cgrp; > - struct proto *prot = sk->sk_prot; > - > - for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto)) > - cg_proto->memory_pressure = 0; > - } > - > } > > static inline void sk_enter_memory_pressure(struct sock *sk) > { > - if (!sk->sk_prot->enter_memory_pressure) > - return; > - > - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) { > - struct cg_proto *cg_proto = sk->sk_cgrp; > - struct proto *prot = sk->sk_prot; > - > - for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto)) > - cg_proto->memory_pressure = 1; > - } > - > - sk->sk_prot->enter_memory_pressure(sk); > + if (sk->sk_prot->enter_memory_pressure) > + sk->sk_prot->enter_memory_pressure(sk); > } > > static inline long sk_prot_mem_limits(const struct sock *sk, int index) > { > - long *prot = sk->sk_prot->sysctl_mem; > - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) > - prot = sk->sk_cgrp->sysctl_mem; > - return prot[index]; > -} > - > -static inline void memcg_memory_allocated_add(struct cg_proto *prot, > - unsigned long amt, > - int *parent_status) > -{ > - page_counter_charge(&prot->memory_allocated, amt); > - > - if (page_counter_read(&prot->memory_allocated) > > - prot->memory_allocated.limit) > - *parent_status = OVER_LIMIT; > -} > - > -static inline void memcg_memory_allocated_sub(struct cg_proto *prot, > - unsigned long amt) > -{ > - page_counter_uncharge(&prot->memory_allocated, amt); > + return sk->sk_prot->sysctl_mem[index]; > } > > static inline long > @@ -1208,24 +1117,14 @@ sk_memory_allocated(const struct sock *sk) > { > struct proto *prot = sk->sk_prot; > > - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) > - return page_counter_read(&sk->sk_cgrp->memory_allocated); > - > return atomic_long_read(prot->memory_allocated); > } > > static inline long > -sk_memory_allocated_add(struct sock *sk, int amt, int *parent_status) > +sk_memory_allocated_add(struct sock *sk, int amt) > { > struct proto *prot = sk->sk_prot; > > - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) { > - memcg_memory_allocated_add(sk->sk_cgrp, amt, parent_status); > - /* update the root cgroup regardless */ > - atomic_long_add_return(amt, prot->memory_allocated); > - return page_counter_read(&sk->sk_cgrp->memory_allocated); > - } > - > return atomic_long_add_return(amt, prot->memory_allocated); > } > > @@ -1234,9 +1133,6 @@ sk_memory_allocated_sub(struct sock *sk, int amt) > { > struct proto *prot = sk->sk_prot; > > - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) > - memcg_memory_allocated_sub(sk->sk_cgrp, amt); > - > atomic_long_sub(amt, prot->memory_allocated); > } > > @@ -1244,13 +1140,6 @@ static inline void sk_sockets_allocated_dec(struct sock *sk) > { > struct proto *prot = sk->sk_prot; > > - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) { > - struct cg_proto *cg_proto = sk->sk_cgrp; > - > - for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto)) > - percpu_counter_dec(&cg_proto->sockets_allocated); > - } > - > percpu_counter_dec(prot->sockets_allocated); > } > > @@ -1258,13 +1147,6 @@ static inline void sk_sockets_allocated_inc(struct sock *sk) > { > struct proto *prot = sk->sk_prot; > > - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) { > - struct cg_proto *cg_proto = sk->sk_cgrp; > - > - for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto)) > - percpu_counter_inc(&cg_proto->sockets_allocated); > - } > - > percpu_counter_inc(prot->sockets_allocated); > } > > @@ -1273,9 +1155,6 @@ sk_sockets_allocated_read_positive(struct sock *sk) > { > struct proto *prot = sk->sk_prot; > > - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) > - return percpu_counter_read_positive(&sk->sk_cgrp->sockets_allocated); > - > return percpu_counter_read_positive(prot->sockets_allocated); > } > > diff --git a/include/net/tcp.h b/include/net/tcp.h > index eed94fc..77b6c7e 100644 > --- a/include/net/tcp.h > +++ b/include/net/tcp.h > @@ -291,9 +291,6 @@ extern int tcp_memory_pressure; > /* optimized version of sk_under_memory_pressure() for TCP sockets */ > static inline bool tcp_under_memory_pressure(const struct sock *sk) > { > - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) > - return !!sk->sk_cgrp->memory_pressure; > - > return tcp_memory_pressure; > } > /* > diff --git a/include/net/tcp_memcontrol.h b/include/net/tcp_memcontrol.h > deleted file mode 100644 > index 05b94d9..0000000 > --- a/include/net/tcp_memcontrol.h > +++ /dev/null > @@ -1,7 +0,0 @@ > -#ifndef _TCP_MEMCG_H > -#define _TCP_MEMCG_H > - > -struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg); > -int tcp_init_cgroup(struct mem_cgroup *memcg, struct cgroup_subsys *ss); > -void tcp_destroy_cgroup(struct mem_cgroup *memcg); > -#endif /* _TCP_MEMCG_H */ > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index e54f434..c41e6d7 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -66,7 +66,6 @@ > #include "internal.h" > #include > #include > -#include > #include "slab.h" > > #include > @@ -291,58 +290,68 @@ static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id) > /* Writing them here to avoid exposing memcg's inner layout */ > #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM) > > +DEFINE_STATIC_KEY_FALSE(mem_cgroup_sockets); > + > void sock_update_memcg(struct sock *sk) > { > - if (mem_cgroup_sockets_enabled) { > - struct mem_cgroup *memcg; > - struct cg_proto *cg_proto; > - > - BUG_ON(!sk->sk_prot->proto_cgroup); > - > - /* Socket cloning can throw us here with sk_cgrp already > - * filled. It won't however, necessarily happen from > - * process context. So the test for root memcg given > - * the current task's memcg won't help us in this case. > - * > - * Respecting the original socket's memcg is a better > - * decision in this case. > - */ > - if (sk->sk_cgrp) { > - BUG_ON(mem_cgroup_is_root(sk->sk_cgrp->memcg)); > - css_get(&sk->sk_cgrp->memcg->css); > - return; > - } > - > - rcu_read_lock(); > - memcg = mem_cgroup_from_task(current); > - cg_proto = sk->sk_prot->proto_cgroup(memcg); > - if (cg_proto && test_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags) && > - css_tryget_online(&memcg->css)) { > - sk->sk_cgrp = cg_proto; > - } > - rcu_read_unlock(); > + struct mem_cgroup *memcg; > + /* > + * Socket cloning can throw us here with sk_cgrp already > + * filled. It won't however, necessarily happen from > + * process context. So the test for root memcg given > + * the current task's memcg won't help us in this case. > + * > + * Respecting the original socket's memcg is a better > + * decision in this case. > + */ > + if (sk->sk_memcg) { > + BUG_ON(mem_cgroup_is_root(sk->sk_memcg)); > + css_get(&sk->sk_memcg->css); > + return; > } > + > + rcu_read_lock(); > + memcg = mem_cgroup_from_task(current); > + if (css_tryget_online(&memcg->css)) > + sk->sk_memcg = memcg; > + rcu_read_unlock(); > } > EXPORT_SYMBOL(sock_update_memcg); > > void sock_release_memcg(struct sock *sk) > { > - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) { > - struct mem_cgroup *memcg; > - WARN_ON(!sk->sk_cgrp->memcg); > - memcg = sk->sk_cgrp->memcg; > - css_put(&sk->sk_cgrp->memcg->css); > - } > + if (sk->sk_memcg) > + css_put(&sk->sk_memcg->css); > } > > -struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg) > +/** > + * mem_cgroup_charge_skmem - charge socket memory > + * @memcg: memcg to charge > + * @nr_pages: number of pages to charge > + * > + * Charges @nr_pages to @memcg. Returns %true if the charge fit within > + * the memcg's configured limit, %false if the charge had to be forced. > + */ > +bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) > { > - if (!memcg || mem_cgroup_is_root(memcg)) > - return NULL; > + struct page_counter *counter; > + > + if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter)) > + return true; > > - return &memcg->tcp_mem; > + page_counter_charge(&memcg->skmem, nr_pages); > + return false; > +} > + > +/** > + * mem_cgroup_uncharge_skmem - uncharge socket memory > + * @memcg: memcg to uncharge > + * @nr_pages: number of pages to uncharge > + */ > +void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) > +{ > + page_counter_uncharge(&memcg->skmem, nr_pages); > } > -EXPORT_SYMBOL(tcp_proto_cgroup); > > #endif > > @@ -3592,13 +3601,7 @@ static int mem_cgroup_oom_control_write(struct cgroup_subsys_state *css, > #ifdef CONFIG_MEMCG_KMEM > static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss) > { > - int ret; > - > - ret = memcg_propagate_kmem(memcg); > - if (ret) > - return ret; > - > - return mem_cgroup_sockets_init(memcg, ss); > + return memcg_propagate_kmem(memcg); > } > > static void memcg_deactivate_kmem(struct mem_cgroup *memcg) > @@ -3654,7 +3657,6 @@ static void memcg_destroy_kmem(struct mem_cgroup *memcg) > static_key_slow_dec(&memcg_kmem_enabled_key); > WARN_ON(page_counter_read(&memcg->kmem)); > } > - mem_cgroup_sockets_destroy(memcg); > } > #else > static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss) > @@ -4218,6 +4220,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) > memcg->soft_limit = PAGE_COUNTER_MAX; > page_counter_init(&memcg->memsw, NULL); > page_counter_init(&memcg->kmem, NULL); > + page_counter_init(&memcg->skmem, NULL); > } > > memcg->last_scanned_node = MAX_NUMNODES; > @@ -4266,6 +4269,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css) > memcg->soft_limit = PAGE_COUNTER_MAX; > page_counter_init(&memcg->memsw, &parent->memsw); > page_counter_init(&memcg->kmem, &parent->kmem); > + page_counter_init(&memcg->skmem, &parent->skmem); > > /* > * No need to take a reference to the parent because cgroup > @@ -4277,6 +4281,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css) > memcg->soft_limit = PAGE_COUNTER_MAX; > page_counter_init(&memcg->memsw, NULL); > page_counter_init(&memcg->kmem, NULL); > + page_counter_init(&memcg->skmem, NULL); > /* > * Deeper hierachy with use_hierarchy == false doesn't make > * much sense so let cgroup subsystem know about this > diff --git a/net/core/sock.c b/net/core/sock.c > index 0fafd27..0debff5 100644 > --- a/net/core/sock.c > +++ b/net/core/sock.c > @@ -194,44 +194,6 @@ bool sk_net_capable(const struct sock *sk, int cap) > } > EXPORT_SYMBOL(sk_net_capable); > > - > -#ifdef CONFIG_MEMCG_KMEM > -int mem_cgroup_sockets_init(struct mem_cgroup *memcg, struct cgroup_subsys *ss) > -{ > - struct proto *proto; > - int ret = 0; > - > - mutex_lock(&proto_list_mutex); > - list_for_each_entry(proto, &proto_list, node) { > - if (proto->init_cgroup) { > - ret = proto->init_cgroup(memcg, ss); > - if (ret) > - goto out; > - } > - } > - > - mutex_unlock(&proto_list_mutex); > - return ret; > -out: > - list_for_each_entry_continue_reverse(proto, &proto_list, node) > - if (proto->destroy_cgroup) > - proto->destroy_cgroup(memcg); > - mutex_unlock(&proto_list_mutex); > - return ret; > -} > - > -void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg) > -{ > - struct proto *proto; > - > - mutex_lock(&proto_list_mutex); > - list_for_each_entry_reverse(proto, &proto_list, node) > - if (proto->destroy_cgroup) > - proto->destroy_cgroup(memcg); > - mutex_unlock(&proto_list_mutex); > -} > -#endif > - > /* > * Each address family might have different locking rules, so we have > * one slock key per address family: > @@ -239,11 +201,6 @@ void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg) > static struct lock_class_key af_family_keys[AF_MAX]; > static struct lock_class_key af_family_slock_keys[AF_MAX]; > > -#if defined(CONFIG_MEMCG_KMEM) > -struct static_key memcg_socket_limit_enabled; > -EXPORT_SYMBOL(memcg_socket_limit_enabled); > -#endif > - > /* > * Make lock validator output more readable. (we pre-construct these > * strings build-time, so that runtime initialization of socket > @@ -1476,12 +1433,6 @@ void sk_free(struct sock *sk) > } > EXPORT_SYMBOL(sk_free); > > -static void sk_update_clone(const struct sock *sk, struct sock *newsk) > -{ > - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) > - sock_update_memcg(newsk); > -} > - > /** > * sk_clone_lock - clone a socket, and lock its clone > * @sk: the socket to clone > @@ -1577,7 +1528,8 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority) > sk_set_socket(newsk, NULL); > newsk->sk_wq = NULL; > > - sk_update_clone(sk, newsk); > + if (mem_cgroup_do_sockets()) > + sock_update_memcg(newsk); > > if (newsk->sk_prot->sockets_allocated) > sk_sockets_allocated_inc(newsk); > @@ -2036,27 +1988,27 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind) > struct proto *prot = sk->sk_prot; > int amt = sk_mem_pages(size); > long allocated; > - int parent_status = UNDER_LIMIT; > > sk->sk_forward_alloc += amt * SK_MEM_QUANTUM; > > - allocated = sk_memory_allocated_add(sk, amt, &parent_status); > + allocated = sk_memory_allocated_add(sk, amt); > + > + if (mem_cgroup_do_sockets() && sk->sk_memcg && > + !mem_cgroup_charge_skmem(sk->sk_memcg, amt)) > + goto suppress_allocation; > > /* Under limit. */ > - if (parent_status == UNDER_LIMIT && > - allocated <= sk_prot_mem_limits(sk, 0)) { > + if (allocated <= sk_prot_mem_limits(sk, 0)) { > sk_leave_memory_pressure(sk); > return 1; > } > > - /* Under pressure. (we or our parents) */ > - if ((parent_status > SOFT_LIMIT) || > - allocated > sk_prot_mem_limits(sk, 1)) > + /* Under pressure. */ > + if (allocated > sk_prot_mem_limits(sk, 1)) > sk_enter_memory_pressure(sk); > > - /* Over hard limit (we or our parents) */ > - if ((parent_status == OVER_LIMIT) || > - (allocated > sk_prot_mem_limits(sk, 2))) > + /* Over hard limit. */ > + if (allocated > sk_prot_mem_limits(sk, 2)) > goto suppress_allocation; > > /* guarantee minimum buffer size under pressure */ > @@ -2105,6 +2057,9 @@ suppress_allocation: > > sk_memory_allocated_sub(sk, amt); > > + if (mem_cgroup_do_sockets() && sk->sk_memcg) > + mem_cgroup_uncharge_skmem(sk->sk_memcg, amt); > + > return 0; > } > EXPORT_SYMBOL(__sk_mem_schedule); > @@ -2120,6 +2075,9 @@ void __sk_mem_reclaim(struct sock *sk, int amount) > sk_memory_allocated_sub(sk, amount); > sk->sk_forward_alloc -= amount << SK_MEM_QUANTUM_SHIFT; > > + if (mem_cgroup_do_sockets() && sk->sk_memcg) > + mem_cgroup_uncharge_skmem(sk->sk_memcg, amount); > + > if (sk_under_memory_pressure(sk) && > (sk_memory_allocated(sk) < sk_prot_mem_limits(sk, 0))) > sk_leave_memory_pressure(sk); > diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c > index 894da3a..1f00819 100644 > --- a/net/ipv4/sysctl_net_ipv4.c > +++ b/net/ipv4/sysctl_net_ipv4.c > @@ -24,7 +24,6 @@ > #include > #include > #include > -#include > > static int zero; > static int one = 1; > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index ac1bdbb..ec931c0 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -421,7 +421,8 @@ void tcp_init_sock(struct sock *sk) > sk->sk_rcvbuf = sysctl_tcp_rmem[1]; > > local_bh_disable(); > - sock_update_memcg(sk); > + if (mem_cgroup_do_sockets()) > + sock_update_memcg(sk); > sk_sockets_allocated_inc(sk); > local_bh_enable(); > } > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c > index 30dd45c..bb5f4f2 100644 > --- a/net/ipv4/tcp_ipv4.c > +++ b/net/ipv4/tcp_ipv4.c > @@ -73,7 +73,6 @@ > #include > #include > #include > -#include > #include > > #include > @@ -1808,7 +1807,8 @@ void tcp_v4_destroy_sock(struct sock *sk) > tcp_saved_syn_free(tp); > > sk_sockets_allocated_dec(sk); > - sock_release_memcg(sk); > + if (mem_cgroup_do_sockets()) > + sock_release_memcg(sk); > } > EXPORT_SYMBOL(tcp_v4_destroy_sock); > > @@ -2330,11 +2330,6 @@ struct proto tcp_prot = { > .compat_setsockopt = compat_tcp_setsockopt, > .compat_getsockopt = compat_tcp_getsockopt, > #endif > -#ifdef CONFIG_MEMCG_KMEM > - .init_cgroup = tcp_init_cgroup, > - .destroy_cgroup = tcp_destroy_cgroup, > - .proto_cgroup = tcp_proto_cgroup, > -#endif > }; > EXPORT_SYMBOL(tcp_prot); > > diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c > index 2379c1b..09a37eb 100644 > --- a/net/ipv4/tcp_memcontrol.c > +++ b/net/ipv4/tcp_memcontrol.c > @@ -1,107 +1,10 @@ > -#include > -#include > -#include > -#include > -#include > +#include > #include > +#include > #include > - > -int tcp_init_cgroup(struct mem_cgroup *memcg, struct cgroup_subsys *ss) > -{ > - /* > - * The root cgroup does not use page_counters, but rather, > - * rely on the data already collected by the network > - * subsystem > - */ > - struct mem_cgroup *parent = parent_mem_cgroup(memcg); > - struct page_counter *counter_parent = NULL; > - struct cg_proto *cg_proto, *parent_cg; > - > - cg_proto = tcp_prot.proto_cgroup(memcg); > - if (!cg_proto) > - return 0; > - > - cg_proto->sysctl_mem[0] = sysctl_tcp_mem[0]; > - cg_proto->sysctl_mem[1] = sysctl_tcp_mem[1]; > - cg_proto->sysctl_mem[2] = sysctl_tcp_mem[2]; > - cg_proto->memory_pressure = 0; > - cg_proto->memcg = memcg; > - > - parent_cg = tcp_prot.proto_cgroup(parent); > - if (parent_cg) > - counter_parent = &parent_cg->memory_allocated; > - > - page_counter_init(&cg_proto->memory_allocated, counter_parent); > - percpu_counter_init(&cg_proto->sockets_allocated, 0, GFP_KERNEL); > - > - return 0; > -} > -EXPORT_SYMBOL(tcp_init_cgroup); > - > -void tcp_destroy_cgroup(struct mem_cgroup *memcg) > -{ > - struct cg_proto *cg_proto; > - > - cg_proto = tcp_prot.proto_cgroup(memcg); > - if (!cg_proto) > - return; > - > - percpu_counter_destroy(&cg_proto->sockets_allocated); > - > - if (test_bit(MEMCG_SOCK_ACTIVATED, &cg_proto->flags)) > - static_key_slow_dec(&memcg_socket_limit_enabled); > - > -} > -EXPORT_SYMBOL(tcp_destroy_cgroup); > - > -static int tcp_update_limit(struct mem_cgroup *memcg, unsigned long nr_pages) > -{ > - struct cg_proto *cg_proto; > - int i; > - int ret; > - > - cg_proto = tcp_prot.proto_cgroup(memcg); > - if (!cg_proto) > - return -EINVAL; > - > - ret = page_counter_limit(&cg_proto->memory_allocated, nr_pages); > - if (ret) > - return ret; > - > - for (i = 0; i < 3; i++) > - cg_proto->sysctl_mem[i] = min_t(long, nr_pages, > - sysctl_tcp_mem[i]); > - > - if (nr_pages == PAGE_COUNTER_MAX) > - clear_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags); > - else { > - /* > - * The active bit needs to be written after the static_key > - * update. This is what guarantees that the socket activation > - * function is the last one to run. See sock_update_memcg() for > - * details, and note that we don't mark any socket as belonging > - * to this memcg until that flag is up. > - * > - * We need to do this, because static_keys will span multiple > - * sites, but we can't control their order. If we mark a socket > - * as accounted, but the accounting functions are not patched in > - * yet, we'll lose accounting. > - * > - * We never race with the readers in sock_update_memcg(), > - * because when this value change, the code to process it is not > - * patched in yet. > - * > - * The activated bit is used to guarantee that no two writers > - * will do the update in the same memcg. Without that, we can't > - * properly shutdown the static key. > - */ > - if (!test_and_set_bit(MEMCG_SOCK_ACTIVATED, &cg_proto->flags)) > - static_key_slow_inc(&memcg_socket_limit_enabled); > - set_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags); > - } > - > - return 0; > -} > +#include > +#include > +#include > > enum { > RES_USAGE, > @@ -124,11 +27,17 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of, > switch (of_cft(of)->private) { > case RES_LIMIT: > /* see memcontrol.c */ > + if (memcg == root_mem_cgroup) { > + ret = -EINVAL; > + break; > + } > ret = page_counter_memparse(buf, "-1", &nr_pages); > if (ret) > break; > mutex_lock(&tcp_limit_mutex); > - ret = tcp_update_limit(memcg, nr_pages); > + ret = page_counter_limit(&memcg->skmem, nr_pages); > + if (!ret) > + static_branch_enable(&mem_cgroup_sockets); > mutex_unlock(&tcp_limit_mutex); > break; > default: > @@ -141,32 +50,28 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of, > static u64 tcp_cgroup_read(struct cgroup_subsys_state *css, struct cftype *cft) > { > struct mem_cgroup *memcg = mem_cgroup_from_css(css); > - struct cg_proto *cg_proto = tcp_prot.proto_cgroup(memcg); > u64 val; > > switch (cft->private) { > case RES_LIMIT: > - if (!cg_proto) > - return PAGE_COUNTER_MAX; > - val = cg_proto->memory_allocated.limit; > + val = memcg->skmem.limit; > val *= PAGE_SIZE; > break; > case RES_USAGE: > - if (!cg_proto) > + if (memcg == root_mem_cgroup) > val = atomic_long_read(&tcp_memory_allocated); > else > - val = page_counter_read(&cg_proto->memory_allocated); > + val = page_counter_read(&memcg->skmem); > val *= PAGE_SIZE; > break; > case RES_FAILCNT: > - if (!cg_proto) > - return 0; > - val = cg_proto->memory_allocated.failcnt; > + val = memcg->skmem.failcnt; > break; > case RES_MAX_USAGE: > - if (!cg_proto) > - return 0; > - val = cg_proto->memory_allocated.watermark; > + if (memcg == root_mem_cgroup) > + val = 0; > + else > + val = memcg->skmem.watermark; > val *= PAGE_SIZE; > break; > default: > @@ -178,20 +83,14 @@ static u64 tcp_cgroup_read(struct cgroup_subsys_state *css, struct cftype *cft) > static ssize_t tcp_cgroup_reset(struct kernfs_open_file *of, > char *buf, size_t nbytes, loff_t off) > { > - struct mem_cgroup *memcg; > - struct cg_proto *cg_proto; > - > - memcg = mem_cgroup_from_css(of_css(of)); > - cg_proto = tcp_prot.proto_cgroup(memcg); > - if (!cg_proto) > - return nbytes; > + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); > > switch (of_cft(of)->private) { > case RES_MAX_USAGE: > - page_counter_reset_watermark(&cg_proto->memory_allocated); > + page_counter_reset_watermark(&memcg->skmem); > break; > case RES_FAILCNT: > - cg_proto->memory_allocated.failcnt = 0; > + memcg->skmem.failcnt = 0; > break; > } > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index 19adedb..b496fc9 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -2819,13 +2819,15 @@ begin_fwd: > */ > void sk_forced_mem_schedule(struct sock *sk, int size) > { > - int amt, status; > + int amt; > > if (size <= sk->sk_forward_alloc) > return; > amt = sk_mem_pages(size); > sk->sk_forward_alloc += amt * SK_MEM_QUANTUM; > - sk_memory_allocated_add(sk, amt, &status); > + sk_memory_allocated_add(sk, amt); > + if (mem_cgroup_do_sockets() && sk->sk_memcg) > + mem_cgroup_charge_skmem(sk->sk_memcg, amt); > } > > /* Send a FIN. The caller locks the socket for us. > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c > index f495d18..cf19e65 100644 > --- a/net/ipv6/tcp_ipv6.c > +++ b/net/ipv6/tcp_ipv6.c > @@ -1862,9 +1862,6 @@ struct proto tcpv6_prot = { > .compat_setsockopt = compat_tcp_setsockopt, > .compat_getsockopt = compat_tcp_getsockopt, > #endif > -#ifdef CONFIG_MEMCG_KMEM > - .proto_cgroup = tcp_proto_cgroup, > -#endif > .clear_sk = tcp_v6_clear_sk, > }; > > -- > 2.6.1 -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752620AbbJWNUB (ORCPT ); Fri, 23 Oct 2015 09:20:01 -0400 Received: from mail-wi0-f169.google.com ([209.85.212.169]:33613 "EHLO mail-wi0-f169.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751307AbbJWNT7 (ORCPT ); Fri, 23 Oct 2015 09:19:59 -0400 Date: Fri, 23 Oct 2015 15:19:56 +0200 From: Michal Hocko To: Johannes Weiner Cc: "David S. Miller" , Andrew Morton , Vladimir Davydov , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151023131956.GA15375@dhcp22.suse.cz> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-6-git-send-email-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1445487696-21545-6-git-send-email-hannes@cmpxchg.org> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 22-10-15 00:21:33, Johannes Weiner wrote: > Socket memory can be a significant share of overall memory consumed by > common workloads. In order to provide reasonable resource isolation > out-of-the-box in the unified hierarchy, this type of memory needs to > be accounted and tracked per default in the memory controller. What about users who do not want to pay an additional overhead for the accounting? How can they disable it? > Signed-off-by: Johannes Weiner [...] > @@ -5453,10 +5470,9 @@ void mem_cgroup_replace_page(struct page *oldpage, struct page *newpage) > commit_charge(newpage, memcg, true); > } > > -/* Writing them here to avoid exposing memcg's inner layout */ > -#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM) > +#ifdef CONFIG_INET > > -DEFINE_STATIC_KEY_FALSE(mem_cgroup_sockets); > +DEFINE_STATIC_KEY_TRUE(mem_cgroup_sockets); AFAIU this means that the jump label is enabled by default. Is this intended when you enable it explicitly where needed? > > void sock_update_memcg(struct sock *sk) > { -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751436AbbJWN01 (ORCPT ); Fri, 23 Oct 2015 09:26:27 -0400 Received: from mail-wi0-f181.google.com ([209.85.212.181]:36149 "EHLO mail-wi0-f181.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750827AbbJWN0Z (ORCPT ); Fri, 23 Oct 2015 09:26:25 -0400 Date: Fri, 23 Oct 2015 15:26:22 +0200 From: Michal Hocko To: Johannes Weiner Cc: "David S. Miller" , Andrew Morton , Vladimir Davydov , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 6/8] mm: vmscan: simplify memcg vs. global shrinker invocation Message-ID: <20151023132622.GB15375@dhcp22.suse.cz> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-7-git-send-email-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1445487696-21545-7-git-send-email-hannes@cmpxchg.org> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 22-10-15 00:21:34, Johannes Weiner wrote: > Letting shrink_slab() handle the root_mem_cgroup, and implicitely the > !CONFIG_MEMCG case, allows shrink_zone() to invoke the shrinkers > unconditionally from within the memcg iteration loop. > > Signed-off-by: Johannes Weiner Acked-by: Michal Hocko > --- > include/linux/memcontrol.h | 2 ++ > mm/vmscan.c | 31 ++++++++++++++++--------------- > 2 files changed, 18 insertions(+), 15 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 6f1e0f8..d66ae18 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -482,6 +482,8 @@ void mem_cgroup_split_huge_fixup(struct page *head); > #else /* CONFIG_MEMCG */ > struct mem_cgroup; > > +#define root_mem_cgroup NULL > + > static inline void mem_cgroup_events(struct mem_cgroup *memcg, > enum mem_cgroup_events_index idx, > unsigned int nr) > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 9b52ecf..ecc2125 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -411,6 +411,10 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid, > struct shrinker *shrinker; > unsigned long freed = 0; > > + /* Global shrinker mode */ > + if (memcg == root_mem_cgroup) > + memcg = NULL; > + > if (memcg && !memcg_kmem_is_active(memcg)) > return 0; > > @@ -2417,11 +2421,22 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc, > shrink_lruvec(lruvec, swappiness, sc, &lru_pages); > zone_lru_pages += lru_pages; > > - if (memcg && is_classzone) > + /* > + * Shrink the slab caches in the same proportion that > + * the eligible LRU pages were scanned. > + */ > + if (is_classzone) { > shrink_slab(sc->gfp_mask, zone_to_nid(zone), > memcg, sc->nr_scanned - scanned, > lru_pages); > > + if (reclaim_state) { > + sc->nr_reclaimed += > + reclaim_state->reclaimed_slab; > + reclaim_state->reclaimed_slab = 0; > + } > + } > + > /* > * Direct reclaim and kswapd have to scan all memory > * cgroups to fulfill the overall scan target for the > @@ -2439,20 +2454,6 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc, > } > } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim))); > > - /* > - * Shrink the slab caches in the same proportion that > - * the eligible LRU pages were scanned. > - */ > - if (global_reclaim(sc) && is_classzone) > - shrink_slab(sc->gfp_mask, zone_to_nid(zone), NULL, > - sc->nr_scanned - nr_scanned, > - zone_lru_pages); > - > - if (reclaim_state) { > - sc->nr_reclaimed += reclaim_state->reclaimed_slab; > - reclaim_state->reclaimed_slab = 0; > - } > - > vmpressure(sc->gfp_mask, sc->target_mem_cgroup, > sc->nr_scanned - nr_scanned, > sc->nr_reclaimed - nr_reclaimed); > -- > 2.6.1 -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752711AbbJWNnQ (ORCPT ); Fri, 23 Oct 2015 09:43:16 -0400 Received: from relay.parallels.com ([195.214.232.42]:36910 "EHLO relay.parallels.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751344AbbJWNnO (ORCPT ); Fri, 23 Oct 2015 09:43:14 -0400 Date: Fri, 23 Oct 2015 16:42:56 +0300 From: Vladimir Davydov To: Johannes Weiner CC: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , , , , Subject: Re: [PATCH 3/8] net: consolidate memcg socket buffer tracking and accounting Message-ID: <20151023134256.GS18351@esperanza> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-4-git-send-email-hannes@cmpxchg.org> <20151022184612.GN18351@esperanza> <20151022190943.GA20871@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <20151022190943.GA20871@cmpxchg.org> X-ClientProxiedBy: US-EXCH2.sw.swsoft.com (10.255.249.46) To MSK-EXCH1.sw.swsoft.com (10.67.48.55) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Oct 22, 2015 at 03:09:43PM -0400, Johannes Weiner wrote: > On Thu, Oct 22, 2015 at 09:46:12PM +0300, Vladimir Davydov wrote: > > On Thu, Oct 22, 2015 at 12:21:31AM -0400, Johannes Weiner wrote: > > > The tcp memory controller has extensive provisions for future memory > > > accounting interfaces that won't materialize after all. Cut the code > > > base down to what's actually used, now and in the likely future. > > > > > > - There won't be any different protocol counters in the future, so a > > > direct sock->sk_memcg linkage is enough. This eliminates a lot of > > > callback maze and boilerplate code, and restores most of the socket > > > allocation code to pre-tcp_memcontrol state. > > > > > > - There won't be a tcp control soft limit, so integrating the memcg > > > > In fact, the code is ready for the "soft" limit (I mean min, pressure, > > max tuple), it just lacks a knob. > > Yeah, but that's not going to materialize if the entire interface for > dedicated tcp throttling is considered obsolete. May be, it shouldn't be. My current understanding is that per memcg tcp window control is necessary, because: - We need to be able to protect a containerized workload from its growing network buffers. Using vmpressure notifications for that does not look reassuring to me. - We need a way to limit network buffers of a particular container, otherwise it can fill the system-wide window throttling other containers, which is unfair. > > > > @@ -1136,9 +1090,6 @@ static inline bool sk_under_memory_pressure(const struct sock *sk) > > > if (!sk->sk_prot->memory_pressure) > > > return false; > > > > > > - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) > > > - return !!sk->sk_cgrp->memory_pressure; > > > - > > > > AFAIU, now we won't shrink the window on hitting the limit, i.e. this > > patch subtly changes the behavior of the existing knobs, potentially > > breaking them. > > Hm, but there is no grace period in which something meaningful could > happen with the window shrinking, is there? Any buffer allocation is > still going to fail hard. AFAIU when we hit the limit, we not only throttle the socket which allocates, but also try to release space reserved by other sockets. After your patch we won't. This looks unfair to me. Thanks, Vladimir From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752930AbbJWNuD (ORCPT ); Fri, 23 Oct 2015 09:50:03 -0400 Received: from mail-wi0-f176.google.com ([209.85.212.176]:33138 "EHLO mail-wi0-f176.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752684AbbJWNuA (ORCPT ); Fri, 23 Oct 2015 09:50:00 -0400 Date: Fri, 23 Oct 2015 15:49:57 +0200 From: Michal Hocko To: Johannes Weiner Cc: "David S. Miller" , Andrew Morton , Vladimir Davydov , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 7/8] mm: vmscan: report vmpressure at the level of reclaim activity Message-ID: <20151023134957.GC15375@dhcp22.suse.cz> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-8-git-send-email-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1445487696-21545-8-git-send-email-hannes@cmpxchg.org> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 22-10-15 00:21:35, Johannes Weiner wrote: > The vmpressure metric is based on reclaim efficiency, which in turn is > an attribute of the LRU. However, vmpressure events are currently > reported at the source of pressure rather than at the reclaim level. > > Switch the reporting to the reclaim level to allow finer-grained > analysis of which memcg is having trouble reclaiming its pages. I can see how this can be useful. > As far as memory.pressure_level interface semantics go, events are > escalated up the hierarchy until a listener is found, so this won't > affect existing users that listen at higher levels. This is true but the parent will not see cumulative events anymore. One memcg might be fighting and barely reclaim anything so it would report high pressure while other would be doing just fine. The parent will just see conflicting events in a short time period and cannot match them the source memcg. This sounds really confusing. Even more confusing than the current semantic which allows the same behavior under certain configurations. I dunno, have to think about it some more. Maybe we need to rethink the way how the pressure is signaled. If we want the breakdown of the particular memcgs then we should be able to identify them for this to be useful. [...] -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752890AbbJWNng (ORCPT ); Fri, 23 Oct 2015 09:43:36 -0400 Received: from shards.monkeyblade.net ([149.20.54.216]:35276 "EHLO shards.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752705AbbJWNnc (ORCPT ); Fri, 23 Oct 2015 09:43:32 -0400 Date: Fri, 23 Oct 2015 06:59:57 -0700 (PDT) Message-Id: <20151023.065957.1690815054807881760.davem@davemloft.net> To: mhocko@kernel.org Cc: hannes@cmpxchg.org, akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy From: David Miller In-Reply-To: <20151023131956.GA15375@dhcp22.suse.cz> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-6-git-send-email-hannes@cmpxchg.org> <20151023131956.GA15375@dhcp22.suse.cz> X-Mailer: Mew version 6.4 on Emacs 23.4 / Mule 6.0 (HANACHIRUSATO) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.5.12 (shards.monkeyblade.net [149.20.54.216]); Fri, 23 Oct 2015 06:43:32 -0700 (PDT) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Michal Hocko Date: Fri, 23 Oct 2015 15:19:56 +0200 > On Thu 22-10-15 00:21:33, Johannes Weiner wrote: >> Socket memory can be a significant share of overall memory consumed by >> common workloads. In order to provide reasonable resource isolation >> out-of-the-box in the unified hierarchy, this type of memory needs to >> be accounted and tracked per default in the memory controller. > > What about users who do not want to pay an additional overhead for the > accounting? How can they disable it? Yeah, this really cannot pass. This extra overhead will be seen by %99.9999 of users, since entities (especially distributions) just flip on all of these config options by default. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752834AbbJZQ4h (ORCPT ); Mon, 26 Oct 2015 12:56:37 -0400 Received: from gum.cmpxchg.org ([85.214.110.215]:40044 "EHLO gum.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751533AbbJZQ4c (ORCPT ); Mon, 26 Oct 2015 12:56:32 -0400 Date: Mon, 26 Oct 2015 12:56:19 -0400 From: Johannes Weiner To: David Miller Cc: mhocko@kernel.org, akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151026165619.GB2214@cmpxchg.org> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-6-git-send-email-hannes@cmpxchg.org> <20151023131956.GA15375@dhcp22.suse.cz> <20151023.065957.1690815054807881760.davem@davemloft.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151023.065957.1690815054807881760.davem@davemloft.net> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Oct 23, 2015 at 06:59:57AM -0700, David Miller wrote: > From: Michal Hocko > Date: Fri, 23 Oct 2015 15:19:56 +0200 > > > On Thu 22-10-15 00:21:33, Johannes Weiner wrote: > >> Socket memory can be a significant share of overall memory consumed by > >> common workloads. In order to provide reasonable resource isolation > >> out-of-the-box in the unified hierarchy, this type of memory needs to > >> be accounted and tracked per default in the memory controller. > > > > What about users who do not want to pay an additional overhead for the > > accounting? How can they disable it? > > Yeah, this really cannot pass. > > This extra overhead will be seen by %99.9999 of users, since entities > (especially distributions) just flip on all of these config options by > default. Okay, there are several layers to this issue. If you boot a machine with a CONFIG_MEMCG distribution kernel and don't create any cgroups, I agree there shouldn't be any overhead. I already sent a patch to generally remove memory accounting on the system or root level. I can easily update this patch here to not have any socket buffer accounting overhead for systems that don't actively use cgroups. Would you be okay with a branch on sk->sk_memcg in the network accounting path? I'd leave that NULL on the system level then. Then there is of course the case when you create cgroups for process organization but don't care about memory accounting. Systemd comes to mind. Or even if you create cgroups to track other resources like CPU but don't care about memory. The unified hierarchy no longer enables controllers on new cgroups per default, so unless you create a cgroup and specifically tell it to account and track memory, you won't have the socket memory accounting overhead, either. Then there is the third case, where you create a control group to specifically manage and limit the memory consumption of a workload. In that scenario, a major memory consumer like socket buffers, which can easily grow until OOM, should definitely be included in the tracking in order to properly contain both untrusted (possibly malicious) and trusted (possibly buggy) workloads. This is not a hole we can reasonbly leave unpatched for general purpose resource management. Now you could argue that there might exist specialized workloads that need to account anonymous pages and page cache, but not socket memory buffers. Or any other combination of pick-and-choose consumers. But honestly, nowadays all our paths are lockless, and the counting is an atomic-add-return with a per-cpu batch cache. I don't think there is a compelling case for an elaborate interface to make individual memory consumers configurable inside the memory controller. So in summary, would you be okay with this patch if networking only called into the memory controller when you explicitely create a cgroup AND tell it to track the memory footprint of the workload in it? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752656AbbJZRW2 (ORCPT ); Mon, 26 Oct 2015 13:22:28 -0400 Received: from gum.cmpxchg.org ([85.214.110.215]:40062 "EHLO gum.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751617AbbJZRW0 (ORCPT ); Mon, 26 Oct 2015 13:22:26 -0400 Date: Mon, 26 Oct 2015 13:22:16 -0400 From: Johannes Weiner To: Vladimir Davydov Cc: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy Message-ID: <20151026172216.GC2214@cmpxchg.org> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <20151022184509.GM18351@esperanza> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151022184509.GM18351@esperanza> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Oct 22, 2015 at 09:45:10PM +0300, Vladimir Davydov wrote: > Hi Johannes, > > On Thu, Oct 22, 2015 at 12:21:28AM -0400, Johannes Weiner wrote: > ... > > Patch #5 adds accounting and tracking of socket memory to the unified > > hierarchy memory controller, as described above. It uses the existing > > per-cpu charge caches and triggers high limit reclaim asynchroneously. > > > > Patch #8 uses the vmpressure extension to equalize pressure between > > the pages tracked natively by the VM and socket buffer pages. As the > > pool is shared, it makes sense that while natively tracked pages are > > under duress the network transmit windows are also not increased. > > First of all, I've no experience in networking, so I'm likely to be > mistaken. Nevertheless I beg to disagree that this patch set is a step > in the right direction. Here goes why. > > I admit that your idea to get rid of explicit tcp window control knobs > and size it dynamically basing on memory pressure instead does sound > tempting, but I don't think it'd always work. The problem is that in > contrast to, say, dcache, we can't shrink tcp buffers AFAIU, we can only > stop growing them. Now suppose a system hasn't experienced memory > pressure for a while. If we don't have explicit tcp window limit, tcp > buffers on such a system might have eaten almost all available memory > (because of network load/problems). If a user workload that needs a > significant amount of memory is started suddenly then, the network code > will receive a notification and surely stop growing buffers, but all > those buffers accumulated won't disappear instantly. As a result, the > workload might be unable to find enough free memory and have no choice > but invoke OOM killer. This looks unexpected from the user POV. I'm not getting rid of those knobs, I'm just reusing the old socket accounting infrastructure in an attempt to make the memory accounting feature useful to more people in cgroups v2 (unified hierarchy). We can always come back to think about per-cgroup tcp window limits in the unified hierarchy, my patches don't get in the way of this. I'm not removing the knobs in cgroups v1 and I'm not preventing them in v2. But regardless of tcp window control, we need to account socket memory in the main memory accounting pool where pressure is shared (to the best of our abilities) between all accounted memory consumers. >>From an interface standpoint alone, I don't think it's reasonable to ask users per default to limit different consumers on a case by case basis. I certainly have no problem with finetuning for scenarios you describe above, but with memory.current, memory.high, memory.max we are providing a generic interface to account and contain memory consumption of workloads. This has to include all major memory consumers to make semantical sense. But also, there are people right now for whom the socket buffers cause system OOM, but the existing memcg's hard tcp window limitq that exists absolutely wrecks network performance for them. It's not usable the way it is. It'd be much better to have the socket buffers exert pressure on the shared pool, and then propagate the overall pressure back to individual consumers with reclaim, shrinkers, vmpressure etc. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754370AbbJ0Inq (ORCPT ); Tue, 27 Oct 2015 04:43:46 -0400 Received: from relay.parallels.com ([195.214.232.42]:58066 "EHLO relay.parallels.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751684AbbJ0Ink (ORCPT ); Tue, 27 Oct 2015 04:43:40 -0400 Date: Tue, 27 Oct 2015 11:43:21 +0300 From: Vladimir Davydov To: Johannes Weiner CC: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , , , , Subject: Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy Message-ID: <20151027084320.GF13221@esperanza> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <20151022184509.GM18351@esperanza> <20151026172216.GC2214@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <20151026172216.GC2214@cmpxchg.org> X-ClientProxiedBy: US-EXCH2.sw.swsoft.com (10.255.249.46) To MSK-EXCH1.sw.swsoft.com (10.67.48.55) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Oct 26, 2015 at 01:22:16PM -0400, Johannes Weiner wrote: > On Thu, Oct 22, 2015 at 09:45:10PM +0300, Vladimir Davydov wrote: > > Hi Johannes, > > > > On Thu, Oct 22, 2015 at 12:21:28AM -0400, Johannes Weiner wrote: > > ... > > > Patch #5 adds accounting and tracking of socket memory to the unified > > > hierarchy memory controller, as described above. It uses the existing > > > per-cpu charge caches and triggers high limit reclaim asynchroneously. > > > > > > Patch #8 uses the vmpressure extension to equalize pressure between > > > the pages tracked natively by the VM and socket buffer pages. As the > > > pool is shared, it makes sense that while natively tracked pages are > > > under duress the network transmit windows are also not increased. > > > > First of all, I've no experience in networking, so I'm likely to be > > mistaken. Nevertheless I beg to disagree that this patch set is a step > > in the right direction. Here goes why. > > > > I admit that your idea to get rid of explicit tcp window control knobs > > and size it dynamically basing on memory pressure instead does sound > > tempting, but I don't think it'd always work. The problem is that in > > contrast to, say, dcache, we can't shrink tcp buffers AFAIU, we can only > > stop growing them. Now suppose a system hasn't experienced memory > > pressure for a while. If we don't have explicit tcp window limit, tcp > > buffers on such a system might have eaten almost all available memory > > (because of network load/problems). If a user workload that needs a > > significant amount of memory is started suddenly then, the network code > > will receive a notification and surely stop growing buffers, but all > > those buffers accumulated won't disappear instantly. As a result, the > > workload might be unable to find enough free memory and have no choice > > but invoke OOM killer. This looks unexpected from the user POV. > > I'm not getting rid of those knobs, I'm just reusing the old socket > accounting infrastructure in an attempt to make the memory accounting > feature useful to more people in cgroups v2 (unified hierarchy). > My understanding is that in the meantime you effectively break the existing per memcg tcp window control logic. > We can always come back to think about per-cgroup tcp window limits in > the unified hierarchy, my patches don't get in the way of this. I'm > not removing the knobs in cgroups v1 and I'm not preventing them in v2. > > But regardless of tcp window control, we need to account socket memory > in the main memory accounting pool where pressure is shared (to the > best of our abilities) between all accounted memory consumers. > No objections to this point. However, I really don't like the idea to charge tcp window size to memory.current instead of charging individual pages consumed by the workload for storing socket buffers, because it is inconsistent with what we have now. Can't we charge individual skb pages as we do in case of other kmem allocations? > From an interface standpoint alone, I don't think it's reasonable to > ask users per default to limit different consumers on a case by case > basis. I certainly have no problem with finetuning for scenarios you > describe above, but with memory.current, memory.high, memory.max we > are providing a generic interface to account and contain memory > consumption of workloads. This has to include all major memory > consumers to make semantical sense. We can propose a reasonable default as we do in the global case. > > But also, there are people right now for whom the socket buffers cause > system OOM, but the existing memcg's hard tcp window limitq that > exists absolutely wrecks network performance for them. It's not usable > the way it is. It'd be much better to have the socket buffers exert > pressure on the shared pool, and then propagate the overall pressure > back to individual consumers with reclaim, shrinkers, vmpressure etc. > This might or might not work. I'm not an expert to judge. But if you do this only for memcg leaving the global case as it is, networking people won't budge IMO. So could you please start such a major rework from the global case? Could you please try to deprecate the tcp window limits not only in the legacy memcg hierarchy, but also system-wide in order to attract attention of networking experts? Thanks, Vladimir From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932418AbbJ0Nck (ORCPT ); Tue, 27 Oct 2015 09:32:40 -0400 Received: from shards.monkeyblade.net ([149.20.54.216]:38362 "EHLO shards.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932072AbbJ0Ncj (ORCPT ); Tue, 27 Oct 2015 09:32:39 -0400 Date: Tue, 27 Oct 2015 06:49:16 -0700 (PDT) Message-Id: <20151027.064916.312540587298733586.davem@davemloft.net> To: mhocko@kernel.org Cc: hannes@cmpxchg.org, akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy From: David Miller In-Reply-To: <20151027122647.GG9891@dhcp22.suse.cz> References: <20151023.065957.1690815054807881760.davem@davemloft.net> <20151026165619.GB2214@cmpxchg.org> <20151027122647.GG9891@dhcp22.suse.cz> X-Mailer: Mew version 6.4 on Emacs 23.4 / Mule 6.0 (HANACHIRUSATO) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.5.12 (shards.monkeyblade.net [149.20.54.216]); Tue, 27 Oct 2015 06:32:38 -0700 (PDT) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Michal Hocko Date: Tue, 27 Oct 2015 13:26:47 +0100 > On Mon 26-10-15 12:56:19, Johannes Weiner wrote: > [...] >> Or any other combination of pick-and-choose consumers. But >> honestly, nowadays all our paths are lockless, and the counting is an >> atomic-add-return with a per-cpu batch cache. > > You are still hooking into hot paths and there are users who want to > squeeze every single cycle from the HW. Yeah, you're basically probably undoing a half year of work by another developer who was able to remove an atomic from these paths. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S964953AbbJ0Plw (ORCPT ); Tue, 27 Oct 2015 11:41:52 -0400 Received: from gum.cmpxchg.org ([85.214.110.215]:40262 "EHLO gum.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932256AbbJ0Plu (ORCPT ); Tue, 27 Oct 2015 11:41:50 -0400 Date: Tue, 27 Oct 2015 11:41:38 -0400 From: Johannes Weiner To: Michal Hocko Cc: David Miller , akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151027154138.GA4665@cmpxchg.org> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-6-git-send-email-hannes@cmpxchg.org> <20151023131956.GA15375@dhcp22.suse.cz> <20151023.065957.1690815054807881760.davem@davemloft.net> <20151026165619.GB2214@cmpxchg.org> <20151027122647.GG9891@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151027122647.GG9891@dhcp22.suse.cz> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Oct 27, 2015 at 01:26:47PM +0100, Michal Hocko wrote: > On Mon 26-10-15 12:56:19, Johannes Weiner wrote: > [...] > > Now you could argue that there might exist specialized workloads that > > need to account anonymous pages and page cache, but not socket memory > > buffers. > > Exactly, and there are loads doing this. Memcg groups are also created to > limit anon/page cache consumers to not affect the others running on > the system (basically in the root memcg context from memcg POV) which > don't care about tracking and they definitely do not want to pay for an > additional overhead. We should definitely be able to offer a global > disable knob for them. The same applies to kmem accounting in general. I don't see how you make such a clear distinction between, say, page cache and the dentry cache, and call one user memory and the other kernel memory. That just doesn't make sense to me. They're both kernel memory allocated on behalf of the user, the only difference being that one is tracked on the page level and the other on the slab level, and we started accounting one before the other. IMO that's an implementation detail and a historical artifact that should not be exposed to the user. And that's the thing I hate about the current opt-out knob. > > I don't think there is a compelling case for an elaborate interface > > to make individual memory consumers configurable inside the memory > > controller. > > I do not think we need an elaborate interface. We just want to have > a global boot time knob to overwrite the default behavior. This is > few lines of code and it should give the sufficient flexibility. Okay, then let's add this for the socket memory to start with. I'll have to think more about how to distinguish the slab-based consumers. Or maybe you have an idea. For now, something like this as a boot commandline? cgroup.memory=nosocket So again in summary, no default overhead until you create a cgroup to specifically track and account memory. And then, when you know what you are doing and have a specialized workload, you can disable socket memory as a specific consumer to remove that particular overhead while still being able to contain page cache, anon, kmem, whatever. Does that sound like reasonable userinterfacing to everyone? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965032AbbJ0QQE (ORCPT ); Tue, 27 Oct 2015 12:16:04 -0400 Received: from mail-wi0-f170.google.com ([209.85.212.170]:38158 "EHLO mail-wi0-f170.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S964826AbbJ0QP5 (ORCPT ); Tue, 27 Oct 2015 12:15:57 -0400 Date: Tue, 27 Oct 2015 17:15:54 +0100 From: Michal Hocko To: Johannes Weiner Cc: David Miller , akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151027161554.GJ9891@dhcp22.suse.cz> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-6-git-send-email-hannes@cmpxchg.org> <20151023131956.GA15375@dhcp22.suse.cz> <20151023.065957.1690815054807881760.davem@davemloft.net> <20151026165619.GB2214@cmpxchg.org> <20151027122647.GG9891@dhcp22.suse.cz> <20151027154138.GA4665@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151027154138.GA4665@cmpxchg.org> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 27-10-15 11:41:38, Johannes Weiner wrote: > On Tue, Oct 27, 2015 at 01:26:47PM +0100, Michal Hocko wrote: > > On Mon 26-10-15 12:56:19, Johannes Weiner wrote: > > [...] > > > Now you could argue that there might exist specialized workloads that > > > need to account anonymous pages and page cache, but not socket memory > > > buffers. > > > > Exactly, and there are loads doing this. Memcg groups are also created to > > limit anon/page cache consumers to not affect the others running on > > the system (basically in the root memcg context from memcg POV) which > > don't care about tracking and they definitely do not want to pay for an > > additional overhead. We should definitely be able to offer a global > > disable knob for them. The same applies to kmem accounting in general. > > I don't see how you make such a clear distinction between, say, page > cache and the dentry cache, and call one user memory and the other > kernel memory. Because the kernel memory footprint would be so small that it simply doesn't change the picture at all. While the page cache or anonymous memory consumption might be so large it might be disruptive. I am talking about loads where good enough is better than "perfect" and ephemeral global memory pressure when kmem goes over expectations is better than a permanent cpu overhead. Whatever we do it will always be non-zero. Also kmem accounting will make the load more non-deterministic because many of the resources are shared between tasks in separate cgroups unless they are explicitly configured. E.g. [id]cache will be shared and first to touch gets charged so you would end up with more false sharing. Nevertheless, I do not want to shift the discussion from the topic. I just think that one-fits-all simply won't work. > That just doesn't make sense to me. They're both kernel > memory allocated on behalf of the user, the only difference being that > one is tracked on the page level and the other on the slab level, and > we started accounting one before the other. > > IMO that's an implementation detail and a historical artifact that > should not be exposed to the user. And that's the thing I hate about > the current opt-out knob. > > > > I don't think there is a compelling case for an elaborate interface > > > to make individual memory consumers configurable inside the memory > > > controller. > > > > I do not think we need an elaborate interface. We just want to have > > a global boot time knob to overwrite the default behavior. This is > > few lines of code and it should give the sufficient flexibility. > > Okay, then let's add this for the socket memory to start with. I'll > have to think more about how to distinguish the slab-based consumers. > Or maybe you have an idea. Isn't that as simple as enabling the jump label during the initialization depending on the knob value? All the charging paths should be disabled by default already. > For now, something like this as a boot commandline? > > cgroup.memory=nosocket That would work for me. I would even see a place to have CONFIG_MEMCG_TCP_KMEM_ENABLED config option for the default and [no]socket as a kernel parameter to override the configuratioin default. This would allow distributions to define their policy without enforcing it hard and those who compile the kernel to define their own policy. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754886AbbJ1A24 (ORCPT ); Tue, 27 Oct 2015 20:28:56 -0400 Received: from shards.monkeyblade.net ([149.20.54.216]:44030 "EHLO shards.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751116AbbJ1A2y (ORCPT ); Tue, 27 Oct 2015 20:28:54 -0400 Date: Tue, 27 Oct 2015 17:45:32 -0700 (PDT) Message-Id: <20151027.174532.469361008055673315.davem@davemloft.net> To: hannes@cmpxchg.org Cc: mhocko@kernel.org, akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy From: David Miller In-Reply-To: <20151027164227.GB7749@cmpxchg.org> References: <20151027154138.GA4665@cmpxchg.org> <20151027161554.GJ9891@dhcp22.suse.cz> <20151027164227.GB7749@cmpxchg.org> X-Mailer: Mew version 6.4 on Emacs 23.4 / Mule 6.0 (HANACHIRUSATO) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.5.12 (shards.monkeyblade.net [149.20.54.216]); Tue, 27 Oct 2015 17:28:54 -0700 (PDT) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Johannes Weiner Date: Tue, 27 Oct 2015 09:42:27 -0700 > On Tue, Oct 27, 2015 at 05:15:54PM +0100, Michal Hocko wrote: >> > For now, something like this as a boot commandline? >> > >> > cgroup.memory=nosocket >> >> That would work for me. > > Okay, then I'll go that route for the socket stuff. > > Dave is that cool with you? Depends upon the default. Until the user configures something explicitly into the memory controller, the networking bits should all evaluate to nothing. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965180AbbJ1IVH (ORCPT ); Wed, 28 Oct 2015 04:21:07 -0400 Received: from relay.parallels.com ([195.214.232.42]:33510 "EHLO relay.parallels.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932384AbbJ1IUW (ORCPT ); Wed, 28 Oct 2015 04:20:22 -0400 Date: Wed, 28 Oct 2015 11:20:03 +0300 From: Vladimir Davydov To: Johannes Weiner CC: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , , , , Subject: Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy Message-ID: <20151028082003.GK13221@esperanza> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <20151022184509.GM18351@esperanza> <20151026172216.GC2214@cmpxchg.org> <20151027084320.GF13221@esperanza> <20151027155833.GB4665@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <20151027155833.GB4665@cmpxchg.org> X-ClientProxiedBy: US-EXCH2.sw.swsoft.com (10.255.249.46) To MSK-EXCH1.sw.swsoft.com (10.67.48.55) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Oct 27, 2015 at 09:01:08AM -0700, Johannes Weiner wrote: ... > > > But regardless of tcp window control, we need to account socket memory > > > in the main memory accounting pool where pressure is shared (to the > > > best of our abilities) between all accounted memory consumers. > > > > > > > No objections to this point. However, I really don't like the idea to > > charge tcp window size to memory.current instead of charging individual > > pages consumed by the workload for storing socket buffers, because it is > > inconsistent with what we have now. Can't we charge individual skb pages > > as we do in case of other kmem allocations? > > Absolutely, both work for me. I chose that route because it's where > the networking code already tracks and accounts memory consumed, so it > seemed like a better site to hook into. > > But I understand your concerns. We want to track this stuff as close > to the memory allocators as possible. Exactly. > > > > But also, there are people right now for whom the socket buffers cause > > > system OOM, but the existing memcg's hard tcp window limitq that > > > exists absolutely wrecks network performance for them. It's not usable > > > the way it is. It'd be much better to have the socket buffers exert > > > pressure on the shared pool, and then propagate the overall pressure > > > back to individual consumers with reclaim, shrinkers, vmpressure etc. > > > > This might or might not work. I'm not an expert to judge. But if you do > > this only for memcg leaving the global case as it is, networking people > > won't budge IMO. So could you please start such a major rework from the > > global case? Could you please try to deprecate the tcp window limits not > > only in the legacy memcg hierarchy, but also system-wide in order to > > attract attention of networking experts? > > I'm definitely interested in addressing this globally as well. > > The idea behind this was to use the memcg part as a testbed. cgroup2 > is going to be new and people are prepared for hiccups when migrating > their applications to it; and they can roll back to cgroup1 and tcp > window limits at any time should they run into problems in production. Then you'd better not touch existing tcp limits at all, because they just work, and the logic behind them is very close to that of global tcp limits. I don't think one can simplify it somehow. Moreover, frankly I still have my reservations about this vmpressure propagation to skb you're proposing. It might work, but I doubt it will allow us to throw away explicit tcp limit, as I explained previously. So, even with your approach I think we can still need per memcg tcp limit *unless* you get rid of global tcp limit somehow. > > So this seemed like a good way to prove a new mechanism before rolling > it out to every single Linux setup, rather than switch everybody over > after the limited scope testing I can do as a developer on my own. > > Keep in mind that my patches are not committing anything in terms of > interface, so we retain all the freedom to fix and tune the way this > is implemented, including the freedom to re-add tcp window limits in > case the pressure balancing is not a comprehensive solution. > I really dislike this kind of proof. It looks like you're trying to push something you think is right covertly, w/o having a proper discussion with networking people and then say that it just works and hence should be done globally, but what if it won't? Revert it? We already have a lot of dubious stuff in memcg that should be reverted, so let's please try to avoid this kind of mistakes in future. Note, I say "w/o having a proper discussion with networking people", because I don't think they will really care *unless* you change the global logic, simply because most of them aren't very interested in memcg AFAICS. That effectively means you loose a chance to listen to networking experts, who could point you at design flaws and propose an improvement right away. Let's please not miss such an opportunity. You said that you'd seen this problem happen w/o cgroups, so you have a use case that might need fixing at the global level. IMO it shouldn't be difficult to prepare an RFC patch for the global case first and see what people think about it. Thanks, Vladimir From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965127AbbJ1S6b (ORCPT ); Wed, 28 Oct 2015 14:58:31 -0400 Received: from gum.cmpxchg.org ([85.214.110.215]:40490 "EHLO gum.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S964828AbbJ1S62 (ORCPT ); Wed, 28 Oct 2015 14:58:28 -0400 Date: Wed, 28 Oct 2015 11:58:10 -0700 From: Johannes Weiner To: Vladimir Davydov Cc: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy Message-ID: <20151028185810.GA31488@cmpxchg.org> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <20151022184509.GM18351@esperanza> <20151026172216.GC2214@cmpxchg.org> <20151027084320.GF13221@esperanza> <20151027155833.GB4665@cmpxchg.org> <20151028082003.GK13221@esperanza> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151028082003.GK13221@esperanza> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Oct 28, 2015 at 11:20:03AM +0300, Vladimir Davydov wrote: > Then you'd better not touch existing tcp limits at all, because they > just work, and the logic behind them is very close to that of global tcp > limits. I don't think one can simplify it somehow. Uhm, no, there is a crapload of boilerplate code and complication that seems entirely unnecessary. The only thing missing from my patch seems to be the part where it enters memory pressure state when the limit is hit. I'm adding this for completeness, but I doubt it even matters. > Moreover, frankly I still have my reservations about this vmpressure > propagation to skb you're proposing. It might work, but I doubt it > will allow us to throw away explicit tcp limit, as I explained > previously. So, even with your approach I think we can still need > per memcg tcp limit *unless* you get rid of global tcp limit > somehow. Having the hard limit as a failsafe (or a minimum for other consumers) is one thing, and certainly something I'm open to for cgroupv2, should we have problems with load startup up after a socket memory landgrab. That being said, if the VM is struggling to reclaim pages, or is even swapping, it makes perfect sense to let the socket memory scheduler know it shouldn't continue to increase its footprint until the VM recovers. Regardless of any hard limitations/minimum guarantees. This is what my patch does and it seems pretty straight-forward to me. I don't really understand why this is so controversial. The *next* step would be to figure out whether we can actually *reclaim* memory in the network subsystem--shrink windows and steal buffers back--and that might even be an avenue to replace tcp window limits. But it's not necessary for *this* patch series to be useful. > > So this seemed like a good way to prove a new mechanism before rolling > > it out to every single Linux setup, rather than switch everybody over > > after the limited scope testing I can do as a developer on my own. > > > > Keep in mind that my patches are not committing anything in terms of > > interface, so we retain all the freedom to fix and tune the way this > > is implemented, including the freedom to re-add tcp window limits in > > case the pressure balancing is not a comprehensive solution. > > I really dislike this kind of proof. It looks like you're trying to > push something you think is right covertly, w/o having a proper > discussion with networking people and then say that it just works > and hence should be done globally, but what if it won't? Revert it? > We already have a lot of dubious stuff in memcg that should be > reverted, so let's please try to avoid this kind of mistakes in > future. Note, I say "w/o having a proper discussion with networking > people", because I don't think they will really care *unless* you > change the global logic, simply because most of them aren't very > interested in memcg AFAICS. Come on, Dave is the first To and netdev is CC'd. They might not care about memcg, but "pushing things covertly" is a bit of a stretch. > That effectively means you loose a chance to listen to networking > experts, who could point you at design flaws and propose an improvement > right away. Let's please not miss such an opportunity. You said that > you'd seen this problem happen w/o cgroups, so you have a use case that > might need fixing at the global level. IMO it shouldn't be difficult to > prepare an RFC patch for the global case first and see what people think > about it. No, the problem we are running into is when network memory is not tracked per cgroup. The lack of containment means that the socket memory consumption of individual cgroups can trigger system OOM. We tried using the per-memcg tcp limits, and that prevents the OOMs for sure, but it's horrendous for network performance. There is no "stop growing" phase, it just keeps going full throttle until it hits the wall hard. Now, we could probably try to replicate the global knobs and add a per-memcg soft limit. But you know better than anyone else how hard it is to estimate the overall workingset size of a workload, and the margins on containerized loads are razor-thin. Performance is much more sensitive to input errors, and often times parameters must be adjusted continuously during the runtime of a workload. It'd be disasterous to rely on yet more static, error-prone user input here. What all this means to me is that fixing it on the cgroup level has higher priority. But it also means that once we figured it out under such a high-pressure environment, it's much easier to apply to the global case and potentially replace the soft limit there. This seems like a better approach to me than starting globally, only to realize that the solution is not workable for cgroups and we need yet something else. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756757AbbJ2J2L (ORCPT ); Thu, 29 Oct 2015 05:28:11 -0400 Received: from relay.parallels.com ([195.214.232.42]:45410 "EHLO relay.parallels.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750927AbbJ2J2G (ORCPT ); Thu, 29 Oct 2015 05:28:06 -0400 Date: Thu, 29 Oct 2015 12:27:47 +0300 From: Vladimir Davydov To: Johannes Weiner CC: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , , , , Subject: Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy Message-ID: <20151029092747.GR13221@esperanza> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <20151022184509.GM18351@esperanza> <20151026172216.GC2214@cmpxchg.org> <20151027084320.GF13221@esperanza> <20151027155833.GB4665@cmpxchg.org> <20151028082003.GK13221@esperanza> <20151028185810.GA31488@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <20151028185810.GA31488@cmpxchg.org> X-ClientProxiedBy: US-EXCH2.sw.swsoft.com (10.255.249.46) To MSK-EXCH1.sw.swsoft.com (10.67.48.55) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Oct 28, 2015 at 11:58:10AM -0700, Johannes Weiner wrote: > On Wed, Oct 28, 2015 at 11:20:03AM +0300, Vladimir Davydov wrote: > > Then you'd better not touch existing tcp limits at all, because they > > just work, and the logic behind them is very close to that of global tcp > > limits. I don't think one can simplify it somehow. > > Uhm, no, there is a crapload of boilerplate code and complication that > seems entirely unnecessary. The only thing missing from my patch seems > to be the part where it enters memory pressure state when the limit is > hit. I'm adding this for completeness, but I doubt it even matters. > > > Moreover, frankly I still have my reservations about this vmpressure > > propagation to skb you're proposing. It might work, but I doubt it > > will allow us to throw away explicit tcp limit, as I explained > > previously. So, even with your approach I think we can still need > > per memcg tcp limit *unless* you get rid of global tcp limit > > somehow. > > Having the hard limit as a failsafe (or a minimum for other consumers) > is one thing, and certainly something I'm open to for cgroupv2, should > we have problems with load startup up after a socket memory landgrab. > > That being said, if the VM is struggling to reclaim pages, or is even > swapping, it makes perfect sense to let the socket memory scheduler > know it shouldn't continue to increase its footprint until the VM > recovers. Regardless of any hard limitations/minimum guarantees. > > This is what my patch does and it seems pretty straight-forward to > me. I don't really understand why this is so controversial. I'm not arguing that the idea behind this patch set is necessarily bad. Quite the contrary, it does look interesting to me. I'm just saying that IMO it can't replace hard/soft limits. It probably could if it was possible to shrink buffers, but I don't think it's feasible, even theoretically. That's why I propose not to change the behavior of the existing per memcg tcp limit at all. And frankly I don't get why you are so keen on simplifying it. You say it's a "crapload of boilerplate code". Well, I don't see how it is - it just replicates global knobs and I don't see how it could be done in a better way. The code is hidden behind jump labels, so the overhead is zero if it isn't used. If you really dislike this code, we can isolate it under a separate config option. But all right, I don't rule out the possibility that the code could be simplified. If you do that w/o breaking it, that'll be OK to me, but I don't see why it should be related to this particular patch set. > > The *next* step would be to figure out whether we can actually > *reclaim* memory in the network subsystem--shrink windows and steal > buffers back--and that might even be an avenue to replace tcp window > limits. But it's not necessary for *this* patch series to be useful. Again, I don't think we can *reclaim* network memory, but you're right. > > > > So this seemed like a good way to prove a new mechanism before rolling > > > it out to every single Linux setup, rather than switch everybody over > > > after the limited scope testing I can do as a developer on my own. > > > > > > Keep in mind that my patches are not committing anything in terms of > > > interface, so we retain all the freedom to fix and tune the way this > > > is implemented, including the freedom to re-add tcp window limits in > > > case the pressure balancing is not a comprehensive solution. > > > > I really dislike this kind of proof. It looks like you're trying to > > push something you think is right covertly, w/o having a proper > > discussion with networking people and then say that it just works > > and hence should be done globally, but what if it won't? Revert it? > > We already have a lot of dubious stuff in memcg that should be > > reverted, so let's please try to avoid this kind of mistakes in > > future. Note, I say "w/o having a proper discussion with networking > > people", because I don't think they will really care *unless* you > > change the global logic, simply because most of them aren't very > > interested in memcg AFAICS. > > Come on, Dave is the first To and netdev is CC'd. They might not care > about memcg, but "pushing things covertly" is a bit of a stretch. Sorry if it sounded rude to you. I just look back at my experience patching slab internals to make kmem accountable, and AFAICS Christoph didn't really care about *what* I was doing, he only cared about the global case - if there was no performance degradation when kmemcg was disabled, he was usually fine with it, even if from the memcg pov it was a crap. Anyway, I can't force you to patch the global case first or simultaneously with the memcg case, so let's just hope I'm a bit too overcautious. > > > That effectively means you loose a chance to listen to networking > > experts, who could point you at design flaws and propose an improvement > > right away. Let's please not miss such an opportunity. You said that > > you'd seen this problem happen w/o cgroups, so you have a use case that > > might need fixing at the global level. IMO it shouldn't be difficult to > > prepare an RFC patch for the global case first and see what people think > > about it. > > No, the problem we are running into is when network memory is not > tracked per cgroup. The lack of containment means that the socket > memory consumption of individual cgroups can trigger system OOM. > > We tried using the per-memcg tcp limits, and that prevents the OOMs > for sure, but it's horrendous for network performance. There is no > "stop growing" phase, it just keeps going full throttle until it hits > the wall hard. > > Now, we could probably try to replicate the global knobs and add a > per-memcg soft limit. But you know better than anyone else how hard it > is to estimate the overall workingset size of a workload, and the > margins on containerized loads are razor-thin. Performance is much > more sensitive to input errors, and often times parameters must be > adjusted continuously during the runtime of a workload. It'd be > disasterous to rely on yet more static, error-prone user input here. Yeah, but the dynamic approach proposed in your patch set doesn't guarantee we won't hit OOM in memcg due to overgrown buffers. It just reduces this possibility. Of course, memcg OOM is far not as disastrous as the global one, but still it usually means the workload breakage. The static approach is error-prone for sure, but it has existed for years and worked satisfactory AFAIK. > > What all this means to me is that fixing it on the cgroup level has > higher priority. But it also means that once we figured it out under > such a high-pressure environment, it's much easier to apply to the > global case and potentially replace the soft limit there. > > This seems like a better approach to me than starting globally, only > to realize that the solution is not workable for cgroups and we need > yet something else. > Are we in rush? I think if you try your approach at the global level and fail, it's still good, because it will probably give us all a better understanding of the problem. If you successfully fix the global case, but then realize that it doesn't fit memcg, it's even better, because you actually fixed a problem. If you patch both global and memcg cases, it's perfect. But of course, that's my understanding and I may be mistaken. Let's hope you're right. Thanks, Vladimir From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757113AbbJ2PZu (ORCPT ); Thu, 29 Oct 2015 11:25:50 -0400 Received: from mail-wm0-f53.google.com ([74.125.82.53]:34289 "EHLO mail-wm0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751586AbbJ2PZs (ORCPT ); Thu, 29 Oct 2015 11:25:48 -0400 Date: Thu, 29 Oct 2015 16:25:46 +0100 From: Michal Hocko To: Johannes Weiner Cc: David Miller , akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151029152546.GG23598@dhcp22.suse.cz> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-6-git-send-email-hannes@cmpxchg.org> <20151023131956.GA15375@dhcp22.suse.cz> <20151023.065957.1690815054807881760.davem@davemloft.net> <20151026165619.GB2214@cmpxchg.org> <20151027122647.GG9891@dhcp22.suse.cz> <20151027154138.GA4665@cmpxchg.org> <20151027161554.GJ9891@dhcp22.suse.cz> <20151027164227.GB7749@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151027164227.GB7749@cmpxchg.org> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 27-10-15 09:42:27, Johannes Weiner wrote: > On Tue, Oct 27, 2015 at 05:15:54PM +0100, Michal Hocko wrote: > > On Tue 27-10-15 11:41:38, Johannes Weiner wrote: [...] > Or it could be exactly the other way around when you have a workload > that is heavy on filesystem metadata. I don't see why any scenario > would be more important than the other. Yes I definitely agree. No scenario is more important. We can only come up with a default that makes more sense for the majority and allow the minority to override. That was what I wanted to say basically. > I'm not saying that distinguishing between consumers is wrong, just > that "user memory vs kernel memory" is a false classification. Why do > you call page cache user memory but dentry cache kernel memory? It > doesn't make any sense. We are not talking about dcache vs. page cache alone here, though. We are talking about _all_ slab allocations vs. only user accessed memory. The slab consumption is directly under kernel control. A great pile of this logic is completly hidden from userspace. While user can estimate the user memory it is hard (if possible) to do that for the kernel memory footprint - not even mentioning this is variable and dependent on the particular kernel version. > > Also kmem accounting will make the load more non-deterministic because > > many of the resources are shared between tasks in separate cgroups > > unless they are explicitly configured. E.g. [id]cache will be shared > > and first to touch gets charged so you would end up with more false > > sharing. > > Exactly like page cache. This differentiation isn't based on reality. Yes false sharing is an existing and long term problem already. I just wanted to point out that the false sharing would be even a bigger problem because some kernel tracked resources are shared more naturally than file sharing. > > > IMO that's an implementation detail and a historical artifact that > > > should not be exposed to the user. And that's the thing I hate about > > > the current opt-out knob. > > You carefully skipped over this part. We can ignore it for socket > memory but it's something we need to figure out when it comes to slab > accounting and tracking. I am sorry, I didn't mean to skip this part, I though it would be clear from the previous text. I think kmem accounting falls into the same category. Have a sane default and a global boottime knob to override it for those that think differently - for whatever reason they might have. [...] > Having page cache accounting built in while presenting dentry+inode > cache as a configurable extension is completely random and doesn't > make sense. They are both first class memory consumers. They're not > separate categories. One isn't more "core" than the other. Again we are talking about all slab allocations not just the dcache. > > > For now, something like this as a boot commandline? > > > > > > cgroup.memory=nosocket > > > > That would work for me. > > Okay, then I'll go that route for the socket stuff. Thanks! -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757698AbbJ2QK1 (ORCPT ); Thu, 29 Oct 2015 12:10:27 -0400 Received: from gum.cmpxchg.org ([85.214.110.215]:40642 "EHLO gum.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757293AbbJ2QKZ (ORCPT ); Thu, 29 Oct 2015 12:10:25 -0400 Date: Thu, 29 Oct 2015 09:10:09 -0700 From: Johannes Weiner To: Michal Hocko Cc: David Miller , akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151029161009.GA9160@cmpxchg.org> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-6-git-send-email-hannes@cmpxchg.org> <20151023131956.GA15375@dhcp22.suse.cz> <20151023.065957.1690815054807881760.davem@davemloft.net> <20151026165619.GB2214@cmpxchg.org> <20151027122647.GG9891@dhcp22.suse.cz> <20151027154138.GA4665@cmpxchg.org> <20151027161554.GJ9891@dhcp22.suse.cz> <20151027164227.GB7749@cmpxchg.org> <20151029152546.GG23598@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151029152546.GG23598@dhcp22.suse.cz> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Oct 29, 2015 at 04:25:46PM +0100, Michal Hocko wrote: > On Tue 27-10-15 09:42:27, Johannes Weiner wrote: > > On Tue, Oct 27, 2015 at 05:15:54PM +0100, Michal Hocko wrote: > > > On Tue 27-10-15 11:41:38, Johannes Weiner wrote: > > > > IMO that's an implementation detail and a historical artifact that > > > > should not be exposed to the user. And that's the thing I hate about > > > > the current opt-out knob. > > > > You carefully skipped over this part. We can ignore it for socket > > memory but it's something we need to figure out when it comes to slab > > accounting and tracking. > > I am sorry, I didn't mean to skip this part, I though it would be clear > from the previous text. I think kmem accounting falls into the same > category. Have a sane default and a global boottime knob to override it > for those that think differently - for whatever reason they might have. Yes, that makes sense to me. Like cgroup.memory=nosocket, would you think it makes sense to include slab in the default for functional/semantical completeness and provide a cgroup.memory=noslab for powerusers? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965669AbbJ2Rwq (ORCPT ); Thu, 29 Oct 2015 13:52:46 -0400 Received: from gum.cmpxchg.org ([85.214.110.215]:40660 "EHLO gum.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757394AbbJ2Rwo (ORCPT ); Thu, 29 Oct 2015 13:52:44 -0400 Date: Thu, 29 Oct 2015 10:52:28 -0700 From: Johannes Weiner To: Vladimir Davydov Cc: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy Message-ID: <20151029175228.GB9160@cmpxchg.org> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <20151022184509.GM18351@esperanza> <20151026172216.GC2214@cmpxchg.org> <20151027084320.GF13221@esperanza> <20151027155833.GB4665@cmpxchg.org> <20151028082003.GK13221@esperanza> <20151028185810.GA31488@cmpxchg.org> <20151029092747.GR13221@esperanza> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151029092747.GR13221@esperanza> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Oct 29, 2015 at 12:27:47PM +0300, Vladimir Davydov wrote: > On Wed, Oct 28, 2015 at 11:58:10AM -0700, Johannes Weiner wrote: > > Having the hard limit as a failsafe (or a minimum for other consumers) > > is one thing, and certainly something I'm open to for cgroupv2, should > > we have problems with load startup up after a socket memory landgrab. > > > > That being said, if the VM is struggling to reclaim pages, or is even > > swapping, it makes perfect sense to let the socket memory scheduler > > know it shouldn't continue to increase its footprint until the VM > > recovers. Regardless of any hard limitations/minimum guarantees. > > > > This is what my patch does and it seems pretty straight-forward to > > me. I don't really understand why this is so controversial. > > I'm not arguing that the idea behind this patch set is necessarily bad. > Quite the contrary, it does look interesting to me. I'm just saying that > IMO it can't replace hard/soft limits. It probably could if it was > possible to shrink buffers, but I don't think it's feasible, even > theoretically. That's why I propose not to change the behavior of the > existing per memcg tcp limit at all. And frankly I don't get why you are > so keen on simplifying it. You say it's a "crapload of boilerplate > code". Well, I don't see how it is - it just replicates global knobs and > I don't see how it could be done in a better way. The code is hidden > behind jump labels, so the overhead is zero if it isn't used. If you > really dislike this code, we can isolate it under a separate config > option. But all right, I don't rule out the possibility that the code > could be simplified. If you do that w/o breaking it, that'll be OK to > me, but I don't see why it should be related to this particular patch > set. Okay, I see your concern. I'm not trying to change the behavior, just the implementation, because it's too complex for the functionality it actually provides. And the reason it's part of this patch set is because I'm using the same code to hook into the memory accounting, so it makes sense to refactor this stuff in the same go. There is also a niceness factor of not adding more memcg callbacks to the networking subsystem when there is an option to consolidate them. Now, you mentioned that you'd rather see the socket buffers accounted at the allocator level, but I looked at the different allocation paths and network protocols and I'm not convinced that this makes sense. We don't want to be in the hotpath of every single packet when a lot of them are small, short-lived management blips that don't involve user space to let the kernel dispose of them. __sk_mem_schedule() on the other hand is already wired up to exactly those consumers we are interested in for memory isolation: those with bigger chunks of data attached to them and those that have exploding receive queues when userspace fails to read(). UDP and TCP. I mean, there is a reason why the global memory limits apply to only those types of packets in the first place: everything else is noise. I agree that it's appealing to account at the allocator level and set page->mem_cgroup etc. but in this case we'd pay extra to capture a lot of noise, and I don't want to pay that just for aesthetics. In this case it's better to track ownership on the socket level and only count packets that can accumulate a significant amount of memory consumed. > > We tried using the per-memcg tcp limits, and that prevents the OOMs > > for sure, but it's horrendous for network performance. There is no > > "stop growing" phase, it just keeps going full throttle until it hits > > the wall hard. > > > > Now, we could probably try to replicate the global knobs and add a > > per-memcg soft limit. But you know better than anyone else how hard it > > is to estimate the overall workingset size of a workload, and the > > margins on containerized loads are razor-thin. Performance is much > > more sensitive to input errors, and often times parameters must be > > adjusted continuously during the runtime of a workload. It'd be > > disasterous to rely on yet more static, error-prone user input here. > > Yeah, but the dynamic approach proposed in your patch set doesn't > guarantee we won't hit OOM in memcg due to overgrown buffers. It just > reduces this possibility. Of course, memcg OOM is far not as disastrous > as the global one, but still it usually means the workload breakage. Right now, the entire machine breaks. Confining it to a faulty memcg, as well as reducing the likelihood of that OOM in many cases seems like a good move in the right direction, no? And how likely are memcg OOMs because of this anyway? There is of course a scenario imaginable where the packets pile up, followed by some *other* part of the workload, the one that doesn't read() and process packets, trying to expand--which then doesn't work and goes OOM. But that seems like a complete corner case. In the vast majority of cases, the application will be in full operation and just fail to read() fast enough--because the network bandwidth is enormous compared to the container's size, or because it shares the CPU with thousands of other workloads and there is scheduling latency. This would be the perfect point to reign in the transmit window... > The static approach is error-prone for sure, but it has existed for > years and worked satisfactory AFAIK. ...but that point is not a fixed amount of memory consumed. It depends on the workload and the random interactions it's having with thousands of other containers on that same machine. The point of containers is to maximize utilization of your hardware and systematically eliminate slack in the system. But it's exactly that slack on dedicated bare-metal machines that allowed us to take a wild guess at the settings and then tune them based on observing a handful of workloads. This approach is not going to work anymore when we pack the machine to capacity and still expect every single container out of thousands to perform well. We need that automation. The static setting working okay on the global level is also why I'm not interested in starting to experiment with it. There is no reason to change it. It's much more likely that any attempt to change it will be shot down, not because of the approach chosen, but because there is no problem to solve there. I doubt we can get networking people to care about containers by screwing with things that work for them ;-) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753853AbbKBOrw (ORCPT ); Mon, 2 Nov 2015 09:47:52 -0500 Received: from relay.parallels.com ([195.214.232.42]:57632 "EHLO relay.parallels.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751024AbbKBOrt (ORCPT ); Mon, 2 Nov 2015 09:47:49 -0500 Date: Mon, 2 Nov 2015 17:47:29 +0300 From: Vladimir Davydov To: Johannes Weiner CC: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , , , , Subject: Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy Message-ID: <20151102144729.GA17424@esperanza> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <20151022184509.GM18351@esperanza> <20151026172216.GC2214@cmpxchg.org> <20151027084320.GF13221@esperanza> <20151027155833.GB4665@cmpxchg.org> <20151028082003.GK13221@esperanza> <20151028185810.GA31488@cmpxchg.org> <20151029092747.GR13221@esperanza> <20151029175228.GB9160@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <20151029175228.GB9160@cmpxchg.org> X-ClientProxiedBy: US-EXCH.sw.swsoft.com (10.255.249.47) To MSK-EXCH1.sw.swsoft.com (10.67.48.55) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Oct 29, 2015 at 10:52:28AM -0700, Johannes Weiner wrote: ... > Now, you mentioned that you'd rather see the socket buffers accounted > at the allocator level, but I looked at the different allocation paths > and network protocols and I'm not convinced that this makes sense. We > don't want to be in the hotpath of every single packet when a lot of > them are small, short-lived management blips that don't involve user > space to let the kernel dispose of them. > > __sk_mem_schedule() on the other hand is already wired up to exactly > those consumers we are interested in for memory isolation: those with > bigger chunks of data attached to them and those that have exploding > receive queues when userspace fails to read(). UDP and TCP. > > I mean, there is a reason why the global memory limits apply to only > those types of packets in the first place: everything else is noise. > > I agree that it's appealing to account at the allocator level and set > page->mem_cgroup etc. but in this case we'd pay extra to capture a lot > of noise, and I don't want to pay that just for aesthetics. In this > case it's better to track ownership on the socket level and only count > packets that can accumulate a significant amount of memory consumed. Sigh, you seem to be right. Moreover, I can't even think of a neat way to account skb pages to memcg, because rcv skbs are generated in device drivers, where we don't know which socket/memcg it will go to. We could recharge individual pages when skb gets to the network or transport layer, but it would result in unjustified overhead. > > > > We tried using the per-memcg tcp limits, and that prevents the OOMs > > > for sure, but it's horrendous for network performance. There is no > > > "stop growing" phase, it just keeps going full throttle until it hits > > > the wall hard. > > > > > > Now, we could probably try to replicate the global knobs and add a > > > per-memcg soft limit. But you know better than anyone else how hard it > > > is to estimate the overall workingset size of a workload, and the > > > margins on containerized loads are razor-thin. Performance is much > > > more sensitive to input errors, and often times parameters must be > > > adjusted continuously during the runtime of a workload. It'd be > > > disasterous to rely on yet more static, error-prone user input here. > > > > Yeah, but the dynamic approach proposed in your patch set doesn't > > guarantee we won't hit OOM in memcg due to overgrown buffers. It just > > reduces this possibility. Of course, memcg OOM is far not as disastrous > > as the global one, but still it usually means the workload breakage. > > Right now, the entire machine breaks. Confining it to a faulty memcg, > as well as reducing the likelihood of that OOM in many cases seems > like a good move in the right direction, no? It seems. However, memcg OOM is also bad, we should strive to avoid it if we can. > > And how likely are memcg OOMs because of this anyway? There is of Frankly, I've no idea. Your arguments below sound reassuring though. > course a scenario imaginable where the packets pile up, followed by > some *other* part of the workload, the one that doesn't read() and > process packets, trying to expand--which then doesn't work and goes > OOM. But that seems like a complete corner case. In the vast majority > of cases, the application will be in full operation and just fail to > read() fast enough--because the network bandwidth is enormous compared > to the container's size, or because it shares the CPU with thousands > of other workloads and there is scheduling latency. > > This would be the perfect point to reign in the transmit window... > > > The static approach is error-prone for sure, but it has existed for > > years and worked satisfactory AFAIK. > > ...but that point is not a fixed amount of memory consumed. It depends > on the workload and the random interactions it's having with thousands > of other containers on that same machine. > > The point of containers is to maximize utilization of your hardware > and systematically eliminate slack in the system. But it's exactly > that slack on dedicated bare-metal machines that allowed us to take a > wild guess at the settings and then tune them based on observing a > handful of workloads. This approach is not going to work anymore when > we pack the machine to capacity and still expect every single > container out of thousands to perform well. We need that automation. But we do use static approach when setting memory limits, no? memory.{low,high,max} - they are all static. I understand it's appealing to have just one knob - memory size - like in case of virtual machines, but it doesn't seem to work with containers. You added memory.low and memory.high knobs. VMs don't have anything like that. How is one supposed to set them? Depends on the workload, I guess. Also, there is the pids cgroup for limiting the number of pids that can be used by a cgroup, because pid turns out to be a resource in case of containers. May be, tcp window should be considered as a separate resource either, as it is now, and shouldn't go to memcg? I'm just wondering... > > The static setting working okay on the global level is also why I'm > not interested in starting to experiment with it. There is no reason > to change it. It's much more likely that any attempt to change it will > be shot down, not because of the approach chosen, but because there is > no problem to solve there. I doubt we can get networking people to > care about containers by screwing with things that work for them ;-) Fair enough. Thanks, Vladimir From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965599AbbKDTux (ORCPT ); Wed, 4 Nov 2015 14:50:53 -0500 Received: from gum.cmpxchg.org ([85.214.110.215]:41534 "EHLO gum.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965374AbbKDTuu (ORCPT ); Wed, 4 Nov 2015 14:50:50 -0500 Date: Wed, 4 Nov 2015 14:50:37 -0500 From: Johannes Weiner To: Michal Hocko Cc: David Miller , akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151104195037.GA6872@cmpxchg.org> References: <20151023131956.GA15375@dhcp22.suse.cz> <20151023.065957.1690815054807881760.davem@davemloft.net> <20151026165619.GB2214@cmpxchg.org> <20151027122647.GG9891@dhcp22.suse.cz> <20151027154138.GA4665@cmpxchg.org> <20151027161554.GJ9891@dhcp22.suse.cz> <20151027164227.GB7749@cmpxchg.org> <20151029152546.GG23598@dhcp22.suse.cz> <20151029161009.GA9160@cmpxchg.org> <20151104104239.GG29607@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151104104239.GG29607@dhcp22.suse.cz> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Nov 04, 2015 at 11:42:40AM +0100, Michal Hocko wrote: > On Thu 29-10-15 09:10:09, Johannes Weiner wrote: > > On Thu, Oct 29, 2015 at 04:25:46PM +0100, Michal Hocko wrote: > > > On Tue 27-10-15 09:42:27, Johannes Weiner wrote: > [...] > > > > You carefully skipped over this part. We can ignore it for socket > > > > memory but it's something we need to figure out when it comes to slab > > > > accounting and tracking. > > > > > > I am sorry, I didn't mean to skip this part, I though it would be clear > > > from the previous text. I think kmem accounting falls into the same > > > category. Have a sane default and a global boottime knob to override it > > > for those that think differently - for whatever reason they might have. > > > > Yes, that makes sense to me. > > > > Like cgroup.memory=nosocket, would you think it makes sense to include > > slab in the default for functional/semantical completeness and provide > > a cgroup.memory=noslab for powerusers? > > I am still not sure whether the kmem accounting is stable enough to be > enabled by default. If for nothing else the allocation failures, which > are not allowed for the global case and easily triggered by the hard > limit, might be a big problem. My last attempts to allow GFP_NOFS to > fail made me quite skeptical. I still believe this is something which > will be solved in the long term but the current state might be still too > fragile. So I would rather be conservative and have the kmem accounting > disabled by default with a config option and boot parameter to override. > If somebody is confident that the desired load is stable then the config > can be enabled easily. I agree with your assessment of the current kmem code state, but I think your conclusion is completely backwards here. The interface will be set in stone forever, whereas any stability issues will be transient and will have to be addressed in a finite amount of time anyway. It doesn't make sense to design an interface based on temporary quality of implementation. Only one of those two can ever be changed. Because it goes without saying that once the cgroupv2 interface is released, and people use it in production, there is no way we can then *add* dentry cache, inode cache, and others to memory.current. That would be an unacceptable change in interface behavior. On the other hand, people will be prepared for hiccups in the early stages of cgroupv2 release, and we're providing cgroup.memory=noslab to let them workaround severe problems in production until we fix it without forcing them to fully revert to cgroupv1. So if we agree that there are no fundamental architectural concerns with slab accounting, i.e. nothing that can't be addressed in the implementation, we have to make the call now. And I maintain that not accounting dentry cache and inode cache is a gaping hole in memory isolation, so it should be included by default. (The rest of the slabs is arguable, but IMO the risk of missing something important is higher than the cost of including them.) As far as your allocation failure concerns go, I think the kmem code is currently not behaving as Glauber originally intended, which is to force charge if reclaim and OOM killing weren't able to make enough space. See this recently rewritten section of the kmem charge path: - /* - * try_charge() chose to bypass to root due to OOM kill or - * fatal signal. Since our only options are to either fail - * the allocation or charge it to this cgroup, do it as a - * temporary condition. But we can't fail. From a kmem/slab - * perspective, the cache has already been selected, by - * mem_cgroup_kmem_get_cache(), so it is too late to change - * our minds. - * - * This condition will only trigger if the task entered - * memcg_charge_kmem in a sane state, but was OOM-killed - * during try_charge() above. Tasks that were already dying - * when the allocation triggers should have been already - * directed to the root cgroup in memcontrol.h - */ - page_counter_charge(&memcg->memory, nr_pages); - if (do_swap_account) - page_counter_charge(&memcg->memsw, nr_pages); It could be that this never properly worked as it was tied to the -EINTR bypass trick, but the idea was these charges never fail. And this makes sense. If the allocator semantics are such that we never fail these page allocations for slab, and the callsites rely on that, surely we should not fail them in the memory controller, either. And it makes a lot more sense to account them in excess of the limit than pretend they don't exist. We might not be able to completely fullfill the containment part of the memory controller (although these slab charges will still create significant pressure before that), but at least we don't fail the accounting part on top of it. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1161611AbbKEOkK (ORCPT ); Thu, 5 Nov 2015 09:40:10 -0500 Received: from mail-wm0-f54.google.com ([74.125.82.54]:36089 "EHLO mail-wm0-f54.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1161320AbbKEOkG (ORCPT ); Thu, 5 Nov 2015 09:40:06 -0500 Date: Thu, 5 Nov 2015 15:40:02 +0100 From: Michal Hocko To: Johannes Weiner Cc: David Miller , akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151105144002.GB15111@dhcp22.suse.cz> References: <20151023.065957.1690815054807881760.davem@davemloft.net> <20151026165619.GB2214@cmpxchg.org> <20151027122647.GG9891@dhcp22.suse.cz> <20151027154138.GA4665@cmpxchg.org> <20151027161554.GJ9891@dhcp22.suse.cz> <20151027164227.GB7749@cmpxchg.org> <20151029152546.GG23598@dhcp22.suse.cz> <20151029161009.GA9160@cmpxchg.org> <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151104195037.GA6872@cmpxchg.org> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 04-11-15 14:50:37, Johannes Weiner wrote: [...] > Because it goes without saying that once the cgroupv2 interface is > released, and people use it in production, there is no way we can then > *add* dentry cache, inode cache, and others to memory.current. That > would be an unacceptable change in interface behavior. They would still have to _enable_ the config option _explicitly_. make oldconfig wouldn't change it silently for them. I do not think it is an unacceptable change of behavior if the config is changed explicitly. > On the other > hand, people will be prepared for hiccups in the early stages of > cgroupv2 release, and we're providing cgroup.memory=noslab to let them > workaround severe problems in production until we fix it without > forcing them to fully revert to cgroupv1. This would be true if they moved on to the new cgroup API intentionally. The reality is more complicated though. AFAIK sysmted is waiting for cgroup2 already and privileged services enable all available resource controllers by default as I've learned just recently. If we know that the interface is not stable enough then we are basically forcing _most_ users to use the kernel boot parameter if we stay with the current kmem semantic. More on that below. > So if we agree that there are no fundamental architectural concerns > with slab accounting, i.e. nothing that can't be addressed in the > implementation, we have to make the call now. We are on the same page here. > And I maintain that not accounting dentry cache and inode cache is a > gaping hole in memory isolation, so it should be included by default. > (The rest of the slabs is arguable, but IMO the risk of missing > something important is higher than the cost of including them.) More on that below. > As far as your allocation failure concerns go, I think the kmem code > is currently not behaving as Glauber originally intended, which is to > force charge if reclaim and OOM killing weren't able to make enough > space. See this recently rewritten section of the kmem charge path: > > - /* > - * try_charge() chose to bypass to root due to OOM kill or > - * fatal signal. Since our only options are to either fail > - * the allocation or charge it to this cgroup, do it as a > - * temporary condition. But we can't fail. From a kmem/slab > - * perspective, the cache has already been selected, by > - * mem_cgroup_kmem_get_cache(), so it is too late to change > - * our minds. > - * > - * This condition will only trigger if the task entered > - * memcg_charge_kmem in a sane state, but was OOM-killed > - * during try_charge() above. Tasks that were already dying > - * when the allocation triggers should have been already > - * directed to the root cgroup in memcontrol.h > - */ > - page_counter_charge(&memcg->memory, nr_pages); > - if (do_swap_account) > - page_counter_charge(&memcg->memsw, nr_pages); > > It could be that this never properly worked as it was tied to the > -EINTR bypass trick, but the idea was these charges never fail. I have always understood this path as a corner case when the task is an oom victim or exiting. So this would be only a temporal condition which cannot cause a complete runaway. > And this makes sense. If the allocator semantics are such that we > never fail these page allocations for slab, and the callsites rely on > that, surely we should not fail them in the memory controller, either. Then we can only bypass them or loop inside the charge code for ever like we do in the page allocator. The later one is really fragile and it would be much more in the restricted environment as we have learned with the memcg OOM killer in the past. > And it makes a lot more sense to account them in excess of the limit > than pretend they don't exist. We might not be able to completely > fullfill the containment part of the memory controller (although these > slab charges will still create significant pressure before that), but > at least we don't fail the accounting part on top of it. Hmm, wouldn't that kill the whole purpose of the kmem accounting? Any load could simply runaway via kernel allocations. What is even worse we might even not trigger memcg OOM killer before we hit the global OOM. So the whole containment goes straight to hell. I can see four options here: 1) enable kmem by default with the current semantic which we know can BUG_ON (at least btrfs is known to hit this) or lead to other issues. 2) enable kmem by default and change the semantic for cgroup2 to allow runaway charges above the hard limit which would defeat the whole purpose of the containment for cgroup2. This can be a temporary workaround until we can afford kmem failures. This has a big risk that we will end up with this permanently because there is a strong pressure that GFP_KERNEL allocations should never fail. Yet this is the most common type of request. Or do we change the consistency with the global case at some point? 3) keep only some (safe) cache types enabled by default with the current failing semantic and require an explicit enabling for the complete kmem accounting. [di]cache code paths should be quite robust to handle allocation failures. 4) disable kmem by default and change the config default later to signal the accounting is safe as far as we are aware and let people enable the functionality on those basis. We would keep the current failing semantic. To me 4) sounds like the safest option because it still keeps the functionality available to those who can benefit from it in v1 already while we are not exposing a potentially buggy behavior to the majority (many of them even unintentionally). Moreover we still allow to change the default later on an explicit basis. 3) sounds like the second best option but I am not really sure whether we can do that very easily without bringing up a lot of unmaintainable mess. 2) sounds like the third best approach but I am afraid it would render the basic use cases unusable for a very long time and kill any interest in cgroup2 for even longer (cargo cults are really hard to get rid of). 1) sounds like a land mine approach which would force many/most users to simply keep using the boot option and force us to re-evaluate the default hard way. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1162154AbbKEQ2K (ORCPT ); Thu, 5 Nov 2015 11:28:10 -0500 Received: from mail-wi0-f174.google.com ([209.85.212.174]:36920 "EHLO mail-wi0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1031593AbbKEQ2H (ORCPT ); Thu, 5 Nov 2015 11:28:07 -0500 Date: Thu, 5 Nov 2015 17:28:03 +0100 From: Michal Hocko To: David Miller Cc: hannes@cmpxchg.org, akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151105162803.GD15111@dhcp22.suse.cz> References: <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> <20151105.111609.1695015438589063316.davem@davemloft.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151105.111609.1695015438589063316.davem@davemloft.net> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 05-11-15 11:16:09, David S. Miller wrote: > From: Michal Hocko > Date: Thu, 5 Nov 2015 15:40:02 +0100 > > > On Wed 04-11-15 14:50:37, Johannes Weiner wrote: > > [...] > >> Because it goes without saying that once the cgroupv2 interface is > >> released, and people use it in production, there is no way we can then > >> *add* dentry cache, inode cache, and others to memory.current. That > >> would be an unacceptable change in interface behavior. > > > > They would still have to _enable_ the config option _explicitly_. make > > oldconfig wouldn't change it silently for them. I do not think > > it is an unacceptable change of behavior if the config is changed > > explicitly. > > Every user is going to get this config option when they update their > distibution kernel or whatever. > > Then they will all wonder why their networking performance went down. > > This is why I do not want the networking accounting bits on by default > even if the kconfig option is enabled. They must be off by default > and guarded by a static branch so the cost is exactly zero. Yes, that part is clear and Johannes made it clear that the kmem tcp part is disabled by default. Or are you considering also all the slab usage by the networking code as well? -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1162324AbbKEQaT (ORCPT ); Thu, 5 Nov 2015 11:30:19 -0500 Received: from shards.monkeyblade.net ([149.20.54.216]:36431 "EHLO shards.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1031593AbbKEQaQ (ORCPT ); Thu, 5 Nov 2015 11:30:16 -0500 Date: Thu, 05 Nov 2015 11:30:12 -0500 (EST) Message-Id: <20151105.113012.433525933573324396.davem@davemloft.net> To: mhocko@kernel.org Cc: hannes@cmpxchg.org, akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy From: David Miller In-Reply-To: <20151105162803.GD15111@dhcp22.suse.cz> References: <20151105144002.GB15111@dhcp22.suse.cz> <20151105.111609.1695015438589063316.davem@davemloft.net> <20151105162803.GD15111@dhcp22.suse.cz> X-Mailer: Mew version 6.6 on Emacs 24.5 / Mule 6.0 (HANACHIRUSATO) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.5.12 (shards.monkeyblade.net [149.20.54.216]); Thu, 05 Nov 2015 08:30:16 -0800 (PST) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Michal Hocko Date: Thu, 5 Nov 2015 17:28:03 +0100 > Yes, that part is clear and Johannes made it clear that the kmem tcp > part is disabled by default. Or are you considering also all the slab > usage by the networking code as well? I'm still thinking about the implications of that aspect, and will comment when I have something coherent to say about it. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756953AbbKEWdE (ORCPT ); Thu, 5 Nov 2015 17:33:04 -0500 Received: from gum.cmpxchg.org ([85.214.110.215]:41992 "EHLO gum.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756312AbbKEWdC (ORCPT ); Thu, 5 Nov 2015 17:33:02 -0500 Date: Thu, 5 Nov 2015 17:32:51 -0500 From: Johannes Weiner To: Michal Hocko Cc: David Miller , akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151105223251.GA4427@cmpxchg.org> References: <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> <20151105.111609.1695015438589063316.davem@davemloft.net> <20151105162803.GD15111@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151105162803.GD15111@dhcp22.suse.cz> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Nov 05, 2015 at 05:28:03PM +0100, Michal Hocko wrote: > On Thu 05-11-15 11:16:09, David S. Miller wrote: > > From: Michal Hocko > > Date: Thu, 5 Nov 2015 15:40:02 +0100 > > > > > On Wed 04-11-15 14:50:37, Johannes Weiner wrote: > > > [...] > > >> Because it goes without saying that once the cgroupv2 interface is > > >> released, and people use it in production, there is no way we can then > > >> *add* dentry cache, inode cache, and others to memory.current. That > > >> would be an unacceptable change in interface behavior. > > > > > > They would still have to _enable_ the config option _explicitly_. make > > > oldconfig wouldn't change it silently for them. I do not think > > > it is an unacceptable change of behavior if the config is changed > > > explicitly. > > > > Every user is going to get this config option when they update their > > distibution kernel or whatever. > > > > Then they will all wonder why their networking performance went down. > > > > This is why I do not want the networking accounting bits on by default > > even if the kconfig option is enabled. They must be off by default > > and guarded by a static branch so the cost is exactly zero. > > Yes, that part is clear and Johannes made it clear that the kmem tcp > part is disabled by default. Or are you considering also all the slab > usage by the networking code as well? Michal, there shouldn't be any tracking or accounting going on per default when you boot into a fresh system. I removed all accounting and statistics on the system level in cgroupv2, so distribution kernels can compile-time enable a single, feature-complete CONFIG_MEMCG that provides a full memory controller while at the same time puts no overhead on users that don't benefit from mem control at all and just want to use the machine bare-metal. This is completely doable. My new series does it for skmem, but I also want to retrofit the code to eliminate that current overhead for page cache, anonymous memory, slab memory and so forth. This is the only sane way to make the memory controller powerful and generally useful without having to make unreasonable compromises with memory consumers. We shouldn't even be *having* the discussion about whether we should sacrifice the quality of our interface in order to compromise with a class of users that doesn't care about any of this in the first place. So let's eliminate the cost for non-users, but make the memory controller feature-complete and useful--with reasonable cost, implementation, and interface--for our actual userbase. Paying the necessary cost for a functionality you actually want is not the problem. Paying for something that doesn't benefit you is. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756917AbbKEWwL (ORCPT ); Thu, 5 Nov 2015 17:52:11 -0500 Received: from gum.cmpxchg.org ([85.214.110.215]:42004 "EHLO gum.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752962AbbKEWwI (ORCPT ); Thu, 5 Nov 2015 17:52:08 -0500 Date: Thu, 5 Nov 2015 17:52:00 -0500 From: Johannes Weiner To: Michal Hocko Cc: David Miller , akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151105225200.GA5432@cmpxchg.org> References: <20151027122647.GG9891@dhcp22.suse.cz> <20151027154138.GA4665@cmpxchg.org> <20151027161554.GJ9891@dhcp22.suse.cz> <20151027164227.GB7749@cmpxchg.org> <20151029152546.GG23598@dhcp22.suse.cz> <20151029161009.GA9160@cmpxchg.org> <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> <20151105205522.GA1067@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151105205522.GA1067@cmpxchg.org> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote: > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote: > > This would be true if they moved on to the new cgroup API intentionally. > > The reality is more complicated though. AFAIK sysmted is waiting for > > cgroup2 already and privileged services enable all available resource > > controllers by default as I've learned just recently. > > Have you filed a report with them? I don't think they should turn them > on unless users explicitely configure resource control for the unit. Okay, verified with systemd people that they're not planning on enabling resource control per default. Inflammatory half-truths, man. This is not constructive. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1033016AbbKFJGX (ORCPT ); Fri, 6 Nov 2015 04:06:23 -0500 Received: from relay.parallels.com ([195.214.232.42]:47279 "EHLO relay.parallels.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1030705AbbKFJGQ (ORCPT ); Fri, 6 Nov 2015 04:06:16 -0500 Date: Fri, 6 Nov 2015 12:05:55 +0300 From: Vladimir Davydov To: Johannes Weiner CC: Michal Hocko , David Miller , , , , , , Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151106090555.GK29259@esperanza> References: <20151027122647.GG9891@dhcp22.suse.cz> <20151027154138.GA4665@cmpxchg.org> <20151027161554.GJ9891@dhcp22.suse.cz> <20151027164227.GB7749@cmpxchg.org> <20151029152546.GG23598@dhcp22.suse.cz> <20151029161009.GA9160@cmpxchg.org> <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> <20151105205522.GA1067@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <20151105205522.GA1067@cmpxchg.org> X-ClientProxiedBy: US-EXCH.sw.swsoft.com (10.255.249.47) To MSK-EXCH1.sw.swsoft.com (10.67.48.55) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote: > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote: ... > > 3) keep only some (safe) cache types enabled by default with the current > > failing semantic and require an explicit enabling for the complete > > kmem accounting. [di]cache code paths should be quite robust to > > handle allocation failures. > > Vladimir, what would be your opinion on this? I'm all for this option. Actually, I've been thinking about this since I introduced the __GFP_NOACCOUNT flag. Not because of the failing semantics, since we can always let kmem allocations breach the limit. This shouldn't be critical, because I don't think it's possible to issue a series of kmem allocations w/o a single user page allocation, which would reclaim/kill the excess. The point is there are allocations that are shared system-wide and therefore shouldn't go to any memcg. Most obvious examples are: mempool users and radix_tree/idr preloads. Accounting them to memcg is likely to result in noticeable memory overhead as memory cgroups are created/destroyed, because they pin dead memory cgroups with all their kmem caches, which aren't tiny. Another funny example is objects destroyed lazily for performance reasons, e.g. vmap_area. Such objects are usually very small, so delaying destruction of a bunch of them will normally go unnoticed. However, if kmemcg is used the effective memory consumption caused by such objects can be multiplied by many times due to dangling kmem caches. We can, of course, mark all such allocations as __GFP_NOACCOUNT, but the problem is they are tricky to identify, because they are scattered all over the kernel source tree. E.g. Dave Chinner mentioned that XFS internals do a lot of allocations that are shared among all XFS filesystems and therefore should not be accounted (BTW that's why list_lru's used by XFS are not marked as memcg-aware). There must be more out there. Besides, kernel developers don't usually even know about kmemcg (they just write the code for their subsys, so why should they?) so they won't care thinking about using __GFP_NOACCOUNT, and hence new falsely-accounted allocations are likely to appear. That said, by switching from black-list (__GFP_NOACCOUNT) to white-list (__GFP_ACCOUNT) kmem accounting policy we would make the system more predictable and robust IMO. OTOH what would we lose? Security? Well, containers aren't secure IMHO. In fact, I doubt they will ever be (as secure as VMs). Anyway, if a runaway allocation is reported, it should be trivial to fix by adding __GFP_ACCOUNT where appropriate. If there are no objections, I'll prepare a patch switching to the white-list approach. Let's start from obvious things like fs_struct, mm_struct, task_struct, signal_struct, dentry, inode, which can be easily allocated from user space. This should cover 90% of all allocations that should be accounted AFAICS. The rest will be added later if necessarily. Thanks, Vladimir From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1161186AbbKFK5a (ORCPT ); Fri, 6 Nov 2015 05:57:30 -0500 Received: from mail-wi0-f178.google.com ([209.85.212.178]:34285 "EHLO mail-wi0-f178.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1033190AbbKFK51 (ORCPT ); Fri, 6 Nov 2015 05:57:27 -0500 Date: Fri, 6 Nov 2015 11:57:24 +0100 From: Michal Hocko To: Johannes Weiner Cc: David Miller , akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151106105724.GG4390@dhcp22.suse.cz> References: <20151027154138.GA4665@cmpxchg.org> <20151027161554.GJ9891@dhcp22.suse.cz> <20151027164227.GB7749@cmpxchg.org> <20151029152546.GG23598@dhcp22.suse.cz> <20151029161009.GA9160@cmpxchg.org> <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> <20151105205522.GA1067@cmpxchg.org> <20151105225200.GA5432@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151105225200.GA5432@cmpxchg.org> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 05-11-15 17:52:00, Johannes Weiner wrote: > On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote: > > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote: > > > This would be true if they moved on to the new cgroup API intentionally. > > > The reality is more complicated though. AFAIK sysmted is waiting for > > > cgroup2 already and privileged services enable all available resource > > > controllers by default as I've learned just recently. > > > > Have you filed a report with them? I don't think they should turn them > > on unless users explicitely configure resource control for the unit. > > Okay, verified with systemd people that they're not planning on > enabling resource control per default. > > Inflammatory half-truths, man. This is not constructive. What about Delegate=yes feature then? We have just been burnt by this quite heavily. AFAIU nspawn@.service and nspawn@.service have this enabled by default http://lists.freedesktop.org/archives/systemd-commits/2014-November/007400.html -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1161428AbbKFNVI (ORCPT ); Fri, 6 Nov 2015 08:21:08 -0500 Received: from mail-wi0-f175.google.com ([209.85.212.175]:35180 "EHLO mail-wi0-f175.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1033190AbbKFNVE (ORCPT ); Fri, 6 Nov 2015 08:21:04 -0500 Date: Fri, 6 Nov 2015 14:21:02 +0100 From: Michal Hocko To: Johannes Weiner Cc: David Miller , akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151106132102.GJ4390@dhcp22.suse.cz> References: <20151027122647.GG9891@dhcp22.suse.cz> <20151027154138.GA4665@cmpxchg.org> <20151027161554.GJ9891@dhcp22.suse.cz> <20151027164227.GB7749@cmpxchg.org> <20151029152546.GG23598@dhcp22.suse.cz> <20151029161009.GA9160@cmpxchg.org> <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> <20151105205522.GA1067@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151105205522.GA1067@cmpxchg.org> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 05-11-15 15:55:22, Johannes Weiner wrote: > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote: > > On Wed 04-11-15 14:50:37, Johannes Weiner wrote: [...] > > This would be true if they moved on to the new cgroup API intentionally. > > The reality is more complicated though. AFAIK sysmted is waiting for > > cgroup2 already and privileged services enable all available resource > > controllers by default as I've learned just recently. > > Have you filed a report with them? I don't think they should turn them > on unless users explicitely configure resource control for the unit. We have just been bitten by this (aka Delegate=yes for some basic services) and our systemd people are supposed to bring this up upstream. I've mentioned that in other email where you accuse me from spreading a FUD. > But what I said still holds: critical production machines don't just > get rolling updates and "accidentally" switch to all this new > code. And those that do take the plunge have the cmdline options. That is exactly my point why I do not think re-evaluating the default config option is a problem at all. The default wouldn't matter for existing users. Those who care can have all the functionality they need right away - be it kmem enabled or disabled. > > > And it makes a lot more sense to account them in excess of the limit > > > than pretend they don't exist. We might not be able to completely > > > fullfill the containment part of the memory controller (although these > > > slab charges will still create significant pressure before that), but > > > at least we don't fail the accounting part on top of it. > > > > Hmm, wouldn't that kill the whole purpose of the kmem accounting? Any > > load could simply runaway via kernel allocations. What is even worse we > > might even not trigger memcg OOM killer before we hit the global OOM. So > > the whole containment goes straight to hell. > > > > I can see four options here: > > 1) enable kmem by default with the current semantic which we know can > > BUG_ON (at least btrfs is known to hit this) or lead to other issues. > > Can you point me to that report? git grep "BUG_ON.*ENOMEM" -- fs/btrfs just to give you a picture. Not all of them are kmalloc and others are not annotated by ENOMEM comment. This came out as a result of my last attempt to allow GFP_NOFS fail (http://lkml.kernel.org/r/1438768284-30927-1-git-send-email-mhocko%40kernel.org) > That's not "semantics", that's a bug! Whether or not a feature is > enabled by default, it can not be allowed to crash the kernel. Yes those are bugs and have to be fixed. Not an easy task but nothing which couldn't be solved. It just takes some time. They are not very likely right now because they are reduced to corner cases right now. But they are more visible with the current kmem accounting semantic. So either we change the semantic or wait until this gets fixed if the accoutning should be on by default. > Presenting this as a choice is a bit of a strawman argument. > > > 2) enable kmem by default and change the semantic for cgroup2 to allow > > runaway charges above the hard limit which would defeat the whole > > purpose of the containment for cgroup2. This can be a temporary > > workaround until we can afford kmem failures. This has a big risk > > that we will end up with this permanently because there is a strong > > pressure that GFP_KERNEL allocations should never fail. Yet this is > > the most common type of request. Or do we change the consistency with > > the global case at some point? > > As per 1) we *have* to fail containment eventually if not doing so > means crashes and lockups. That's not a choice of semantics. > > But that doesn't mean we have to give up *immediately* and allow > unrestrained "runaway charges"--again, more of a strawman than a > choice. We can still throttle the allocator and apply significant > pressure on the memory pool, culminating in OOM kills eventually. > > Once we run out of available containment tools, however, we *have* to > follow the semantics of the page and slab allocator and succeed the > request. We can not just return -ENOMEM if that causes kernel bugs. > > That's the only thing we can do right now. > > In fact, it's likely going to be the best we will ever be able to do > when it comes to kernel memory accounting. Linus made it clear where > he stands on failing kernel allocations, so all we can do is continue > to improve our containment tools and then give up on containment when > they're exhausted and force the charge past the limit. OK, then we need all the additional measures to keep the hard limit excess bound. > > 3) keep only some (safe) cache types enabled by default with the current > > failing semantic and require an explicit enabling for the complete > > kmem accounting. [di]cache code paths should be quite robust to > > handle allocation failures. > > Vladimir, what would be your opinion on this? > > > 4) disable kmem by default and change the config default later to signal > > the accounting is safe as far as we are aware and let people enable > > the functionality on those basis. We would keep the current failing > > semantic. > > > > To me 4) sounds like the safest option because it still keeps the > > functionality available to those who can benefit from it in v1 already > > while we are not exposing a potentially buggy behavior to the majority > > (many of them even unintentionally). Moreover we still allow to change > > the default later on an explicit basis. > > I'm not interested in fragmenting the interface forever out of caution > because there might be a bug in the implementation right now. As I > said we have to fix any instability in the features we provide whether > they are turned on by default or not. I don't see how this is relevant > to the interface discussion. > > Also, there is no way we can later fundamentally change the semantics > of memory.current, so it would have to remain configurable forever, > forcing people forever to select multiple options in order to piece > together a single logical kernel feature. > > This is really not an option, either. Why not? I can clearly see people who would really want to have this disabled and doing that by config is much more easier than providing a command line parameter. A config option doesn't give any additional maintenance burden than the boot time parameter. > If there are show-stopping bugs in the implementation, I'd rather hold > off the release of the unified hierarchy than commit to a half-assed > interface right out of the gate. If you are willing to postpone releasing cgroup2 until this gets resolved - one way or another - then I have no objections. My impression was that Tejun wanted to release it sooner rather than later. As this mere discussion shows we are even not sure what should be the kmem failure behavior. > The point of v2 is sane interfaces. And the sane interface to me is to use a single set of knobs regardless of memory type. We are currently only discussing what should be accounted by default. My understanding of what David said is that tcp kmem should be enabled only when explicitly opted in. Did I get this wrong? > So let's please focus on fixing any problems that slab accounting may > have, rather than designing complex config options and transition > procedures whose sole purpose is to defer dealing with our issues. > > Please? -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1033408AbbKFN3p (ORCPT ); Fri, 6 Nov 2015 08:29:45 -0500 Received: from mail-wm0-f44.google.com ([74.125.82.44]:37677 "EHLO mail-wm0-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932883AbbKFN3n (ORCPT ); Fri, 6 Nov 2015 08:29:43 -0500 Date: Fri, 6 Nov 2015 14:29:40 +0100 From: Michal Hocko To: Vladimir Davydov Cc: Johannes Weiner , David Miller , akpm@linux-foundation.org, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151106132940.GK4390@dhcp22.suse.cz> References: <20151027154138.GA4665@cmpxchg.org> <20151027161554.GJ9891@dhcp22.suse.cz> <20151027164227.GB7749@cmpxchg.org> <20151029152546.GG23598@dhcp22.suse.cz> <20151029161009.GA9160@cmpxchg.org> <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> <20151105205522.GA1067@cmpxchg.org> <20151106090555.GK29259@esperanza> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151106090555.GK29259@esperanza> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 06-11-15 12:05:55, Vladimir Davydov wrote: [...] > If there are no objections, I'll prepare a patch switching to the > white-list approach. Let's start from obvious things like fs_struct, > mm_struct, task_struct, signal_struct, dentry, inode, which can be > easily allocated from user space. pipe buffers, kernel stacks and who knows what more. > This should cover 90% of all > allocations that should be accounted AFAICS. The rest will be added > later if necessarily. The more I think about that the more I am convinced that is the only sane way forward. The only concerns I would have is how do we deal with the old interface in cgroup1? We do not want to break existing deployments which might depend on the current behavior. I doubt they are but... -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757554AbbKFQUI (ORCPT ); Fri, 6 Nov 2015 11:20:08 -0500 Received: from gum.cmpxchg.org ([85.214.110.215]:42110 "EHLO gum.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751644AbbKFQUG (ORCPT ); Fri, 6 Nov 2015 11:20:06 -0500 Date: Fri, 6 Nov 2015 11:19:53 -0500 From: Johannes Weiner To: Michal Hocko Cc: David Miller , akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151106161953.GA7813@cmpxchg.org> References: <20151027161554.GJ9891@dhcp22.suse.cz> <20151027164227.GB7749@cmpxchg.org> <20151029152546.GG23598@dhcp22.suse.cz> <20151029161009.GA9160@cmpxchg.org> <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> <20151105205522.GA1067@cmpxchg.org> <20151105225200.GA5432@cmpxchg.org> <20151106105724.GG4390@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151106105724.GG4390@dhcp22.suse.cz> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Nov 06, 2015 at 11:57:24AM +0100, Michal Hocko wrote: > On Thu 05-11-15 17:52:00, Johannes Weiner wrote: > > On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote: > > > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote: > > > > This would be true if they moved on to the new cgroup API intentionally. > > > > The reality is more complicated though. AFAIK sysmted is waiting for > > > > cgroup2 already and privileged services enable all available resource > > > > controllers by default as I've learned just recently. > > > > > > Have you filed a report with them? I don't think they should turn them > > > on unless users explicitely configure resource control for the unit. > > > > Okay, verified with systemd people that they're not planning on > > enabling resource control per default. > > > > Inflammatory half-truths, man. This is not constructive. > > What about Delegate=yes feature then? We have just been burnt by this > quite heavily. AFAIU nspawn@.service and nspawn@.service have this > enabled by default > http://lists.freedesktop.org/archives/systemd-commits/2014-November/007400.html That's when you launch a *container* and want it to be able to use nested resource control. We're talking about actual container users here. It's not turning on resource control for all "privileged services", which is what we were worried about here. Can you at least admit that when you yourself link to the refuting evidence? And if you've been "burnt quite heavily" by this, where is your bug report to stop other users from getting "burnt quite heavily" as well? All I read here is vague inflammatory language to spread FUD. You might think sending these emails is helpful, but it really isn't. Not only is it not contributing code, insights, or solutions, you're now actively sabotaging someone else's effort to build something. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757546AbbKFQgJ (ORCPT ); Fri, 6 Nov 2015 11:36:09 -0500 Received: from gum.cmpxchg.org ([85.214.110.215]:42122 "EHLO gum.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751615AbbKFQgG (ORCPT ); Fri, 6 Nov 2015 11:36:06 -0500 Date: Fri, 6 Nov 2015 11:35:55 -0500 From: Johannes Weiner To: Vladimir Davydov Cc: Michal Hocko , David Miller , akpm@linux-foundation.org, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151106163555.GB7813@cmpxchg.org> References: <20151027154138.GA4665@cmpxchg.org> <20151027161554.GJ9891@dhcp22.suse.cz> <20151027164227.GB7749@cmpxchg.org> <20151029152546.GG23598@dhcp22.suse.cz> <20151029161009.GA9160@cmpxchg.org> <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> <20151105205522.GA1067@cmpxchg.org> <20151106090555.GK29259@esperanza> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151106090555.GK29259@esperanza> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Nov 06, 2015 at 12:05:55PM +0300, Vladimir Davydov wrote: > On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote: > > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote: > ... > > > 3) keep only some (safe) cache types enabled by default with the current > > > failing semantic and require an explicit enabling for the complete > > > kmem accounting. [di]cache code paths should be quite robust to > > > handle allocation failures. > > > > Vladimir, what would be your opinion on this? > > I'm all for this option. Actually, I've been thinking about this since I > introduced the __GFP_NOACCOUNT flag. Not because of the failing > semantics, since we can always let kmem allocations breach the limit. > This shouldn't be critical, because I don't think it's possible to issue > a series of kmem allocations w/o a single user page allocation, which > would reclaim/kill the excess. > > The point is there are allocations that are shared system-wide and > therefore shouldn't go to any memcg. Most obvious examples are: mempool > users and radix_tree/idr preloads. Accounting them to memcg is likely to > result in noticeable memory overhead as memory cgroups are > created/destroyed, because they pin dead memory cgroups with all their > kmem caches, which aren't tiny. > > Another funny example is objects destroyed lazily for performance > reasons, e.g. vmap_area. Such objects are usually very small, so > delaying destruction of a bunch of them will normally go unnoticed. > However, if kmemcg is used the effective memory consumption caused by > such objects can be multiplied by many times due to dangling kmem > caches. > > We can, of course, mark all such allocations as __GFP_NOACCOUNT, but the > problem is they are tricky to identify, because they are scattered all > over the kernel source tree. E.g. Dave Chinner mentioned that XFS > internals do a lot of allocations that are shared among all XFS > filesystems and therefore should not be accounted (BTW that's why > list_lru's used by XFS are not marked as memcg-aware). There must be > more out there. Besides, kernel developers don't usually even know about > kmemcg (they just write the code for their subsys, so why should they?) > so they won't care thinking about using __GFP_NOACCOUNT, and hence new > falsely-accounted allocations are likely to appear. > > That said, by switching from black-list (__GFP_NOACCOUNT) to white-list > (__GFP_ACCOUNT) kmem accounting policy we would make the system more > predictable and robust IMO. OTOH what would we lose? Security? Well, > containers aren't secure IMHO. In fact, I doubt they will ever be (as > secure as VMs). Anyway, if a runaway allocation is reported, it should > be trivial to fix by adding __GFP_ACCOUNT where appropriate. I wholeheartedly agree with all of this. > If there are no objections, I'll prepare a patch switching to the > white-list approach. Let's start from obvious things like fs_struct, > mm_struct, task_struct, signal_struct, dentry, inode, which can be > easily allocated from user space. This should cover 90% of all > allocations that should be accounted AFAICS. The rest will be added > later if necessarily. Awesome, I'm looking forward to that patch! From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1161525AbbKFQrD (ORCPT ); Fri, 6 Nov 2015 11:47:03 -0500 Received: from mail-wm0-f53.google.com ([74.125.82.53]:38652 "EHLO mail-wm0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751365AbbKFQrA (ORCPT ); Fri, 6 Nov 2015 11:47:00 -0500 Date: Fri, 6 Nov 2015 17:46:57 +0100 From: Michal Hocko To: Johannes Weiner Cc: David Miller , akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151106164657.GL4390@dhcp22.suse.cz> References: <20151027164227.GB7749@cmpxchg.org> <20151029152546.GG23598@dhcp22.suse.cz> <20151029161009.GA9160@cmpxchg.org> <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> <20151105205522.GA1067@cmpxchg.org> <20151105225200.GA5432@cmpxchg.org> <20151106105724.GG4390@dhcp22.suse.cz> <20151106161953.GA7813@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151106161953.GA7813@cmpxchg.org> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 06-11-15 11:19:53, Johannes Weiner wrote: > On Fri, Nov 06, 2015 at 11:57:24AM +0100, Michal Hocko wrote: > > On Thu 05-11-15 17:52:00, Johannes Weiner wrote: > > > On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote: > > > > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote: > > > > > This would be true if they moved on to the new cgroup API intentionally. > > > > > The reality is more complicated though. AFAIK sysmted is waiting for > > > > > cgroup2 already and privileged services enable all available resource > > > > > controllers by default as I've learned just recently. > > > > > > > > Have you filed a report with them? I don't think they should turn them > > > > on unless users explicitely configure resource control for the unit. > > > > > > Okay, verified with systemd people that they're not planning on > > > enabling resource control per default. > > > > > > Inflammatory half-truths, man. This is not constructive. > > > > What about Delegate=yes feature then? We have just been burnt by this > > quite heavily. AFAIU nspawn@.service and nspawn@.service have this > > enabled by default > > http://lists.freedesktop.org/archives/systemd-commits/2014-November/007400.html > > That's when you launch a *container* and want it to be able to use > nested resource control. Ups. copy&paste error here. The second one was user@.service. So it is not only about containers AFAIU but all user defined sessions. > We're talking about actual container users here. It's not turning on > resource control for all "privileged services", which is what we were > worried about here. Can you at least admit that when you yourself link > to the refuting evidence? My bad, that was misundestanding of the changelog. > And if you've been "burnt quite heavily" by this, where is your bug > report to stop other users from getting "burnt quite heavily" as well? The bug report is still internal because it is tracking an unrelased product. We have ended up reverting Delegate feature. Our systemd developers are supposed to bring this up with the upstream. The basic problem was that the Delegate feature has been backported to our systemd package without further consideration and that has invalidated a lot of performance testing because some resource controllers have measurable effects on those benchmarks. > All I read here is vague inflammatory language to spread FUD. I was merely pointing out that memory controller might be enabled without _user_ actually even noticing because the controller wasn't enabled explicitly. I haven't blamed anybody for that. > You might think sending these emails is helpful, but it really > isn't. Not only is it not contributing code, insights, or solutions, > you're now actively sabotaging someone else's effort to build something. Come on! Are you even serious? -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1161792AbbKFRpr (ORCPT ); Fri, 6 Nov 2015 12:45:47 -0500 Received: from gum.cmpxchg.org ([85.214.110.215]:42138 "EHLO gum.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932340AbbKFRpp (ORCPT ); Fri, 6 Nov 2015 12:45:45 -0500 Date: Fri, 6 Nov 2015 12:45:17 -0500 From: Johannes Weiner To: Michal Hocko Cc: David Miller , akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151106174517.GA9315@cmpxchg.org> References: <20151029152546.GG23598@dhcp22.suse.cz> <20151029161009.GA9160@cmpxchg.org> <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> <20151105205522.GA1067@cmpxchg.org> <20151105225200.GA5432@cmpxchg.org> <20151106105724.GG4390@dhcp22.suse.cz> <20151106161953.GA7813@cmpxchg.org> <20151106164657.GL4390@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151106164657.GL4390@dhcp22.suse.cz> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Nov 06, 2015 at 05:46:57PM +0100, Michal Hocko wrote: > The basic problem was that the Delegate feature has been backported to > our systemd package without further consideration and that has > invalidated a lot of performance testing because some resource > controllers have measurable effects on those benchmarks. You're talking about a userspace bug. No amount of fragmenting and layering and opt-in in the kernel's runtime configuration space is going to help you if you screw up and enable it all by accident. > > All I read here is vague inflammatory language to spread FUD. > > I was merely pointing out that memory controller might be enabled without > _user_ actually even noticing because the controller wasn't enabled > explicitly. I haven't blamed anybody for that. Why does that have anything to do with how we design our interface? We can't do more than present a sane interface in good faith and lobby userspace projects if we think they misuse it. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757951AbbKGDpp (ORCPT ); Fri, 6 Nov 2015 22:45:45 -0500 Received: from shards.monkeyblade.net ([149.20.54.216]:53010 "EHLO shards.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751116AbbKGDpn (ORCPT ); Fri, 6 Nov 2015 22:45:43 -0500 Date: Fri, 06 Nov 2015 22:45:41 -0500 (EST) Message-Id: <20151106.224541.1640743718816725953.davem@davemloft.net> To: mhocko@kernel.org Cc: hannes@cmpxchg.org, akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy From: David Miller In-Reply-To: <20151106164657.GL4390@dhcp22.suse.cz> References: <20151106105724.GG4390@dhcp22.suse.cz> <20151106161953.GA7813@cmpxchg.org> <20151106164657.GL4390@dhcp22.suse.cz> X-Mailer: Mew version 6.7 on Emacs 24.5 / Mule 6.0 (HANACHIRUSATO) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.5.12 (shards.monkeyblade.net [149.20.54.216]); Fri, 06 Nov 2015 19:45:43 -0800 (PST) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Michal Hocko Date: Fri, 6 Nov 2015 17:46:57 +0100 > On Fri 06-11-15 11:19:53, Johannes Weiner wrote: >> You might think sending these emails is helpful, but it really >> isn't. Not only is it not contributing code, insights, or solutions, >> you're now actively sabotaging someone else's effort to build something. > > Come on! Are you even serious? He is, and I agree %100 with him FWIW. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932212AbbKLSlp (ORCPT ); Thu, 12 Nov 2015 13:41:45 -0500 Received: from outbound-smtp08.blacknight.com ([46.22.139.13]:39704 "EHLO outbound-smtp08.blacknight.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753013AbbKLSlo (ORCPT ); Thu, 12 Nov 2015 13:41:44 -0500 Date: Thu, 12 Nov 2015 18:36:20 +0000 From: Mel Gorman To: Johannes Weiner Cc: Michal Hocko , David Miller , akpm@linux-foundation.org, vdavydov@virtuozzo.com, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151112183620.GC14880@techsingularity.net> References: <20151027164227.GB7749@cmpxchg.org> <20151029152546.GG23598@dhcp22.suse.cz> <20151029161009.GA9160@cmpxchg.org> <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> <20151105205522.GA1067@cmpxchg.org> <20151105225200.GA5432@cmpxchg.org> <20151106105724.GG4390@dhcp22.suse.cz> <20151106161953.GA7813@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20151106161953.GA7813@cmpxchg.org> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Nov 06, 2015 at 11:19:53AM -0500, Johannes Weiner wrote: > On Fri, Nov 06, 2015 at 11:57:24AM +0100, Michal Hocko wrote: > > On Thu 05-11-15 17:52:00, Johannes Weiner wrote: > > > On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote: > > > > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote: > > > > > This would be true if they moved on to the new cgroup API intentionally. > > > > > The reality is more complicated though. AFAIK sysmted is waiting for > > > > > cgroup2 already and privileged services enable all available resource > > > > > controllers by default as I've learned just recently. > > > > > > > > Have you filed a report with them? I don't think they should turn them > > > > on unless users explicitely configure resource control for the unit. > > > > > > Okay, verified with systemd people that they're not planning on > > > enabling resource control per default. > > > > > > Inflammatory half-truths, man. This is not constructive. > > > > What about Delegate=yes feature then? We have just been burnt by this > > quite heavily. AFAIU nspawn@.service and nspawn@.service have this > > enabled by default > > http://lists.freedesktop.org/archives/systemd-commits/2014-November/007400.html > > That's when you launch a *container* and want it to be able to use > nested resource control. > > We're talking about actual container users here. It's not turning on > resource control for all "privileged services", which is what we were > worried about here. Can you at least admit that when you yourself link > to the refuting evidence? > > And if you've been "burnt quite heavily" by this, where is your bug > report to stop other users from getting "burnt quite heavily" as well? > I didn't read this thread in detail but the lack of public information on problems with cgroup controllers is partially my fault so I'd like to correct that. https://bugzilla.suse.com/show_bug.cgi?id=954765 This bug documents some of the impact that was incurred due to ssh sessions being resource controlled by default. It talks primarily about pipetest being impacted by cpu,cpuacct. It is also found in the recent past that dbench4 was previously impacted because the blkio controller was enabled. That bug is not public but basically dbench4 regressed 80% as the journal thread was in a different cgroup than dbench4. dbench4 would stall for 8ms in case more IO was issued before the journal thread could issue any IO. The opensuse bug 954765 bug is not affected by blkio because it's disabled by a distribution-specific patch. Mike Galbraith adds some additional information on why activating the cpu controller can have an impact on semantics even if the overhead was zero. It may be the case that it's an oversight by the systemd developers and the intent was only to affect containers. My experience was that everything was affected. It also may be the case that this is an opensuse-specific problem due to how the maintainers packaged systemd. I don't actually know and hopefully the bug will be able to determine if upstream is really affected or not. There is also a link to this bug on the upstream project so there is some chance they are aware https://github.com/systemd/systemd/issues/1715 Bottom line, there is legimate confusion over whether cgroup controllers are going to be enabled by default or not in the future. If they are enabled by default, there is a non-zero cost to that and a change in semantics that people may or may not be surprised by. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Vladimir Davydov Subject: Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy Date: Thu, 22 Oct 2015 21:45:10 +0300 Message-ID: <20151022184509.GM18351@esperanza> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Cc: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , , , , To: Johannes Weiner Return-path: Content-Disposition: inline In-Reply-To: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-Id: netdev.vger.kernel.org Hi Johannes, On Thu, Oct 22, 2015 at 12:21:28AM -0400, Johannes Weiner wrote: ... > Patch #5 adds accounting and tracking of socket memory to the unified > hierarchy memory controller, as described above. It uses the existing > per-cpu charge caches and triggers high limit reclaim asynchroneously. > > Patch #8 uses the vmpressure extension to equalize pressure between > the pages tracked natively by the VM and socket buffer pages. As the > pool is shared, it makes sense that while natively tracked pages are > under duress the network transmit windows are also not increased. First of all, I've no experience in networking, so I'm likely to be mistaken. Nevertheless I beg to disagree that this patch set is a step in the right direction. Here goes why. I admit that your idea to get rid of explicit tcp window control knobs and size it dynamically basing on memory pressure instead does sound tempting, but I don't think it'd always work. The problem is that in contrast to, say, dcache, we can't shrink tcp buffers AFAIU, we can only stop growing them. Now suppose a system hasn't experienced memory pressure for a while. If we don't have explicit tcp window limit, tcp buffers on such a system might have eaten almost all available memory (because of network load/problems). If a user workload that needs a significant amount of memory is started suddenly then, the network code will receive a notification and surely stop growing buffers, but all those buffers accumulated won't disappear instantly. As a result, the workload might be unable to find enough free memory and have no choice but invoke OOM killer. This looks unexpected from the user POV. That said, I think we do need per memcg tcp window control similar to what we have system-wide. In other words, Glauber's work makes sense to me. You might want to point me at my RFC patch where I proposed to revert it (https://lkml.org/lkml/2014/9/12/401). Well, I've changed my mind since then. Now I think I was mistaken, luckily I was stopped. However, I may be mistaken again :-) Thanks, Vladimir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Vladimir Davydov Subject: Re: [PATCH 3/8] net: consolidate memcg socket buffer tracking and accounting Date: Thu, 22 Oct 2015 21:46:12 +0300 Message-ID: <20151022184612.GN18351@esperanza> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-4-git-send-email-hannes@cmpxchg.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Cc: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , , , , To: Johannes Weiner Return-path: Content-Disposition: inline In-Reply-To: <1445487696-21545-4-git-send-email-hannes@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-Id: netdev.vger.kernel.org On Thu, Oct 22, 2015 at 12:21:31AM -0400, Johannes Weiner wrote: > The tcp memory controller has extensive provisions for future memory > accounting interfaces that won't materialize after all. Cut the code > base down to what's actually used, now and in the likely future. > > - There won't be any different protocol counters in the future, so a > direct sock->sk_memcg linkage is enough. This eliminates a lot of > callback maze and boilerplate code, and restores most of the socket > allocation code to pre-tcp_memcontrol state. > > - There won't be a tcp control soft limit, so integrating the memcg In fact, the code is ready for the "soft" limit (I mean min, pressure, max tuple), it just lacks a knob. > code into the global skmem limiting scheme complicates things > unnecessarily. Replace all that with simple and clear charge and > uncharge calls--hidden behind a jump label--to account skb memory. > > - The previous jump label code was an elaborate state machine that > tracked the number of cgroups with an active socket limit in order > to enable the skmem tracking and accounting code only when actively > necessary. But this is overengineered: it was meant to protect the > people who never use this feature in the first place. Simply enable > the branches once when the first limit is set until the next reboot. > ... > @@ -1136,9 +1090,6 @@ static inline bool sk_under_memory_pressure(const struct sock *sk) > if (!sk->sk_prot->memory_pressure) > return false; > > - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) > - return !!sk->sk_cgrp->memory_pressure; > - AFAIU, now we won't shrink the window on hitting the limit, i.e. this patch subtly changes the behavior of the existing knobs, potentially breaking them. Thanks, Vladimir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Vladimir Davydov Subject: Re: [PATCH 7/8] mm: vmscan: report vmpressure at the level of reclaim activity Date: Thu, 22 Oct 2015 21:48:53 +0300 Message-ID: <20151022184852.GP18351@esperanza> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-8-git-send-email-hannes@cmpxchg.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Cc: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , , , , To: Johannes Weiner Return-path: Content-Disposition: inline In-Reply-To: <1445487696-21545-8-git-send-email-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-Id: netdev.vger.kernel.org On Thu, Oct 22, 2015 at 12:21:35AM -0400, Johannes Weiner wrote: ... > @@ -2437,6 +2439,10 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc, > } > } > > + vmpressure(sc->gfp_mask, memcg, > + sc->nr_scanned - scanned, > + sc->nr_reclaimed - reclaimed); > + > /* > * Direct reclaim and kswapd have to scan all memory > * cgroups to fulfill the overall scan target for the > @@ -2454,10 +2460,6 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc, > } > } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim))); > > - vmpressure(sc->gfp_mask, sc->target_mem_cgroup, > - sc->nr_scanned - nr_scanned, > - sc->nr_reclaimed - nr_reclaimed); > - > if (sc->nr_reclaimed - nr_reclaimed) > reclaimable = true; > I may be mistaken, but AFAIU this patch subtly changes the behavior of vmpressure visible from the userspace: w/o this patch a userspace process will only receive a notification for a memory cgroup only if *this* memory cgroup calls reclaimer; with this patch userspace notification will be issued even if reclaimer is invoked by any cgroup up the hierarchy. Thanks, Vladimir From mboxrd@z Thu Jan 1 00:00:00 1970 From: Vladimir Davydov Subject: Re: [PATCH 3/8] net: consolidate memcg socket buffer tracking and accounting Date: Fri, 23 Oct 2015 16:42:56 +0300 Message-ID: <20151023134256.GS18351@esperanza> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <1445487696-21545-4-git-send-email-hannes@cmpxchg.org> <20151022184612.GN18351@esperanza> <20151022190943.GA20871@cmpxchg.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Cc: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , , , , To: Johannes Weiner Return-path: Content-Disposition: inline In-Reply-To: <20151022190943.GA20871@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-Id: netdev.vger.kernel.org On Thu, Oct 22, 2015 at 03:09:43PM -0400, Johannes Weiner wrote: > On Thu, Oct 22, 2015 at 09:46:12PM +0300, Vladimir Davydov wrote: > > On Thu, Oct 22, 2015 at 12:21:31AM -0400, Johannes Weiner wrote: > > > The tcp memory controller has extensive provisions for future memory > > > accounting interfaces that won't materialize after all. Cut the code > > > base down to what's actually used, now and in the likely future. > > > > > > - There won't be any different protocol counters in the future, so a > > > direct sock->sk_memcg linkage is enough. This eliminates a lot of > > > callback maze and boilerplate code, and restores most of the socket > > > allocation code to pre-tcp_memcontrol state. > > > > > > - There won't be a tcp control soft limit, so integrating the memcg > > > > In fact, the code is ready for the "soft" limit (I mean min, pressure, > > max tuple), it just lacks a knob. > > Yeah, but that's not going to materialize if the entire interface for > dedicated tcp throttling is considered obsolete. May be, it shouldn't be. My current understanding is that per memcg tcp window control is necessary, because: - We need to be able to protect a containerized workload from its growing network buffers. Using vmpressure notifications for that does not look reassuring to me. - We need a way to limit network buffers of a particular container, otherwise it can fill the system-wide window throttling other containers, which is unfair. > > > > @@ -1136,9 +1090,6 @@ static inline bool sk_under_memory_pressure(const struct sock *sk) > > > if (!sk->sk_prot->memory_pressure) > > > return false; > > > > > > - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) > > > - return !!sk->sk_cgrp->memory_pressure; > > > - > > > > AFAIU, now we won't shrink the window on hitting the limit, i.e. this > > patch subtly changes the behavior of the existing knobs, potentially > > breaking them. > > Hm, but there is no grace period in which something meaningful could > happen with the window shrinking, is there? Any buffer allocation is > still going to fail hard. AFAIU when we hit the limit, we not only throttle the socket which allocates, but also try to release space reserved by other sockets. After your patch we won't. This looks unfair to me. Thanks, Vladimir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy Date: Mon, 26 Oct 2015 13:22:16 -0400 Message-ID: <20151026172216.GC2214@cmpxchg.org> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <20151022184509.GM18351@esperanza> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Vladimir Davydov Return-path: Content-Disposition: inline In-Reply-To: <20151022184509.GM18351@esperanza> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-Id: netdev.vger.kernel.org On Thu, Oct 22, 2015 at 09:45:10PM +0300, Vladimir Davydov wrote: > Hi Johannes, > > On Thu, Oct 22, 2015 at 12:21:28AM -0400, Johannes Weiner wrote: > ... > > Patch #5 adds accounting and tracking of socket memory to the unified > > hierarchy memory controller, as described above. It uses the existing > > per-cpu charge caches and triggers high limit reclaim asynchroneously. > > > > Patch #8 uses the vmpressure extension to equalize pressure between > > the pages tracked natively by the VM and socket buffer pages. As the > > pool is shared, it makes sense that while natively tracked pages are > > under duress the network transmit windows are also not increased. > > First of all, I've no experience in networking, so I'm likely to be > mistaken. Nevertheless I beg to disagree that this patch set is a step > in the right direction. Here goes why. > > I admit that your idea to get rid of explicit tcp window control knobs > and size it dynamically basing on memory pressure instead does sound > tempting, but I don't think it'd always work. The problem is that in > contrast to, say, dcache, we can't shrink tcp buffers AFAIU, we can only > stop growing them. Now suppose a system hasn't experienced memory > pressure for a while. If we don't have explicit tcp window limit, tcp > buffers on such a system might have eaten almost all available memory > (because of network load/problems). If a user workload that needs a > significant amount of memory is started suddenly then, the network code > will receive a notification and surely stop growing buffers, but all > those buffers accumulated won't disappear instantly. As a result, the > workload might be unable to find enough free memory and have no choice > but invoke OOM killer. This looks unexpected from the user POV. I'm not getting rid of those knobs, I'm just reusing the old socket accounting infrastructure in an attempt to make the memory accounting feature useful to more people in cgroups v2 (unified hierarchy). We can always come back to think about per-cgroup tcp window limits in the unified hierarchy, my patches don't get in the way of this. I'm not removing the knobs in cgroups v1 and I'm not preventing them in v2. But regardless of tcp window control, we need to account socket memory in the main memory accounting pool where pressure is shared (to the best of our abilities) between all accounted memory consumers. >>From an interface standpoint alone, I don't think it's reasonable to ask users per default to limit different consumers on a case by case basis. I certainly have no problem with finetuning for scenarios you describe above, but with memory.current, memory.high, memory.max we are providing a generic interface to account and contain memory consumption of workloads. This has to include all major memory consumers to make semantical sense. But also, there are people right now for whom the socket buffers cause system OOM, but the existing memcg's hard tcp window limitq that exists absolutely wrecks network performance for them. It's not usable the way it is. It'd be much better to have the socket buffers exert pressure on the shared pool, and then propagate the overall pressure back to individual consumers with reclaim, shrinkers, vmpressure etc. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Vladimir Davydov Subject: Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy Date: Tue, 27 Oct 2015 11:43:21 +0300 Message-ID: <20151027084320.GF13221@esperanza> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <20151022184509.GM18351@esperanza> <20151026172216.GC2214@cmpxchg.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Cc: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , , , , To: Johannes Weiner Return-path: Content-Disposition: inline In-Reply-To: <20151026172216.GC2214@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-Id: netdev.vger.kernel.org On Mon, Oct 26, 2015 at 01:22:16PM -0400, Johannes Weiner wrote: > On Thu, Oct 22, 2015 at 09:45:10PM +0300, Vladimir Davydov wrote: > > Hi Johannes, > > > > On Thu, Oct 22, 2015 at 12:21:28AM -0400, Johannes Weiner wrote: > > ... > > > Patch #5 adds accounting and tracking of socket memory to the unified > > > hierarchy memory controller, as described above. It uses the existing > > > per-cpu charge caches and triggers high limit reclaim asynchroneously. > > > > > > Patch #8 uses the vmpressure extension to equalize pressure between > > > the pages tracked natively by the VM and socket buffer pages. As the > > > pool is shared, it makes sense that while natively tracked pages are > > > under duress the network transmit windows are also not increased. > > > > First of all, I've no experience in networking, so I'm likely to be > > mistaken. Nevertheless I beg to disagree that this patch set is a step > > in the right direction. Here goes why. > > > > I admit that your idea to get rid of explicit tcp window control knobs > > and size it dynamically basing on memory pressure instead does sound > > tempting, but I don't think it'd always work. The problem is that in > > contrast to, say, dcache, we can't shrink tcp buffers AFAIU, we can only > > stop growing them. Now suppose a system hasn't experienced memory > > pressure for a while. If we don't have explicit tcp window limit, tcp > > buffers on such a system might have eaten almost all available memory > > (because of network load/problems). If a user workload that needs a > > significant amount of memory is started suddenly then, the network code > > will receive a notification and surely stop growing buffers, but all > > those buffers accumulated won't disappear instantly. As a result, the > > workload might be unable to find enough free memory and have no choice > > but invoke OOM killer. This looks unexpected from the user POV. > > I'm not getting rid of those knobs, I'm just reusing the old socket > accounting infrastructure in an attempt to make the memory accounting > feature useful to more people in cgroups v2 (unified hierarchy). > My understanding is that in the meantime you effectively break the existing per memcg tcp window control logic. > We can always come back to think about per-cgroup tcp window limits in > the unified hierarchy, my patches don't get in the way of this. I'm > not removing the knobs in cgroups v1 and I'm not preventing them in v2. > > But regardless of tcp window control, we need to account socket memory > in the main memory accounting pool where pressure is shared (to the > best of our abilities) between all accounted memory consumers. > No objections to this point. However, I really don't like the idea to charge tcp window size to memory.current instead of charging individual pages consumed by the workload for storing socket buffers, because it is inconsistent with what we have now. Can't we charge individual skb pages as we do in case of other kmem allocations? > From an interface standpoint alone, I don't think it's reasonable to > ask users per default to limit different consumers on a case by case > basis. I certainly have no problem with finetuning for scenarios you > describe above, but with memory.current, memory.high, memory.max we > are providing a generic interface to account and contain memory > consumption of workloads. This has to include all major memory > consumers to make semantical sense. We can propose a reasonable default as we do in the global case. > > But also, there are people right now for whom the socket buffers cause > system OOM, but the existing memcg's hard tcp window limitq that > exists absolutely wrecks network performance for them. It's not usable > the way it is. It'd be much better to have the socket buffers exert > pressure on the shared pool, and then propagate the overall pressure > back to individual consumers with reclaim, shrinkers, vmpressure etc. > This might or might not work. I'm not an expert to judge. But if you do this only for memcg leaving the global case as it is, networking people won't budge IMO. So could you please start such a major rework from the global case? Could you please try to deprecate the tcp window limits not only in the legacy memcg hierarchy, but also system-wide in order to attract attention of networking experts? Thanks, Vladimir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Vladimir Davydov Subject: Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy Date: Wed, 28 Oct 2015 11:20:03 +0300 Message-ID: <20151028082003.GK13221@esperanza> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <20151022184509.GM18351@esperanza> <20151026172216.GC2214@cmpxchg.org> <20151027084320.GF13221@esperanza> <20151027155833.GB4665@cmpxchg.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Cc: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , , , , To: Johannes Weiner Return-path: Content-Disposition: inline In-Reply-To: <20151027155833.GB4665@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-Id: netdev.vger.kernel.org On Tue, Oct 27, 2015 at 09:01:08AM -0700, Johannes Weiner wrote: ... > > > But regardless of tcp window control, we need to account socket memory > > > in the main memory accounting pool where pressure is shared (to the > > > best of our abilities) between all accounted memory consumers. > > > > > > > No objections to this point. However, I really don't like the idea to > > charge tcp window size to memory.current instead of charging individual > > pages consumed by the workload for storing socket buffers, because it is > > inconsistent with what we have now. Can't we charge individual skb pages > > as we do in case of other kmem allocations? > > Absolutely, both work for me. I chose that route because it's where > the networking code already tracks and accounts memory consumed, so it > seemed like a better site to hook into. > > But I understand your concerns. We want to track this stuff as close > to the memory allocators as possible. Exactly. > > > > But also, there are people right now for whom the socket buffers cause > > > system OOM, but the existing memcg's hard tcp window limitq that > > > exists absolutely wrecks network performance for them. It's not usable > > > the way it is. It'd be much better to have the socket buffers exert > > > pressure on the shared pool, and then propagate the overall pressure > > > back to individual consumers with reclaim, shrinkers, vmpressure etc. > > > > This might or might not work. I'm not an expert to judge. But if you do > > this only for memcg leaving the global case as it is, networking people > > won't budge IMO. So could you please start such a major rework from the > > global case? Could you please try to deprecate the tcp window limits not > > only in the legacy memcg hierarchy, but also system-wide in order to > > attract attention of networking experts? > > I'm definitely interested in addressing this globally as well. > > The idea behind this was to use the memcg part as a testbed. cgroup2 > is going to be new and people are prepared for hiccups when migrating > their applications to it; and they can roll back to cgroup1 and tcp > window limits at any time should they run into problems in production. Then you'd better not touch existing tcp limits at all, because they just work, and the logic behind them is very close to that of global tcp limits. I don't think one can simplify it somehow. Moreover, frankly I still have my reservations about this vmpressure propagation to skb you're proposing. It might work, but I doubt it will allow us to throw away explicit tcp limit, as I explained previously. So, even with your approach I think we can still need per memcg tcp limit *unless* you get rid of global tcp limit somehow. > > So this seemed like a good way to prove a new mechanism before rolling > it out to every single Linux setup, rather than switch everybody over > after the limited scope testing I can do as a developer on my own. > > Keep in mind that my patches are not committing anything in terms of > interface, so we retain all the freedom to fix and tune the way this > is implemented, including the freedom to re-add tcp window limits in > case the pressure balancing is not a comprehensive solution. > I really dislike this kind of proof. It looks like you're trying to push something you think is right covertly, w/o having a proper discussion with networking people and then say that it just works and hence should be done globally, but what if it won't? Revert it? We already have a lot of dubious stuff in memcg that should be reverted, so let's please try to avoid this kind of mistakes in future. Note, I say "w/o having a proper discussion with networking people", because I don't think they will really care *unless* you change the global logic, simply because most of them aren't very interested in memcg AFAICS. That effectively means you loose a chance to listen to networking experts, who could point you at design flaws and propose an improvement right away. Let's please not miss such an opportunity. You said that you'd seen this problem happen w/o cgroups, so you have a use case that might need fixing at the global level. IMO it shouldn't be difficult to prepare an RFC patch for the global case first and see what people think about it. Thanks, Vladimir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Vladimir Davydov Subject: Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy Date: Thu, 29 Oct 2015 12:27:47 +0300 Message-ID: <20151029092747.GR13221@esperanza> References: <1445487696-21545-1-git-send-email-hannes@cmpxchg.org> <20151022184509.GM18351@esperanza> <20151026172216.GC2214@cmpxchg.org> <20151027084320.GF13221@esperanza> <20151027155833.GB4665@cmpxchg.org> <20151028082003.GK13221@esperanza> <20151028185810.GA31488@cmpxchg.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Cc: "David S. Miller" , Andrew Morton , Michal Hocko , Tejun Heo , , , , To: Johannes Weiner Return-path: Content-Disposition: inline In-Reply-To: <20151028185810.GA31488@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-Id: netdev.vger.kernel.org On Wed, Oct 28, 2015 at 11:58:10AM -0700, Johannes Weiner wrote: > On Wed, Oct 28, 2015 at 11:20:03AM +0300, Vladimir Davydov wrote: > > Then you'd better not touch existing tcp limits at all, because they > > just work, and the logic behind them is very close to that of global tcp > > limits. I don't think one can simplify it somehow. > > Uhm, no, there is a crapload of boilerplate code and complication that > seems entirely unnecessary. The only thing missing from my patch seems > to be the part where it enters memory pressure state when the limit is > hit. I'm adding this for completeness, but I doubt it even matters. > > > Moreover, frankly I still have my reservations about this vmpressure > > propagation to skb you're proposing. It might work, but I doubt it > > will allow us to throw away explicit tcp limit, as I explained > > previously. So, even with your approach I think we can still need > > per memcg tcp limit *unless* you get rid of global tcp limit > > somehow. > > Having the hard limit as a failsafe (or a minimum for other consumers) > is one thing, and certainly something I'm open to for cgroupv2, should > we have problems with load startup up after a socket memory landgrab. > > That being said, if the VM is struggling to reclaim pages, or is even > swapping, it makes perfect sense to let the socket memory scheduler > know it shouldn't continue to increase its footprint until the VM > recovers. Regardless of any hard limitations/minimum guarantees. > > This is what my patch does and it seems pretty straight-forward to > me. I don't really understand why this is so controversial. I'm not arguing that the idea behind this patch set is necessarily bad. Quite the contrary, it does look interesting to me. I'm just saying that IMO it can't replace hard/soft limits. It probably could if it was possible to shrink buffers, but I don't think it's feasible, even theoretically. That's why I propose not to change the behavior of the existing per memcg tcp limit at all. And frankly I don't get why you are so keen on simplifying it. You say it's a "crapload of boilerplate code". Well, I don't see how it is - it just replicates global knobs and I don't see how it could be done in a better way. The code is hidden behind jump labels, so the overhead is zero if it isn't used. If you really dislike this code, we can isolate it under a separate config option. But all right, I don't rule out the possibility that the code could be simplified. If you do that w/o breaking it, that'll be OK to me, but I don't see why it should be related to this particular patch set. > > The *next* step would be to figure out whether we can actually > *reclaim* memory in the network subsystem--shrink windows and steal > buffers back--and that might even be an avenue to replace tcp window > limits. But it's not necessary for *this* patch series to be useful. Again, I don't think we can *reclaim* network memory, but you're right. > > > > So this seemed like a good way to prove a new mechanism before rolling > > > it out to every single Linux setup, rather than switch everybody over > > > after the limited scope testing I can do as a developer on my own. > > > > > > Keep in mind that my patches are not committing anything in terms of > > > interface, so we retain all the freedom to fix and tune the way this > > > is implemented, including the freedom to re-add tcp window limits in > > > case the pressure balancing is not a comprehensive solution. > > > > I really dislike this kind of proof. It looks like you're trying to > > push something you think is right covertly, w/o having a proper > > discussion with networking people and then say that it just works > > and hence should be done globally, but what if it won't? Revert it? > > We already have a lot of dubious stuff in memcg that should be > > reverted, so let's please try to avoid this kind of mistakes in > > future. Note, I say "w/o having a proper discussion with networking > > people", because I don't think they will really care *unless* you > > change the global logic, simply because most of them aren't very > > interested in memcg AFAICS. > > Come on, Dave is the first To and netdev is CC'd. They might not care > about memcg, but "pushing things covertly" is a bit of a stretch. Sorry if it sounded rude to you. I just look back at my experience patching slab internals to make kmem accountable, and AFAICS Christoph didn't really care about *what* I was doing, he only cared about the global case - if there was no performance degradation when kmemcg was disabled, he was usually fine with it, even if from the memcg pov it was a crap. Anyway, I can't force you to patch the global case first or simultaneously with the memcg case, so let's just hope I'm a bit too overcautious. > > > That effectively means you loose a chance to listen to networking > > experts, who could point you at design flaws and propose an improvement > > right away. Let's please not miss such an opportunity. You said that > > you'd seen this problem happen w/o cgroups, so you have a use case that > > might need fixing at the global level. IMO it shouldn't be difficult to > > prepare an RFC patch for the global case first and see what people think > > about it. > > No, the problem we are running into is when network memory is not > tracked per cgroup. The lack of containment means that the socket > memory consumption of individual cgroups can trigger system OOM. > > We tried using the per-memcg tcp limits, and that prevents the OOMs > for sure, but it's horrendous for network performance. There is no > "stop growing" phase, it just keeps going full throttle until it hits > the wall hard. > > Now, we could probably try to replicate the global knobs and add a > per-memcg soft limit. But you know better than anyone else how hard it > is to estimate the overall workingset size of a workload, and the > margins on containerized loads are razor-thin. Performance is much > more sensitive to input errors, and often times parameters must be > adjusted continuously during the runtime of a workload. It'd be > disasterous to rely on yet more static, error-prone user input here. Yeah, but the dynamic approach proposed in your patch set doesn't guarantee we won't hit OOM in memcg due to overgrown buffers. It just reduces this possibility. Of course, memcg OOM is far not as disastrous as the global one, but still it usually means the workload breakage. The static approach is error-prone for sure, but it has existed for years and worked satisfactory AFAIK. > > What all this means to me is that fixing it on the cgroup level has > higher priority. But it also means that once we figured it out under > such a high-pressure environment, it's much easier to apply to the > global case and potentially replace the soft limit there. > > This seems like a better approach to me than starting globally, only > to realize that the solution is not workable for cgroups and we need > yet something else. > Are we in rush? I think if you try your approach at the global level and fail, it's still good, because it will probably give us all a better understanding of the problem. If you successfully fix the global case, but then realize that it doesn't fit memcg, it's even better, because you actually fixed a problem. If you patch both global and memcg cases, it's perfect. But of course, that's my understanding and I may be mistaken. Let's hope you're right. Thanks, Vladimir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Vladimir Davydov Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Date: Fri, 6 Nov 2015 12:05:55 +0300 Message-ID: <20151106090555.GK29259@esperanza> References: <20151027122647.GG9891@dhcp22.suse.cz> <20151027154138.GA4665@cmpxchg.org> <20151027161554.GJ9891@dhcp22.suse.cz> <20151027164227.GB7749@cmpxchg.org> <20151029152546.GG23598@dhcp22.suse.cz> <20151029161009.GA9160@cmpxchg.org> <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> <20151105205522.GA1067@cmpxchg.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Cc: Michal Hocko , David Miller , , , , , , To: Johannes Weiner Return-path: Content-Disposition: inline In-Reply-To: <20151105205522.GA1067-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-Id: netdev.vger.kernel.org On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote: > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote: ... > > 3) keep only some (safe) cache types enabled by default with the current > > failing semantic and require an explicit enabling for the complete > > kmem accounting. [di]cache code paths should be quite robust to > > handle allocation failures. > > Vladimir, what would be your opinion on this? I'm all for this option. Actually, I've been thinking about this since I introduced the __GFP_NOACCOUNT flag. Not because of the failing semantics, since we can always let kmem allocations breach the limit. This shouldn't be critical, because I don't think it's possible to issue a series of kmem allocations w/o a single user page allocation, which would reclaim/kill the excess. The point is there are allocations that are shared system-wide and therefore shouldn't go to any memcg. Most obvious examples are: mempool users and radix_tree/idr preloads. Accounting them to memcg is likely to result in noticeable memory overhead as memory cgroups are created/destroyed, because they pin dead memory cgroups with all their kmem caches, which aren't tiny. Another funny example is objects destroyed lazily for performance reasons, e.g. vmap_area. Such objects are usually very small, so delaying destruction of a bunch of them will normally go unnoticed. However, if kmemcg is used the effective memory consumption caused by such objects can be multiplied by many times due to dangling kmem caches. We can, of course, mark all such allocations as __GFP_NOACCOUNT, but the problem is they are tricky to identify, because they are scattered all over the kernel source tree. E.g. Dave Chinner mentioned that XFS internals do a lot of allocations that are shared among all XFS filesystems and therefore should not be accounted (BTW that's why list_lru's used by XFS are not marked as memcg-aware). There must be more out there. Besides, kernel developers don't usually even know about kmemcg (they just write the code for their subsys, so why should they?) so they won't care thinking about using __GFP_NOACCOUNT, and hence new falsely-accounted allocations are likely to appear. That said, by switching from black-list (__GFP_NOACCOUNT) to white-list (__GFP_ACCOUNT) kmem accounting policy we would make the system more predictable and robust IMO. OTOH what would we lose? Security? Well, containers aren't secure IMHO. In fact, I doubt they will ever be (as secure as VMs). Anyway, if a runaway allocation is reported, it should be trivial to fix by adding __GFP_ACCOUNT where appropriate. If there are no objections, I'll prepare a patch switching to the white-list approach. Let's start from obvious things like fs_struct, mm_struct, task_struct, signal_struct, dentry, inode, which can be easily allocated from user space. This should cover 90% of all allocations that should be accounted AFAICS. The rest will be added later if necessarily. Thanks, Vladimir