From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: [patch 0/2] mm: memcg reclaim integration followups Date: Tue, 10 Jan 2012 16:02:50 +0100 Message-ID: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> Return-path: Sender: owner-linux-mm@kvack.org List-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Andrew Morton Cc: Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , Ying Han , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Hi, here are two patches based on memcg-aware global reclaim, which I dropped from the initial series to focus on the exclusive-lru changes. The first one is per-memcg reclaim statistics. For now, they include only pages scanned and pages reclaimed, separately for direct reclaim and kswapd, as well as separately for internal pressure or reclaim due to parental memcgs. The second one is integrating soft limit reclaim into the now memcg-aware global reclaim path. It kills a lot of code and performs better as far as I have tested it. Furthermore, Ying is working on turning soft limits into guarantees, as discussed in Prague, and this patch is also in preparation for that. Sorry for the odd point in time to submit this, I guess this will mean 3.4 at the earliest. But the soft limit removal is a bit heavy weight so it's probably easier conflict-wise to have it at the bottom of the -mm stack. Documentation/cgroups/memory.txt | 4 + include/linux/memcontrol.h | 28 ++- mm/memcontrol.c | 482 +++++++++----------------------------- mm/vmscan.c | 87 ++------ 4 files changed, 144 insertions(+), 457 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: [patch 1/2] mm: memcg: per-memcg reclaim statistics Date: Tue, 10 Jan 2012 16:02:51 +0100 Message-ID: <1326207772-16762-2-git-send-email-hannes@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> Return-path: In-Reply-To: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Andrew Morton Cc: Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , Ying Han , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org With the single per-zone LRU gone and global reclaim scanning individual memcgs, it's straight-forward to collect meaningful and accurate per-memcg reclaim statistics. This adds the following items to memory.stat: pgreclaim pgscan Number of pages reclaimed/scanned from that memcg due to its own hard limit (or physical limit in case of the root memcg) by the allocating task. kswapd_pgreclaim kswapd_pgscan Reclaim activity from kswapd due to the memcg's own limit. Only applicable to the root memcg for now since kswapd is only triggered by physical limits, but kswapd-style reclaim based on memcg hard limits is being developped. hierarchy_pgreclaim hierarchy_pgscan hierarchy_kswapd_pgreclaim hierarchy_kswapd_pgscan Reclaim activity due to limitations in one of the memcg's parents. Signed-off-by: Johannes Weiner --- Documentation/cgroups/memory.txt | 4 ++ include/linux/memcontrol.h | 10 +++++ mm/memcontrol.c | 84 +++++++++++++++++++++++++++++++++++++- mm/vmscan.c | 7 +++ 4 files changed, 103 insertions(+), 2 deletions(-) diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index cc0ebc5..eb9e982 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -389,6 +389,10 @@ mapped_file - # of bytes of mapped file (includes tmpfs/shmem) pgpgin - # of pages paged in (equivalent to # of charging events). pgpgout - # of pages paged out (equivalent to # of uncharging events). swap - # of bytes of swap usage +pgreclaim - # of pages reclaimed due to this memcg's limit +pgscan - # of pages scanned due to this memcg's limit +kswapd_* - # reclaim activity by background daemon due to this memcg's limit +hierarchy_* - # reclaim activity due to pressure from parental memcg inactive_anon - # of bytes of anonymous memory and swap cache memory on LRU list. active_anon - # of bytes of anonymous and swap cache memory on active diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index bd3b102..6c1d69e 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -121,6 +121,8 @@ struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, struct zone *zone); struct zone_reclaim_stat* mem_cgroup_get_reclaim_stat_from_page(struct page *page); +void mem_cgroup_account_reclaim(struct mem_cgroup *, struct mem_cgroup *, + unsigned long, unsigned long, bool); extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p); extern void mem_cgroup_replace_page_cache(struct page *oldpage, @@ -347,6 +349,14 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page) return NULL; } +static inline void mem_cgroup_account_reclaim(struct mem_cgroup *root, + struct mem_cgroup *memcg, + unsigned long nr_reclaimed, + unsigned long nr_scanned, + bool kswapd) +{ +} + static inline void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 8e2a80d..170dff4 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -91,12 +91,23 @@ enum mem_cgroup_stat_index { MEM_CGROUP_STAT_NSTATS, }; +#define MEM_CGROUP_EVENTS_KSWAPD 2 +#define MEM_CGROUP_EVENTS_HIERARCHY 4 + enum mem_cgroup_events_index { MEM_CGROUP_EVENTS_PGPGIN, /* # of pages paged in */ MEM_CGROUP_EVENTS_PGPGOUT, /* # of pages paged out */ MEM_CGROUP_EVENTS_COUNT, /* # of pages paged in/out */ MEM_CGROUP_EVENTS_PGFAULT, /* # of page-faults */ MEM_CGROUP_EVENTS_PGMAJFAULT, /* # of major page-faults */ + MEM_CGROUP_EVENTS_PGRECLAIM, + MEM_CGROUP_EVENTS_PGSCAN, + MEM_CGROUP_EVENTS_KSWAPD_PGRECLAIM, + MEM_CGROUP_EVENTS_KSWAPD_PGSCAN, + MEM_CGROUP_EVENTS_HIERARCHY_PGRECLAIM, + MEM_CGROUP_EVENTS_HIERARCHY_PGSCAN, + MEM_CGROUP_EVENTS_HIERARCHY_KSWAPD_PGRECLAIM, + MEM_CGROUP_EVENTS_HIERARCHY_KSWAPD_PGSCAN, MEM_CGROUP_EVENTS_NSTATS, }; /* @@ -889,6 +900,38 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg) return (memcg == root_mem_cgroup); } +/** + * mem_cgroup_account_reclaim - update per-memcg reclaim statistics + * @root: memcg that triggered reclaim + * @memcg: memcg that is actually being scanned + * @nr_reclaimed: number of pages reclaimed from @memcg + * @nr_scanned: number of pages scanned from @memcg + * @kswapd: whether reclaiming task is kswapd or allocator itself + */ +void mem_cgroup_account_reclaim(struct mem_cgroup *root, + struct mem_cgroup *memcg, + unsigned long nr_reclaimed, + unsigned long nr_scanned, + bool kswapd) +{ + unsigned int offset = 0; + + if (!root) + root = root_mem_cgroup; + + if (kswapd) + offset += MEM_CGROUP_EVENTS_KSWAPD; + if (root != memcg) + offset += MEM_CGROUP_EVENTS_HIERARCHY; + + preempt_disable(); + __this_cpu_add(memcg->stat->events[MEM_CGROUP_EVENTS_PGRECLAIM + offset], + nr_reclaimed); + __this_cpu_add(memcg->stat->events[MEM_CGROUP_EVENTS_PGSCAN + offset], + nr_scanned); + preempt_enable(); +} + void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx) { struct mem_cgroup *memcg; @@ -1662,6 +1705,8 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg, excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT; while (1) { + unsigned long nr_reclaimed; + victim = mem_cgroup_iter(root_memcg, victim, &reclaim); if (!victim) { loop++; @@ -1687,8 +1732,11 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg, } if (!mem_cgroup_reclaimable(victim, false)) continue; - total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false, - zone, &nr_scanned); + nr_reclaimed = mem_cgroup_shrink_node_zone(victim, gfp_mask, false, + zone, &nr_scanned); + mem_cgroup_account_reclaim(root_mem_cgroup, victim, nr_reclaimed, + nr_scanned, current_is_kswapd()); + total += nr_reclaimed; *total_scanned += nr_scanned; if (!res_counter_soft_limit_excess(&root_memcg->res)) break; @@ -4023,6 +4071,14 @@ enum { MCS_SWAP, MCS_PGFAULT, MCS_PGMAJFAULT, + MCS_PGRECLAIM, + MCS_PGSCAN, + MCS_KSWAPD_PGRECLAIM, + MCS_KSWAPD_PGSCAN, + MCS_HIERARCHY_PGRECLAIM, + MCS_HIERARCHY_PGSCAN, + MCS_HIERARCHY_KSWAPD_PGRECLAIM, + MCS_HIERARCHY_KSWAPD_PGSCAN, MCS_INACTIVE_ANON, MCS_ACTIVE_ANON, MCS_INACTIVE_FILE, @@ -4047,6 +4103,14 @@ struct { {"swap", "total_swap"}, {"pgfault", "total_pgfault"}, {"pgmajfault", "total_pgmajfault"}, + {"pgreclaim", "total_pgreclaim"}, + {"pgscan", "total_pgscan"}, + {"kswapd_pgreclaim", "total_kswapd_pgreclaim"}, + {"kswapd_pgscan", "total_kswapd_pgscan"}, + {"hierarchy_pgreclaim", "total_hierarchy_pgreclaim"}, + {"hierarchy_pgscan", "total_hierarchy_pgscan"}, + {"hierarchy_kswapd_pgreclaim", "total_hierarchy_kswapd_pgreclaim"}, + {"hierarchy_kswapd_pgscan", "total_hierarchy_kswapd_pgscan"}, {"inactive_anon", "total_inactive_anon"}, {"active_anon", "total_active_anon"}, {"inactive_file", "total_inactive_file"}, @@ -4079,6 +4143,22 @@ mem_cgroup_get_local_stat(struct mem_cgroup *memcg, struct mcs_total_stat *s) s->stat[MCS_PGFAULT] += val; val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_PGMAJFAULT); s->stat[MCS_PGMAJFAULT] += val; + val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_PGRECLAIM); + s->stat[MCS_PGRECLAIM] += val; + val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_PGSCAN); + s->stat[MCS_PGSCAN] += val; + val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_KSWAPD_PGRECLAIM); + s->stat[MCS_KSWAPD_PGRECLAIM] += val; + val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_KSWAPD_PGSCAN); + s->stat[MCS_KSWAPD_PGSCAN] += val; + val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_HIERARCHY_PGRECLAIM); + s->stat[MCS_HIERARCHY_PGRECLAIM] += val; + val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_HIERARCHY_PGSCAN); + s->stat[MCS_HIERARCHY_PGSCAN] += val; + val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_HIERARCHY_KSWAPD_PGRECLAIM); + s->stat[MCS_HIERARCHY_KSWAPD_PGRECLAIM] += val; + val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_HIERARCHY_KSWAPD_PGSCAN); + s->stat[MCS_HIERARCHY_KSWAPD_PGSCAN] += val; /* per zone stat */ val = mem_cgroup_nr_lru_pages(memcg, BIT(LRU_INACTIVE_ANON)); diff --git a/mm/vmscan.c b/mm/vmscan.c index c631234..e3fd8a7 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2115,12 +2115,19 @@ static void shrink_zone(int priority, struct zone *zone, memcg = mem_cgroup_iter(root, NULL, &reclaim); do { + unsigned long nr_reclaimed = sc->nr_reclaimed; + unsigned long nr_scanned = sc->nr_scanned; struct mem_cgroup_zone mz = { .mem_cgroup = memcg, .zone = zone, }; shrink_mem_cgroup_zone(priority, &mz, sc); + + mem_cgroup_account_reclaim(root, memcg, + sc->nr_reclaimed - nr_reclaimed, + sc->nr_scanned - nr_scanned, + current_is_kswapd()); /* * Limit reclaim has historically picked one memcg and * scanned it with decreasing priority levels until -- 1.7.7.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Date: Tue, 10 Jan 2012 16:02:52 +0100 Message-ID: <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> Return-path: In-Reply-To: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Andrew Morton Cc: Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , Ying Han , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Right now, memcg soft limits are implemented by having a sorted tree of memcgs that are in excess of their limits. Under global memory pressure, kswapd first reclaims from the biggest excessor and then proceeds to do regular global reclaim. The result of this is that pages are reclaimed from all memcgs, but more scanning happens against those above their soft limit. With global reclaim doing memcg-aware hierarchical reclaim by default, this is a lot easier to implement: everytime a memcg is reclaimed from, scan more aggressively (per tradition with a priority of 0) if it's above its soft limit. With the same end result of scanning everybody, but soft limit excessors a bit more. Advantages: o smoother reclaim: soft limit reclaim is a separate stage before global reclaim, whose result is not communicated down the line and so overreclaim of the groups in excess is very likely. After this patch, soft limit reclaim is fully integrated into regular reclaim and each memcg is considered exactly once per cycle. o true hierarchy support: soft limits are only considered when kswapd does global reclaim, but after this patch, targetted reclaim of a memcg will mind the soft limit settings of its child groups. o code size: soft limit reclaim requires a lot of code to maintain the per-node per-zone rb-trees to quickly find the biggest offender, dedicated paths for soft limit reclaim etc. while this new implementation gets away without all that. Test: The test consists of two concurrent kernel build jobs in separate source trees, the master and the slave. The two jobs get along nicely on 600MB of available memory, so this is the zero overcommit control case. When available memory is decreased, the overcommit is compensated by decreasing the soft limit of the slave by the same amount, in the hope that the slave takes the hit and the master stays unaffected. 600M-0M-vanilla 600M-0M-patched Master walltime (s) 552.65 ( +0.00%) 552.38 ( -0.05%) Master walltime (stddev) 1.25 ( +0.00%) 0.92 ( -14.66%) Master major faults 204.38 ( +0.00%) 205.38 ( +0.49%) Master major faults (stddev) 27.16 ( +0.00%) 13.80 ( -47.43%) Master reclaim 31.88 ( +0.00%) 37.75 ( +17.87%) Master reclaim (stddev) 34.01 ( +0.00%) 75.88 (+119.59%) Master scan 31.88 ( +0.00%) 37.75 ( +17.87%) Master scan (stddev) 34.01 ( +0.00%) 75.88 (+119.59%) Master kswapd reclaim 33922.12 ( +0.00%) 33887.12 ( -0.10%) Master kswapd reclaim (stddev) 969.08 ( +0.00%) 492.22 ( -49.16%) Master kswapd scan 34085.75 ( +0.00%) 33985.75 ( -0.29%) Master kswapd scan (stddev) 1101.07 ( +0.00%) 563.33 ( -48.79%) Slave walltime (s) 552.68 ( +0.00%) 552.12 ( -0.10%) Slave walltime (stddev) 0.79 ( +0.00%) 1.05 ( +14.76%) Slave major faults 212.50 ( +0.00%) 204.50 ( -3.75%) Slave major faults (stddev) 26.90 ( +0.00%) 13.17 ( -49.20%) Slave reclaim 26.12 ( +0.00%) 35.00 ( +32.72%) Slave reclaim (stddev) 29.42 ( +0.00%) 74.91 (+149.55%) Slave scan 31.38 ( +0.00%) 35.00 ( +11.20%) Slave scan (stddev) 33.31 ( +0.00%) 74.91 (+121.24%) Slave kswapd reclaim 34259.00 ( +0.00%) 33469.88 ( -2.30%) Slave kswapd reclaim (stddev) 925.15 ( +0.00%) 565.07 ( -38.88%) Slave kswapd scan 34354.62 ( +0.00%) 33555.75 ( -2.33%) Slave kswapd scan (stddev) 969.62 ( +0.00%) 581.70 ( -39.97%) In the control case, the differences in elapsed time, number of major faults taken, and reclaim statistics are within the noise for both the master and the slave job. 600M-280M-vanilla 600M-280M-patched Master walltime (s) 595.13 ( +0.00%) 553.19 ( -7.04%) Master walltime (stddev) 8.31 ( +0.00%) 2.57 ( -61.64%) Master major faults 3729.75 ( +0.00%) 783.25 ( -78.98%) Master major faults (stddev) 258.79 ( +0.00%) 226.68 ( -12.36%) Master reclaim 705.00 ( +0.00%) 29.50 ( -95.68%) Master reclaim (stddev) 232.87 ( +0.00%) 44.72 ( -80.45%) Master scan 714.88 ( +0.00%) 30.00 ( -95.67%) Master scan (stddev) 237.44 ( +0.00%) 45.39 ( -80.54%) Master kswapd reclaim 114.75 ( +0.00%) 50.00 ( -55.94%) Master kswapd reclaim (stddev) 128.51 ( +0.00%) 9.45 ( -91.93%) Master kswapd scan 115.75 ( +0.00%) 50.00 ( -56.32%) Master kswapd scan (stddev) 130.31 ( +0.00%) 9.45 ( -92.04%) Slave walltime (s) 631.18 ( +0.00%) 577.68 ( -8.46%) Slave walltime (stddev) 9.89 ( +0.00%) 3.63 ( -57.47%) Slave major faults 28401.75 ( +0.00%) 14656.75 ( -48.39%) Slave major faults (stddev) 2629.97 ( +0.00%) 1911.81 ( -27.30%) Slave reclaim 65400.62 ( +0.00%) 1479.62 ( -97.74%) Slave reclaim (stddev) 11623.02 ( +0.00%) 1482.13 ( -87.24%) Slave scan 9050047.88 ( +0.00%) 95968.25 ( -98.94%) Slave scan (stddev) 1912786.94 ( +0.00%) 93390.71 ( -95.12%) Slave kswapd reclaim 327894.50 ( +0.00%) 227099.88 ( -30.74%) Slave kswapd reclaim (stddev) 22289.43 ( +0.00%) 16113.14 ( -27.71%) Slave kswapd scan 34987335.75 ( +0.00%) 1362367.12 ( -96.11%) Slave kswapd scan (stddev) 2523642.98 ( +0.00%) 156754.74 ( -93.79%) Here, the available memory is limited to 320 MB, the machine is overcommitted by 280 MB. The soft limit of the master is 300 MB, that of the slave merely 20 MB. Looking at the slave job first, it is much better off with the patched kernel: direct reclaim is almost gone, kswapd reclaim is decreased by a third. The result is much fewer major faults taken, which in turn lets the job finish quicker. It would be a zero-sum game if the improvement happened at the cost of the master but looking at the numbers, even the master performs better with the patched kernel. In fact, the master job is almost unaffected on the patched kernel compared to the control case. This is an odd phenomenon, as the patch does not directly change how the master is reclaimed. An explanation for this is that the severe overreclaim of the slave in the unpatched kernel results in the master growing bigger than in the patched case. Combining the fact that memcgs are scanned according to their size with the increased refault rate of the overreclaimed slave triggering global reclaim more often means that overall pressure on the master job is higher in the unpatched kernel. At any rate, the patched kernel seems to do a much better job at both overall resource allocation under soft limit overcommit as well as the requested prioritization of the master job. Signed-off-by: Johannes Weiner --- include/linux/memcontrol.h | 18 +-- mm/memcontrol.c | 412 ++++---------------------------------------- mm/vmscan.c | 80 +-------- 3 files changed, 48 insertions(+), 462 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 6c1d69e..72368b7 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -121,6 +121,7 @@ struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, struct zone *zone); struct zone_reclaim_stat* mem_cgroup_get_reclaim_stat_from_page(struct page *page); +bool mem_cgroup_over_softlimit(struct mem_cgroup *, struct mem_cgroup *); void mem_cgroup_account_reclaim(struct mem_cgroup *, struct mem_cgroup *, unsigned long, unsigned long, bool); extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, @@ -155,9 +156,6 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, mem_cgroup_update_page_stat(page, idx, -1); } -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, - gfp_t gfp_mask, - unsigned long *total_scanned); u64 mem_cgroup_get_limit(struct mem_cgroup *memcg); void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx); @@ -362,22 +360,20 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) { } -static inline void mem_cgroup_inc_page_stat(struct page *page, - enum mem_cgroup_page_stat_item idx) +static inline bool +mem_cgroup_over_softlimit(struct mem_cgroup *root, struct mem_cgroup *memcg) { + return false; } -static inline void mem_cgroup_dec_page_stat(struct page *page, +static inline void mem_cgroup_inc_page_stat(struct page *page, enum mem_cgroup_page_stat_item idx) { } -static inline -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, - gfp_t gfp_mask, - unsigned long *total_scanned) +static inline void mem_cgroup_dec_page_stat(struct page *page, + enum mem_cgroup_page_stat_item idx) { - return 0; } static inline diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 170dff4..d4f7ae5 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -35,7 +35,6 @@ #include #include #include -#include #include #include #include @@ -118,12 +117,10 @@ enum mem_cgroup_events_index { */ enum mem_cgroup_events_target { MEM_CGROUP_TARGET_THRESH, - MEM_CGROUP_TARGET_SOFTLIMIT, MEM_CGROUP_TARGET_NUMAINFO, MEM_CGROUP_NTARGETS, }; #define THRESHOLDS_EVENTS_TARGET (128) -#define SOFTLIMIT_EVENTS_TARGET (1024) #define NUMAINFO_EVENTS_TARGET (1024) struct mem_cgroup_stat_cpu { @@ -149,12 +146,6 @@ struct mem_cgroup_per_zone { struct mem_cgroup_reclaim_iter reclaim_iter[DEF_PRIORITY + 1]; struct zone_reclaim_stat reclaim_stat; - struct rb_node tree_node; /* RB tree node */ - unsigned long long usage_in_excess;/* Set to the value by which */ - /* the soft limit is exceeded*/ - bool on_tree; - struct mem_cgroup *mem; /* Back pointer, we cannot */ - /* use container_of */ }; /* Macro for accessing counter */ #define MEM_CGROUP_ZSTAT(mz, idx) ((mz)->count[(idx)]) @@ -167,26 +158,6 @@ struct mem_cgroup_lru_info { struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES]; }; -/* - * Cgroups above their limits are maintained in a RB-Tree, independent of - * their hierarchy representation - */ - -struct mem_cgroup_tree_per_zone { - struct rb_root rb_root; - spinlock_t lock; -}; - -struct mem_cgroup_tree_per_node { - struct mem_cgroup_tree_per_zone rb_tree_per_zone[MAX_NR_ZONES]; -}; - -struct mem_cgroup_tree { - struct mem_cgroup_tree_per_node *rb_tree_per_node[MAX_NUMNODES]; -}; - -static struct mem_cgroup_tree soft_limit_tree __read_mostly; - struct mem_cgroup_threshold { struct eventfd_ctx *eventfd; u64 threshold; @@ -343,7 +314,6 @@ static bool move_file(void) * limit reclaim to prevent infinite loops, if they ever occur. */ #define MEM_CGROUP_MAX_RECLAIM_LOOPS (100) -#define MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS (2) enum charge_type { MEM_CGROUP_CHARGE_TYPE_CACHE = 0, @@ -398,164 +368,6 @@ page_cgroup_zoneinfo(struct mem_cgroup *memcg, struct page *page) return mem_cgroup_zoneinfo(memcg, nid, zid); } -static struct mem_cgroup_tree_per_zone * -soft_limit_tree_node_zone(int nid, int zid) -{ - return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid]; -} - -static struct mem_cgroup_tree_per_zone * -soft_limit_tree_from_page(struct page *page) -{ - int nid = page_to_nid(page); - int zid = page_zonenum(page); - - return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid]; -} - -static void -__mem_cgroup_insert_exceeded(struct mem_cgroup *memcg, - struct mem_cgroup_per_zone *mz, - struct mem_cgroup_tree_per_zone *mctz, - unsigned long long new_usage_in_excess) -{ - struct rb_node **p = &mctz->rb_root.rb_node; - struct rb_node *parent = NULL; - struct mem_cgroup_per_zone *mz_node; - - if (mz->on_tree) - return; - - mz->usage_in_excess = new_usage_in_excess; - if (!mz->usage_in_excess) - return; - while (*p) { - parent = *p; - mz_node = rb_entry(parent, struct mem_cgroup_per_zone, - tree_node); - if (mz->usage_in_excess < mz_node->usage_in_excess) - p = &(*p)->rb_left; - /* - * We can't avoid mem cgroups that are over their soft - * limit by the same amount - */ - else if (mz->usage_in_excess >= mz_node->usage_in_excess) - p = &(*p)->rb_right; - } - rb_link_node(&mz->tree_node, parent, p); - rb_insert_color(&mz->tree_node, &mctz->rb_root); - mz->on_tree = true; -} - -static void -__mem_cgroup_remove_exceeded(struct mem_cgroup *memcg, - struct mem_cgroup_per_zone *mz, - struct mem_cgroup_tree_per_zone *mctz) -{ - if (!mz->on_tree) - return; - rb_erase(&mz->tree_node, &mctz->rb_root); - mz->on_tree = false; -} - -static void -mem_cgroup_remove_exceeded(struct mem_cgroup *memcg, - struct mem_cgroup_per_zone *mz, - struct mem_cgroup_tree_per_zone *mctz) -{ - spin_lock(&mctz->lock); - __mem_cgroup_remove_exceeded(memcg, mz, mctz); - spin_unlock(&mctz->lock); -} - - -static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page) -{ - unsigned long long excess; - struct mem_cgroup_per_zone *mz; - struct mem_cgroup_tree_per_zone *mctz; - int nid = page_to_nid(page); - int zid = page_zonenum(page); - mctz = soft_limit_tree_from_page(page); - - /* - * Necessary to update all ancestors when hierarchy is used. - * because their event counter is not touched. - */ - for (; memcg; memcg = parent_mem_cgroup(memcg)) { - mz = mem_cgroup_zoneinfo(memcg, nid, zid); - excess = res_counter_soft_limit_excess(&memcg->res); - /* - * We have to update the tree if mz is on RB-tree or - * mem is over its softlimit. - */ - if (excess || mz->on_tree) { - spin_lock(&mctz->lock); - /* if on-tree, remove it */ - if (mz->on_tree) - __mem_cgroup_remove_exceeded(memcg, mz, mctz); - /* - * Insert again. mz->usage_in_excess will be updated. - * If excess is 0, no tree ops. - */ - __mem_cgroup_insert_exceeded(memcg, mz, mctz, excess); - spin_unlock(&mctz->lock); - } - } -} - -static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg) -{ - int node, zone; - struct mem_cgroup_per_zone *mz; - struct mem_cgroup_tree_per_zone *mctz; - - for_each_node_state(node, N_POSSIBLE) { - for (zone = 0; zone < MAX_NR_ZONES; zone++) { - mz = mem_cgroup_zoneinfo(memcg, node, zone); - mctz = soft_limit_tree_node_zone(node, zone); - mem_cgroup_remove_exceeded(memcg, mz, mctz); - } - } -} - -static struct mem_cgroup_per_zone * -__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz) -{ - struct rb_node *rightmost = NULL; - struct mem_cgroup_per_zone *mz; - -retry: - mz = NULL; - rightmost = rb_last(&mctz->rb_root); - if (!rightmost) - goto done; /* Nothing to reclaim from */ - - mz = rb_entry(rightmost, struct mem_cgroup_per_zone, tree_node); - /* - * Remove the node now but someone else can add it back, - * we will to add it back at the end of reclaim to its correct - * position in the tree. - */ - __mem_cgroup_remove_exceeded(mz->mem, mz, mctz); - if (!res_counter_soft_limit_excess(&mz->mem->res) || - !css_tryget(&mz->mem->css)) - goto retry; -done: - return mz; -} - -static struct mem_cgroup_per_zone * -mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz) -{ - struct mem_cgroup_per_zone *mz; - - spin_lock(&mctz->lock); - mz = __mem_cgroup_largest_soft_limit_node(mctz); - spin_unlock(&mctz->lock); - return mz; -} - /* * Implementation Note: reading percpu statistics for memcg. * @@ -696,9 +508,6 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg, case MEM_CGROUP_TARGET_THRESH: next = val + THRESHOLDS_EVENTS_TARGET; break; - case MEM_CGROUP_TARGET_SOFTLIMIT: - next = val + SOFTLIMIT_EVENTS_TARGET; - break; case MEM_CGROUP_TARGET_NUMAINFO: next = val + NUMAINFO_EVENTS_TARGET; break; @@ -718,13 +527,11 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg, static void memcg_check_events(struct mem_cgroup *memcg, struct page *page) { preempt_disable(); - /* threshold event is triggered in finer grain than soft limit */ + /* threshold event is triggered in finer grain than numa info */ if (unlikely(mem_cgroup_event_ratelimit(memcg, MEM_CGROUP_TARGET_THRESH))) { - bool do_softlimit, do_numainfo; + bool do_numainfo; - do_softlimit = mem_cgroup_event_ratelimit(memcg, - MEM_CGROUP_TARGET_SOFTLIMIT); #if MAX_NUMNODES > 1 do_numainfo = mem_cgroup_event_ratelimit(memcg, MEM_CGROUP_TARGET_NUMAINFO); @@ -732,8 +539,6 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page) preempt_enable(); mem_cgroup_threshold(memcg); - if (unlikely(do_softlimit)) - mem_cgroup_update_tree(memcg, page); #if MAX_NUMNODES > 1 if (unlikely(do_numainfo)) atomic_inc(&memcg->numainfo_events); @@ -1318,6 +1123,36 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) return margin >> PAGE_SHIFT; } +/** + * mem_cgroup_over_softlimit + * @root: hierarchy root + * @memcg: child of @root to test + * + * Returns %true if @memcg exceeds its own soft limit or contributes + * to the soft limit excess of one of its parents up to and including + * @root. + */ +bool mem_cgroup_over_softlimit(struct mem_cgroup *root, + struct mem_cgroup *memcg) +{ + if (mem_cgroup_disabled()) + return false; + + if (!root) + root = root_mem_cgroup; + + for (; memcg; memcg = parent_mem_cgroup(memcg)) { + /* root_mem_cgroup does not have a soft limit */ + if (memcg == root_mem_cgroup) + break; + if (res_counter_soft_limit_excess(&memcg->res)) + return true; + if (memcg == root) + break; + } + return false; +} + int mem_cgroup_swappiness(struct mem_cgroup *memcg) { struct cgroup *cgrp = memcg->css.cgroup; @@ -1687,64 +1522,6 @@ bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap) } #endif -static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg, - struct zone *zone, - gfp_t gfp_mask, - unsigned long *total_scanned) -{ - struct mem_cgroup *victim = NULL; - int total = 0; - int loop = 0; - unsigned long excess; - unsigned long nr_scanned; - struct mem_cgroup_reclaim_cookie reclaim = { - .zone = zone, - .priority = 0, - }; - - excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT; - - while (1) { - unsigned long nr_reclaimed; - - victim = mem_cgroup_iter(root_memcg, victim, &reclaim); - if (!victim) { - loop++; - if (loop >= 2) { - /* - * If we have not been able to reclaim - * anything, it might because there are - * no reclaimable pages under this hierarchy - */ - if (!total) - break; - /* - * We want to do more targeted reclaim. - * excess >> 2 is not to excessive so as to - * reclaim too much, nor too less that we keep - * coming back to reclaim from this cgroup - */ - if (total >= (excess >> 2) || - (loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) - break; - } - continue; - } - if (!mem_cgroup_reclaimable(victim, false)) - continue; - nr_reclaimed = mem_cgroup_shrink_node_zone(victim, gfp_mask, false, - zone, &nr_scanned); - mem_cgroup_account_reclaim(root_mem_cgroup, victim, nr_reclaimed, - nr_scanned, current_is_kswapd()); - total += nr_reclaimed; - *total_scanned += nr_scanned; - if (!res_counter_soft_limit_excess(&root_memcg->res)) - break; - } - mem_cgroup_iter_break(root_memcg, victim); - return total; -} - /* * Check OOM-Killer is already running under our hierarchy. * If someone is running, return false. @@ -2507,8 +2284,6 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg, unlock_page_cgroup(pc); /* * "charge_statistics" updated event counter. Then, check it. - * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree. - * if they exceeds softlimit. */ memcg_check_events(memcg, page); } @@ -3578,98 +3353,6 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg, return ret; } -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, - gfp_t gfp_mask, - unsigned long *total_scanned) -{ - unsigned long nr_reclaimed = 0; - struct mem_cgroup_per_zone *mz, *next_mz = NULL; - unsigned long reclaimed; - int loop = 0; - struct mem_cgroup_tree_per_zone *mctz; - unsigned long long excess; - unsigned long nr_scanned; - - if (order > 0) - return 0; - - mctz = soft_limit_tree_node_zone(zone_to_nid(zone), zone_idx(zone)); - /* - * This loop can run a while, specially if mem_cgroup's continuously - * keep exceeding their soft limit and putting the system under - * pressure - */ - do { - if (next_mz) - mz = next_mz; - else - mz = mem_cgroup_largest_soft_limit_node(mctz); - if (!mz) - break; - - nr_scanned = 0; - reclaimed = mem_cgroup_soft_reclaim(mz->mem, zone, - gfp_mask, &nr_scanned); - nr_reclaimed += reclaimed; - *total_scanned += nr_scanned; - spin_lock(&mctz->lock); - - /* - * If we failed to reclaim anything from this memory cgroup - * it is time to move on to the next cgroup - */ - next_mz = NULL; - if (!reclaimed) { - do { - /* - * Loop until we find yet another one. - * - * By the time we get the soft_limit lock - * again, someone might have aded the - * group back on the RB tree. Iterate to - * make sure we get a different mem. - * mem_cgroup_largest_soft_limit_node returns - * NULL if no other cgroup is present on - * the tree - */ - next_mz = - __mem_cgroup_largest_soft_limit_node(mctz); - if (next_mz == mz) - css_put(&next_mz->mem->css); - else /* next_mz == NULL or other memcg */ - break; - } while (1); - } - __mem_cgroup_remove_exceeded(mz->mem, mz, mctz); - excess = res_counter_soft_limit_excess(&mz->mem->res); - /* - * One school of thought says that we should not add - * back the node to the tree if reclaim returns 0. - * But our reclaim could return 0, simply because due - * to priority we are exposing a smaller subset of - * memory to reclaim from. Consider this as a longer - * term TODO. - */ - /* If excess == 0, no tree ops */ - __mem_cgroup_insert_exceeded(mz->mem, mz, mctz, excess); - spin_unlock(&mctz->lock); - css_put(&mz->mem->css); - loop++; - /* - * Could not reclaim anything and there are no more - * mem cgroups to try or we seem to be looping without - * reclaiming anything. - */ - if (!nr_reclaimed && - (next_mz == NULL || - loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS)) - break; - } while (!nr_reclaimed); - if (next_mz) - css_put(&next_mz->mem->css); - return nr_reclaimed; -} - /* * This routine traverse page_cgroup in given list and drop them all. * *And* this routine doesn't reclaim page itself, just removes page_cgroup. @@ -4816,9 +4499,6 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node) mz = &pn->zoneinfo[zone]; for_each_lru(l) INIT_LIST_HEAD(&mz->lruvec.lists[l]); - mz->usage_in_excess = 0; - mz->on_tree = false; - mz->mem = memcg; } memcg->info.nodeinfo[node] = pn; return 0; @@ -4872,7 +4552,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) { int node; - mem_cgroup_remove_from_trees(memcg); free_css_id(&mem_cgroup_subsys, &memcg->css); for_each_node_state(node, N_POSSIBLE) @@ -4927,31 +4606,6 @@ static void __init enable_swap_cgroup(void) } #endif -static int mem_cgroup_soft_limit_tree_init(void) -{ - struct mem_cgroup_tree_per_node *rtpn; - struct mem_cgroup_tree_per_zone *rtpz; - int tmp, node, zone; - - for_each_node_state(node, N_POSSIBLE) { - tmp = node; - if (!node_state(node, N_NORMAL_MEMORY)) - tmp = -1; - rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, tmp); - if (!rtpn) - return 1; - - soft_limit_tree.rb_tree_per_node[node] = rtpn; - - for (zone = 0; zone < MAX_NR_ZONES; zone++) { - rtpz = &rtpn->rb_tree_per_zone[zone]; - rtpz->rb_root = RB_ROOT; - spin_lock_init(&rtpz->lock); - } - } - return 0; -} - static struct cgroup_subsys_state * __ref mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) { @@ -4973,8 +4627,6 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) enable_swap_cgroup(); parent = NULL; root_mem_cgroup = memcg; - if (mem_cgroup_soft_limit_tree_init()) - goto free_out; for_each_possible_cpu(cpu) { struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu); diff --git a/mm/vmscan.c b/mm/vmscan.c index e3fd8a7..4279549 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2121,8 +2121,16 @@ static void shrink_zone(int priority, struct zone *zone, .mem_cgroup = memcg, .zone = zone, }; + int epriority = priority; + /* + * Put more pressure on hierarchies that exceed their + * soft limit, to push them back harder than their + * well-behaving siblings. + */ + if (mem_cgroup_over_softlimit(root, memcg)) + epriority = 0; - shrink_mem_cgroup_zone(priority, &mz, sc); + shrink_mem_cgroup_zone(epriority, &mz, sc); mem_cgroup_account_reclaim(root, memcg, sc->nr_reclaimed - nr_reclaimed, @@ -2171,8 +2179,6 @@ static bool shrink_zones(int priority, struct zonelist *zonelist, { struct zoneref *z; struct zone *zone; - unsigned long nr_soft_reclaimed; - unsigned long nr_soft_scanned; bool should_abort_reclaim = false; for_each_zone_zonelist_nodemask(zone, z, zonelist, @@ -2205,19 +2211,6 @@ static bool shrink_zones(int priority, struct zonelist *zonelist, continue; } } - /* - * This steals pages from memory cgroups over softlimit - * and returns the number of reclaimed pages and - * scanned pages. This works for global memory pressure - * and balancing, not for a memcg's limit. - */ - nr_soft_scanned = 0; - nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone, - sc->order, sc->gfp_mask, - &nr_soft_scanned); - sc->nr_reclaimed += nr_soft_reclaimed; - sc->nr_scanned += nr_soft_scanned; - /* need some check for avoid more shrink_zone() */ } shrink_zone(priority, zone, sc); @@ -2393,48 +2386,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, } #ifdef CONFIG_CGROUP_MEM_RES_CTLR - -unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg, - gfp_t gfp_mask, bool noswap, - struct zone *zone, - unsigned long *nr_scanned) -{ - struct scan_control sc = { - .nr_scanned = 0, - .nr_to_reclaim = SWAP_CLUSTER_MAX, - .may_writepage = !laptop_mode, - .may_unmap = 1, - .may_swap = !noswap, - .order = 0, - .target_mem_cgroup = memcg, - }; - struct mem_cgroup_zone mz = { - .mem_cgroup = memcg, - .zone = zone, - }; - - sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) | - (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK); - - trace_mm_vmscan_memcg_softlimit_reclaim_begin(0, - sc.may_writepage, - sc.gfp_mask); - - /* - * NOTE: Although we can get the priority field, using it - * here is not a good idea, since it limits the pages we can scan. - * if we don't reclaim here, the shrink_zone from balance_pgdat - * will pick up pages from other mem cgroup's as well. We hack - * the priority and make it zero. - */ - shrink_mem_cgroup_zone(0, &mz, &sc); - - trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed); - - *nr_scanned = sc.nr_scanned; - return sc.nr_reclaimed; -} - unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, gfp_t gfp_mask, bool noswap) @@ -2609,8 +2560,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */ unsigned long total_scanned; struct reclaim_state *reclaim_state = current->reclaim_state; - unsigned long nr_soft_reclaimed; - unsigned long nr_soft_scanned; struct scan_control sc = { .gfp_mask = GFP_KERNEL, .may_unmap = 1, @@ -2701,17 +2650,6 @@ loop_again: continue; sc.nr_scanned = 0; - - nr_soft_scanned = 0; - /* - * Call soft limit reclaim before calling shrink_zone. - */ - nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone, - order, sc.gfp_mask, - &nr_soft_scanned); - sc.nr_reclaimed += nr_soft_reclaimed; - total_scanned += nr_soft_scanned; - /* * We put equal pressure on every zone, unless * one zone has way too many pages free -- 1.7.7.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ying Han Subject: Re: [patch 1/2] mm: memcg: per-memcg reclaim statistics Date: Tue, 10 Jan 2012 15:54:05 -0800 Message-ID: References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-2-git-send-email-hannes@cmpxchg.org> Mime-Version: 1.0 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:x-system-of-record:content-type:content-transfer-encoding; bh=ONglsLOIx4YSU92h4dXlArZ5AwsBKwXH6nIMNCiIMXU=; b=onfuOhkxhtQp6qDGa4J9+Xohc1CscadvfjNkfYiAv+gE9vcw+yCkBHWYh716rugxVV 9JcBJkD6kQ239hCgGUPC9SQo0id7ZK9uHnen3l83EXemrI4CdHWcDglp0CKPpYdNqeJM 7qGGU6zVJtDDvRk3UDGxKHtlxFmEXzLM/Wqv8= In-Reply-To: <1326207772-16762-2-git-send-email-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="iso-8859-1" To: Johannes Weiner Cc: Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Thank you for the patch and the stats looks reasonable to me, few questions as below: On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner w= rote: > With the single per-zone LRU gone and global reclaim scanning > individual memcgs, it's straight-forward to collect meaningful and > accurate per-memcg reclaim statistics. > > This adds the following items to memory.stat: Some of the previous discussions including patches have similar stats in memory.vmscan_stat API, which collects all the per-memcg vmscan stats. I would like to understand more why we add into memory.stat instead, and do we have plan to keep extending memory.stat for those vmstat like stats? > > pgreclaim Not sure if we want to keep this more consistent to /proc/vmstat, then it will be "pgsteal"? > pgscan > > =A0Number of pages reclaimed/scanned from that memcg due to its own > =A0hard limit (or physical limit in case of the root memcg) by the > =A0allocating task. > > kswapd_pgreclaim > kswapd_pgscan we have "pgscan_kswapd_*" in vmstat, so maybe ? "pgsteal_kswapd" "pgscan_kswapd" > > =A0Reclaim activity from kswapd due to the memcg's own limit. =A0Only > =A0applicable to the root memcg for now since kswapd is only triggere= d > =A0by physical limits, but kswapd-style reclaim based on memcg hard > =A0limits is being developped. > > hierarchy_pgreclaim > hierarchy_pgscan > hierarchy_kswapd_pgreclaim > hierarchy_kswapd_pgscan "pgsteal_hierarchy" "pgsteal_kswapd_hierarchy" =2E. No strong option on the naming, but try to make it more consistent to existing API. > > =A0Reclaim activity due to limitations in one of the memcg's parents. > > Signed-off-by: Johannes Weiner > --- > =A0Documentation/cgroups/memory.txt | =A0 =A04 ++ > =A0include/linux/memcontrol.h =A0 =A0 =A0 | =A0 10 +++++ > =A0mm/memcontrol.c =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0| =A0 84 ++++++= +++++++++++++++++++++++++++++++- > =A0mm/vmscan.c =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0| =A0 =A07 = +++ > =A04 files changed, 103 insertions(+), 2 deletions(-) > > diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups= /memory.txt > index cc0ebc5..eb9e982 100644 > --- a/Documentation/cgroups/memory.txt > +++ b/Documentation/cgroups/memory.txt > @@ -389,6 +389,10 @@ mapped_file =A0 =A0 =A0 =A0- # of bytes of mappe= d file (includes tmpfs/shmem) > =A0pgpgin =A0 =A0 =A0 =A0 - # of pages paged in (equivalent to # of c= harging events). > =A0pgpgout =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0- # of pages paged out (equ= ivalent to # of uncharging events). > =A0swap =A0 =A0 =A0 =A0 =A0 - # of bytes of swap usage > +pgreclaim =A0 =A0 =A0- # of pages reclaimed due to this memcg's limi= t > +pgscan =A0 =A0 =A0 =A0 - # of pages scanned due to this memcg's limi= t > +kswapd_* =A0 =A0 =A0 - # reclaim activity by background daemon due t= o this memcg's limit > +hierarchy_* =A0 =A0- # reclaim activity due to pressure from parenta= l memcg > =A0inactive_anon =A0- # of bytes of anonymous memory and swap cache m= emory on > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0LRU list. > =A0active_anon =A0 =A0- # of bytes of anonymous and swap cache memory= on active > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index bd3b102..6c1d69e 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -121,6 +121,8 @@ struct zone_reclaim_stat *mem_cgroup_get_reclaim_= stat(struct mem_cgroup *memcg, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0struct zone *zone); > =A0struct zone_reclaim_stat* > =A0mem_cgroup_get_reclaim_stat_from_page(struct page *page); > +void mem_cgroup_account_reclaim(struct mem_cgroup *, struct mem_cgro= up *, > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 unsigne= d long, unsigned long, bool); > =A0extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0struct task_struct *p); > =A0extern void mem_cgroup_replace_page_cache(struct page *oldpage, > @@ -347,6 +349,14 @@ mem_cgroup_get_reclaim_stat_from_page(struct pag= e *page) > =A0 =A0 =A0 =A0return NULL; > =A0} > > +static inline void mem_cgroup_account_reclaim(struct mem_cgroup *roo= t, > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 struct mem_cgroup *memcg, > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 unsigned long nr_reclaimed, > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 unsigned long nr_scanned, > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 bool kswapd) > +{ > +} > + > =A0static inline void > =A0mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_st= ruct *p) > =A0{ > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 8e2a80d..170dff4 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -91,12 +91,23 @@ enum mem_cgroup_stat_index { > =A0 =A0 =A0 =A0MEM_CGROUP_STAT_NSTATS, > =A0}; > > +#define MEM_CGROUP_EVENTS_KSWAPD 2 > +#define MEM_CGROUP_EVENTS_HIERARCHY 4 > + > =A0enum mem_cgroup_events_index { > =A0 =A0 =A0 =A0MEM_CGROUP_EVENTS_PGPGIN, =A0 =A0 =A0 /* # of pages pa= ged in */ > =A0 =A0 =A0 =A0MEM_CGROUP_EVENTS_PGPGOUT, =A0 =A0 =A0/* # of pages pa= ged out */ > =A0 =A0 =A0 =A0MEM_CGROUP_EVENTS_COUNT, =A0 =A0 =A0 =A0/* # of pages = paged in/out */ > =A0 =A0 =A0 =A0MEM_CGROUP_EVENTS_PGFAULT, =A0 =A0 =A0/* # of page-fau= lts */ > =A0 =A0 =A0 =A0MEM_CGROUP_EVENTS_PGMAJFAULT, =A0 /* # of major page-f= aults */ > + =A0 =A0 =A0 MEM_CGROUP_EVENTS_PGRECLAIM, > + =A0 =A0 =A0 MEM_CGROUP_EVENTS_PGSCAN, > + =A0 =A0 =A0 MEM_CGROUP_EVENTS_KSWAPD_PGRECLAIM, > + =A0 =A0 =A0 MEM_CGROUP_EVENTS_KSWAPD_PGSCAN, > + =A0 =A0 =A0 MEM_CGROUP_EVENTS_HIERARCHY_PGRECLAIM, > + =A0 =A0 =A0 MEM_CGROUP_EVENTS_HIERARCHY_PGSCAN, > + =A0 =A0 =A0 MEM_CGROUP_EVENTS_HIERARCHY_KSWAPD_PGRECLAIM, > + =A0 =A0 =A0 MEM_CGROUP_EVENTS_HIERARCHY_KSWAPD_PGSCAN, missing comment here? > =A0 =A0 =A0 =A0MEM_CGROUP_EVENTS_NSTATS, > =A0}; > =A0/* > @@ -889,6 +900,38 @@ static inline bool mem_cgroup_is_root(struct mem= _cgroup *memcg) > =A0 =A0 =A0 =A0return (memcg =3D=3D root_mem_cgroup); > =A0} > > +/** > + * mem_cgroup_account_reclaim - update per-memcg reclaim statistics > + * @root: memcg that triggered reclaim > + * @memcg: memcg that is actually being scanned > + * @nr_reclaimed: number of pages reclaimed from @memcg > + * @nr_scanned: number of pages scanned from @memcg > + * @kswapd: whether reclaiming task is kswapd or allocator itself > + */ > +void mem_cgroup_account_reclaim(struct mem_cgroup *root, > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 struct = mem_cgroup *memcg, > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 unsigne= d long nr_reclaimed, > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 unsigne= d long nr_scanned, > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 bool ks= wapd) > +{ > + =A0 =A0 =A0 unsigned int offset =3D 0; > + > + =A0 =A0 =A0 if (!root) > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 root =3D root_mem_cgroup; > + > + =A0 =A0 =A0 if (kswapd) > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 offset +=3D MEM_CGROUP_EVENTS_KSWAPD; > + =A0 =A0 =A0 if (root !=3D memcg) > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 offset +=3D MEM_CGROUP_EVENTS_HIERARCHY= ; Just to be clear, here root cgroup has hierarchy_* stats always 0 ? Also, we might want to consider renaming the root here, something like target? The root is confusing with root_mem_cgroup. --Ying > + > + =A0 =A0 =A0 preempt_disable(); > + =A0 =A0 =A0 __this_cpu_add(memcg->stat->events[MEM_CGROUP_EVENTS_PG= RECLAIM + offset], > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0nr_reclaimed); > + =A0 =A0 =A0 __this_cpu_add(memcg->stat->events[MEM_CGROUP_EVENTS_PG= SCAN + offset], > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0nr_scanned); > + =A0 =A0 =A0 preempt_enable(); > +} > + > =A0void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event= _item idx) > =A0{ > =A0 =A0 =A0 =A0struct mem_cgroup *memcg; > @@ -1662,6 +1705,8 @@ static int mem_cgroup_soft_reclaim(struct mem_c= group *root_memcg, > =A0 =A0 =A0 =A0excess =3D res_counter_soft_limit_excess(&root_memcg->= res) >> PAGE_SHIFT; > > =A0 =A0 =A0 =A0while (1) { > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 unsigned long nr_reclaimed; > + > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0victim =3D mem_cgroup_iter(root_memcg,= victim, &reclaim); > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (!victim) { > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0loop++; > @@ -1687,8 +1732,11 @@ static int mem_cgroup_soft_reclaim(struct mem_= cgroup *root_memcg, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0} > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (!mem_cgroup_reclaimable(victim, fa= lse)) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0continue; > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 total +=3D mem_cgroup_shrink_node_zone(= victim, gfp_mask, false, > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0zone, &nr_scanned); > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 nr_reclaimed =3D mem_cgroup_shrink_node= _zone(victim, gfp_mask, false, > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0zone, &nr_scanned); > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 mem_cgroup_account_reclaim(root_mem_cgr= oup, victim, nr_reclaimed, > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0nr_scanned, current_is_kswapd()); > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 total +=3D nr_reclaimed; > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*total_scanned +=3D nr_scanned; > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (!res_counter_soft_limit_excess(&ro= ot_memcg->res)) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0break; > @@ -4023,6 +4071,14 @@ enum { > =A0 =A0 =A0 =A0MCS_SWAP, > =A0 =A0 =A0 =A0MCS_PGFAULT, > =A0 =A0 =A0 =A0MCS_PGMAJFAULT, > + =A0 =A0 =A0 MCS_PGRECLAIM, > + =A0 =A0 =A0 MCS_PGSCAN, > + =A0 =A0 =A0 MCS_KSWAPD_PGRECLAIM, > + =A0 =A0 =A0 MCS_KSWAPD_PGSCAN, > + =A0 =A0 =A0 MCS_HIERARCHY_PGRECLAIM, > + =A0 =A0 =A0 MCS_HIERARCHY_PGSCAN, > + =A0 =A0 =A0 MCS_HIERARCHY_KSWAPD_PGRECLAIM, > + =A0 =A0 =A0 MCS_HIERARCHY_KSWAPD_PGSCAN, > =A0 =A0 =A0 =A0MCS_INACTIVE_ANON, > =A0 =A0 =A0 =A0MCS_ACTIVE_ANON, > =A0 =A0 =A0 =A0MCS_INACTIVE_FILE, > @@ -4047,6 +4103,14 @@ struct { > =A0 =A0 =A0 =A0{"swap", "total_swap"}, > =A0 =A0 =A0 =A0{"pgfault", "total_pgfault"}, > =A0 =A0 =A0 =A0{"pgmajfault", "total_pgmajfault"}, > + =A0 =A0 =A0 {"pgreclaim", "total_pgreclaim"}, > + =A0 =A0 =A0 {"pgscan", "total_pgscan"}, > + =A0 =A0 =A0 {"kswapd_pgreclaim", "total_kswapd_pgreclaim"}, > + =A0 =A0 =A0 {"kswapd_pgscan", "total_kswapd_pgscan"}, > + =A0 =A0 =A0 {"hierarchy_pgreclaim", "total_hierarchy_pgreclaim"}, > + =A0 =A0 =A0 {"hierarchy_pgscan", "total_hierarchy_pgscan"}, > + =A0 =A0 =A0 {"hierarchy_kswapd_pgreclaim", "total_hierarchy_kswapd_= pgreclaim"}, > + =A0 =A0 =A0 {"hierarchy_kswapd_pgscan", "total_hierarchy_kswapd_pgs= can"}, > =A0 =A0 =A0 =A0{"inactive_anon", "total_inactive_anon"}, > =A0 =A0 =A0 =A0{"active_anon", "total_active_anon"}, > =A0 =A0 =A0 =A0{"inactive_file", "total_inactive_file"}, > @@ -4079,6 +4143,22 @@ mem_cgroup_get_local_stat(struct mem_cgroup *m= emcg, struct mcs_total_stat *s) > =A0 =A0 =A0 =A0s->stat[MCS_PGFAULT] +=3D val; > =A0 =A0 =A0 =A0val =3D mem_cgroup_read_events(memcg, MEM_CGROUP_EVENT= S_PGMAJFAULT); > =A0 =A0 =A0 =A0s->stat[MCS_PGMAJFAULT] +=3D val; > + =A0 =A0 =A0 val =3D mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS= _PGRECLAIM); > + =A0 =A0 =A0 s->stat[MCS_PGRECLAIM] +=3D val; > + =A0 =A0 =A0 val =3D mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS= _PGSCAN); > + =A0 =A0 =A0 s->stat[MCS_PGSCAN] +=3D val; > + =A0 =A0 =A0 val =3D mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS= _KSWAPD_PGRECLAIM); > + =A0 =A0 =A0 s->stat[MCS_KSWAPD_PGRECLAIM] +=3D val; > + =A0 =A0 =A0 val =3D mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS= _KSWAPD_PGSCAN); > + =A0 =A0 =A0 s->stat[MCS_KSWAPD_PGSCAN] +=3D val; > + =A0 =A0 =A0 val =3D mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS= _HIERARCHY_PGRECLAIM); > + =A0 =A0 =A0 s->stat[MCS_HIERARCHY_PGRECLAIM] +=3D val; > + =A0 =A0 =A0 val =3D mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS= _HIERARCHY_PGSCAN); > + =A0 =A0 =A0 s->stat[MCS_HIERARCHY_PGSCAN] +=3D val; > + =A0 =A0 =A0 val =3D mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS= _HIERARCHY_KSWAPD_PGRECLAIM); > + =A0 =A0 =A0 s->stat[MCS_HIERARCHY_KSWAPD_PGRECLAIM] +=3D val; > + =A0 =A0 =A0 val =3D mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS= _HIERARCHY_KSWAPD_PGSCAN); > + =A0 =A0 =A0 s->stat[MCS_HIERARCHY_KSWAPD_PGSCAN] +=3D val; > > =A0 =A0 =A0 =A0/* per zone stat */ > =A0 =A0 =A0 =A0val =3D mem_cgroup_nr_lru_pages(memcg, BIT(LRU_INACTIV= E_ANON)); > diff --git a/mm/vmscan.c b/mm/vmscan.c > index c631234..e3fd8a7 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2115,12 +2115,19 @@ static void shrink_zone(int priority, struct = zone *zone, > > =A0 =A0 =A0 =A0memcg =3D mem_cgroup_iter(root, NULL, &reclaim); > =A0 =A0 =A0 =A0do { > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 unsigned long nr_reclaimed =3D sc->nr_r= eclaimed; > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 unsigned long nr_scanned =3D sc->nr_sca= nned; > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0struct mem_cgroup_zone mz =3D { > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0.mem_cgroup =3D memcg, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0.zone =3D zone, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0}; > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0shrink_mem_cgroup_zone(priority, &mz, = sc); > + > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 mem_cgroup_account_reclaim(root, memcg, > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0sc->nr_reclaimed - nr_reclaimed, > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0sc->nr_scanned - nr_scanned, > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0current_is_kswapd()); > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0/* > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 * Limit reclaim has historically pick= ed one memcg and > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 * scanned it with decreasing priority= levels until > -- > 1.7.7.5 > From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [patch 1/2] mm: memcg: per-memcg reclaim statistics Date: Wed, 11 Jan 2012 01:30:20 +0100 Message-ID: <20120111003020.GD24386@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-2-git-send-email-hannes@cmpxchg.org> Mime-Version: 1.0 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Content-Disposition: inline In-Reply-To: Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="iso-8859-1" To: Ying Han Cc: Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Tue, Jan 10, 2012 at 03:54:05PM -0800, Ying Han wrote: > Thank you for the patch and the stats looks reasonable to me, few > questions as below: >=20 > On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner = wrote: > > With the single per-zone LRU gone and global reclaim scanning > > individual memcgs, it's straight-forward to collect meaningful and > > accurate per-memcg reclaim statistics. > > > > This adds the following items to memory.stat: >=20 > Some of the previous discussions including patches have similar stats > in memory.vmscan_stat API, which collects all the per-memcg vmscan > stats. I would like to understand more why we add into memory.stat > instead, and do we have plan to keep extending memory.stat for those > vmstat like stats? I think they were put into an extra file in particular to be able to write to this file to reset the statistics. But in my opinion, it's trivial to calculate a delta from before and after running a workload, so I didn't really like adding kernel code for that. Did you have another reason for a separate file in mind? > > pgreclaim >=20 > Not sure if we want to keep this more consistent to /proc/vmstat, the= n > it will be "pgsteal"? The problem with that was that we didn't like to call pages stolen when they were reclaimed from within the cgroup, so we had pgfree for inner reclaim and pgsteal for outer reclaim, respectively. I found it cleaner to just go with pgreclaim, it's unambiguous and straight-forward. Outer reclaim is designated by the hierarchy_ prefix. > > pgscan > > > > =A0Number of pages reclaimed/scanned from that memcg due to its own > > =A0hard limit (or physical limit in case of the root memcg) by the > > =A0allocating task. > > > > kswapd_pgreclaim > > kswapd_pgscan >=20 > we have "pgscan_kswapd_*" in vmstat, so maybe ? > "pgsteal_kswapd" > "pgscan_kswapd" >=20 > > =A0Reclaim activity from kswapd due to the memcg's own limit. =A0On= ly > > =A0applicable to the root memcg for now since kswapd is only trigge= red > > =A0by physical limits, but kswapd-style reclaim based on memcg hard > > =A0limits is being developped. > > > > hierarchy_pgreclaim > > hierarchy_pgscan > > hierarchy_kswapd_pgreclaim > > hierarchy_kswapd_pgscan >=20 > "pgsteal_hierarchy" > "pgsteal_kswapd_hierarchy" > .. >=20 > No strong option on the naming, but try to make it more consistent to > existing API. I swear I tried, but the existing naming is pretty screwed up :( =46or example, pgscan_direct_* and pgscan_kswapd_* allow you to compare scan rates of direct reclaim vs. kswapd reclaim. To get the total number of pages reclaimed, you sum them up. On the other hand, pgsteal_* does not differentiate between direct reclaim and kswapd, so to get direct reclaim numbers, you add up the pgsteal_* counters and subtract kswapd_steal (notice the lack of pg?), which is in turn not available at zone granularity. > > +#define MEM_CGROUP_EVENTS_KSWAPD 2 > > +#define MEM_CGROUP_EVENTS_HIERARCHY 4 These two function as namespaces, that's why I put hierarchy_ and kswapd_ at the beginning of the names. Given that we have kswapd_steal, would you be okay with doing it like this? I mean, at least my naming conforms to ONE of the standards in /proc/vmstat, right? ;-) > > @@ -91,12 +91,23 @@ enum mem_cgroup_stat_index { > > =A0 =A0 =A0 =A0MEM_CGROUP_STAT_NSTATS, > > =A0}; > > > > +#define MEM_CGROUP_EVENTS_KSWAPD 2 > > +#define MEM_CGROUP_EVENTS_HIERARCHY 4 > > + > > =A0enum mem_cgroup_events_index { > > =A0 =A0 =A0 =A0MEM_CGROUP_EVENTS_PGPGIN, =A0 =A0 =A0 /* # of pages = paged in */ > > =A0 =A0 =A0 =A0MEM_CGROUP_EVENTS_PGPGOUT, =A0 =A0 =A0/* # of pages = paged out */ > > =A0 =A0 =A0 =A0MEM_CGROUP_EVENTS_COUNT, =A0 =A0 =A0 =A0/* # of page= s paged in/out */ > > =A0 =A0 =A0 =A0MEM_CGROUP_EVENTS_PGFAULT, =A0 =A0 =A0/* # of page-f= aults */ > > =A0 =A0 =A0 =A0MEM_CGROUP_EVENTS_PGMAJFAULT, =A0 /* # of major page= -faults */ > > + =A0 =A0 =A0 MEM_CGROUP_EVENTS_PGRECLAIM, > > + =A0 =A0 =A0 MEM_CGROUP_EVENTS_PGSCAN, > > + =A0 =A0 =A0 MEM_CGROUP_EVENTS_KSWAPD_PGRECLAIM, > > + =A0 =A0 =A0 MEM_CGROUP_EVENTS_KSWAPD_PGSCAN, > > + =A0 =A0 =A0 MEM_CGROUP_EVENTS_HIERARCHY_PGRECLAIM, > > + =A0 =A0 =A0 MEM_CGROUP_EVENTS_HIERARCHY_PGSCAN, > > + =A0 =A0 =A0 MEM_CGROUP_EVENTS_HIERARCHY_KSWAPD_PGRECLAIM, > > + =A0 =A0 =A0 MEM_CGROUP_EVENTS_HIERARCHY_KSWAPD_PGSCAN, >=20 > missing comment here? As if the lines weren't long enough already ;-) I'll add some. > > =A0 =A0 =A0 =A0MEM_CGROUP_EVENTS_NSTATS, > > =A0}; > > =A0/* > > @@ -889,6 +900,38 @@ static inline bool mem_cgroup_is_root(struct m= em_cgroup *memcg) > > =A0 =A0 =A0 =A0return (memcg =3D=3D root_mem_cgroup); > > =A0} > > > > +/** > > + * mem_cgroup_account_reclaim - update per-memcg reclaim statistic= s > > + * @root: memcg that triggered reclaim > > + * @memcg: memcg that is actually being scanned > > + * @nr_reclaimed: number of pages reclaimed from @memcg > > + * @nr_scanned: number of pages scanned from @memcg > > + * @kswapd: whether reclaiming task is kswapd or allocator itself > > + */ > > +void mem_cgroup_account_reclaim(struct mem_cgroup *root, > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 struc= t mem_cgroup *memcg, > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 unsig= ned long nr_reclaimed, > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 unsig= ned long nr_scanned, > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 bool = kswapd) > > +{ > > + =A0 =A0 =A0 unsigned int offset =3D 0; > > + > > + =A0 =A0 =A0 if (!root) > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 root =3D root_mem_cgroup; > > + > > + =A0 =A0 =A0 if (kswapd) > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 offset +=3D MEM_CGROUP_EVENTS_KSWAPD; > > + =A0 =A0 =A0 if (root !=3D memcg) > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 offset +=3D MEM_CGROUP_EVENTS_HIERARC= HY; >=20 > Just to be clear, here root cgroup has hierarchy_* stats always 0 ? That's correct, there can't be any hierarchical pressure on the topmost parent. > Also, we might want to consider renaming the root here, something lik= e > target? The root is confusing with root_mem_cgroup. It's the same naming scheme I used for the iterator functions (mem_cgroup_iter() and friends), so if we change it, I'd like to change it consistently. Having target and memcg as parameters is even more confusing and non-descriptive, IMO. Other places use mem_over_limit, which is a bit better, but quite long. Any other ideas for great names for parameters that designate a hierarchy root and a memcg in that hierarchy? From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ying Han Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Date: Wed, 11 Jan 2012 13:42:31 -0800 Message-ID: References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:x-system-of-record:content-type:content-transfer-encoding; bh=4K+LcF4B35gHp4ksLXR/xch0zTZIw3OeXAykzWjsThI=; b=USr7kdy4/c1LG6nb1f+gZ9cpF/QnmNKt6VPFUknVFyS2fCjv+MInTwCjpUbe9m5Vlg 9+X92xUDEPqoH7Ns0Jyv+zj/Dp4OzSAJ5Ma1hRoU5GSJy7AtUQgL6AQ4iKKmVqQE5YAu JsNTQCMoYg0QWAsxC7yL4OFOnAua8O27OA0vI= In-Reply-To: <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="iso-8859-1" To: Johannes Weiner Cc: Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner wrote= : > Right now, memcg soft limits are implemented by having a sorted tree > of memcgs that are in excess of their limits. =A0Under global memory > pressure, kswapd first reclaims from the biggest excessor and then > proceeds to do regular global reclaim. =A0The result of this is that > pages are reclaimed from all memcgs, but more scanning happens against > those above their soft limit. > > With global reclaim doing memcg-aware hierarchical reclaim by default, > this is a lot easier to implement: everytime a memcg is reclaimed > from, scan more aggressively (per tradition with a priority of 0) if > it's above its soft limit. =A0With the same end result of scanning > everybody, but soft limit excessors a bit more. > > Advantages: > > =A0o smoother reclaim: soft limit reclaim is a separate stage before > =A0 =A0global reclaim, whose result is not communicated down the line and > =A0 =A0so overreclaim of the groups in excess is very likely. =A0After th= is > =A0 =A0patch, soft limit reclaim is fully integrated into regular reclaim > =A0 =A0and each memcg is considered exactly once per cycle. > > =A0o true hierarchy support: soft limits are only considered when > =A0 =A0kswapd does global reclaim, but after this patch, targetted > =A0 =A0reclaim of a memcg will mind the soft limit settings of its child > =A0 =A0groups. Why we add soft limit reclaim into target reclaim? Based on the discussions, my understanding is that the soft limit only take effect while the whole machine is under memory contention. We don't want to add extra pressure on a cgroup if there is free memory on the system even the cgroup is above its limit. > > =A0o code size: soft limit reclaim requires a lot of code to maintain > =A0 =A0the per-node per-zone rb-trees to quickly find the biggest > =A0 =A0offender, dedicated paths for soft limit reclaim etc. while this > =A0 =A0new implementation gets away without all that. > > Test: > > The test consists of two concurrent kernel build jobs in separate > source trees, the master and the slave. =A0The two jobs get along nicely > on 600MB of available memory, so this is the zero overcommit control > case. =A0When available memory is decreased, the overcommit is > compensated by decreasing the soft limit of the slave by the same > amount, in the hope that the slave takes the hit and the master stays > unaffected. > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A060= 0M-0M-vanilla =A0 =A0 =A0 =A0 600M-0M-patched > Master walltime (s) =A0 =A0 =A0 =A0 =A0 =A0 =A0 552.65 ( =A0+0.00%) =A0 = =A0 =A0 552.38 ( =A0-0.05%) > Master walltime (stddev) =A0 =A0 =A0 =A0 =A0 =A01.25 ( =A0+0.00%) =A0 =A0= =A0 =A0 0.92 ( -14.66%) > Master major faults =A0 =A0 =A0 =A0 =A0 =A0 =A0 204.38 ( =A0+0.00%) =A0 = =A0 =A0 205.38 ( =A0+0.49%) > Master major faults (stddev) =A0 =A0 =A0 27.16 ( =A0+0.00%) =A0 =A0 =A0 = =A013.80 ( -47.43%) > Master reclaim =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 31.88 ( =A0+0.00%)= =A0 =A0 =A0 =A037.75 ( +17.87%) > Master reclaim (stddev) =A0 =A0 =A0 =A0 =A0 =A034.01 ( =A0+0.00%) =A0 =A0= =A0 =A075.88 (+119.59%) > Master scan =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A031.88 ( =A0+0.= 00%) =A0 =A0 =A0 =A037.75 ( +17.87%) > Master scan (stddev) =A0 =A0 =A0 =A0 =A0 =A0 =A0 34.01 ( =A0+0.00%) =A0 = =A0 =A0 =A075.88 (+119.59%) > Master kswapd reclaim =A0 =A0 =A0 =A0 =A0 33922.12 ( =A0+0.00%) =A0 =A0 3= 3887.12 ( =A0-0.10%) > Master kswapd reclaim (stddev) =A0 =A0969.08 ( =A0+0.00%) =A0 =A0 =A0 492= .22 ( -49.16%) > Master kswapd scan =A0 =A0 =A0 =A0 =A0 =A0 =A034085.75 ( =A0+0.00%) =A0 = =A0 33985.75 ( =A0-0.29%) > Master kswapd scan (stddev) =A0 =A0 =A01101.07 ( =A0+0.00%) =A0 =A0 =A0 5= 63.33 ( -48.79%) > Slave walltime (s) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0552.68 ( =A0+0.00%) =A0= =A0 =A0 552.12 ( =A0-0.10%) > Slave walltime (stddev) =A0 =A0 =A0 =A0 =A0 =A0 0.79 ( =A0+0.00%) =A0 =A0= =A0 =A0 1.05 ( +14.76%) > Slave major faults =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0212.50 ( =A0+0.00%) =A0= =A0 =A0 204.50 ( =A0-3.75%) > Slave major faults (stddev) =A0 =A0 =A0 =A026.90 ( =A0+0.00%) =A0 =A0 =A0= =A013.17 ( -49.20%) > Slave reclaim =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A026.12 ( =A0+0.00= %) =A0 =A0 =A0 =A035.00 ( +32.72%) > Slave reclaim (stddev) =A0 =A0 =A0 =A0 =A0 =A0 29.42 ( =A0+0.00%) =A0 =A0= =A0 =A074.91 (+149.55%) > Slave scan =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 31.38 ( =A0+0.= 00%) =A0 =A0 =A0 =A035.00 ( +11.20%) > Slave scan (stddev) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A033.31 ( =A0+0.00%) =A0= =A0 =A0 =A074.91 (+121.24%) > Slave kswapd reclaim =A0 =A0 =A0 =A0 =A0 =A034259.00 ( =A0+0.00%) =A0 =A0= 33469.88 ( =A0-2.30%) > Slave kswapd reclaim (stddev) =A0 =A0 925.15 ( =A0+0.00%) =A0 =A0 =A0 565= .07 ( -38.88%) > Slave kswapd scan =A0 =A0 =A0 =A0 =A0 =A0 =A0 34354.62 ( =A0+0.00%) =A0 = =A0 33555.75 ( =A0-2.33%) > Slave kswapd scan (stddev) =A0 =A0 =A0 =A0969.62 ( =A0+0.00%) =A0 =A0 =A0= 581.70 ( -39.97%) > > In the control case, the differences in elapsed time, number of major > faults taken, and reclaim statistics are within the noise for both the > master and the slave job. What's the soft limit setting in the controlled case? I assume it is the default RESOURCE_MAX. So both Master and Slave get equal pressure before/after the patch, and no differences on the stats should be observed. > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 6= 00M-280M-vanilla =A0 =A0 =A0600M-280M-patched > Master walltime (s) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0595.13 ( =A0+0.00%= ) =A0 =A0 =A0553.19 ( =A0-7.04%) > Master walltime (stddev) =A0 =A0 =A0 =A0 =A0 =A0 =A0 8.31 ( =A0+0.00%) = =A0 =A0 =A0 =A02.57 ( -61.64%) > Master major faults =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 3729.75 ( =A0+0.00%) = =A0 =A0 =A0783.25 ( -78.98%) > Master major faults (stddev) =A0 =A0 =A0 =A0 258.79 ( =A0+0.00%) =A0 =A0 = =A0226.68 ( -12.36%) > Master reclaim =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 705.00 ( =A0+0= .00%) =A0 =A0 =A0 29.50 ( -95.68%) > Master reclaim (stddev) =A0 =A0 =A0 =A0 =A0 =A0 =A0232.87 ( =A0+0.00%) = =A0 =A0 =A0 44.72 ( -80.45%) > Master scan =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0714.88 ( = =A0+0.00%) =A0 =A0 =A0 30.00 ( -95.67%) > Master scan (stddev) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 237.44 ( =A0+0.00%) = =A0 =A0 =A0 45.39 ( -80.54%) > Master kswapd reclaim =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0114.75 ( =A0+0.00%) = =A0 =A0 =A0 50.00 ( -55.94%) > Master kswapd reclaim (stddev) =A0 =A0 =A0 128.51 ( =A0+0.00%) =A0 =A0 = =A0 =A09.45 ( -91.93%) > Master kswapd scan =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 115.75 ( =A0+0.00%= ) =A0 =A0 =A0 50.00 ( -56.32%) > Master kswapd scan (stddev) =A0 =A0 =A0 =A0 =A0130.31 ( =A0+0.00%) =A0 = =A0 =A0 =A09.45 ( -92.04%) > Slave walltime (s) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 631.18 ( =A0+0.00%= ) =A0 =A0 =A0577.68 ( =A0-8.46%) > Slave walltime (stddev) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A09.89 ( =A0+0.00%) = =A0 =A0 =A0 =A03.63 ( -57.47%) > Slave major faults =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 28401.75 ( =A0+0.00%) = =A0 =A014656.75 ( -48.39%) > Slave major faults (stddev) =A0 =A0 =A0 =A0 2629.97 ( =A0+0.00%) =A0 =A0 = 1911.81 ( -27.30%) > Slave reclaim =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A065400.62 ( =A0+0= .00%) =A0 =A0 1479.62 ( -97.74%) > Slave reclaim (stddev) =A0 =A0 =A0 =A0 =A0 =A0 11623.02 ( =A0+0.00%) =A0 = =A0 1482.13 ( -87.24%) > Slave scan =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 9050047.88 ( =A0+0= .00%) =A0 =A095968.25 ( -98.94%) > Slave scan (stddev) =A0 =A0 =A0 =A0 =A0 =A0 =A01912786.94 ( =A0+0.00%) = =A0 =A093390.71 ( -95.12%) > Slave kswapd reclaim =A0 =A0 =A0 =A0 =A0 =A0 =A0327894.50 ( =A0+0.00%) = =A0 227099.88 ( -30.74%) > Slave kswapd reclaim (stddev) =A0 =A0 =A022289.43 ( =A0+0.00%) =A0 =A0161= 13.14 ( -27.71%) > Slave kswapd scan =A0 =A0 =A0 =A0 =A0 =A0 =A0 34987335.75 ( =A0+0.00%) = =A01362367.12 ( -96.11%) > Slave kswapd scan (stddev) =A0 =A0 =A0 2523642.98 ( =A0+0.00%) =A0 156754= .74 ( -93.79%) > > Here, the available memory is limited to 320 MB, the machine is > overcommitted by 280 MB. =A0The soft limit of the master is 300 MB, that > of the slave merely 20 MB. > > Looking at the slave job first, it is much better off with the patched > kernel: direct reclaim is almost gone, kswapd reclaim is decreased by > a third. =A0The result is much fewer major faults taken, which in turn > lets the job finish quicker. What's the setting of the hard limit here? Is the direct reclaim referring to per-memcg directly reclaim or global one. > > It would be a zero-sum game if the improvement happened at the cost of > the master but looking at the numbers, even the master performs better > with the patched kernel. =A0In fact, the master job is almost unaffected > on the patched kernel compared to the control case. It makes sense since the master job get less affected by the patch than the slave job under the example. Under the control case, if both master and slave have RESOURCE_MAX soft limit setting, they are under equal memory pressure(priority =3D DEF_PRIORITY) . On the second example, only the slave pressure being increased by priority =3D 0, and the Master got scanned with same priority =3D DEF_PRIORITY pretty much. So I would expect to see more reclaim activities happens in slave on the patched kernel compared to the control case. It seems match the testing result. > > This is an odd phenomenon, as the patch does not directly change how > the master is reclaimed. =A0An explanation for this is that the severe > overreclaim of the slave in the unpatched kernel results in the master > growing bigger than in the patched case. =A0Combining the fact that > memcgs are scanned according to their size with the increased refault > rate of the overreclaimed slave triggering global reclaim more often > means that overall pressure on the master job is higher in the > unpatched kernel. We can check the Master memory.usage_in_bytes while the job is running. On the other hand, I don't see why we expect the Master being less reclaimed in the controlled case? On the unpatched kernel, the Master is being reclaimed under global pressure each time anyway since we ignore the return value of softlimit. > > At any rate, the patched kernel seems to do a much better job at both > overall resource allocation under soft limit overcommit as well as the > requested prioritization of the master job. > > Signed-off-by: Johannes Weiner > --- > =A0include/linux/memcontrol.h | =A0 18 +-- > =A0mm/memcontrol.c =A0 =A0 =A0 =A0 =A0 =A0| =A0412 ++++------------------= ---------------------- > =A0mm/vmscan.c =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0| =A0 80 +-------- > =A03 files changed, 48 insertions(+), 462 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 6c1d69e..72368b7 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -121,6 +121,7 @@ struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat= (struct mem_cgroup *memcg, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0struct zone *zone); > =A0struct zone_reclaim_stat* > =A0mem_cgroup_get_reclaim_stat_from_page(struct page *page); > +bool mem_cgroup_over_softlimit(struct mem_cgroup *, struct mem_cgroup *)= ; Maybe something like "mem_cgroup_over_soft_limit()" ? > =A0void mem_cgroup_account_reclaim(struct mem_cgroup *, struct mem_cgroup= *, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0unsigned l= ong, unsigned long, bool); > =A0extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, > @@ -155,9 +156,6 @@ static inline void mem_cgroup_dec_page_stat(struct pa= ge *page, > =A0 =A0 =A0 =A0mem_cgroup_update_page_stat(page, idx, -1); > =A0} > > -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order= , > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 gfp_t gfp_mask, > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 unsigned long *total_scanned); > =A0u64 mem_cgroup_get_limit(struct mem_cgroup *memcg); > > =A0void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_ite= m idx); > @@ -362,22 +360,20 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg,= struct task_struct *p) > =A0{ > =A0} > > -static inline void mem_cgroup_inc_page_stat(struct page *page, > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 enum mem_cgroup_page_stat_item idx) > +static inline bool > +mem_cgroup_over_softlimit(struct mem_cgroup *root, struct mem_cgroup *me= mcg) > =A0{ > + =A0 =A0 =A0 return false; > =A0} > > -static inline void mem_cgroup_dec_page_stat(struct page *page, > +static inline void mem_cgroup_inc_page_stat(struct page *page, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0enum mem_cgroup_page_stat_item idx) > =A0{ > =A0} > > -static inline > -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order= , > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 gfp_t gfp_mask, > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 unsigned long *total_scanned) > +static inline void mem_cgroup_dec_page_stat(struct page *page, > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 enum mem_cgroup_page_stat_item idx) > =A0{ > - =A0 =A0 =A0 return 0; > =A0} > > =A0static inline > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 170dff4..d4f7ae5 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -35,7 +35,6 @@ > =A0#include > =A0#include > =A0#include > -#include > =A0#include > =A0#include > =A0#include > @@ -118,12 +117,10 @@ enum mem_cgroup_events_index { > =A0*/ > =A0enum mem_cgroup_events_target { > =A0 =A0 =A0 =A0MEM_CGROUP_TARGET_THRESH, > - =A0 =A0 =A0 MEM_CGROUP_TARGET_SOFTLIMIT, > =A0 =A0 =A0 =A0MEM_CGROUP_TARGET_NUMAINFO, > =A0 =A0 =A0 =A0MEM_CGROUP_NTARGETS, > =A0}; > =A0#define THRESHOLDS_EVENTS_TARGET (128) > -#define SOFTLIMIT_EVENTS_TARGET (1024) > =A0#define NUMAINFO_EVENTS_TARGET (1024) > > =A0struct mem_cgroup_stat_cpu { > @@ -149,12 +146,6 @@ struct mem_cgroup_per_zone { > =A0 =A0 =A0 =A0struct mem_cgroup_reclaim_iter reclaim_iter[DEF_PRIORITY += 1]; > > =A0 =A0 =A0 =A0struct zone_reclaim_stat reclaim_stat; > - =A0 =A0 =A0 struct rb_node =A0 =A0 =A0 =A0 =A0tree_node; =A0 =A0 =A0/* = RB tree node */ > - =A0 =A0 =A0 unsigned long long =A0 =A0 =A0usage_in_excess;/* Set to the= value by which */ > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 /* the soft limit is exceeded*/ > - =A0 =A0 =A0 bool =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0on_tree; > - =A0 =A0 =A0 struct mem_cgroup =A0 =A0 =A0 *mem; =A0 =A0 =A0 =A0 =A0 /* = Back pointer, we cannot */ > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 /* use container_of =A0 =A0 =A0 =A0*/ > =A0}; > =A0/* Macro for accessing counter */ > =A0#define MEM_CGROUP_ZSTAT(mz, idx) =A0 =A0 =A0((mz)->count[(idx)]) > @@ -167,26 +158,6 @@ struct mem_cgroup_lru_info { > =A0 =A0 =A0 =A0struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES]; > =A0}; > > -/* > - * Cgroups above their limits are maintained in a RB-Tree, independent o= f > - * their hierarchy representation > - */ > - > -struct mem_cgroup_tree_per_zone { > - =A0 =A0 =A0 struct rb_root rb_root; > - =A0 =A0 =A0 spinlock_t lock; > -}; > - > -struct mem_cgroup_tree_per_node { > - =A0 =A0 =A0 struct mem_cgroup_tree_per_zone rb_tree_per_zone[MAX_NR_ZON= ES]; > -}; > - > -struct mem_cgroup_tree { > - =A0 =A0 =A0 struct mem_cgroup_tree_per_node *rb_tree_per_node[MAX_NUMNO= DES]; > -}; > - > -static struct mem_cgroup_tree soft_limit_tree __read_mostly; > - > =A0struct mem_cgroup_threshold { > =A0 =A0 =A0 =A0struct eventfd_ctx *eventfd; > =A0 =A0 =A0 =A0u64 threshold; > @@ -343,7 +314,6 @@ static bool move_file(void) > =A0* limit reclaim to prevent infinite loops, if they ever occur. > =A0*/ > =A0#define =A0 =A0 =A0 =A0MEM_CGROUP_MAX_RECLAIM_LOOPS =A0 =A0 =A0 =A0 = =A0 =A0(100) > -#define =A0 =A0 =A0 =A0MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS (2) You might need to remove the comment above as well. > > =A0enum charge_type { > =A0 =A0 =A0 =A0MEM_CGROUP_CHARGE_TYPE_CACHE =3D 0, > @@ -398,164 +368,6 @@ page_cgroup_zoneinfo(struct mem_cgroup *memcg, stru= ct page *page) > =A0 =A0 =A0 =A0return mem_cgroup_zoneinfo(memcg, nid, zid); > =A0} > > -static struct mem_cgroup_tree_per_zone * > -soft_limit_tree_node_zone(int nid, int zid) > -{ > - =A0 =A0 =A0 return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_= zone[zid]; > -} > - > -static struct mem_cgroup_tree_per_zone * > -soft_limit_tree_from_page(struct page *page) > -{ > - =A0 =A0 =A0 int nid =3D page_to_nid(page); > - =A0 =A0 =A0 int zid =3D page_zonenum(page); > - > - =A0 =A0 =A0 return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_= zone[zid]; > -} > - > -static void > -__mem_cgroup_insert_exceeded(struct mem_cgroup *memcg, > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 struct mem_= cgroup_per_zone *mz, > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 struct mem_= cgroup_tree_per_zone *mctz, > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 unsigned lo= ng long new_usage_in_excess) > -{ > - =A0 =A0 =A0 struct rb_node **p =3D &mctz->rb_root.rb_node; > - =A0 =A0 =A0 struct rb_node *parent =3D NULL; > - =A0 =A0 =A0 struct mem_cgroup_per_zone *mz_node; > - > - =A0 =A0 =A0 if (mz->on_tree) > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 return; > - > - =A0 =A0 =A0 mz->usage_in_excess =3D new_usage_in_excess; > - =A0 =A0 =A0 if (!mz->usage_in_excess) > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 return; > - =A0 =A0 =A0 while (*p) { > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 parent =3D *p; > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 mz_node =3D rb_entry(parent, struct mem_cgr= oup_per_zone, > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 tree_node); > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (mz->usage_in_excess < mz_node->usage_in= _excess) > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 p =3D &(*p)->rb_left; > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* We can't avoid mem cgroups that are ov= er their soft > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* limit by the same amount > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*/ > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 else if (mz->usage_in_excess >=3D mz_node->= usage_in_excess) > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 p =3D &(*p)->rb_right; > - =A0 =A0 =A0 } > - =A0 =A0 =A0 rb_link_node(&mz->tree_node, parent, p); > - =A0 =A0 =A0 rb_insert_color(&mz->tree_node, &mctz->rb_root); > - =A0 =A0 =A0 mz->on_tree =3D true; > -} > - > -static void > -__mem_cgroup_remove_exceeded(struct mem_cgroup *memcg, > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 struct mem_= cgroup_per_zone *mz, > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 struct mem_= cgroup_tree_per_zone *mctz) > -{ > - =A0 =A0 =A0 if (!mz->on_tree) > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 return; > - =A0 =A0 =A0 rb_erase(&mz->tree_node, &mctz->rb_root); > - =A0 =A0 =A0 mz->on_tree =3D false; > -} > - > -static void > -mem_cgroup_remove_exceeded(struct mem_cgroup *memcg, > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 struct mem_= cgroup_per_zone *mz, > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 struct mem_= cgroup_tree_per_zone *mctz) > -{ > - =A0 =A0 =A0 spin_lock(&mctz->lock); > - =A0 =A0 =A0 __mem_cgroup_remove_exceeded(memcg, mz, mctz); > - =A0 =A0 =A0 spin_unlock(&mctz->lock); > -} > - > - > -static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page= *page) > -{ > - =A0 =A0 =A0 unsigned long long excess; > - =A0 =A0 =A0 struct mem_cgroup_per_zone *mz; > - =A0 =A0 =A0 struct mem_cgroup_tree_per_zone *mctz; > - =A0 =A0 =A0 int nid =3D page_to_nid(page); > - =A0 =A0 =A0 int zid =3D page_zonenum(page); > - =A0 =A0 =A0 mctz =3D soft_limit_tree_from_page(page); > - > - =A0 =A0 =A0 /* > - =A0 =A0 =A0 =A0* Necessary to update all ancestors when hierarchy is us= ed. > - =A0 =A0 =A0 =A0* because their event counter is not touched. > - =A0 =A0 =A0 =A0*/ > - =A0 =A0 =A0 for (; memcg; memcg =3D parent_mem_cgroup(memcg)) { > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 mz =3D mem_cgroup_zoneinfo(memcg, nid, zid)= ; > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 excess =3D res_counter_soft_limit_excess(&m= emcg->res); > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* We have to update the tree if mz is on= RB-tree or > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* mem is over its softlimit. > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*/ > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (excess || mz->on_tree) { > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 spin_lock(&mctz->lock); > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* if on-tree, remove it */ > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (mz->on_tree) > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 __mem_cgrou= p_remove_exceeded(memcg, mz, mctz); > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* Insert again. mz->usag= e_in_excess will be updated. > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* If excess is 0, no tre= e ops. > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*/ > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 __mem_cgroup_insert_exceede= d(memcg, mz, mctz, excess); > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 spin_unlock(&mctz->lock); > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 } > - =A0 =A0 =A0 } > -} > - > -static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg) > -{ > - =A0 =A0 =A0 int node, zone; > - =A0 =A0 =A0 struct mem_cgroup_per_zone *mz; > - =A0 =A0 =A0 struct mem_cgroup_tree_per_zone *mctz; > - > - =A0 =A0 =A0 for_each_node_state(node, N_POSSIBLE) { > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 for (zone =3D 0; zone < MAX_NR_ZONES; zone+= +) { > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 mz =3D mem_cgroup_zoneinfo(= memcg, node, zone); > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 mctz =3D soft_limit_tree_no= de_zone(node, zone); > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 mem_cgroup_remove_exceeded(= memcg, mz, mctz); > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 } > - =A0 =A0 =A0 } > -} > - > -static struct mem_cgroup_per_zone * > -__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mc= tz) > -{ > - =A0 =A0 =A0 struct rb_node *rightmost =3D NULL; > - =A0 =A0 =A0 struct mem_cgroup_per_zone *mz; > - > -retry: > - =A0 =A0 =A0 mz =3D NULL; > - =A0 =A0 =A0 rightmost =3D rb_last(&mctz->rb_root); > - =A0 =A0 =A0 if (!rightmost) > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 goto done; =A0 =A0 =A0 =A0 =A0 =A0 =A0/* No= thing to reclaim from */ > - > - =A0 =A0 =A0 mz =3D rb_entry(rightmost, struct mem_cgroup_per_zone, tree= _node); > - =A0 =A0 =A0 /* > - =A0 =A0 =A0 =A0* Remove the node now but someone else can add it back, > - =A0 =A0 =A0 =A0* we will to add it back at the end of reclaim to its co= rrect > - =A0 =A0 =A0 =A0* position in the tree. > - =A0 =A0 =A0 =A0*/ > - =A0 =A0 =A0 __mem_cgroup_remove_exceeded(mz->mem, mz, mctz); > - =A0 =A0 =A0 if (!res_counter_soft_limit_excess(&mz->mem->res) || > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 !css_tryget(&mz->mem->css)) > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 goto retry; > -done: > - =A0 =A0 =A0 return mz; > -} > - > -static struct mem_cgroup_per_zone * > -mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz= ) > -{ > - =A0 =A0 =A0 struct mem_cgroup_per_zone *mz; > - > - =A0 =A0 =A0 spin_lock(&mctz->lock); > - =A0 =A0 =A0 mz =3D __mem_cgroup_largest_soft_limit_node(mctz); > - =A0 =A0 =A0 spin_unlock(&mctz->lock); > - =A0 =A0 =A0 return mz; > -} > - > =A0/* > =A0* Implementation Note: reading percpu statistics for memcg. > =A0* > @@ -696,9 +508,6 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgr= oup *memcg, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0case MEM_CGROUP_TARGET_THRESH: > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0next =3D val + THRESHOLDS_= EVENTS_TARGET; > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0break; > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 case MEM_CGROUP_TARGET_SOFTLIMIT: > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 next =3D val + SOFTLIMIT_EV= ENTS_TARGET; > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break; > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0case MEM_CGROUP_TARGET_NUMAINFO: > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0next =3D val + NUMAINFO_EV= ENTS_TARGET; > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0break; > @@ -718,13 +527,11 @@ static bool mem_cgroup_event_ratelimit(struct mem_c= group *memcg, > =A0static void memcg_check_events(struct mem_cgroup *memcg, struct page *= page) > =A0{ > =A0 =A0 =A0 =A0preempt_disable(); > - =A0 =A0 =A0 /* threshold event is triggered in finer grain than soft li= mit */ > + =A0 =A0 =A0 /* threshold event is triggered in finer grain than numa in= fo */ > =A0 =A0 =A0 =A0if (unlikely(mem_cgroup_event_ratelimit(memcg, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0MEM_CGROUP_TARGET_THRESH))) { > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 bool do_softlimit, do_numainfo; > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 bool do_numainfo; > > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 do_softlimit =3D mem_cgroup_event_ratelimit= (memcg, > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 MEM_CGROUP_TARGET_SOFTLIMIT); > =A0#if MAX_NUMNODES > 1 > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0do_numainfo =3D mem_cgroup_event_ratelimit= (memcg, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0MEM_CGROUP_TARGET_NUMAINFO); > @@ -732,8 +539,6 @@ static void memcg_check_events(struct mem_cgroup *mem= cg, struct page *page) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0preempt_enable(); > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0mem_cgroup_threshold(memcg); > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (unlikely(do_softlimit)) > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 mem_cgroup_update_tree(memc= g, page); > =A0#if MAX_NUMNODES > 1 > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (unlikely(do_numainfo)) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0atomic_inc(&memcg->numainf= o_events); > @@ -1318,6 +1123,36 @@ static unsigned long mem_cgroup_margin(struct mem_= cgroup *memcg) > =A0 =A0 =A0 =A0return margin >> PAGE_SHIFT; > =A0} > > +/** > + * mem_cgroup_over_softlimit > + * @root: hierarchy root > + * @memcg: child of @root to test > + * > + * Returns %true if @memcg exceeds its own soft limit or contributes > + * to the soft limit excess of one of its parents up to and including > + * @root. > + */ > +bool mem_cgroup_over_softlimit(struct mem_cgroup *root, > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0struct mem_c= group *memcg) > +{ > + =A0 =A0 =A0 if (mem_cgroup_disabled()) > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 return false; > + > + =A0 =A0 =A0 if (!root) > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 root =3D root_mem_cgroup; > + > + =A0 =A0 =A0 for (; memcg; memcg =3D parent_mem_cgroup(memcg)) { > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* root_mem_cgroup does not have a soft lim= it */ > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (memcg =3D=3D root_mem_cgroup) > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break; > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (res_counter_soft_limit_excess(&memcg->r= es)) > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 return true; > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (memcg =3D=3D root) > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break; > + =A0 =A0 =A0 } Here it adds pressure on a cgroup if one of its parents exceeds soft limit, although the cgroup itself is under soft limit. It does change my understanding of soft limit, and might introduce regression of our existing use cases. Here is an example: Machine capacity 32G and we over-commit by 8G. root -> A (hard limit 20G, soft limit 15G, usage 16G) -> A1 (soft limit 5G, usage 4G) -> A2 (soft limit 10G, usage 12G) -> B (hard limit 20G, soft limit 10G, usage 16G) under global reclaim, we don't want to add pressure on A1 although its parent A exceeds its soft limit. Assume that if we set the soft limit corresponding to each cgroup's working set size (hot memory), and it will introduce regression to A1 in that case. In my existing implementation, i am checking the cgroup's soft limit standalone w/o looking its ancestors. > + =A0 =A0 =A0 return false; > +} > + > =A0int mem_cgroup_swappiness(struct mem_cgroup *memcg) > =A0{ > =A0 =A0 =A0 =A0struct cgroup *cgrp =3D memcg->css.cgroup; > @@ -1687,64 +1522,6 @@ bool mem_cgroup_reclaimable(struct mem_cgroup *mem= cg, bool noswap) > =A0} > =A0#endif > > -static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg, > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0stru= ct zone *zone, > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0gfp_= t gfp_mask, > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0unsi= gned long *total_scanned) > -{ > - =A0 =A0 =A0 struct mem_cgroup *victim =3D NULL; > - =A0 =A0 =A0 int total =3D 0; > - =A0 =A0 =A0 int loop =3D 0; > - =A0 =A0 =A0 unsigned long excess; > - =A0 =A0 =A0 unsigned long nr_scanned; > - =A0 =A0 =A0 struct mem_cgroup_reclaim_cookie reclaim =3D { > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 .zone =3D zone, > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 .priority =3D 0, > - =A0 =A0 =A0 }; > - > - =A0 =A0 =A0 excess =3D res_counter_soft_limit_excess(&root_memcg->res) = >> PAGE_SHIFT; > - > - =A0 =A0 =A0 while (1) { > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 unsigned long nr_reclaimed; > - > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 victim =3D mem_cgroup_iter(root_memcg, vict= im, &reclaim); > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (!victim) { > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 loop++; > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (loop >=3D 2) { > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* If we = have not been able to reclaim > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* anythi= ng, it might because there are > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* no rec= laimable pages under this hierarchy > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*/ > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (!total) > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 break; > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* We wan= t to do more targeted reclaim. > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* excess= >> 2 is not to excessive so as to > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* reclai= m too much, nor too less that we keep > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* coming= back to reclaim from this cgroup > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*/ > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (total >= =3D (excess >> 2) || > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 (loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 break; > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 } > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue; > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 } > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (!mem_cgroup_reclaimable(victim, false)) > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue; > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 nr_reclaimed =3D mem_cgroup_shrink_node_zon= e(victim, gfp_mask, false, > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0zone, &nr_scanned); > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 mem_cgroup_account_reclaim(root_mem_cgroup,= victim, nr_reclaimed, > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0nr_scanned, current_is_kswapd()); > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 total +=3D nr_reclaimed; > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 *total_scanned +=3D nr_scanned; > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (!res_counter_soft_limit_excess(&root_me= mcg->res)) > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break; > - =A0 =A0 =A0 } > - =A0 =A0 =A0 mem_cgroup_iter_break(root_memcg, victim); > - =A0 =A0 =A0 return total; > -} > - > =A0/* > =A0* Check OOM-Killer is already running under our hierarchy. > =A0* If someone is running, return false. > @@ -2507,8 +2284,6 @@ static void __mem_cgroup_commit_charge(struct mem_c= group *memcg, > =A0 =A0 =A0 =A0unlock_page_cgroup(pc); > =A0 =A0 =A0 =A0/* > =A0 =A0 =A0 =A0 * "charge_statistics" updated event counter. Then, check = it. > - =A0 =A0 =A0 =A0* Insert ancestor (and ancestor's ancestors), to softlim= it RB-tree. > - =A0 =A0 =A0 =A0* if they exceeds softlimit. > =A0 =A0 =A0 =A0 */ > =A0 =A0 =A0 =A0memcg_check_events(memcg, page); > =A0} > @@ -3578,98 +3353,6 @@ static int mem_cgroup_resize_memsw_limit(struct me= m_cgroup *memcg, > =A0 =A0 =A0 =A0return ret; > =A0} > > -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order= , > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 gfp_t gfp_mask, > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 unsigned long *total_scanned) > -{ > - =A0 =A0 =A0 unsigned long nr_reclaimed =3D 0; > - =A0 =A0 =A0 struct mem_cgroup_per_zone *mz, *next_mz =3D NULL; > - =A0 =A0 =A0 unsigned long reclaimed; > - =A0 =A0 =A0 int loop =3D 0; > - =A0 =A0 =A0 struct mem_cgroup_tree_per_zone *mctz; > - =A0 =A0 =A0 unsigned long long excess; > - =A0 =A0 =A0 unsigned long nr_scanned; > - > - =A0 =A0 =A0 if (order > 0) > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 return 0; > - > - =A0 =A0 =A0 mctz =3D soft_limit_tree_node_zone(zone_to_nid(zone), zone_= idx(zone)); > - =A0 =A0 =A0 /* > - =A0 =A0 =A0 =A0* This loop can run a while, specially if mem_cgroup's c= ontinuously > - =A0 =A0 =A0 =A0* keep exceeding their soft limit and putting the system= under > - =A0 =A0 =A0 =A0* pressure > - =A0 =A0 =A0 =A0*/ > - =A0 =A0 =A0 do { > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (next_mz) > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 mz =3D next_mz; > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 else > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 mz =3D mem_cgroup_largest_s= oft_limit_node(mctz); > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (!mz) > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break; > - > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 nr_scanned =3D 0; > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 reclaimed =3D mem_cgroup_soft_reclaim(mz->m= em, zone, > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 gfp_mask, &nr_scanned); > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 nr_reclaimed +=3D reclaimed; > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 *total_scanned +=3D nr_scanned; > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 spin_lock(&mctz->lock); > - > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* If we failed to reclaim anything from = this memory cgroup > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* it is time to move on to the next cgro= up > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*/ > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 next_mz =3D NULL; > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (!reclaimed) { > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 do { > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* Loop u= ntil we find yet another one. > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* By the= time we get the soft_limit lock > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* again,= someone might have aded the > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* group = back on the RB tree. Iterate to > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* make s= ure we get a different mem. > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* mem_cg= roup_largest_soft_limit_node returns > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* NULL i= f no other cgroup is present on > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* the tr= ee > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*/ > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 next_mz =3D > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 __mem_cgrou= p_largest_soft_limit_node(mctz); > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (next_mz= =3D=3D mz) > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 css_put(&next_mz->mem->css); > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 else /* nex= t_mz =3D=3D NULL or other memcg */ > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 break; > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 } while (1); > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 } > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 __mem_cgroup_remove_exceeded(mz->mem, mz, m= ctz); > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 excess =3D res_counter_soft_limit_excess(&m= z->mem->res); > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* One school of thought says that we sho= uld not add > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* back the node to the tree if reclaim r= eturns 0. > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* But our reclaim could return 0, simply= because due > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* to priority we are exposing a smaller = subset of > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* memory to reclaim from. Consider this = as a longer > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* term TODO. > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*/ > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* If excess =3D=3D 0, no tree ops */ > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 __mem_cgroup_insert_exceeded(mz->mem, mz, m= ctz, excess); > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 spin_unlock(&mctz->lock); > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 css_put(&mz->mem->css); > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 loop++; > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* Could not reclaim anything and there a= re no more > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* mem cgroups to try or we seem to be lo= oping without > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* reclaiming anything. > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*/ > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (!nr_reclaimed && > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (next_mz =3D=3D NULL || > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 loop > MEM_CGROUP_MAX_SOFT_= LIMIT_RECLAIM_LOOPS)) > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break; > - =A0 =A0 =A0 } while (!nr_reclaimed); > - =A0 =A0 =A0 if (next_mz) > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 css_put(&next_mz->mem->css); > - =A0 =A0 =A0 return nr_reclaimed; > -} > - > =A0/* > =A0* This routine traverse page_cgroup in given list and drop them all. > =A0* *And* this routine doesn't reclaim page itself, just removes page_cg= roup. > @@ -4816,9 +4499,6 @@ static int alloc_mem_cgroup_per_zone_info(struct me= m_cgroup *memcg, int node) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0mz =3D &pn->zoneinfo[zone]; > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0for_each_lru(l) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0INIT_LIST_HEAD(&mz->lruvec= .lists[l]); > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 mz->usage_in_excess =3D 0; > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 mz->on_tree =3D false; > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 mz->mem =3D memcg; > =A0 =A0 =A0 =A0} > =A0 =A0 =A0 =A0memcg->info.nodeinfo[node] =3D pn; > =A0 =A0 =A0 =A0return 0; > @@ -4872,7 +4552,6 @@ static void __mem_cgroup_free(struct mem_cgroup *me= mcg) > =A0{ > =A0 =A0 =A0 =A0int node; > > - =A0 =A0 =A0 mem_cgroup_remove_from_trees(memcg); > =A0 =A0 =A0 =A0free_css_id(&mem_cgroup_subsys, &memcg->css); > > =A0 =A0 =A0 =A0for_each_node_state(node, N_POSSIBLE) > @@ -4927,31 +4606,6 @@ static void __init enable_swap_cgroup(void) > =A0} > =A0#endif > > -static int mem_cgroup_soft_limit_tree_init(void) > -{ > - =A0 =A0 =A0 struct mem_cgroup_tree_per_node *rtpn; > - =A0 =A0 =A0 struct mem_cgroup_tree_per_zone *rtpz; > - =A0 =A0 =A0 int tmp, node, zone; > - > - =A0 =A0 =A0 for_each_node_state(node, N_POSSIBLE) { > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 tmp =3D node; > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (!node_state(node, N_NORMAL_MEMORY)) > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 tmp =3D -1; > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 rtpn =3D kzalloc_node(sizeof(*rtpn), GFP_KE= RNEL, tmp); > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (!rtpn) > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 return 1; > - > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 soft_limit_tree.rb_tree_per_node[node] =3D = rtpn; > - > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 for (zone =3D 0; zone < MAX_NR_ZONES; zone+= +) { > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rtpz =3D &rtpn->rb_tree_per= _zone[zone]; > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rtpz->rb_root =3D RB_ROOT; > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 spin_lock_init(&rtpz->lock)= ; > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 } > - =A0 =A0 =A0 } > - =A0 =A0 =A0 return 0; > -} > - > =A0static struct cgroup_subsys_state * __ref > =A0mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) > =A0{ > @@ -4973,8 +4627,6 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct = cgroup *cont) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0enable_swap_cgroup(); > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0parent =3D NULL; > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0root_mem_cgroup =3D memcg; > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (mem_cgroup_soft_limit_tree_init()) > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 goto free_out; > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0for_each_possible_cpu(cpu) { > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0struct memcg_stock_pcp *st= ock =3D > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0&per_cpu(memcg_stock, cpu); > diff --git a/mm/vmscan.c b/mm/vmscan.c > index e3fd8a7..4279549 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2121,8 +2121,16 @@ static void shrink_zone(int priority, struct zone = *zone, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0.mem_cgroup =3D memcg, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0.zone =3D zone, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0}; > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 int epriority =3D priority; > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* Put more pressure on hierarchies that = exceed their > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* soft limit, to push them back harder t= han their > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* well-behaving siblings. > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*/ > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (mem_cgroup_over_softlimit(root, memcg)) > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 epriority =3D 0; > > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 shrink_mem_cgroup_zone(priority, &mz, sc); > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 shrink_mem_cgroup_zone(epriority, &mz, sc); > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0mem_cgroup_account_reclaim(root, memcg, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 sc->nr_reclaimed - nr_reclaimed, > @@ -2171,8 +2179,6 @@ static bool shrink_zones(int priority, struct zonel= ist *zonelist, > =A0{ > =A0 =A0 =A0 =A0struct zoneref *z; > =A0 =A0 =A0 =A0struct zone *zone; > - =A0 =A0 =A0 unsigned long nr_soft_reclaimed; > - =A0 =A0 =A0 unsigned long nr_soft_scanned; > =A0 =A0 =A0 =A0bool should_abort_reclaim =3D false; > > =A0 =A0 =A0 =A0for_each_zone_zonelist_nodemask(zone, z, zonelist, > @@ -2205,19 +2211,6 @@ static bool shrink_zones(int priority, struct zone= list *zonelist, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0continue; > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0} > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0} > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* This steals pages from= memory cgroups over softlimit > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* and returns the number= of reclaimed pages and > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* scanned pages. This wo= rks for global memory pressure > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* and balancing, not for= a memcg's limit. > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*/ > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 nr_soft_scanned =3D 0; > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 nr_soft_reclaimed =3D mem_c= group_soft_limit_reclaim(zone, > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 sc->order, sc->gfp_mask, > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 &nr_soft_scanned); > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 sc->nr_reclaimed +=3D nr_so= ft_reclaimed; > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 sc->nr_scanned +=3D nr_soft= _scanned; > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* need some check for avoi= d more shrink_zone() */ > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0} > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0shrink_zone(priority, zone, sc); > @@ -2393,48 +2386,6 @@ unsigned long try_to_free_pages(struct zonelist *z= onelist, int order, > =A0} > > =A0#ifdef CONFIG_CGROUP_MEM_RES_CTLR > - > -unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg, > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 gfp_t gfp_mask, bool noswap, > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 struct zone *zone, > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 unsigned long *nr_scanned) > -{ > - =A0 =A0 =A0 struct scan_control sc =3D { > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 .nr_scanned =3D 0, > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 .nr_to_reclaim =3D SWAP_CLUSTER_MAX, > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 .may_writepage =3D !laptop_mode, > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 .may_unmap =3D 1, > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 .may_swap =3D !noswap, > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 .order =3D 0, > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 .target_mem_cgroup =3D memcg, > - =A0 =A0 =A0 }; > - =A0 =A0 =A0 struct mem_cgroup_zone mz =3D { > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 .mem_cgroup =3D memcg, > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 .zone =3D zone, > - =A0 =A0 =A0 }; > - > - =A0 =A0 =A0 sc.gfp_mask =3D (gfp_mask & GFP_RECLAIM_MASK) | > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (GFP_HIGHUSER_MOVABLE & ~GF= P_RECLAIM_MASK); > - > - =A0 =A0 =A0 trace_mm_vmscan_memcg_softlimit_reclaim_begin(0, > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 sc.may_writepage, > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 sc.gfp_mask); > - > - =A0 =A0 =A0 /* > - =A0 =A0 =A0 =A0* NOTE: Although we can get the priority field, using it > - =A0 =A0 =A0 =A0* here is not a good idea, since it limits the pages we = can scan. > - =A0 =A0 =A0 =A0* if we don't reclaim here, the shrink_zone from balance= _pgdat > - =A0 =A0 =A0 =A0* will pick up pages from other mem cgroup's as well. We= hack > - =A0 =A0 =A0 =A0* the priority and make it zero. > - =A0 =A0 =A0 =A0*/ > - =A0 =A0 =A0 shrink_mem_cgroup_zone(0, &mz, &sc); > - > - =A0 =A0 =A0 trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed= ); > - > - =A0 =A0 =A0 *nr_scanned =3D sc.nr_scanned; > - =A0 =A0 =A0 return sc.nr_reclaimed; > -} > - > =A0unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 gfp_t gfp_mask, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 bool noswap) > @@ -2609,8 +2560,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat= , int order, > =A0 =A0 =A0 =A0int end_zone =3D 0; =A0 =A0 =A0 /* Inclusive. =A00 =3D ZON= E_DMA */ > =A0 =A0 =A0 =A0unsigned long total_scanned; > =A0 =A0 =A0 =A0struct reclaim_state *reclaim_state =3D current->reclaim_s= tate; > - =A0 =A0 =A0 unsigned long nr_soft_reclaimed; > - =A0 =A0 =A0 unsigned long nr_soft_scanned; > =A0 =A0 =A0 =A0struct scan_control sc =3D { > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0.gfp_mask =3D GFP_KERNEL, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0.may_unmap =3D 1, > @@ -2701,17 +2650,6 @@ loop_again: > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0continue; > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0sc.nr_scanned =3D 0; > - > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 nr_soft_scanned =3D 0; > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* Call soft limit reclai= m before calling shrink_zone. > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*/ > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 nr_soft_reclaimed =3D mem_c= group_soft_limit_reclaim(zone, > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 order, sc.gfp_mask, > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 &nr_soft_scanned); > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 sc.nr_reclaimed +=3D nr_sof= t_reclaimed; > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 total_scanned +=3D nr_soft_= scanned; > - > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0/* > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 * We put equal pressure o= n every zone, unless > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 * one zone has way too ma= ny pages free > -- > 1.7.7.5 > --Ying -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ying Han Subject: Re: [patch 1/2] mm: memcg: per-memcg reclaim statistics Date: Wed, 11 Jan 2012 14:33:59 -0800 Message-ID: References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-2-git-send-email-hannes@cmpxchg.org> <20120111003020.GD24386@cmpxchg.org> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:x-system-of-record:content-type:content-transfer-encoding; bh=KBEmbxZnlwSVIcEKYV0xMRyJTqukuLn9IXZjZPiswJ0=; b=FZelVT5Djgok8Hr556zfOePJUMyjqpLTZNsAhCLGc0vBUmtwicjAgscBS3tusN+V4J EITbQkrFDfClwI4SjAGGWXFKxip4u5iS9blDKN5F/2ujcmaWZQg34bmUsYFRFIZ+C9Vd pLTlucNMRNFliFZv9NizS2vgRJZjfJUU8zFHw= In-Reply-To: <20120111003020.GD24386@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="iso-8859-1" To: Johannes Weiner Cc: Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org On Tue, Jan 10, 2012 at 4:30 PM, Johannes Weiner wrote= : > On Tue, Jan 10, 2012 at 03:54:05PM -0800, Ying Han wrote: >> Thank you for the patch and the stats looks reasonable to me, few >> questions as below: >> >> On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner wr= ote: >> > With the single per-zone LRU gone and global reclaim scanning >> > individual memcgs, it's straight-forward to collect meaningful and >> > accurate per-memcg reclaim statistics. >> > >> > This adds the following items to memory.stat: >> >> Some of the previous discussions including patches have similar stats >> in memory.vmscan_stat API, which collects all the per-memcg vmscan >> stats. I would like to understand more why we add into memory.stat >> instead, and do we have plan to keep extending memory.stat for those >> vmstat like stats? > > I think they were put into an extra file in particular to be able to > write to this file to reset the statistics. =A0But in my opinion, it's > trivial to calculate a delta from before and after running a workload, > so I didn't really like adding kernel code for that. > > Did you have another reason for a separate file in mind? Another reason I had them in separate file is easier to extend. I don't know if we have plan to have something like memory.vmstat, or just keep adding stuff into memory.stat. In general, I wanted to keep the memory.stat being reasonable size including only the basic statistics. In my existing vmscan_stat path, i have breakdowns of reclaim stats into file/anon which will make the memory.stat even larger. >> > pgreclaim >> >> Not sure if we want to keep this more consistent to /proc/vmstat, then >> it will be "pgsteal"? > > The problem with that was that we didn't like to call pages stolen > when they were reclaimed from within the cgroup, so we had pgfree for > inner reclaim and pgsteal for outer reclaim, respectively. > > I found it cleaner to just go with pgreclaim, it's unambiguous and > straight-forward. =A0Outer reclaim is designated by the hierarchy_ > prefix. > >> > pgscan >> > >> > =E1Number of pages reclaimed/scanned from that memcg due to its own >> > =E1hard limit (or physical limit in case of the root memcg) by the >> > =E1allocating task. >> > >> > kswapd_pgreclaim >> > kswapd_pgscan >> >> we have "pgscan_kswapd_*" in vmstat, so maybe ? >> "pgsteal_kswapd" >> "pgscan_kswapd" >> >> > =E1Reclaim activity from kswapd due to the memcg's own limit. =E1Only >> > =E1applicable to the root memcg for now since kswapd is only triggered >> > =E1by physical limits, but kswapd-style reclaim based on memcg hard >> > =E1limits is being developped. >> > >> > hierarchy_pgreclaim >> > hierarchy_pgscan >> > hierarchy_kswapd_pgreclaim >> > hierarchy_kswapd_pgscan >> >> "pgsteal_hierarchy" >> "pgsteal_kswapd_hierarchy" >> .. >> >> No strong option on the naming, but try to make it more consistent to >> existing API. > > I swear I tried, but the existing naming is pretty screwed up :( > > For example, pgscan_direct_* and pgscan_kswapd_* allow you to compare > scan rates of direct reclaim vs. kswapd reclaim. =A0To get the total > number of pages reclaimed, you sum them up. > > On the other hand, pgsteal_* does not differentiate between direct > reclaim and kswapd, so to get direct reclaim numbers, you add up the > pgsteal_* counters and subtract kswapd_steal (notice the lack of pg?), > which is in turn not available at zone granularity. agree and that always confuses me. > >> > +#define MEM_CGROUP_EVENTS_KSWAPD 2 >> > +#define MEM_CGROUP_EVENTS_HIERARCHY 4 > > These two function as namespaces, that's why I put hierarchy_ and > kswapd_ at the beginning of the names. > > Given that we have kswapd_steal, would you be okay with doing it like > this? =A0I mean, at least my naming conforms to ONE of the standards in > /proc/vmstat, right? ;-) I don't have much problem with the existing naming scheme, as long as we well document it and make it less confusing. > >> > @@ -91,12 +91,23 @@ enum mem_cgroup_stat_index { >> > =E1 =E1 =E1 =E1MEM_CGROUP_STAT_NSTATS, >> > =E1}; >> > >> > +#define MEM_CGROUP_EVENTS_KSWAPD 2 >> > +#define MEM_CGROUP_EVENTS_HIERARCHY 4 >> > + >> > =E1enum mem_cgroup_events_index { >> > =E1 =E1 =E1 =E1MEM_CGROUP_EVENTS_PGPGIN, =E1 =E1 =E1 /* # of pages pag= ed in */ >> > =E1 =E1 =E1 =E1MEM_CGROUP_EVENTS_PGPGOUT, =E1 =E1 =E1/* # of pages pag= ed out */ >> > =E1 =E1 =E1 =E1MEM_CGROUP_EVENTS_COUNT, =E1 =E1 =E1 =E1/* # of pages p= aged in/out */ >> > =E1 =E1 =E1 =E1MEM_CGROUP_EVENTS_PGFAULT, =E1 =E1 =E1/* # of page-faul= ts */ >> > =E1 =E1 =E1 =E1MEM_CGROUP_EVENTS_PGMAJFAULT, =E1 /* # of major page-fa= ults */ >> > + =E1 =E1 =E1 MEM_CGROUP_EVENTS_PGRECLAIM, >> > + =E1 =E1 =E1 MEM_CGROUP_EVENTS_PGSCAN, >> > + =E1 =E1 =E1 MEM_CGROUP_EVENTS_KSWAPD_PGRECLAIM, >> > + =E1 =E1 =E1 MEM_CGROUP_EVENTS_KSWAPD_PGSCAN, >> > + =E1 =E1 =E1 MEM_CGROUP_EVENTS_HIERARCHY_PGRECLAIM, >> > + =E1 =E1 =E1 MEM_CGROUP_EVENTS_HIERARCHY_PGSCAN, >> > + =E1 =E1 =E1 MEM_CGROUP_EVENTS_HIERARCHY_KSWAPD_PGRECLAIM, >> > + =E1 =E1 =E1 MEM_CGROUP_EVENTS_HIERARCHY_KSWAPD_PGSCAN, >> >> missing comment here? > > As if the lines weren't long enough already ;-) I'll add some. Thanks. > >> > =E1 =E1 =E1 =E1MEM_CGROUP_EVENTS_NSTATS, >> > =E1}; >> > =E1/* >> > @@ -889,6 +900,38 @@ static inline bool mem_cgroup_is_root(struct mem_= cgroup *memcg) >> > =E1 =E1 =E1 =E1return (memcg =3D=3D root_mem_cgroup); >> > =E1} >> > >> > +/** >> > + * mem_cgroup_account_reclaim - update per-memcg reclaim statistics >> > + * @root: memcg that triggered reclaim >> > + * @memcg: memcg that is actually being scanned >> > + * @nr_reclaimed: number of pages reclaimed from @memcg >> > + * @nr_scanned: number of pages scanned from @memcg >> > + * @kswapd: whether reclaiming task is kswapd or allocator itself >> > + */ >> > +void mem_cgroup_account_reclaim(struct mem_cgroup *root, >> > + =E1 =E1 =E1 =E1 =E1 =E1 =E1 =E1 =E1 =E1 =E1 =E1 =E1 =E1 =E1 struct m= em_cgroup *memcg, >> > + =E1 =E1 =E1 =E1 =E1 =E1 =E1 =E1 =E1 =E1 =E1 =E1 =E1 =E1 =E1 unsigned= long nr_reclaimed, >> > + =E1 =E1 =E1 =E1 =E1 =E1 =E1 =E1 =E1 =E1 =E1 =E1 =E1 =E1 =E1 unsigned= long nr_scanned, >> > + =E1 =E1 =E1 =E1 =E1 =E1 =E1 =E1 =E1 =E1 =E1 =E1 =E1 =E1 =E1 bool ksw= apd) >> > +{ >> > + =E1 =E1 =E1 unsigned int offset =3D 0; >> > + >> > + =E1 =E1 =E1 if (!root) >> > + =E1 =E1 =E1 =E1 =E1 =E1 =E1 root =3D root_mem_cgroup; >> > + >> > + =E1 =E1 =E1 if (kswapd) >> > + =E1 =E1 =E1 =E1 =E1 =E1 =E1 offset +=3D MEM_CGROUP_EVENTS_KSWAPD; >> > + =E1 =E1 =E1 if (root !=3D memcg) >> > + =E1 =E1 =E1 =E1 =E1 =E1 =E1 offset +=3D MEM_CGROUP_EVENTS_HIERARCHY; >> >> Just to be clear, here root cgroup has hierarchy_* stats always 0 ? > > That's correct, there can't be any hierarchical pressure on the > topmost parent. Thank you for clarifying. > >> Also, we might want to consider renaming the root here, something like >> target? The root is confusing with root_mem_cgroup. > > It's the same naming scheme I used for the iterator functions > (mem_cgroup_iter() and friends), so if we change it, I'd like to > change it consistently. That sounds good, and the change is separate from this effort. > > Having target and memcg as parameters is even more confusing and > non-descriptive, IMO. > > Other places use mem_over_limit, which is a bit better, but quite > long. > > Any other ideas for great names for parameters that designate a > hierarchy root and a memcg in that hierarchy? I don't have better name other than "target", which matches the naming in scan_control as well. Or in this case, we can avoid passing both target and memcg by doing something like: +static inline void mem_cgroup_account_reclaim( + struct mem_cgroup *memcg, + unsigned long nr_reclaimed, + unsigned long nr_scanned, + bool kswapd, + bool hierarchy) +{ +} + + mem_cgroup_account_reclaim(victim, nr_reclaimed, + nr_scanned, current_is_kswapd(), + target !=3D victim); then we need to do something on the root_mem_cgroup before that. Just a tho= ught. --Ying -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: KAMEZAWA Hiroyuki Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Date: Thu, 12 Jan 2012 10:54:27 +0900 Message-ID: <20120112105427.4b80437b.kamezawa.hiroyu@jp.fujitsu.com> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <1326207772-16762-3-git-send-email-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" To: Johannes Weiner Cc: Andrew Morton , Michal Hocko , Balbir Singh , Ying Han , cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Tue, 10 Jan 2012 16:02:52 +0100 Johannes Weiner wrote: > Right now, memcg soft limits are implemented by having a sorted tree > of memcgs that are in excess of their limits. Under global memory > pressure, kswapd first reclaims from the biggest excessor and then > proceeds to do regular global reclaim. The result of this is that > pages are reclaimed from all memcgs, but more scanning happens against > those above their soft limit. > > With global reclaim doing memcg-aware hierarchical reclaim by default, > this is a lot easier to implement: everytime a memcg is reclaimed > from, scan more aggressively (per tradition with a priority of 0) if > it's above its soft limit. With the same end result of scanning > everybody, but soft limit excessors a bit more. > > Advantages: > > o smoother reclaim: soft limit reclaim is a separate stage before > global reclaim, whose result is not communicated down the line and > so overreclaim of the groups in excess is very likely. After this > patch, soft limit reclaim is fully integrated into regular reclaim > and each memcg is considered exactly once per cycle. > > o true hierarchy support: soft limits are only considered when > kswapd does global reclaim, but after this patch, targetted > reclaim of a memcg will mind the soft limit settings of its child > groups. > > o code size: soft limit reclaim requires a lot of code to maintain > the per-node per-zone rb-trees to quickly find the biggest > offender, dedicated paths for soft limit reclaim etc. while this > new implementation gets away without all that. > > Test: > > The test consists of two concurrent kernel build jobs in separate > source trees, the master and the slave. The two jobs get along nicely > on 600MB of available memory, so this is the zero overcommit control > case. When available memory is decreased, the overcommit is > compensated by decreasing the soft limit of the slave by the same > amount, in the hope that the slave takes the hit and the master stays > unaffected. > > 600M-0M-vanilla 600M-0M-patched > Master walltime (s) 552.65 ( +0.00%) 552.38 ( -0.05%) > Master walltime (stddev) 1.25 ( +0.00%) 0.92 ( -14.66%) > Master major faults 204.38 ( +0.00%) 205.38 ( +0.49%) > Master major faults (stddev) 27.16 ( +0.00%) 13.80 ( -47.43%) > Master reclaim 31.88 ( +0.00%) 37.75 ( +17.87%) > Master reclaim (stddev) 34.01 ( +0.00%) 75.88 (+119.59%) > Master scan 31.88 ( +0.00%) 37.75 ( +17.87%) > Master scan (stddev) 34.01 ( +0.00%) 75.88 (+119.59%) > Master kswapd reclaim 33922.12 ( +0.00%) 33887.12 ( -0.10%) > Master kswapd reclaim (stddev) 969.08 ( +0.00%) 492.22 ( -49.16%) > Master kswapd scan 34085.75 ( +0.00%) 33985.75 ( -0.29%) > Master kswapd scan (stddev) 1101.07 ( +0.00%) 563.33 ( -48.79%) > Slave walltime (s) 552.68 ( +0.00%) 552.12 ( -0.10%) > Slave walltime (stddev) 0.79 ( +0.00%) 1.05 ( +14.76%) > Slave major faults 212.50 ( +0.00%) 204.50 ( -3.75%) > Slave major faults (stddev) 26.90 ( +0.00%) 13.17 ( -49.20%) > Slave reclaim 26.12 ( +0.00%) 35.00 ( +32.72%) > Slave reclaim (stddev) 29.42 ( +0.00%) 74.91 (+149.55%) > Slave scan 31.38 ( +0.00%) 35.00 ( +11.20%) > Slave scan (stddev) 33.31 ( +0.00%) 74.91 (+121.24%) > Slave kswapd reclaim 34259.00 ( +0.00%) 33469.88 ( -2.30%) > Slave kswapd reclaim (stddev) 925.15 ( +0.00%) 565.07 ( -38.88%) > Slave kswapd scan 34354.62 ( +0.00%) 33555.75 ( -2.33%) > Slave kswapd scan (stddev) 969.62 ( +0.00%) 581.70 ( -39.97%) > > In the control case, the differences in elapsed time, number of major > faults taken, and reclaim statistics are within the noise for both the > master and the slave job. > > 600M-280M-vanilla 600M-280M-patched > Master walltime (s) 595.13 ( +0.00%) 553.19 ( -7.04%) > Master walltime (stddev) 8.31 ( +0.00%) 2.57 ( -61.64%) > Master major faults 3729.75 ( +0.00%) 783.25 ( -78.98%) > Master major faults (stddev) 258.79 ( +0.00%) 226.68 ( -12.36%) > Master reclaim 705.00 ( +0.00%) 29.50 ( -95.68%) > Master reclaim (stddev) 232.87 ( +0.00%) 44.72 ( -80.45%) > Master scan 714.88 ( +0.00%) 30.00 ( -95.67%) > Master scan (stddev) 237.44 ( +0.00%) 45.39 ( -80.54%) > Master kswapd reclaim 114.75 ( +0.00%) 50.00 ( -55.94%) > Master kswapd reclaim (stddev) 128.51 ( +0.00%) 9.45 ( -91.93%) > Master kswapd scan 115.75 ( +0.00%) 50.00 ( -56.32%) > Master kswapd scan (stddev) 130.31 ( +0.00%) 9.45 ( -92.04%) > Slave walltime (s) 631.18 ( +0.00%) 577.68 ( -8.46%) > Slave walltime (stddev) 9.89 ( +0.00%) 3.63 ( -57.47%) > Slave major faults 28401.75 ( +0.00%) 14656.75 ( -48.39%) > Slave major faults (stddev) 2629.97 ( +0.00%) 1911.81 ( -27.30%) > Slave reclaim 65400.62 ( +0.00%) 1479.62 ( -97.74%) > Slave reclaim (stddev) 11623.02 ( +0.00%) 1482.13 ( -87.24%) > Slave scan 9050047.88 ( +0.00%) 95968.25 ( -98.94%) > Slave scan (stddev) 1912786.94 ( +0.00%) 93390.71 ( -95.12%) > Slave kswapd reclaim 327894.50 ( +0.00%) 227099.88 ( -30.74%) > Slave kswapd reclaim (stddev) 22289.43 ( +0.00%) 16113.14 ( -27.71%) > Slave kswapd scan 34987335.75 ( +0.00%) 1362367.12 ( -96.11%) > Slave kswapd scan (stddev) 2523642.98 ( +0.00%) 156754.74 ( -93.79%) > > Here, the available memory is limited to 320 MB, the machine is > overcommitted by 280 MB. The soft limit of the master is 300 MB, that > of the slave merely 20 MB. > > Looking at the slave job first, it is much better off with the patched > kernel: direct reclaim is almost gone, kswapd reclaim is decreased by > a third. The result is much fewer major faults taken, which in turn > lets the job finish quicker. > > It would be a zero-sum game if the improvement happened at the cost of > the master but looking at the numbers, even the master performs better > with the patched kernel. In fact, the master job is almost unaffected > on the patched kernel compared to the control case. > > This is an odd phenomenon, as the patch does not directly change how > the master is reclaimed. An explanation for this is that the severe > overreclaim of the slave in the unpatched kernel results in the master > growing bigger than in the patched case. Combining the fact that > memcgs are scanned according to their size with the increased refault > rate of the overreclaimed slave triggering global reclaim more often > means that overall pressure on the master job is higher in the > unpatched kernel. > > At any rate, the patched kernel seems to do a much better job at both > overall resource allocation under soft limit overcommit as well as the > requested prioritization of the master job. > > Signed-off-by: Johannes Weiner Thank you for your work and the result seems atractive and code is much simpler. My small concerns are.. 1. This approach may increase latency of direct-reclaim because of priority=0. 2. In a case numa-spread/interleave application run in its own container, pages on a node may paged-out again and again becasue of priority=0 if some other application runs in the node. It seems difficult to use soft-limit with numa-aware applications. Do you have suggestions ? Thanks, -Kame From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Date: Thu, 12 Jan 2012 09:59:04 +0100 Message-ID: <20120112085904.GG24386@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> Mime-Version: 1.0 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Content-Disposition: inline In-Reply-To: Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="iso-8859-1" To: Ying Han Cc: Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Wed, Jan 11, 2012 at 01:42:31PM -0800, Ying Han wrote: > On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner = wrote: > > Right now, memcg soft limits are implemented by having a sorted tre= e > > of memcgs that are in excess of their limits. =A0Under global memor= y > > pressure, kswapd first reclaims from the biggest excessor and then > > proceeds to do regular global reclaim. =A0The result of this is tha= t > > pages are reclaimed from all memcgs, but more scanning happens agai= nst > > those above their soft limit. > > > > With global reclaim doing memcg-aware hierarchical reclaim by defau= lt, > > this is a lot easier to implement: everytime a memcg is reclaimed > > from, scan more aggressively (per tradition with a priority of 0) i= f > > it's above its soft limit. =A0With the same end result of scanning > > everybody, but soft limit excessors a bit more. > > > > Advantages: > > > > =A0o smoother reclaim: soft limit reclaim is a separate stage befor= e > > =A0 =A0global reclaim, whose result is not communicated down the li= ne and > > =A0 =A0so overreclaim of the groups in excess is very likely. =A0Af= ter this > > =A0 =A0patch, soft limit reclaim is fully integrated into regular r= eclaim > > =A0 =A0and each memcg is considered exactly once per cycle. > > > > =A0o true hierarchy support: soft limits are only considered when > > =A0 =A0kswapd does global reclaim, but after this patch, targetted > > =A0 =A0reclaim of a memcg will mind the soft limit settings of its = child > > =A0 =A0groups. >=20 > Why we add soft limit reclaim into target reclaim? -> A hard limit 10G, usage 10G -> A1 soft limit 8G, usage 5G -> A2 soft limit 2G, usage 5G When A hits its hard limit, A2 will experience more pressure than A1. Soft limits are already applied hierarchically: the memcg that is picked from the tree is reclaimed hierarchically. What I wanted to add is the soft limit also being /triggerable/ from non-global hierarchy levels. > Based on the discussions, my understanding is that the soft limit onl= y > take effect while the whole machine is under memory contention. We > don't want to add extra pressure on a cgroup if there is free memory > on the system even the cgroup is above its limit. If a hierarchy is under pressure, we will reclaim that hierarchy. We allow groups to be prioritized under global pressure, why not allow it for local pressure as well? I am not quite sure what you are objecting to. > > =A0o code size: soft limit reclaim requires a lot of code to mainta= in > > =A0 =A0the per-node per-zone rb-trees to quickly find the biggest > > =A0 =A0offender, dedicated paths for soft limit reclaim etc. while = this > > =A0 =A0new implementation gets away without all that. > > > > Test: > > > > The test consists of two concurrent kernel build jobs in separate > > source trees, the master and the slave. =A0The two jobs get along n= icely > > on 600MB of available memory, so this is the zero overcommit contro= l > > case. =A0When available memory is decreased, the overcommit is > > compensated by decreasing the soft limit of the slave by the same > > amount, in the hope that the slave takes the hit and the master sta= ys > > unaffected. > > > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0600M-0M-vanilla =A0 =A0 =A0 =A0 600M-0M-patched > > Master walltime (s) =A0 =A0 =A0 =A0 =A0 =A0 =A0 552.65 ( =A0+0.00%)= =A0 =A0 =A0 552.38 ( =A0-0.05%) > > Master walltime (stddev) =A0 =A0 =A0 =A0 =A0 =A01.25 ( =A0+0.00%) =A0= =A0 =A0 =A0 0.92 ( -14.66%) > > Master major faults =A0 =A0 =A0 =A0 =A0 =A0 =A0 204.38 ( =A0+0.00%)= =A0 =A0 =A0 205.38 ( =A0+0.49%) > > Master major faults (stddev) =A0 =A0 =A0 27.16 ( =A0+0.00%) =A0 =A0= =A0 =A013.80 ( -47.43%) > > Master reclaim =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 31.88 ( =A0+= 0.00%) =A0 =A0 =A0 =A037.75 ( +17.87%) > > Master reclaim (stddev) =A0 =A0 =A0 =A0 =A0 =A034.01 ( =A0+0.00%) =A0= =A0 =A0 =A075.88 (+119.59%) > > Master scan =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A031.88 ( = =A0+0.00%) =A0 =A0 =A0 =A037.75 ( +17.87%) > > Master scan (stddev) =A0 =A0 =A0 =A0 =A0 =A0 =A0 34.01 ( =A0+0.00%)= =A0 =A0 =A0 =A075.88 (+119.59%) > > Master kswapd reclaim =A0 =A0 =A0 =A0 =A0 33922.12 ( =A0+0.00%) =A0= =A0 33887.12 ( =A0-0.10%) > > Master kswapd reclaim (stddev) =A0 =A0969.08 ( =A0+0.00%) =A0 =A0 =A0= 492.22 ( -49.16%) > > Master kswapd scan =A0 =A0 =A0 =A0 =A0 =A0 =A034085.75 ( =A0+0.00%)= =A0 =A0 33985.75 ( =A0-0.29%) > > Master kswapd scan (stddev) =A0 =A0 =A01101.07 ( =A0+0.00%) =A0 =A0= =A0 563.33 ( -48.79%) > > Slave walltime (s) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0552.68 ( =A0+0.00= %) =A0 =A0 =A0 552.12 ( =A0-0.10%) > > Slave walltime (stddev) =A0 =A0 =A0 =A0 =A0 =A0 0.79 ( =A0+0.00%) =A0= =A0 =A0 =A0 1.05 ( +14.76%) > > Slave major faults =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0212.50 ( =A0+0.00= %) =A0 =A0 =A0 204.50 ( =A0-3.75%) > > Slave major faults (stddev) =A0 =A0 =A0 =A026.90 ( =A0+0.00%) =A0 =A0= =A0 =A013.17 ( -49.20%) > > Slave reclaim =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A026.12 ( =A0= +0.00%) =A0 =A0 =A0 =A035.00 ( +32.72%) > > Slave reclaim (stddev) =A0 =A0 =A0 =A0 =A0 =A0 29.42 ( =A0+0.00%) =A0= =A0 =A0 =A074.91 (+149.55%) > > Slave scan =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 31.38 ( = =A0+0.00%) =A0 =A0 =A0 =A035.00 ( +11.20%) > > Slave scan (stddev) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A033.31 ( =A0+0.00= %) =A0 =A0 =A0 =A074.91 (+121.24%) > > Slave kswapd reclaim =A0 =A0 =A0 =A0 =A0 =A034259.00 ( =A0+0.00%) =A0= =A0 33469.88 ( =A0-2.30%) > > Slave kswapd reclaim (stddev) =A0 =A0 925.15 ( =A0+0.00%) =A0 =A0 =A0= 565.07 ( -38.88%) > > Slave kswapd scan =A0 =A0 =A0 =A0 =A0 =A0 =A0 34354.62 ( =A0+0.00%)= =A0 =A0 33555.75 ( =A0-2.33%) > > Slave kswapd scan (stddev) =A0 =A0 =A0 =A0969.62 ( =A0+0.00%) =A0 =A0= =A0 581.70 ( -39.97%) > > > > In the control case, the differences in elapsed time, number of maj= or > > faults taken, and reclaim statistics are within the noise for both = the > > master and the slave job. >=20 > What's the soft limit setting in the controlled case? 300MB for both jobs. > I assume it is the default RESOURCE_MAX. So both Master and Slave get > equal pressure before/after the patch, and no differences on the stat= s > should be observed. Yes. The control case demonstrates that both jobs can fit comfortably, don't compete for space and that in general the patch does not have unexpected negative impact (after all, it modifies codepaths that were invoked regularly outside of reclaim). > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 600M-280M-vanilla =A0 =A0 =A0600M-280M-patched > > Master walltime (s) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0595.13 ( =A0= +0.00%) =A0 =A0 =A0553.19 ( =A0-7.04%) > > Master walltime (stddev) =A0 =A0 =A0 =A0 =A0 =A0 =A0 8.31 ( =A0+0.0= 0%) =A0 =A0 =A0 =A02.57 ( -61.64%) > > Master major faults =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 3729.75 ( =A0+0= =2E00%) =A0 =A0 =A0783.25 ( -78.98%) > > Master major faults (stddev) =A0 =A0 =A0 =A0 258.79 ( =A0+0.00%) =A0= =A0 =A0226.68 ( -12.36%) > > Master reclaim =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 705.00 (= =A0+0.00%) =A0 =A0 =A0 29.50 ( -95.68%) > > Master reclaim (stddev) =A0 =A0 =A0 =A0 =A0 =A0 =A0232.87 ( =A0+0.0= 0%) =A0 =A0 =A0 44.72 ( -80.45%) > > Master scan =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0714.= 88 ( =A0+0.00%) =A0 =A0 =A0 30.00 ( -95.67%) > > Master scan (stddev) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 237.44 ( =A0+0= =2E00%) =A0 =A0 =A0 45.39 ( -80.54%) > > Master kswapd reclaim =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0114.75 ( =A0+0= =2E00%) =A0 =A0 =A0 50.00 ( -55.94%) > > Master kswapd reclaim (stddev) =A0 =A0 =A0 128.51 ( =A0+0.00%) =A0 = =A0 =A0 =A09.45 ( -91.93%) > > Master kswapd scan =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 115.75 ( =A0= +0.00%) =A0 =A0 =A0 50.00 ( -56.32%) > > Master kswapd scan (stddev) =A0 =A0 =A0 =A0 =A0130.31 ( =A0+0.00%) = =A0 =A0 =A0 =A09.45 ( -92.04%) > > Slave walltime (s) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 631.18 ( =A0= +0.00%) =A0 =A0 =A0577.68 ( =A0-8.46%) > > Slave walltime (stddev) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A09.89 ( =A0+0= =2E00%) =A0 =A0 =A0 =A03.63 ( -57.47%) > > Slave major faults =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 28401.75 ( =A0+0= =2E00%) =A0 =A014656.75 ( -48.39%) > > Slave major faults (stddev) =A0 =A0 =A0 =A0 2629.97 ( =A0+0.00%) =A0= =A0 1911.81 ( -27.30%) > > Slave reclaim =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A065400.62 (= =A0+0.00%) =A0 =A0 1479.62 ( -97.74%) > > Slave reclaim (stddev) =A0 =A0 =A0 =A0 =A0 =A0 11623.02 ( =A0+0.00%= ) =A0 =A0 1482.13 ( -87.24%) > > Slave scan =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 9050047.88 (= =A0+0.00%) =A0 =A095968.25 ( -98.94%) > > Slave scan (stddev) =A0 =A0 =A0 =A0 =A0 =A0 =A01912786.94 ( =A0+0.0= 0%) =A0 =A093390.71 ( -95.12%) > > Slave kswapd reclaim =A0 =A0 =A0 =A0 =A0 =A0 =A0327894.50 ( =A0+0.0= 0%) =A0 227099.88 ( -30.74%) > > Slave kswapd reclaim (stddev) =A0 =A0 =A022289.43 ( =A0+0.00%) =A0 = =A016113.14 ( -27.71%) > > Slave kswapd scan =A0 =A0 =A0 =A0 =A0 =A0 =A0 34987335.75 ( =A0+0.0= 0%) =A01362367.12 ( -96.11%) > > Slave kswapd scan (stddev) =A0 =A0 =A0 2523642.98 ( =A0+0.00%) =A0 = 156754.74 ( -93.79%) > > > > Here, the available memory is limited to 320 MB, the machine is > > overcommitted by 280 MB. =A0The soft limit of the master is 300 MB,= that > > of the slave merely 20 MB. > > > > Looking at the slave job first, it is much better off with the patc= hed > > kernel: direct reclaim is almost gone, kswapd reclaim is decreased = by > > a third. =A0The result is much fewer major faults taken, which in t= urn > > lets the job finish quicker. >=20 > What's the setting of the hard limit here? Is the direct reclaim > referring to per-memcg directly reclaim or global one. The machine's memory is limited to 600M, the hard limits are unset. All reclaim is a result of global memory pressure. With the patched kernel, I could have used a dedicated parent cgroup and let master and slave run in children of this group, the soft limits would be taken into account just the same. But this does not work on the unpatched kernel, as soft limits are only recognized on the global level there. > > It would be a zero-sum game if the improvement happened at the cost= of > > the master but looking at the numbers, even the master performs bet= ter > > with the patched kernel. =A0In fact, the master job is almost unaff= ected > > on the patched kernel compared to the control case. >=20 > It makes sense since the master job get less affected by the patch > than the slave job under the example. Under the control case, if both > master and slave have RESOURCE_MAX soft limit setting, they are under > equal memory pressure(priority =3D DEF_PRIORITY) . On the second > example, only the slave pressure being increased by priority =3D 0, a= nd > the Master got scanned with same priority =3D DEF_PRIORITY pretty muc= h. >=20 > So I would expect to see more reclaim activities happens in slave on > the patched kernel compared to the control case. It seems match the > testing result. Uhm, > > Slave reclaim =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A065400.62 (= =A0+0.00%) =A0 =A0 1479.62 ( -97.74%) > > Slave reclaim (stddev) =A0 =A0 =A0 =A0 =A0 =A0 11623.02 ( =A0+0.00%= ) =A0 =A0 1482.13 ( -87.24%) > > Slave scan =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 9050047.88 (= =A0+0.00%) =A0 =A095968.25 ( -98.94%) > > Slave scan (stddev) =A0 =A0 =A0 =A0 =A0 =A0 =A01912786.94 ( =A0+0.0= 0%) =A0 =A093390.71 ( -95.12%) > > Slave kswapd reclaim =A0 =A0 =A0 =A0 =A0 =A0 =A0327894.50 ( =A0+0.0= 0%) =A0 227099.88 ( -30.74%) > > Slave kswapd reclaim (stddev) =A0 =A0 =A022289.43 ( =A0+0.00%) =A0 = =A016113.14 ( -27.71%) > > Slave kswapd scan =A0 =A0 =A0 =A0 =A0 =A0 =A0 34987335.75 ( =A0+0.0= 0%) =A01362367.12 ( -96.11%) > > Slave kswapd scan (stddev) =A0 =A0 =A0 2523642.98 ( =A0+0.00%) =A0 = 156754.74 ( -93.79%) Direct reclaim _shrunk_ by 98%, kswapd reclaim by 31%. > > This is an odd phenomenon, as the patch does not directly change ho= w > > the master is reclaimed. =A0An explanation for this is that the sev= ere > > overreclaim of the slave in the unpatched kernel results in the mas= ter > > growing bigger than in the patched case. =A0Combining the fact that > > memcgs are scanned according to their size with the increased refau= lt > > rate of the overreclaimed slave triggering global reclaim more ofte= n > > means that overall pressure on the master job is higher in the > > unpatched kernel. >=20 > We can check the Master memory.usage_in_bytes while the job is runnin= g. Yep, the plots of cache/rss over time confirmed exactly this. The unpatched kernel shows higher spikes in the size of the master job followed by deeper pits when reclaim kicked in. The patched kernel is much smoother in that regard. > On the other hand, I don't see why we expect the Master being less > reclaimed in the controlled case? On the unpatched kernel, the Master > is being reclaimed under global pressure each time anyway since we > ignore the return value of softlimit. I didn't expect that, I expected both jobs to perform equally in the control case. And in the pressurized case, the master being unaffected and the slave taking the hit. The patched kernel does this, the unpatched one does not. > > @@ -121,6 +121,7 @@ struct zone_reclaim_stat *mem_cgroup_get_reclai= m_stat(struct mem_cgroup *memcg, > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0struct zone *zone); > > =A0struct zone_reclaim_stat* > > =A0mem_cgroup_get_reclaim_stat_from_page(struct page *page); > > +bool mem_cgroup_over_softlimit(struct mem_cgroup *, struct mem_cgr= oup *); >=20 > Maybe something like "mem_cgroup_over_soft_limit()" ? Probably more consistent, yeah. Will do. > > @@ -343,7 +314,6 @@ static bool move_file(void) > > =A0* limit reclaim to prevent infinite loops, if they ever occur. > > =A0*/ > > =A0#define =A0 =A0 =A0 =A0MEM_CGROUP_MAX_RECLAIM_LOOPS =A0 =A0 =A0 = =A0 =A0 =A0(100) > > -#define =A0 =A0 =A0 =A0MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS (2) >=20 > You might need to remove the comment above as well. Oops, will fix. > > @@ -1318,6 +1123,36 @@ static unsigned long mem_cgroup_margin(struc= t mem_cgroup *memcg) > > =A0 =A0 =A0 =A0return margin >> PAGE_SHIFT; > > =A0} > > > > +/** > > + * mem_cgroup_over_softlimit > > + * @root: hierarchy root > > + * @memcg: child of @root to test > > + * > > + * Returns %true if @memcg exceeds its own soft limit or contribut= es > > + * to the soft limit excess of one of its parents up to and includ= ing > > + * @root. > > + */ > > +bool mem_cgroup_over_softlimit(struct mem_cgroup *root, > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0struct= mem_cgroup *memcg) > > +{ > > + =A0 =A0 =A0 if (mem_cgroup_disabled()) > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 return false; > > + > > + =A0 =A0 =A0 if (!root) > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 root =3D root_mem_cgroup; > > + > > + =A0 =A0 =A0 for (; memcg; memcg =3D parent_mem_cgroup(memcg)) { > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* root_mem_cgroup does not have a so= ft limit */ > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (memcg =3D=3D root_mem_cgroup) > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break; > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (res_counter_soft_limit_excess(&me= mcg->res)) > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 return true; > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (memcg =3D=3D root) > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break; > > + =A0 =A0 =A0 } >=20 > Here it adds pressure on a cgroup if one of its parents exceeds soft > limit, although the cgroup itself is under soft limit. It does change > my understanding of soft limit, and might introduce regression of our > existing use cases. >=20 > Here is an example: >=20 > Machine capacity 32G and we over-commit by 8G. >=20 > root > -> A (hard limit 20G, soft limit 15G, usage 16G) > -> A1 (soft limit 5G, usage 4G) > -> A2 (soft limit 10G, usage 12G) > -> B (hard limit 20G, soft limit 10G, usage 16G) >=20 > under global reclaim, we don't want to add pressure on A1 although it= s > parent A exceeds its soft limit. Assume that if we set the soft limit > corresponding to each cgroup's working set size (hot memory), and it > will introduce regression to A1 in that case. >=20 > In my existing implementation, i am checking the cgroup's soft limit > standalone w/o looking its ancestors. Why do you set the soft limit of A in the first place if you don't want it to be enforced? This is not really new behaviour, soft limit reclaim has always been operating hierarchically on the biggest excessor. In your case, the excess of A is smaller than the excess of A2 and so that weird "only pick the biggest excessor" behaviour hides it, but consider this: -> A soft 30G, usage 39G -> A1 soft 5G, usage 4G -> A2 soft 10G, usage 15G -> A3 soft 15G, usage 20G Upstream would pick A from the soft limit tree and reclaim its children with priority 0, including A1. On the other hand, if you don't consider ancestral soft limits, you break perfectly reasonable setups like these -> A soft 10G, usage 20G -> A1 usage 10G -> A2 usage 10G -> B soft 10G, usage 11G where upstream would pick A and reclaim it recursively, but your version would only apply higher pressure to B. If you would just not set the soft limit of A in your case: -> A (hard limit 20G, usage 16G) -> A1 (soft limit 5G, usage 4G) -> A2 (soft limit 10G, usage 12G) -> B (hard limit 20G, soft limit 10G, usage 16G) only A2 and B would experience higher pressure upon global pressure. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [patch 1/2] mm: memcg: per-memcg reclaim statistics Date: Thu, 12 Jan 2012 10:17:21 +0100 Message-ID: <20120112091721.GH24386@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-2-git-send-email-hannes@cmpxchg.org> <20120111003020.GD24386@cmpxchg.org> Mime-Version: 1.0 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Content-Disposition: inline In-Reply-To: Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="iso-8859-1" To: Ying Han Cc: Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Wed, Jan 11, 2012 at 02:33:59PM -0800, Ying Han wrote: > On Tue, Jan 10, 2012 at 4:30 PM, Johannes Weiner = wrote: > > On Tue, Jan 10, 2012 at 03:54:05PM -0800, Ying Han wrote: > >> Thank you for the patch and the stats looks reasonable to me, few > >> questions as below: > >> > >> On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner wrote: > >> > With the single per-zone LRU gone and global reclaim scanning > >> > individual memcgs, it's straight-forward to collect meaningful a= nd > >> > accurate per-memcg reclaim statistics. > >> > > >> > This adds the following items to memory.stat: > >> > >> Some of the previous discussions including patches have similar st= ats > >> in memory.vmscan_stat API, which collects all the per-memcg vmscan > >> stats. I would like to understand more why we add into memory.stat > >> instead, and do we have plan to keep extending memory.stat for tho= se > >> vmstat like stats? > > > > I think they were put into an extra file in particular to be able t= o > > write to this file to reset the statistics. =A0But in my opinion, i= t's > > trivial to calculate a delta from before and after running a worklo= ad, > > so I didn't really like adding kernel code for that. > > > > Did you have another reason for a separate file in mind? >=20 > Another reason I had them in separate file is easier to extend. I > don't know if we have plan to have something like memory.vmstat, or > just keep adding stuff into memory.stat. In general, I wanted to keep > the memory.stat being reasonable size including only the basic > statistics. In my existing vmscan_stat path, i have breakdowns of > reclaim stats into file/anon which will make the memory.stat even > larger. Do you think it's a problem of presentation, where we want to allow admins to figure out the memcg parameters at a glance when looking at memory.stat but be able to debug malfunction by looking at the more extensive vmstat file? > >> > =E1Reclaim activity from kswapd due to the memcg's own limit. =E1= Only > >> > =E1applicable to the root memcg for now since kswapd is only tri= ggered > >> > =E1by physical limits, but kswapd-style reclaim based on memcg h= ard > >> > =E1limits is being developped. > >> > > >> > hierarchy_pgreclaim > >> > hierarchy_pgscan > >> > hierarchy_kswapd_pgreclaim > >> > hierarchy_kswapd_pgscan > >> > >> "pgsteal_hierarchy" > >> "pgsteal_kswapd_hierarchy" > >> .. > >> > >> No strong option on the naming, but try to make it more consistent= to > >> existing API. > > > > I swear I tried, but the existing naming is pretty screwed up :( > > > > For example, pgscan_direct_* and pgscan_kswapd_* allow you to compa= re > > scan rates of direct reclaim vs. kswapd reclaim. =A0To get the tota= l > > number of pages reclaimed, you sum them up. > > > > On the other hand, pgsteal_* does not differentiate between direct > > reclaim and kswapd, so to get direct reclaim numbers, you add up th= e > > pgsteal_* counters and subtract kswapd_steal (notice the lack of pg= ?), > > which is in turn not available at zone granularity. >=20 > agree and that always confuses me. I just have scripts that present it as 'Direct page reclaimed' and 'Kswapd page reclaimed' when evaluating data so I don't have to remember anymore :-) But I think the wish for consistency is a bit misguided when we end up with something like pgpgin that means something completely different in memcg than it does on the global level. Likewise, I don't want to use pgsteal_* and pgsteal_kswapd_* because of their similarity to /proc/vmstat while the numbers represent something different. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Date: Fri, 13 Jan 2012 13:04:06 +0100 Message-ID: <20120113120406.GC17060@tiehlicka.suse.cz> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <1326207772-16762-3-git-send-email-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: Andrew Morton , KAMEZAWA Hiroyuki , Balbir Singh , Ying Han , cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Tue 10-01-12 16:02:52, Johannes Weiner wrote: > Right now, memcg soft limits are implemented by having a sorted tree > of memcgs that are in excess of their limits. Under global memory > pressure, kswapd first reclaims from the biggest excessor and then > proceeds to do regular global reclaim. The result of this is that > pages are reclaimed from all memcgs, but more scanning happens against > those above their soft limit. > > With global reclaim doing memcg-aware hierarchical reclaim by default, > this is a lot easier to implement: everytime a memcg is reclaimed > from, scan more aggressively (per tradition with a priority of 0) if > it's above its soft limit. With the same end result of scanning > everybody, but soft limit excessors a bit more. > > Advantages: > > o smoother reclaim: soft limit reclaim is a separate stage before > global reclaim, whose result is not communicated down the line and > so overreclaim of the groups in excess is very likely. After this > patch, soft limit reclaim is fully integrated into regular reclaim > and each memcg is considered exactly once per cycle. > > o true hierarchy support: soft limits are only considered when > kswapd does global reclaim, but after this patch, targetted > reclaim of a memcg will mind the soft limit settings of its child > groups. Yes it makes sense. At first I was thinking that soft limit should be considered only under global mem. pressure (at least documentation says so) but now it makes sense. We can push on over-soft limit groups more because they told us they could sacrifice something... Anyway documentation needs an update as well. But we have to be little bit careful here. I am still quite confuses how we should handle hierarchies vs. subtrees. See bellow. > > o code size: soft limit reclaim requires a lot of code to maintain > the per-node per-zone rb-trees to quickly find the biggest > offender, dedicated paths for soft limit reclaim etc. while this > new implementation gets away without all that. on my i386 pae setup (including swap extension enabled): Before text data bss dec hex filename 310086 29970 35372 375428 5ba84 mm/built-in.o After size mm/built-in.o text data bss dec hex filename 309048 30030 35372 374450 5b6b2 mm/built-in.o I would expect a bigger difference but still good. > Test: Will look into results later. [...] > Signed-off-by: Johannes Weiner > --- > include/linux/memcontrol.h | 18 +-- > mm/memcontrol.c | 412 ++++---------------------------------------- > mm/vmscan.c | 80 +-------- > 3 files changed, 48 insertions(+), 462 deletions(-) Really nice to see [...] > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 170dff4..d4f7ae5 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c [...] > @@ -1318,6 +1123,36 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) > return margin >> PAGE_SHIFT; > } > > +/** > + * mem_cgroup_over_softlimit > + * @root: hierarchy root > + * @memcg: child of @root to test > + * > + * Returns %true if @memcg exceeds its own soft limit or contributes > + * to the soft limit excess of one of its parents up to and including > + * @root. > + */ > +bool mem_cgroup_over_softlimit(struct mem_cgroup *root, > + struct mem_cgroup *memcg) > +{ > + if (mem_cgroup_disabled()) > + return false; > + > + if (!root) > + root = root_mem_cgroup; > + > + for (; memcg; memcg = parent_mem_cgroup(memcg)) { > + /* root_mem_cgroup does not have a soft limit */ > + if (memcg == root_mem_cgroup) > + break; > + if (res_counter_soft_limit_excess(&memcg->res)) > + return true; > + if (memcg == root) > + break; > + } > + return false; > +} Well, this might be little bit tricky. We do not check whether memcg and root are in a hierarchy (in terms of use_hierarchy) relation. If we are under global reclaim then we iterate over all memcgs and so there is no guarantee that there is a hierarchical relation between the given memcg and its parent. While, on the other hand, if we are doing memcg reclaim then we have this guarantee. Why should we punish a group (subtree) which is perfectly under its soft limit just because some other subtree contributes to the common parent's usage and makes it over its limit? Should we check memcg->use_hierarchy here? Does it even makes sense to setup soft limit on a parent group without hierarchies? Well I have to admit that hierarchies makes me headache. > + > int mem_cgroup_swappiness(struct mem_cgroup *memcg) > { > struct cgroup *cgrp = memcg->css.cgroup; [...] > diff --git a/mm/vmscan.c b/mm/vmscan.c > index e3fd8a7..4279549 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2121,8 +2121,16 @@ static void shrink_zone(int priority, struct zone *zone, > .mem_cgroup = memcg, > .zone = zone, > }; > + int epriority = priority; > + /* > + * Put more pressure on hierarchies that exceed their > + * soft limit, to push them back harder than their > + * well-behaving siblings. > + */ > + if (mem_cgroup_over_softlimit(root, memcg)) > + epriority = 0; This sounds too aggressive to me. Shouldn't we just double the pressure or something like that? Previously we always had nr_to_reclaim == SWAP_CLUSTER_MAX when we did memcg reclaim but this is not the case now. For the kswapd we have nr_to_reclaim == ULONG_MAX so we will not break out of the reclaim early and we have to scan a lot. Direct reclaim (shrink or hard limit) shouldn't be affected here. > > - shrink_mem_cgroup_zone(priority, &mz, sc); > + shrink_mem_cgroup_zone(epriority, &mz, sc); > > mem_cgroup_account_reclaim(root, memcg, > sc->nr_reclaimed - nr_reclaimed, -- Michal Hocko SUSE Labs SUSE LINUX s.r.o. Lihovarska 1060/12 190 00 Praha 9 Czech Republic From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Date: Fri, 13 Jan 2012 13:16:56 +0100 Message-ID: <20120113121645.GA1653@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120112105427.4b80437b.kamezawa.hiroyu@jp.fujitsu.com> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20120112105427.4b80437b.kamezawa.hiroyu@jp.fujitsu.com> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: KAMEZAWA Hiroyuki Cc: Andrew Morton , Michal Hocko , Balbir Singh , Ying Han , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org On Thu, Jan 12, 2012 at 10:54:27AM +0900, KAMEZAWA Hiroyuki wrote: > On Tue, 10 Jan 2012 16:02:52 +0100 > Johannes Weiner wrote: > > > Right now, memcg soft limits are implemented by having a sorted tree > > of memcgs that are in excess of their limits. Under global memory > > pressure, kswapd first reclaims from the biggest excessor and then > > proceeds to do regular global reclaim. The result of this is that > > pages are reclaimed from all memcgs, but more scanning happens against > > those above their soft limit. > > > > With global reclaim doing memcg-aware hierarchical reclaim by default, > > this is a lot easier to implement: everytime a memcg is reclaimed > > from, scan more aggressively (per tradition with a priority of 0) if > > it's above its soft limit. With the same end result of scanning > > everybody, but soft limit excessors a bit more. > > > > Advantages: > > > > o smoother reclaim: soft limit reclaim is a separate stage before > > global reclaim, whose result is not communicated down the line and > > so overreclaim of the groups in excess is very likely. After this > > patch, soft limit reclaim is fully integrated into regular reclaim > > and each memcg is considered exactly once per cycle. > > > > o true hierarchy support: soft limits are only considered when > > kswapd does global reclaim, but after this patch, targetted > > reclaim of a memcg will mind the soft limit settings of its child > > groups. > > > > o code size: soft limit reclaim requires a lot of code to maintain > > the per-node per-zone rb-trees to quickly find the biggest > > offender, dedicated paths for soft limit reclaim etc. while this > > new implementation gets away without all that. > > > > Test: > > > > The test consists of two concurrent kernel build jobs in separate > > source trees, the master and the slave. The two jobs get along nicely > > on 600MB of available memory, so this is the zero overcommit control > > case. When available memory is decreased, the overcommit is > > compensated by decreasing the soft limit of the slave by the same > > amount, in the hope that the slave takes the hit and the master stays > > unaffected. > > > > 600M-0M-vanilla 600M-0M-patched > > Master walltime (s) 552.65 ( +0.00%) 552.38 ( -0.05%) > > Master walltime (stddev) 1.25 ( +0.00%) 0.92 ( -14.66%) > > Master major faults 204.38 ( +0.00%) 205.38 ( +0.49%) > > Master major faults (stddev) 27.16 ( +0.00%) 13.80 ( -47.43%) > > Master reclaim 31.88 ( +0.00%) 37.75 ( +17.87%) > > Master reclaim (stddev) 34.01 ( +0.00%) 75.88 (+119.59%) > > Master scan 31.88 ( +0.00%) 37.75 ( +17.87%) > > Master scan (stddev) 34.01 ( +0.00%) 75.88 (+119.59%) > > Master kswapd reclaim 33922.12 ( +0.00%) 33887.12 ( -0.10%) > > Master kswapd reclaim (stddev) 969.08 ( +0.00%) 492.22 ( -49.16%) > > Master kswapd scan 34085.75 ( +0.00%) 33985.75 ( -0.29%) > > Master kswapd scan (stddev) 1101.07 ( +0.00%) 563.33 ( -48.79%) > > Slave walltime (s) 552.68 ( +0.00%) 552.12 ( -0.10%) > > Slave walltime (stddev) 0.79 ( +0.00%) 1.05 ( +14.76%) > > Slave major faults 212.50 ( +0.00%) 204.50 ( -3.75%) > > Slave major faults (stddev) 26.90 ( +0.00%) 13.17 ( -49.20%) > > Slave reclaim 26.12 ( +0.00%) 35.00 ( +32.72%) > > Slave reclaim (stddev) 29.42 ( +0.00%) 74.91 (+149.55%) > > Slave scan 31.38 ( +0.00%) 35.00 ( +11.20%) > > Slave scan (stddev) 33.31 ( +0.00%) 74.91 (+121.24%) > > Slave kswapd reclaim 34259.00 ( +0.00%) 33469.88 ( -2.30%) > > Slave kswapd reclaim (stddev) 925.15 ( +0.00%) 565.07 ( -38.88%) > > Slave kswapd scan 34354.62 ( +0.00%) 33555.75 ( -2.33%) > > Slave kswapd scan (stddev) 969.62 ( +0.00%) 581.70 ( -39.97%) > > > > In the control case, the differences in elapsed time, number of major > > faults taken, and reclaim statistics are within the noise for both the > > master and the slave job. > > > > 600M-280M-vanilla 600M-280M-patched > > Master walltime (s) 595.13 ( +0.00%) 553.19 ( -7.04%) > > Master walltime (stddev) 8.31 ( +0.00%) 2.57 ( -61.64%) > > Master major faults 3729.75 ( +0.00%) 783.25 ( -78.98%) > > Master major faults (stddev) 258.79 ( +0.00%) 226.68 ( -12.36%) > > Master reclaim 705.00 ( +0.00%) 29.50 ( -95.68%) > > Master reclaim (stddev) 232.87 ( +0.00%) 44.72 ( -80.45%) > > Master scan 714.88 ( +0.00%) 30.00 ( -95.67%) > > Master scan (stddev) 237.44 ( +0.00%) 45.39 ( -80.54%) > > Master kswapd reclaim 114.75 ( +0.00%) 50.00 ( -55.94%) > > Master kswapd reclaim (stddev) 128.51 ( +0.00%) 9.45 ( -91.93%) > > Master kswapd scan 115.75 ( +0.00%) 50.00 ( -56.32%) > > Master kswapd scan (stddev) 130.31 ( +0.00%) 9.45 ( -92.04%) > > Slave walltime (s) 631.18 ( +0.00%) 577.68 ( -8.46%) > > Slave walltime (stddev) 9.89 ( +0.00%) 3.63 ( -57.47%) > > Slave major faults 28401.75 ( +0.00%) 14656.75 ( -48.39%) > > Slave major faults (stddev) 2629.97 ( +0.00%) 1911.81 ( -27.30%) > > Slave reclaim 65400.62 ( +0.00%) 1479.62 ( -97.74%) > > Slave reclaim (stddev) 11623.02 ( +0.00%) 1482.13 ( -87.24%) > > Slave scan 9050047.88 ( +0.00%) 95968.25 ( -98.94%) > > Slave scan (stddev) 1912786.94 ( +0.00%) 93390.71 ( -95.12%) > > Slave kswapd reclaim 327894.50 ( +0.00%) 227099.88 ( -30.74%) > > Slave kswapd reclaim (stddev) 22289.43 ( +0.00%) 16113.14 ( -27.71%) > > Slave kswapd scan 34987335.75 ( +0.00%) 1362367.12 ( -96.11%) > > Slave kswapd scan (stddev) 2523642.98 ( +0.00%) 156754.74 ( -93.79%) > > > > Here, the available memory is limited to 320 MB, the machine is > > overcommitted by 280 MB. The soft limit of the master is 300 MB, that > > of the slave merely 20 MB. > > > > Looking at the slave job first, it is much better off with the patched > > kernel: direct reclaim is almost gone, kswapd reclaim is decreased by > > a third. The result is much fewer major faults taken, which in turn > > lets the job finish quicker. > > > > It would be a zero-sum game if the improvement happened at the cost of > > the master but looking at the numbers, even the master performs better > > with the patched kernel. In fact, the master job is almost unaffected > > on the patched kernel compared to the control case. > > > > This is an odd phenomenon, as the patch does not directly change how > > the master is reclaimed. An explanation for this is that the severe > > overreclaim of the slave in the unpatched kernel results in the master > > growing bigger than in the patched case. Combining the fact that > > memcgs are scanned according to their size with the increased refault > > rate of the overreclaimed slave triggering global reclaim more often > > means that overall pressure on the master job is higher in the > > unpatched kernel. > > > > At any rate, the patched kernel seems to do a much better job at both > > overall resource allocation under soft limit overcommit as well as the > > requested prioritization of the master job. > > > > Signed-off-by: Johannes Weiner > > Thank you for your work and the result seems atractive and code is much > simpler. My small concerns are.. > > 1. This approach may increase latency of direct-reclaim because of priority=0. I think strictly speaking yes, but note that with kswapd being less likely to get stuck in hammering on one group, the need for allocators to enter direct reclaim itself is reduced. However, if this really becomes a problem in real world loads, the fix is pretty easy: just ignore the soft limit for direct reclaim. We can still consider it from hard limit reclaim and kswapd. > 2. In a case numa-spread/interleave application run in its own container, > pages on a node may paged-out again and again becasue of priority=0 > if some other application runs in the node. > It seems difficult to use soft-limit with numa-aware applications. > Do you have suggestions ? This is a question about soft limits in general rather than about this particular patch, right? And if I understand correctly, the problem you are referring to is this: an application and parts of a soft-limited container share a node, the soft limit setting means that the container's pages on that node are reclaimed harder. At that point, the container's share on that node becomes tiny, but since the soft limit is oblivious to nodes, the expansion of the other application pushes the soft-limited container off that node completely as long as the container stays above its soft limit with the usage on other nodes. What would you think about having node-local soft limits that take the node size into account? local_soft_limit = soft_limit * node_size / memcg_size The soft limit can be exceeded globally, but the container is no longer pushed off a node on which it's only occupying a small share of memory. Putting it into proportion of the memcg size, not overall memory size has the following advantages: 1. if the container is sitting on only one of several available nodes without exceeding the limit globally, the memcg will not be reclaimed harder just because it has a relatively large share of the node. 2. if the soft limit excess is ridiculously high, the local soft limits will be pushed down, so the tolerance for smaller shares on nodes goes down in proportion to the global soft limit excess. Example: 4G soft limit * 2G node / 4G container = 2G node-local limit The container is globally within its soft limit, so the local limit is at least the size of the node. It's never reclaimed harder compared to other applications on the node. 4G soft limit * 2G node / 5G container = ~1.6G node-local limit Here, it will experience more pressure initially, but it will level off when the shrinking usage and the thereby increasing node-local soft limit meet. From that point on, the container and the competing application will be treated equally during reclaim. Finally, if the container is 16G in size, i.e. 300% in excess, the per-node tolerance is at 512M node-local soft limit, which IMO strikes a good balance between zero tolerance and still applying some stress to the hugely oversized container when other applications (with virtually unlimited soft limits) want to run on the same node. What do you think? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Date: Fri, 13 Jan 2012 16:50:01 +0100 Message-ID: <20120113155001.GB1653@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120113120406.GC17060@tiehlicka.suse.cz> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20120113120406.GC17060@tiehlicka.suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: Andrew Morton , KAMEZAWA Hiroyuki , Balbir Singh , Ying Han , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org On Fri, Jan 13, 2012 at 01:04:06PM +0100, Michal Hocko wrote: > On Tue 10-01-12 16:02:52, Johannes Weiner wrote: > > Right now, memcg soft limits are implemented by having a sorted tree > > of memcgs that are in excess of their limits. Under global memory > > pressure, kswapd first reclaims from the biggest excessor and then > > proceeds to do regular global reclaim. The result of this is that > > pages are reclaimed from all memcgs, but more scanning happens against > > those above their soft limit. > > > > With global reclaim doing memcg-aware hierarchical reclaim by default, > > this is a lot easier to implement: everytime a memcg is reclaimed > > from, scan more aggressively (per tradition with a priority of 0) if > > it's above its soft limit. With the same end result of scanning > > everybody, but soft limit excessors a bit more. > > > > Advantages: > > > > o smoother reclaim: soft limit reclaim is a separate stage before > > global reclaim, whose result is not communicated down the line and > > so overreclaim of the groups in excess is very likely. After this > > patch, soft limit reclaim is fully integrated into regular reclaim > > and each memcg is considered exactly once per cycle. > > > > o true hierarchy support: soft limits are only considered when > > kswapd does global reclaim, but after this patch, targetted > > reclaim of a memcg will mind the soft limit settings of its child > > groups. > > Yes it makes sense. At first I was thinking that soft limit should be > considered only under global mem. pressure (at least documentation says > so) but now it makes sense. > We can push on over-soft limit groups more because they told us they > could sacrifice something... Anyway documentation needs an update as > well. You are right, I'll look into it. > But we have to be little bit careful here. I am still quite confuses how > we should handle hierarchies vs. subtrees. See bellow. > > @@ -1318,6 +1123,36 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) > > return margin >> PAGE_SHIFT; > > } > > > > +/** > > + * mem_cgroup_over_softlimit > > + * @root: hierarchy root > > + * @memcg: child of @root to test > > + * > > + * Returns %true if @memcg exceeds its own soft limit or contributes > > + * to the soft limit excess of one of its parents up to and including > > + * @root. > > + */ > > +bool mem_cgroup_over_softlimit(struct mem_cgroup *root, > > + struct mem_cgroup *memcg) > > +{ > > + if (mem_cgroup_disabled()) > > + return false; > > + > > + if (!root) > > + root = root_mem_cgroup; > > + > > + for (; memcg; memcg = parent_mem_cgroup(memcg)) { > > + /* root_mem_cgroup does not have a soft limit */ > > + if (memcg == root_mem_cgroup) > > + break; > > + if (res_counter_soft_limit_excess(&memcg->res)) > > + return true; > > + if (memcg == root) > > + break; > > + } > > + return false; > > +} > > Well, this might be little bit tricky. We do not check whether memcg and > root are in a hierarchy (in terms of use_hierarchy) relation. > > If we are under global reclaim then we iterate over all memcgs and so > there is no guarantee that there is a hierarchical relation between the > given memcg and its parent. While, on the other hand, if we are doing > memcg reclaim then we have this guarantee. > > Why should we punish a group (subtree) which is perfectly under its soft > limit just because some other subtree contributes to the common parent's > usage and makes it over its limit? > Should we check memcg->use_hierarchy here? We do, actually. parent_mem_cgroup() checks the res_counter parent, which is only set when ->use_hierarchy is also set. The loop should never walk upwards outside of a hierarchy. And yes, if you have this: A / \ B C and configured a soft limit for A, you asked for both B and C to be responsible when this limit is exceeded, that's not new behaviour. > Does it even makes sense to setup soft limit on a parent group without > hierarchies? > Well I have to admit that hierarchies makes me headache. There is no parent without a hierarchy. It is insofar pretty confusing that you can actually create a directory hierarchy that does not reflect a memcg hierarchy: # pwd /sys/fs/cgroup/memory/foo/bar # cat memory.usage_in_bytes 450560 # cat ../memory.usage_in_bytes 0 there is no accounting/limiting/whatever parent-child relationship between foo and bar. > > @@ -2121,8 +2121,16 @@ static void shrink_zone(int priority, struct zone *zone, > > .mem_cgroup = memcg, > > .zone = zone, > > }; > > + int epriority = priority; > > + /* > > + * Put more pressure on hierarchies that exceed their > > + * soft limit, to push them back harder than their > > + * well-behaving siblings. > > + */ > > + if (mem_cgroup_over_softlimit(root, memcg)) > > + epriority = 0; > > This sounds too aggressive to me. Shouldn't we just double the pressure > or something like that? That's the historical value. When I tried priority - 1, it was not aggressive enough. > Previously we always had nr_to_reclaim == SWAP_CLUSTER_MAX when we did > memcg reclaim but this is not the case now. For the kswapd we have > nr_to_reclaim == ULONG_MAX so we will not break out of the reclaim early > and we have to scan a lot. > Direct reclaim (shrink or hard limit) shouldn't be affected here. It took me a while: we had SWAP_CLUSTER_MAX in _soft limit reclaim_, which means that even with priority 0 we would bail after reclaiming SWAP_CLUSTER_MAX from each lru of a zone. But it's now happening with kswapd's own scan_control, so the overreclaim protection is gone. That is indeed a change in behaviour I haven't noticed, good catch! I will look into it. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Date: Fri, 13 Jan 2012 17:34:23 +0100 Message-ID: <20120113163423.GG17060@tiehlicka.suse.cz> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120113120406.GC17060@tiehlicka.suse.cz> <20120113155001.GB1653@cmpxchg.org> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20120113155001.GB1653-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: Andrew Morton , KAMEZAWA Hiroyuki , Balbir Singh , Ying Han , cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Fri 13-01-12 16:50:01, Johannes Weiner wrote: > On Fri, Jan 13, 2012 at 01:04:06PM +0100, Michal Hocko wrote: > > On Tue 10-01-12 16:02:52, Johannes Weiner wrote: [...] > > > +bool mem_cgroup_over_softlimit(struct mem_cgroup *root, > > > + struct mem_cgroup *memcg) > > > +{ > > > + if (mem_cgroup_disabled()) > > > + return false; > > > + > > > + if (!root) > > > + root = root_mem_cgroup; > > > + > > > + for (; memcg; memcg = parent_mem_cgroup(memcg)) { > > > + /* root_mem_cgroup does not have a soft limit */ > > > + if (memcg == root_mem_cgroup) > > > + break; > > > + if (res_counter_soft_limit_excess(&memcg->res)) > > > + return true; > > > + if (memcg == root) > > > + break; > > > + } > > > + return false; > > > +} > > > > Well, this might be little bit tricky. We do not check whether memcg and > > root are in a hierarchy (in terms of use_hierarchy) relation. > > > > If we are under global reclaim then we iterate over all memcgs and so > > there is no guarantee that there is a hierarchical relation between the > > given memcg and its parent. While, on the other hand, if we are doing > > memcg reclaim then we have this guarantee. > > > > Why should we punish a group (subtree) which is perfectly under its soft > > limit just because some other subtree contributes to the common parent's > > usage and makes it over its limit? > > Should we check memcg->use_hierarchy here? > > We do, actually. parent_mem_cgroup() checks the res_counter parent, > which is only set when ->use_hierarchy is also set. Of course I am blind.. We do not setup res_counter parent for !use_hierarchy case. Sorry for noise... Now it makes much better sense. I was wondering how !use_hierarchy could ever work, this should be a signal that I am overlooking something terribly. [...] > > > @@ -2121,8 +2121,16 @@ static void shrink_zone(int priority, struct zone *zone, > > > .mem_cgroup = memcg, > > > .zone = zone, > > > }; > > > + int epriority = priority; > > > + /* > > > + * Put more pressure on hierarchies that exceed their > > > + * soft limit, to push them back harder than their > > > + * well-behaving siblings. > > > + */ > > > + if (mem_cgroup_over_softlimit(root, memcg)) > > > + epriority = 0; > > > > This sounds too aggressive to me. Shouldn't we just double the pressure > > or something like that? > > That's the historical value. When I tried priority - 1, it was not > aggressive enough. Probably because we want to reclaim too much. Maybe we should do reduce nr_to_reclaim (ugly) or reclaim only overlimit groups until certain priority level as Ying suggested in her patchset. -- Michal Hocko SUSE Labs SUSE LINUX s.r.o. Lihovarska 1060/12 190 00 Praha 9 Czech Republic From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ying Han Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Date: Fri, 13 Jan 2012 13:31:16 -0800 Message-ID: References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120112085904.GG24386@cmpxchg.org> Mime-Version: 1.0 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:x-system-of-record:content-type:content-transfer-encoding; bh=/AFmawpk7Ftgbs1t3hdLl/rs4EuQA8OPDFGGWsJvKdA=; b=a+mWXsFWRRPQOg+x6x5w2Uvc0SgofbfzBkG/A7FxUSJstWNz72G14hqPz8xp5nq9nU WzqIPTM4slrqoFQ+34XWTIGihCkxLlqRZ442lE++Eb5gvXA+0eCMtYmS7h2TI/PPxtyd pE5kiiZyudvrjViPckqlgOZw1Y1RUHadulbFo= In-Reply-To: <20120112085904.GG24386-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="iso-8859-1" To: Johannes Weiner Cc: Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Thu, Jan 12, 2012 at 12:59 AM, Johannes Weiner = wrote: > On Wed, Jan 11, 2012 at 01:42:31PM -0800, Ying Han wrote: >> On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner wrote: >> > Right now, memcg soft limits are implemented by having a sorted tr= ee >> > of memcgs that are in excess of their limits. =A0Under global memo= ry >> > pressure, kswapd first reclaims from the biggest excessor and then >> > proceeds to do regular global reclaim. =A0The result of this is th= at >> > pages are reclaimed from all memcgs, but more scanning happens aga= inst >> > those above their soft limit. >> > >> > With global reclaim doing memcg-aware hierarchical reclaim by defa= ult, >> > this is a lot easier to implement: everytime a memcg is reclaimed >> > from, scan more aggressively (per tradition with a priority of 0) = if >> > it's above its soft limit. =A0With the same end result of scanning >> > everybody, but soft limit excessors a bit more. >> > >> > Advantages: >> > >> > =A0o smoother reclaim: soft limit reclaim is a separate stage befo= re >> > =A0 =A0global reclaim, whose result is not communicated down the l= ine and >> > =A0 =A0so overreclaim of the groups in excess is very likely. =A0A= fter this >> > =A0 =A0patch, soft limit reclaim is fully integrated into regular = reclaim >> > =A0 =A0and each memcg is considered exactly once per cycle. >> > >> > =A0o true hierarchy support: soft limits are only considered when >> > =A0 =A0kswapd does global reclaim, but after this patch, targetted >> > =A0 =A0reclaim of a memcg will mind the soft limit settings of its= child >> > =A0 =A0groups. >> >> Why we add soft limit reclaim into target reclaim? > > =A0 =A0 =A0 =A0-> A hard limit 10G, usage 10G > =A0 =A0 =A0 =A0 =A0 -> A1 soft limit 8G, usage 5G > =A0 =A0 =A0 =A0 =A0 -> A2 soft limit 2G, usage 5G > > When A hits its hard limit, A2 will experience more pressure than A1. > > Soft limits are already applied hierarchically: the memcg that is > picked from the tree is reclaimed hierarchically. =A0What I wanted to > add is the soft limit also being /triggerable/ from non-global > hierarchy levels. > >> Based on the discussions, my understanding is that the soft limit on= ly >> take effect while the whole machine is under memory contention. We >> don't want to add extra pressure on a cgroup if there is free memory >> on the system even the cgroup is above its limit. > > If a hierarchy is under pressure, we will reclaim that hierarchy. =A0= We > allow groups to be prioritized under global pressure, why not allow i= t > for local pressure as well? > > I am not quite sure what you are objecting to. > >> > =A0o code size: soft limit reclaim requires a lot of code to maint= ain >> > =A0 =A0the per-node per-zone rb-trees to quickly find the biggest >> > =A0 =A0offender, dedicated paths for soft limit reclaim etc. while= this >> > =A0 =A0new implementation gets away without all that. >> > >> > Test: >> > >> > The test consists of two concurrent kernel build jobs in separate >> > source trees, the master and the slave. =A0The two jobs get along = nicely >> > on 600MB of available memory, so this is the zero overcommit contr= ol >> > case. =A0When available memory is decreased, the overcommit is >> > compensated by decreasing the soft limit of the slave by the same >> > amount, in the hope that the slave takes the hit and the master st= ays >> > unaffected. >> > >> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0600M-0M-vanilla =A0 =A0 =A0 =A0 600M-0M-patched >> > Master walltime (s) =A0 =A0 =A0 =A0 =A0 =A0 =A0 552.65 ( =A0+0.00%= ) =A0 =A0 =A0 552.38 ( =A0-0.05%) >> > Master walltime (stddev) =A0 =A0 =A0 =A0 =A0 =A01.25 ( =A0+0.00%) = =A0 =A0 =A0 =A0 0.92 ( -14.66%) >> > Master major faults =A0 =A0 =A0 =A0 =A0 =A0 =A0 204.38 ( =A0+0.00%= ) =A0 =A0 =A0 205.38 ( =A0+0.49%) >> > Master major faults (stddev) =A0 =A0 =A0 27.16 ( =A0+0.00%) =A0 =A0= =A0 =A013.80 ( -47.43%) >> > Master reclaim =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 31.88 ( =A0= +0.00%) =A0 =A0 =A0 =A037.75 ( +17.87%) >> > Master reclaim (stddev) =A0 =A0 =A0 =A0 =A0 =A034.01 ( =A0+0.00%) = =A0 =A0 =A0 =A075.88 (+119.59%) >> > Master scan =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A031.88 (= =A0+0.00%) =A0 =A0 =A0 =A037.75 ( +17.87%) >> > Master scan (stddev) =A0 =A0 =A0 =A0 =A0 =A0 =A0 34.01 ( =A0+0.00%= ) =A0 =A0 =A0 =A075.88 (+119.59%) >> > Master kswapd reclaim =A0 =A0 =A0 =A0 =A0 33922.12 ( =A0+0.00%) =A0= =A0 33887.12 ( =A0-0.10%) >> > Master kswapd reclaim (stddev) =A0 =A0969.08 ( =A0+0.00%) =A0 =A0 = =A0 492.22 ( -49.16%) >> > Master kswapd scan =A0 =A0 =A0 =A0 =A0 =A0 =A034085.75 ( =A0+0.00%= ) =A0 =A0 33985.75 ( =A0-0.29%) >> > Master kswapd scan (stddev) =A0 =A0 =A01101.07 ( =A0+0.00%) =A0 =A0= =A0 563.33 ( -48.79%) >> > Slave walltime (s) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0552.68 ( =A0+0.0= 0%) =A0 =A0 =A0 552.12 ( =A0-0.10%) >> > Slave walltime (stddev) =A0 =A0 =A0 =A0 =A0 =A0 0.79 ( =A0+0.00%) = =A0 =A0 =A0 =A0 1.05 ( +14.76%) >> > Slave major faults =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0212.50 ( =A0+0.0= 0%) =A0 =A0 =A0 204.50 ( =A0-3.75%) >> > Slave major faults (stddev) =A0 =A0 =A0 =A026.90 ( =A0+0.00%) =A0 = =A0 =A0 =A013.17 ( -49.20%) >> > Slave reclaim =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A026.12 ( =A0= +0.00%) =A0 =A0 =A0 =A035.00 ( +32.72%) >> > Slave reclaim (stddev) =A0 =A0 =A0 =A0 =A0 =A0 29.42 ( =A0+0.00%) = =A0 =A0 =A0 =A074.91 (+149.55%) >> > Slave scan =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 31.38 (= =A0+0.00%) =A0 =A0 =A0 =A035.00 ( +11.20%) >> > Slave scan (stddev) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A033.31 ( =A0+0.0= 0%) =A0 =A0 =A0 =A074.91 (+121.24%) >> > Slave kswapd reclaim =A0 =A0 =A0 =A0 =A0 =A034259.00 ( =A0+0.00%) = =A0 =A0 33469.88 ( =A0-2.30%) >> > Slave kswapd reclaim (stddev) =A0 =A0 925.15 ( =A0+0.00%) =A0 =A0 = =A0 565.07 ( -38.88%) >> > Slave kswapd scan =A0 =A0 =A0 =A0 =A0 =A0 =A0 34354.62 ( =A0+0.00%= ) =A0 =A0 33555.75 ( =A0-2.33%) >> > Slave kswapd scan (stddev) =A0 =A0 =A0 =A0969.62 ( =A0+0.00%) =A0 = =A0 =A0 581.70 ( -39.97%) >> > >> > In the control case, the differences in elapsed time, number of ma= jor >> > faults taken, and reclaim statistics are within the noise for both= the >> > master and the slave job. >> >> What's the soft limit setting in the controlled case? > > 300MB for both jobs. > >> I assume it is the default RESOURCE_MAX. So both Master and Slave ge= t >> equal pressure before/after the patch, and no differences on the sta= ts >> should be observed. > > Yes. =A0The control case demonstrates that both jobs can fit > comfortably, don't compete for space and that in general the patch > does not have unexpected negative impact (after all, it modifies > codepaths that were invoked regularly outside of reclaim). > >> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 600M-280M-vanilla =A0 =A0 =A0600M-280M-patched >> > Master walltime (s) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0595.13 ( =A0= +0.00%) =A0 =A0 =A0553.19 ( =A0-7.04%) >> > Master walltime (stddev) =A0 =A0 =A0 =A0 =A0 =A0 =A0 8.31 ( =A0+0.= 00%) =A0 =A0 =A0 =A02.57 ( -61.64%) >> > Master major faults =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 3729.75 ( =A0+= 0.00%) =A0 =A0 =A0783.25 ( -78.98%) >> > Master major faults (stddev) =A0 =A0 =A0 =A0 258.79 ( =A0+0.00%) =A0= =A0 =A0226.68 ( -12.36%) >> > Master reclaim =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 705.00 = ( =A0+0.00%) =A0 =A0 =A0 29.50 ( -95.68%) >> > Master reclaim (stddev) =A0 =A0 =A0 =A0 =A0 =A0 =A0232.87 ( =A0+0.= 00%) =A0 =A0 =A0 44.72 ( -80.45%) >> > Master scan =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0714= =2E88 ( =A0+0.00%) =A0 =A0 =A0 30.00 ( -95.67%) >> > Master scan (stddev) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 237.44 ( =A0+= 0.00%) =A0 =A0 =A0 45.39 ( -80.54%) >> > Master kswapd reclaim =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0114.75 ( =A0+= 0.00%) =A0 =A0 =A0 50.00 ( -55.94%) >> > Master kswapd reclaim (stddev) =A0 =A0 =A0 128.51 ( =A0+0.00%) =A0= =A0 =A0 =A09.45 ( -91.93%) >> > Master kswapd scan =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 115.75 ( =A0= +0.00%) =A0 =A0 =A0 50.00 ( -56.32%) >> > Master kswapd scan (stddev) =A0 =A0 =A0 =A0 =A0130.31 ( =A0+0.00%)= =A0 =A0 =A0 =A09.45 ( -92.04%) >> > Slave walltime (s) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 631.18 ( =A0= +0.00%) =A0 =A0 =A0577.68 ( =A0-8.46%) >> > Slave walltime (stddev) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A09.89 ( =A0+= 0.00%) =A0 =A0 =A0 =A03.63 ( -57.47%) >> > Slave major faults =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 28401.75 ( =A0+= 0.00%) =A0 =A014656.75 ( -48.39%) >> > Slave major faults (stddev) =A0 =A0 =A0 =A0 2629.97 ( =A0+0.00%) =A0= =A0 1911.81 ( -27.30%) >> > Slave reclaim =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A065400.62 = ( =A0+0.00%) =A0 =A0 1479.62 ( -97.74%) >> > Slave reclaim (stddev) =A0 =A0 =A0 =A0 =A0 =A0 11623.02 ( =A0+0.00= %) =A0 =A0 1482.13 ( -87.24%) >> > Slave scan =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 9050047.88 = ( =A0+0.00%) =A0 =A095968.25 ( -98.94%) >> > Slave scan (stddev) =A0 =A0 =A0 =A0 =A0 =A0 =A01912786.94 ( =A0+0.= 00%) =A0 =A093390.71 ( -95.12%) >> > Slave kswapd reclaim =A0 =A0 =A0 =A0 =A0 =A0 =A0327894.50 ( =A0+0.= 00%) =A0 227099.88 ( -30.74%) >> > Slave kswapd reclaim (stddev) =A0 =A0 =A022289.43 ( =A0+0.00%) =A0= =A016113.14 ( -27.71%) >> > Slave kswapd scan =A0 =A0 =A0 =A0 =A0 =A0 =A0 34987335.75 ( =A0+0.= 00%) =A01362367.12 ( -96.11%) >> > Slave kswapd scan (stddev) =A0 =A0 =A0 2523642.98 ( =A0+0.00%) =A0= 156754.74 ( -93.79%) >> > >> > Here, the available memory is limited to 320 MB, the machine is >> > overcommitted by 280 MB. =A0The soft limit of the master is 300 MB= , that >> > of the slave merely 20 MB. >> > >> > Looking at the slave job first, it is much better off with the pat= ched >> > kernel: direct reclaim is almost gone, kswapd reclaim is decreased= by >> > a third. =A0The result is much fewer major faults taken, which in = turn >> > lets the job finish quicker. >> >> What's the setting of the hard limit here? Is the direct reclaim >> referring to per-memcg directly reclaim or global one. > > The machine's memory is limited to 600M, the hard limits are unset. > All reclaim is a result of global memory pressure. > > With the patched kernel, I could have used a dedicated parent cgroup > and let master and slave run in children of this group, the soft > limits would be taken into account just the same. =A0But this does no= t > work on the unpatched kernel, as soft limits are only recognized on > the global level there. > >> > It would be a zero-sum game if the improvement happened at the cos= t of >> > the master but looking at the numbers, even the master performs be= tter >> > with the patched kernel. =A0In fact, the master job is almost unaf= fected >> > on the patched kernel compared to the control case. >> >> It makes sense since the master job get less affected by the patch >> than the slave job under the example. Under the control case, if bot= h >> master and slave have RESOURCE_MAX soft limit setting, they are unde= r >> equal memory pressure(priority =3D DEF_PRIORITY) . On the second >> example, only the slave pressure being increased by priority =3D 0, = and >> the Master got scanned with same priority =3D DEF_PRIORITY pretty mu= ch. >> >> So I would expect to see more reclaim activities happens in slave on >> the patched kernel compared to the control case. It seems match the >> testing result. > > Uhm, > >> > Slave reclaim =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A065400.62 = ( =A0+0.00%) =A0 =A0 1479.62 ( -97.74%) >> > Slave reclaim (stddev) =A0 =A0 =A0 =A0 =A0 =A0 11623.02 ( =A0+0.00= %) =A0 =A0 1482.13 ( -87.24%) >> > Slave scan =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 9050047.88 = ( =A0+0.00%) =A0 =A095968.25 ( -98.94%) >> > Slave scan (stddev) =A0 =A0 =A0 =A0 =A0 =A0 =A01912786.94 ( =A0+0.= 00%) =A0 =A093390.71 ( -95.12%) >> > Slave kswapd reclaim =A0 =A0 =A0 =A0 =A0 =A0 =A0327894.50 ( =A0+0.= 00%) =A0 227099.88 ( -30.74%) >> > Slave kswapd reclaim (stddev) =A0 =A0 =A022289.43 ( =A0+0.00%) =A0= =A016113.14 ( -27.71%) >> > Slave kswapd scan =A0 =A0 =A0 =A0 =A0 =A0 =A0 34987335.75 ( =A0+0.= 00%) =A01362367.12 ( -96.11%) >> > Slave kswapd scan (stddev) =A0 =A0 =A0 2523642.98 ( =A0+0.00%) =A0= 156754.74 ( -93.79%) > > Direct reclaim _shrunk_ by 98%, kswapd reclaim by 31%. > >> > This is an odd phenomenon, as the patch does not directly change h= ow >> > the master is reclaimed. =A0An explanation for this is that the se= vere >> > overreclaim of the slave in the unpatched kernel results in the ma= ster >> > growing bigger than in the patched case. =A0Combining the fact tha= t >> > memcgs are scanned according to their size with the increased refa= ult >> > rate of the overreclaimed slave triggering global reclaim more oft= en >> > means that overall pressure on the master job is higher in the >> > unpatched kernel. >> >> We can check the Master memory.usage_in_bytes while the job is runni= ng. > > Yep, the plots of cache/rss over time confirmed exactly this. =A0The > unpatched kernel shows higher spikes in the size of the master job > followed by deeper pits when reclaim kicked in. =A0The patched kernel= is > much smoother in that regard. > >> On the other hand, I don't see why we expect the Master being less >> reclaimed in the controlled case? On the unpatched kernel, the Maste= r >> is being reclaimed under global pressure each time anyway since we >> ignore the return value of softlimit. > > I didn't expect that, I expected both jobs to perform equally in the > control case. =A0And in the pressurized case, the master being > unaffected and the slave taking the hit. =A0The patched kernel does > this, the unpatched one does not. > >> > @@ -121,6 +121,7 @@ struct zone_reclaim_stat *mem_cgroup_get_recla= im_stat(struct mem_cgroup *memcg, >> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0struct zone *zone); >> > =A0struct zone_reclaim_stat* >> > =A0mem_cgroup_get_reclaim_stat_from_page(struct page *page); >> > +bool mem_cgroup_over_softlimit(struct mem_cgroup *, struct mem_cg= roup *); >> >> Maybe something like "mem_cgroup_over_soft_limit()" ? > > Probably more consistent, yeah. =A0Will do. > >> > @@ -343,7 +314,6 @@ static bool move_file(void) >> > =A0* limit reclaim to prevent infinite loops, if they ever occur. >> > =A0*/ >> > =A0#define =A0 =A0 =A0 =A0MEM_CGROUP_MAX_RECLAIM_LOOPS =A0 =A0 =A0= =A0 =A0 =A0(100) >> > -#define =A0 =A0 =A0 =A0MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS (2= ) >> >> You might need to remove the comment above as well. > > Oops, will fix. > >> > @@ -1318,6 +1123,36 @@ static unsigned long mem_cgroup_margin(stru= ct mem_cgroup *memcg) >> > =A0 =A0 =A0 =A0return margin >> PAGE_SHIFT; >> > =A0} >> > >> > +/** >> > + * mem_cgroup_over_softlimit >> > + * @root: hierarchy root >> > + * @memcg: child of @root to test >> > + * >> > + * Returns %true if @memcg exceeds its own soft limit or contribu= tes >> > + * to the soft limit excess of one of its parents up to and inclu= ding >> > + * @root. >> > + */ >> > +bool mem_cgroup_over_softlimit(struct mem_cgroup *root, >> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0struc= t mem_cgroup *memcg) >> > +{ >> > + =A0 =A0 =A0 if (mem_cgroup_disabled()) >> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 return false; >> > + >> > + =A0 =A0 =A0 if (!root) >> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 root =3D root_mem_cgroup; >> > + >> > + =A0 =A0 =A0 for (; memcg; memcg =3D parent_mem_cgroup(memcg)) { >> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* root_mem_cgroup does not have a s= oft limit */ >> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (memcg =3D=3D root_mem_cgroup) >> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break; >> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (res_counter_soft_limit_excess(&m= emcg->res)) >> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 return true; >> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (memcg =3D=3D root) >> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break; >> > + =A0 =A0 =A0 } >> >> Here it adds pressure on a cgroup if one of its parents exceeds soft >> limit, although the cgroup itself is under soft limit. It does chang= e >> my understanding of soft limit, and might introduce regression of ou= r >> existing use cases. >> >> Here is an example: >> >> Machine capacity 32G and we over-commit by 8G. >> >> root >> =A0 -> A (hard limit 20G, soft limit 15G, usage 16G) >> =A0 =A0 =A0 =A0-> A1 (soft limit 5G, usage 4G) >> =A0 =A0 =A0 =A0-> A2 (soft limit 10G, usage 12G) >> =A0 -> B (hard limit 20G, soft limit 10G, usage 16G) >> >> under global reclaim, we don't want to add pressure on A1 although i= ts >> parent A exceeds its soft limit. Assume that if we set the soft limi= t >> corresponding to each cgroup's working set size (hot memory), and it >> will introduce regression to A1 in that case. >> >> In my existing implementation, i am checking the cgroup's soft limit >> standalone w/o looking its ancestors. > > Why do you set the soft limit of A in the first place if you don't > want it to be enforced? The soft limit should be enforced under certain condition, not always. The soft limit of A is set to be enforced when the parent of A and B is under memory pressure. For example: Machine capacity 32G and we over-commit by 8G root -> A (hard limit 20G, soft limit 12G, usage 20G) =A0 =A0 =A0 =A0-> A1 (soft limit 2G, usage 1G) =A0 =A0 =A0 =A0-> A2 (soft limit 10G, usage 19G) -> B (hard limit 20G, soft limit 10G, usage 0G) Now, A is under memory pressure since the total usage is hitting its hard limit. Then we start hierarchical reclaim under A, and each cgroup under A also takes consideration of soft limit. In this case, we should only set priority =3D 0 to A2 since it contributes to A's charge as well as exceeding its own soft limit. Why punishing A1 (set priority =3D 0) also which has usage under its soft limit ? I can imagine it will introduce regression to existing environment which the soft limit is set based on the working set size of the cgroup. To answer the question why we set soft limit to A, it is used to over-commit the host while sharing the resource with its sibling (B in this case). If the machine is under memory contention, we would like to push down memory to A or B depends on their usage and soft limit. --Ying > > This is not really new behaviour, soft limit reclaim has always been > operating hierarchically on the biggest excessor. =A0In your case, th= e > excess of A is smaller than the excess of A2 and so that weird "only > pick the biggest excessor" behaviour hides it, but consider this: > > =A0 =A0 =A0 =A0-> A soft 30G, usage 39G > =A0 =A0 =A0 =A0 =A0 -> A1 soft 5G, usage 4G > =A0 =A0 =A0 =A0 =A0 -> A2 soft 10G, usage 15G > =A0 =A0 =A0 =A0 =A0 -> A3 soft 15G, usage 20G > > Upstream would pick A from the soft limit tree and reclaim its > children with priority 0, including A1. > > On the other hand, if you don't consider ancestral soft limits, you > break perfectly reasonable setups like these > > =A0 =A0 =A0 =A0-> A soft 10G, usage 20G > =A0 =A0 =A0 =A0 =A0 -> A1 usage 10G > =A0 =A0 =A0 =A0 =A0 -> A2 usage 10G > =A0 =A0 =A0 =A0-> B soft 10G, usage 11G > > where upstream would pick A and reclaim it recursively, but your > version would only apply higher pressure to B. > > If you would just not set the soft limit of A in your case: > > =A0 =A0 =A0 =A0-> A (hard limit 20G, usage 16G) > =A0 =A0 =A0 =A0 =A0 -> A1 (soft limit 5G, usage 4G) > =A0 =A0 =A0 =A0 =A0 -> A2 (soft limit 10G, usage 12G) > =A0 =A0 =A0 =A0-> B (hard limit 20G, soft limit 10G, usage 16G) > > only A2 and B would experience higher pressure upon global pressure. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ying Han Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Date: Fri, 13 Jan 2012 13:45:30 -0800 Message-ID: References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120113120406.GC17060@tiehlicka.suse.cz> <20120113155001.GB1653@cmpxchg.org> <20120113163423.GG17060@tiehlicka.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:x-system-of-record:content-type:content-transfer-encoding; bh=t4Z7HJpkDoiRrr5yS7q/iJ5wwepsDIBlKmBl1FqTfAU=; b=yZoGq89exe8cllfKn1mllg85y/UIubLpilsJ/FbBnBVe/lF/lwp/hSGQ7108F34Mn3 7sk9PM+Uwnm4oYGNW75eE4B7avEA4sifQWzCGF3SjSKF9Lgmymr4ftjP1Byf4YPJGLOa qfnYc3UTziZUY5M6TeibmXBtdZDt2bS7bP9JU= In-Reply-To: <20120113163423.GG17060-VqjxzfR4DlwKmadIfiO5sKVXKuFTiq87@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="iso-8859-1" To: Michal Hocko Cc: Johannes Weiner , Andrew Morton , KAMEZAWA Hiroyuki , Balbir Singh , cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Fri, Jan 13, 2012 at 8:34 AM, Michal Hocko wrote: > On Fri 13-01-12 16:50:01, Johannes Weiner wrote: >> On Fri, Jan 13, 2012 at 01:04:06PM +0100, Michal Hocko wrote: >> > On Tue 10-01-12 16:02:52, Johannes Weiner wrote: > [...] >> > > +bool mem_cgroup_over_softlimit(struct mem_cgroup *root, >> > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0struct mem_cgro= up *memcg) >> > > +{ >> > > + if (mem_cgroup_disabled()) >> > > + =A0 =A0 =A0 =A0 return false; >> > > + >> > > + if (!root) >> > > + =A0 =A0 =A0 =A0 root =3D root_mem_cgroup; >> > > + >> > > + for (; memcg; memcg =3D parent_mem_cgroup(memcg)) { >> > > + =A0 =A0 =A0 =A0 /* root_mem_cgroup does not have a soft limit = */ >> > > + =A0 =A0 =A0 =A0 if (memcg =3D=3D root_mem_cgroup) >> > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break; >> > > + =A0 =A0 =A0 =A0 if (res_counter_soft_limit_excess(&memcg->res)= ) >> > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 return true; >> > > + =A0 =A0 =A0 =A0 if (memcg =3D=3D root) >> > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break; >> > > + } >> > > + return false; >> > > +} >> > >> > Well, this might be little bit tricky. We do not check whether mem= cg and >> > root are in a hierarchy (in terms of use_hierarchy) relation. >> > >> > If we are under global reclaim then we iterate over all memcgs and= so >> > there is no guarantee that there is a hierarchical relation betwee= n the >> > given memcg and its parent. While, on the other hand, if we are do= ing >> > memcg reclaim then we have this guarantee. >> > >> > Why should we punish a group (subtree) which is perfectly under it= s soft >> > limit just because some other subtree contributes to the common pa= rent's >> > usage and makes it over its limit? >> > Should we check memcg->use_hierarchy here? >> >> We do, actually. =A0parent_mem_cgroup() checks the res_counter paren= t, >> which is only set when ->use_hierarchy is also set. > > Of course I am blind.. We do not setup res_counter parent for > !use_hierarchy case. Sorry for noise... > Now it makes much better sense. I was wondering how !use_hierarchy co= uld > ever work, this should be a signal that I am overlooking something > terribly. > > [...] >> > > @@ -2121,8 +2121,16 @@ static void shrink_zone(int priority, str= uct zone *zone, >> > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 .mem_cgroup =3D memcg, >> > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 .zone =3D zone, >> > > =A0 =A0 =A0 =A0 =A0 }; >> > > + =A0 =A0 =A0 =A0 int epriority =3D priority; >> > > + =A0 =A0 =A0 =A0 /* >> > > + =A0 =A0 =A0 =A0 =A0* Put more pressure on hierarchies that exc= eed their >> > > + =A0 =A0 =A0 =A0 =A0* soft limit, to push them back harder than= their >> > > + =A0 =A0 =A0 =A0 =A0* well-behaving siblings. >> > > + =A0 =A0 =A0 =A0 =A0*/ >> > > + =A0 =A0 =A0 =A0 if (mem_cgroup_over_softlimit(root, memcg)) >> > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 epriority =3D 0; >> > >> > This sounds too aggressive to me. Shouldn't we just double the pre= ssure >> > or something like that? >> >> That's the historical value. =A0When I tried priority - 1, it was no= t >> aggressive enough. > > Probably because we want to reclaim too much. Maybe we should do > reduce nr_to_reclaim (ugly) or reclaim only overlimit groups until ce= rtain > priority level as Ying suggested in her patchset. I plan to post that change on top of this, and this patch set does the basic stuff to allow us doing further improvement. I still like the design to skip over_soft_limit cgroups until certain priority. One way to set up the soft limit for each cgroup is to base on its actual working set size, and we prefer to punish A first with lots of page cache ( cold file pages above soft limit) than reclaiming anon pages from B ( below soft limit ). Unless we can not get enough pages reclaimed from A, we will start reclaiming from B. This might not be the ideal solution, but should be a good start. Thoug= hts? --Ying > -- > Michal Hocko > SUSE Labs > SUSE LINUX s.r.o. > Lihovarska 1060/12 > 190 00 Praha 9 > Czech Republic From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Date: Fri, 13 Jan 2012 23:44:24 +0100 Message-ID: <20120113224424.GC1653@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120112085904.GG24386@cmpxchg.org> Mime-Version: 1.0 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Content-Disposition: inline In-Reply-To: Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="iso-8859-9" To: Ying Han Cc: Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Fri, Jan 13, 2012 at 01:31:16PM -0800, Ying Han wrote: > On Thu, Jan 12, 2012 at 12:59 AM, Johannes Weiner wrote: > > On Wed, Jan 11, 2012 at 01:42:31PM -0800, Ying Han wrote: > >> On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner wrote: > >> > @@ -1318,6 +1123,36 @@ static unsigned long mem_cgroup_margin(st= ruct mem_cgroup *memcg) > >> > =A0 =A0 =A0 =A0return margin >> PAGE_SHIFT; > >> > =A0} > >> > > >> > +/** > >> > + * mem_cgroup_over_softlimit > >> > + * @root: hierarchy root > >> > + * @memcg: child of @root to test > >> > + * > >> > + * Returns %true if @memcg exceeds its own soft limit or contri= butes > >> > + * to the soft limit excess of one of its parents up to and inc= luding > >> > + * @root. > >> > + */ > >> > +bool mem_cgroup_over_softlimit(struct mem_cgroup *root, > >> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0str= uct mem_cgroup *memcg) > >> > +{ > >> > + =A0 =A0 =A0 if (mem_cgroup_disabled()) > >> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 return false; > >> > + > >> > + =A0 =A0 =A0 if (!root) > >> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 root =3D root_mem_cgroup; > >> > + > >> > + =A0 =A0 =A0 for (; memcg; memcg =3D parent_mem_cgroup(memcg)) = { > >> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* root_mem_cgroup does not have a= soft limit */ > >> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (memcg =3D=3D root_mem_cgroup) > >> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break; > >> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (res_counter_soft_limit_excess(= &memcg->res)) > >> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 return true; > >> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (memcg =3D=3D root) > >> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break; > >> > + =A0 =A0 =A0 } > >> > >> Here it adds pressure on a cgroup if one of its parents exceeds so= ft > >> limit, although the cgroup itself is under soft limit. It does cha= nge > >> my understanding of soft limit, and might introduce regression of = our > >> existing use cases. > >> > >> Here is an example: > >> > >> Machine capacity 32G and we over-commit by 8G. > >> > >> root > >> =A0 -> A (hard limit 20G, soft limit 15G, usage 16G) > >> =A0 =A0 =A0 =A0-> A1 (soft limit 5G, usage 4G) > >> =A0 =A0 =A0 =A0-> A2 (soft limit 10G, usage 12G) > >> =A0 -> B (hard limit 20G, soft limit 10G, usage 16G) > >> > >> under global reclaim, we don't want to add pressure on A1 although= its > >> parent A exceeds its soft limit. Assume that if we set the soft li= mit > >> corresponding to each cgroup's working set size (hot memory), and = it > >> will introduce regression to A1 in that case. > >> > >> In my existing implementation, i am checking the cgroup's soft lim= it > >> standalone w/o looking its ancestors. > > > > Why do you set the soft limit of A in the first place if you don't > > want it to be enforced? >=20 > The soft limit should be enforced under certain condition, not always= =2E > The soft limit of A is set to be enforced when the parent of A and B > is under memory pressure. For example: >=20 > Machine capacity 32G and we over-commit by 8G >=20 > root > -> A (hard limit 20G, soft limit 12G, usage 20G) > =A0 =A0 =A0 =A0-> A1 (soft limit 2G, usage 1G) > =A0 =A0 =A0 =A0-> A2 (soft limit 10G, usage 19G) > -> B (hard limit 20G, soft limit 10G, usage 0G) >=20 > Now, A is under memory pressure since the total usage is hitting its > hard limit. Then we start hierarchical reclaim under A, and each > cgroup under A also takes consideration of soft limit. In this case, > we should only set priority =3D 0 to A2 since it contributes to A's > charge as well as exceeding its own soft limit. Why punishing A1 (set > priority =3D 0) also which has usage under its soft limit ? I can > imagine it will introduce regression to existing environment which th= e > soft limit is set based on the working set size of the cgroup. > > To answer the question why we set soft limit to A, it is used to > over-commit the host while sharing the resource with its sibling (B i= n > this case). If the machine is under memory contention, we would like > to push down memory to A or B depends on their usage and soft limit. D'oh, I think the problem is just that we walk up the hierarchy one too many when checking whether a group exceeds a soft limit. The soft limit is a signal to distribute pressure that comes from above, it's meaningless and should indeed be ignored on the level the pressure originates from. Say mem_cgroup_over_soft_limit(root, memcg) would check everyone up to but not including root, wouldn't that do exactly what we both want? Example: 1. If global memory is short, we reclaim with root=3Droot_mem_cgroup. A1 and A2 get soft limit reclaimed because of A's soft limit excess, just like the current kernel would do. 2. If A hits its hard limit, we reclaim with root=3DA, so we only mind the soft limits of A1 and A2. A1 is below its soft limit, all good. A2 is above its soft limit, gets treated accordingly. This is new behaviour, the current kernel would just reclaim them equally. Code: bool mem_cgroup_over_soft_limit(struct mem_cgroup *root, struct mem_cgroup *memcg) { if (mem_cgroup_disabled()) return false; if (!root) root =3D root_mem_cgroup; for (; memcg; memcg =3D parent_mem_cgroup(memcg)) { if (memcg =3D=3D root) break; if (res_counter_soft_limit_excess(&memcg->res)) return true; } return false; } From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sha Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Date: Tue, 17 Jan 2012 22:22:16 +0800 Message-ID: <4F158418.2090509@gmail.com> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120112085904.GG24386@cmpxchg.org> <20120113224424.GC1653@cmpxchg.org> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding; bh=VwiBvxDjxbEcyGgsh8WInKsjVoB/cMmkgyGBmIC2lFw=; b=PyrVLbwYo1mimsr53qb1vefN60czHOBnE0FM9uhE6Lae6WzsDuLnHKbMp3RnOEaI5o bMBX/MawgsvqTXKlCChymMWRA4jcqbHjswoCgE8I2P74wHECOTNHG7L/1FNc3KzkbmeB niX9XZiKQhstHbOQnKpbgujvnZEX+e88U/cfs= In-Reply-To: <20120113224424.GC1653-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii"; format="flowed" To: Johannes Weiner Cc: Ying Han , Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On 01/14/2012 06:44 AM, Johannes Weiner wrote: > On Fri, Jan 13, 2012 at 01:31:16PM -0800, Ying Han wrote: >> On Thu, Jan 12, 2012 at 12:59 AM, Johannes Weiner wrote: >>> On Wed, Jan 11, 2012 at 01:42:31PM -0800, Ying Han wrote: >>>> On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner wrote: >>>>> @@ -1318,6 +1123,36 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) >>>>> return margin>> PAGE_SHIFT; >>>>> } >>>>> >>>>> +/** >>>>> + * mem_cgroup_over_softlimit >>>>> + * @root: hierarchy root >>>>> + * @memcg: child of @root to test >>>>> + * >>>>> + * Returns %true if @memcg exceeds its own soft limit or contributes >>>>> + * to the soft limit excess of one of its parents up to and including >>>>> + * @root. >>>>> + */ >>>>> +bool mem_cgroup_over_softlimit(struct mem_cgroup *root, >>>>> + struct mem_cgroup *memcg) >>>>> +{ >>>>> + if (mem_cgroup_disabled()) >>>>> + return false; >>>>> + >>>>> + if (!root) >>>>> + root = root_mem_cgroup; >>>>> + >>>>> + for (; memcg; memcg = parent_mem_cgroup(memcg)) { >>>>> + /* root_mem_cgroup does not have a soft limit */ >>>>> + if (memcg == root_mem_cgroup) >>>>> + break; >>>>> + if (res_counter_soft_limit_excess(&memcg->res)) >>>>> + return true; >>>>> + if (memcg == root) >>>>> + break; >>>>> + } >>>> Here it adds pressure on a cgroup if one of its parents exceeds soft >>>> limit, although the cgroup itself is under soft limit. It does change >>>> my understanding of soft limit, and might introduce regression of our >>>> existing use cases. >>>> >>>> Here is an example: >>>> >>>> Machine capacity 32G and we over-commit by 8G. >>>> >>>> root >>>> -> A (hard limit 20G, soft limit 15G, usage 16G) >>>> -> A1 (soft limit 5G, usage 4G) >>>> -> A2 (soft limit 10G, usage 12G) >>>> -> B (hard limit 20G, soft limit 10G, usage 16G) >>>> >>>> under global reclaim, we don't want to add pressure on A1 although its >>>> parent A exceeds its soft limit. Assume that if we set the soft limit >>>> corresponding to each cgroup's working set size (hot memory), and it >>>> will introduce regression to A1 in that case. >>>> >>>> In my existing implementation, i am checking the cgroup's soft limit >>>> standalone w/o looking its ancestors. >>> Why do you set the soft limit of A in the first place if you don't >>> want it to be enforced? >> The soft limit should be enforced under certain condition, not always. >> The soft limit of A is set to be enforced when the parent of A and B >> is under memory pressure. For example: >> >> Machine capacity 32G and we over-commit by 8G >> >> root >> -> A (hard limit 20G, soft limit 12G, usage 20G) >> -> A1 (soft limit 2G, usage 1G) >> -> A2 (soft limit 10G, usage 19G) >> -> B (hard limit 20G, soft limit 10G, usage 0G) >> >> Now, A is under memory pressure since the total usage is hitting its >> hard limit. Then we start hierarchical reclaim under A, and each >> cgroup under A also takes consideration of soft limit. In this case, >> we should only set priority = 0 to A2 since it contributes to A's >> charge as well as exceeding its own soft limit. Why punishing A1 (set >> priority = 0) also which has usage under its soft limit ? I can >> imagine it will introduce regression to existing environment which the >> soft limit is set based on the working set size of the cgroup. >> >> To answer the question why we set soft limit to A, it is used to >> over-commit the host while sharing the resource with its sibling (B in >> this case). If the machine is under memory contention, we would like >> to push down memory to A or B depends on their usage and soft limit. > D'oh, I think the problem is just that we walk up the hierarchy one > too many when checking whether a group exceeds a soft limit. The soft > limit is a signal to distribute pressure that comes from above, it's > meaningless and should indeed be ignored on the level the pressure > originates from. > > Say mem_cgroup_over_soft_limit(root, memcg) would check everyone up to > but not including root, wouldn't that do exactly what we both want? > > Example: > > 1. If global memory is short, we reclaim with root=root_mem_cgroup. > A1 and A2 get soft limit reclaimed because of A's soft limit > excess, just like the current kernel would do. > > 2. If A hits its hard limit, we reclaim with root=A, so we only mind > the soft limits of A1 and A2. A1 is below its soft limit, all > good. A2 is above its soft limit, gets treated accordingly. This > is new behaviour, the current kernel would just reclaim them > equally. > > Code: > > bool mem_cgroup_over_soft_limit(struct mem_cgroup *root, > struct mem_cgroup *memcg) > { > if (mem_cgroup_disabled()) > return false; > > if (!root) > root = root_mem_cgroup; > > for (; memcg; memcg = parent_mem_cgroup(memcg)) { > if (memcg == root) > break; > if (res_counter_soft_limit_excess(&memcg->res)) > return true; > } > return false; > } Hi Johannes, I don't think it solve the root of the problem, example: root -> A (hard limit 20G, soft limit 12G, usage 20G) -> A1 ( soft limit 2G, usage 1G) -> A2 ( soft limit 10G, usage 19G) ->B1 (soft limit 5G, usage 4G) ->B2 (soft limit 5G, usage 15G) Now A is hitting its hard limit and start hierarchical reclaim under A. If we choose B1 to go through mem_cgroup_over_soft_limit, it will return true because its parent A2 has a large usage and will lead to priority=0 reclaiming. But in fact it should be B2 to be punished. IMHO, it may checking the cgroup's soft limit standalone without looking up its ancestors just as Ying said. Thanks, Sha From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Date: Tue, 17 Jan 2012 15:53:48 +0100 Message-ID: <20120117145348.GA3144@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120112085904.GG24386@cmpxchg.org> <20120113224424.GC1653@cmpxchg.org> <4F158418.2090509@gmail.com> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <4F158418.2090509-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Sha Cc: Ying Han , Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Tue, Jan 17, 2012 at 10:22:16PM +0800, Sha wrote: > On 01/14/2012 06:44 AM, Johannes Weiner wrote: > >On Fri, Jan 13, 2012 at 01:31:16PM -0800, Ying Han wrote: > >>On Thu, Jan 12, 2012 at 12:59 AM, Johannes Weiner wrote: > >>>On Wed, Jan 11, 2012 at 01:42:31PM -0800, Ying Han wrote: > >>>>On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner wrote: > >>>>>@@ -1318,6 +1123,36 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) > >>>>> return margin>> PAGE_SHIFT; > >>>>> } > >>>>> > >>>>>+/** > >>>>>+ * mem_cgroup_over_softlimit > >>>>>+ * @root: hierarchy root > >>>>>+ * @memcg: child of @root to test > >>>>>+ * > >>>>>+ * Returns %true if @memcg exceeds its own soft limit or contributes > >>>>>+ * to the soft limit excess of one of its parents up to and including > >>>>>+ * @root. > >>>>>+ */ > >>>>>+bool mem_cgroup_over_softlimit(struct mem_cgroup *root, > >>>>>+ struct mem_cgroup *memcg) > >>>>>+{ > >>>>>+ if (mem_cgroup_disabled()) > >>>>>+ return false; > >>>>>+ > >>>>>+ if (!root) > >>>>>+ root = root_mem_cgroup; > >>>>>+ > >>>>>+ for (; memcg; memcg = parent_mem_cgroup(memcg)) { > >>>>>+ /* root_mem_cgroup does not have a soft limit */ > >>>>>+ if (memcg == root_mem_cgroup) > >>>>>+ break; > >>>>>+ if (res_counter_soft_limit_excess(&memcg->res)) > >>>>>+ return true; > >>>>>+ if (memcg == root) > >>>>>+ break; > >>>>>+ } > >>>>Here it adds pressure on a cgroup if one of its parents exceeds soft > >>>>limit, although the cgroup itself is under soft limit. It does change > >>>>my understanding of soft limit, and might introduce regression of our > >>>>existing use cases. > >>>> > >>>>Here is an example: > >>>> > >>>>Machine capacity 32G and we over-commit by 8G. > >>>> > >>>>root > >>>> -> A (hard limit 20G, soft limit 15G, usage 16G) > >>>> -> A1 (soft limit 5G, usage 4G) > >>>> -> A2 (soft limit 10G, usage 12G) > >>>> -> B (hard limit 20G, soft limit 10G, usage 16G) > >>>> > >>>>under global reclaim, we don't want to add pressure on A1 although its > >>>>parent A exceeds its soft limit. Assume that if we set the soft limit > >>>>corresponding to each cgroup's working set size (hot memory), and it > >>>>will introduce regression to A1 in that case. > >>>> > >>>>In my existing implementation, i am checking the cgroup's soft limit > >>>>standalone w/o looking its ancestors. > >>>Why do you set the soft limit of A in the first place if you don't > >>>want it to be enforced? > >>The soft limit should be enforced under certain condition, not always. > >>The soft limit of A is set to be enforced when the parent of A and B > >>is under memory pressure. For example: > >> > >>Machine capacity 32G and we over-commit by 8G > >> > >>root > >>-> A (hard limit 20G, soft limit 12G, usage 20G) > >> -> A1 (soft limit 2G, usage 1G) > >> -> A2 (soft limit 10G, usage 19G) > >>-> B (hard limit 20G, soft limit 10G, usage 0G) > >> > >>Now, A is under memory pressure since the total usage is hitting its > >>hard limit. Then we start hierarchical reclaim under A, and each > >>cgroup under A also takes consideration of soft limit. In this case, > >>we should only set priority = 0 to A2 since it contributes to A's > >>charge as well as exceeding its own soft limit. Why punishing A1 (set > >>priority = 0) also which has usage under its soft limit ? I can > >>imagine it will introduce regression to existing environment which the > >>soft limit is set based on the working set size of the cgroup. > >> > >>To answer the question why we set soft limit to A, it is used to > >>over-commit the host while sharing the resource with its sibling (B in > >>this case). If the machine is under memory contention, we would like > >>to push down memory to A or B depends on their usage and soft limit. > >D'oh, I think the problem is just that we walk up the hierarchy one > >too many when checking whether a group exceeds a soft limit. The soft > >limit is a signal to distribute pressure that comes from above, it's > >meaningless and should indeed be ignored on the level the pressure > >originates from. > > > >Say mem_cgroup_over_soft_limit(root, memcg) would check everyone up to > >but not including root, wouldn't that do exactly what we both want? > > > >Example: > > > >1. If global memory is short, we reclaim with root=root_mem_cgroup. > > A1 and A2 get soft limit reclaimed because of A's soft limit > > excess, just like the current kernel would do. > > > >2. If A hits its hard limit, we reclaim with root=A, so we only mind > > the soft limits of A1 and A2. A1 is below its soft limit, all > > good. A2 is above its soft limit, gets treated accordingly. This > > is new behaviour, the current kernel would just reclaim them > > equally. > > > >Code: > > > >bool mem_cgroup_over_soft_limit(struct mem_cgroup *root, > > struct mem_cgroup *memcg) > >{ > > if (mem_cgroup_disabled()) > > return false; > > > > if (!root) > > root = root_mem_cgroup; > > > > for (; memcg; memcg = parent_mem_cgroup(memcg)) { > > if (memcg == root) > > break; > > if (res_counter_soft_limit_excess(&memcg->res)) > > return true; > > } > > return false; > >} > Hi Johannes, > > I don't think it solve the root of the problem, example: > root > -> A (hard limit 20G, soft limit 12G, usage 20G) > -> A1 ( soft limit 2G, usage 1G) > -> A2 ( soft limit 10G, usage 19G) > ->B1 (soft limit 5G, usage 4G) > ->B2 (soft limit 5G, usage 15G) > > Now A is hitting its hard limit and start hierarchical reclaim under A. > If we choose B1 to go through mem_cgroup_over_soft_limit, it will > return true because its parent A2 has a large usage and will lead to > priority=0 reclaiming. But in fact it should be B2 to be punished. Because A2 is over its soft limit, the whole hierarchy below it should be preferred over A1, so both B1 and B2 should be soft limit reclaimed to be consistent with behaviour at the root level. > IMHO, it may checking the cgroup's soft limit standalone without > looking up its ancestors just as Ying said. Again, this would be a regression as soft limits have been applied hierarchically forever. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ying Han Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Date: Tue, 17 Jan 2012 12:25:31 -0800 Message-ID: References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120112085904.GG24386@cmpxchg.org> <20120113224424.GC1653@cmpxchg.org> <4F158418.2090509@gmail.com> <20120117145348.GA3144@cmpxchg.org> Mime-Version: 1.0 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:x-system-of-record:content-type:content-transfer-encoding; bh=ehd0yCOk4iUtnIibYW09aOC+zROO1ti1c5j2rCEItB8=; b=Pg5Ad4TzsJMH0+oVNVFJEdU8Q3CzwvukVqIS9piPzWLKdgmsO41xqTwwnE/WUpeZy2 40Ppb+cXEr7SDQofKzVOPRQFeWQlI2Vy8ZzypkTn7Oq52lx9wZX+dl9F3NQpIr/qykg4 +fikucbhmbe/KPYv5fDwhhi6CidZLx37xWers= In-Reply-To: <20120117145348.GA3144-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="iso-8859-1" To: Johannes Weiner Cc: Sha , Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Tue, Jan 17, 2012 at 6:53 AM, Johannes Weiner w= rote: > On Tue, Jan 17, 2012 at 10:22:16PM +0800, Sha wrote: >> On 01/14/2012 06:44 AM, Johannes Weiner wrote: >> >On Fri, Jan 13, 2012 at 01:31:16PM -0800, Ying Han wrote: >> >>On Thu, Jan 12, 2012 at 12:59 AM, Johannes Weiner =A0wrote: >> >>>On Wed, Jan 11, 2012 at 01:42:31PM -0800, Ying Han wrote: >> >>>>On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner =A0wrote: >> >>>>>@@ -1318,6 +1123,36 @@ static unsigned long mem_cgroup_margin(s= truct mem_cgroup *memcg) >> >>>>> =A0 =A0 =A0 =A0return margin>> =A0PAGE_SHIFT; >> >>>>> =A0} >> >>>>> >> >>>>>+/** >> >>>>>+ * mem_cgroup_over_softlimit >> >>>>>+ * @root: hierarchy root >> >>>>>+ * @memcg: child of @root to test >> >>>>>+ * >> >>>>>+ * Returns %true if @memcg exceeds its own soft limit or contr= ibutes >> >>>>>+ * to the soft limit excess of one of its parents up to and in= cluding >> >>>>>+ * @root. >> >>>>>+ */ >> >>>>>+bool mem_cgroup_over_softlimit(struct mem_cgroup *root, >> >>>>>+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0st= ruct mem_cgroup *memcg) >> >>>>>+{ >> >>>>>+ =A0 =A0 =A0 if (mem_cgroup_disabled()) >> >>>>>+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 return false; >> >>>>>+ >> >>>>>+ =A0 =A0 =A0 if (!root) >> >>>>>+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 root =3D root_mem_cgroup; >> >>>>>+ >> >>>>>+ =A0 =A0 =A0 for (; memcg; memcg =3D parent_mem_cgroup(memcg))= { >> >>>>>+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* root_mem_cgroup does not have = a soft limit */ >> >>>>>+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (memcg =3D=3D root_mem_cgroup) >> >>>>>+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break; >> >>>>>+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (res_counter_soft_limit_excess= (&memcg->res)) >> >>>>>+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 return true; >> >>>>>+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (memcg =3D=3D root) >> >>>>>+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break; >> >>>>>+ =A0 =A0 =A0 } >> >>>>Here it adds pressure on a cgroup if one of its parents exceeds = soft >> >>>>limit, although the cgroup itself is under soft limit. It does c= hange >> >>>>my understanding of soft limit, and might introduce regression o= f our >> >>>>existing use cases. >> >>>> >> >>>>Here is an example: >> >>>> >> >>>>Machine capacity 32G and we over-commit by 8G. >> >>>> >> >>>>root >> >>>> =A0 -> =A0A (hard limit 20G, soft limit 15G, usage 16G) >> >>>> =A0 =A0 =A0 =A0-> =A0A1 (soft limit 5G, usage 4G) >> >>>> =A0 =A0 =A0 =A0-> =A0A2 (soft limit 10G, usage 12G) >> >>>> =A0 -> =A0B (hard limit 20G, soft limit 10G, usage 16G) >> >>>> >> >>>>under global reclaim, we don't want to add pressure on A1 althou= gh its >> >>>>parent A exceeds its soft limit. Assume that if we set the soft = limit >> >>>>corresponding to each cgroup's working set size (hot memory), an= d it >> >>>>will introduce regression to A1 in that case. >> >>>> >> >>>>In my existing implementation, i am checking the cgroup's soft l= imit >> >>>>standalone w/o looking its ancestors. >> >>>Why do you set the soft limit of A in the first place if you don'= t >> >>>want it to be enforced? >> >>The soft limit should be enforced under certain condition, not alw= ays. >> >>The soft limit of A is set to be enforced when the parent of A and= B >> >>is under memory pressure. For example: >> >> >> >>Machine capacity 32G and we over-commit by 8G >> >> >> >>root >> >>-> =A0A (hard limit 20G, soft limit 12G, usage 20G) >> >> =A0 =A0 =A0 =A0-> =A0A1 (soft limit 2G, usage 1G) >> >> =A0 =A0 =A0 =A0-> =A0A2 (soft limit 10G, usage 19G) >> >>-> =A0B (hard limit 20G, soft limit 10G, usage 0G) >> >> >> >>Now, A is under memory pressure since the total usage is hitting i= ts >> >>hard limit. Then we start hierarchical reclaim under A, and each >> >>cgroup under A also takes consideration of soft limit. In this cas= e, >> >>we should only set priority =3D 0 to A2 since it contributes to A'= s >> >>charge as well as exceeding its own soft limit. Why punishing A1 (= set >> >>priority =3D 0) also which has usage under its soft limit ? I can >> >>imagine it will introduce regression to existing environment which= the >> >>soft limit is set based on the working set size of the cgroup. >> >> >> >>To answer the question why we set soft limit to A, it is used to >> >>over-commit the host while sharing the resource with its sibling (= B in >> >>this case). If the machine is under memory contention, we would li= ke >> >>to push down memory to A or B depends on their usage and soft limi= t. >> >D'oh, I think the problem is just that we walk up the hierarchy one >> >too many when checking whether a group exceeds a soft limit. =A0The= soft >> >limit is a signal to distribute pressure that comes from above, it'= s >> >meaningless and should indeed be ignored on the level the pressure >> >originates from. >> > >> >Say mem_cgroup_over_soft_limit(root, memcg) would check everyone up= to >> >but not including root, wouldn't that do exactly what we both want? >> > >> >Example: >> > >> >1. If global memory is short, we reclaim with root=3Droot_mem_cgrou= p. >> > =A0 =A0A1 and A2 get soft limit reclaimed because of A's soft limi= t >> > =A0 =A0excess, just like the current kernel would do. >> > >> >2. If A hits its hard limit, we reclaim with root=3DA, so we only m= ind >> > =A0 =A0the soft limits of A1 and A2. =A0A1 is below its soft limit= , all >> > =A0 =A0good. =A0A2 is above its soft limit, gets treated according= ly. =A0This >> > =A0 =A0is new behaviour, the current kernel would just reclaim the= m >> > =A0 =A0equally. >> > >> >Code: >> > >> >bool mem_cgroup_over_soft_limit(struct mem_cgroup *root, >> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 struct mem= _cgroup *memcg) >> >{ >> > =A0 =A0 if (mem_cgroup_disabled()) >> > =A0 =A0 =A0 =A0 =A0 =A0 return false; >> > >> > =A0 =A0 if (!root) >> > =A0 =A0 =A0 =A0 =A0 =A0 root =3D root_mem_cgroup; >> > >> > =A0 =A0 for (; memcg; memcg =3D parent_mem_cgroup(memcg)) { >> > =A0 =A0 =A0 =A0 =A0 =A0 if (memcg =3D=3D root) >> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break; >> > =A0 =A0 =A0 =A0 =A0 =A0 if (res_counter_soft_limit_excess(&memcg->= res)) >> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 return true; >> > =A0 =A0 } >> > =A0 =A0 return false; >> >} >> Hi Johannes, >> >> I don't think it solve the root of the problem, example: >> root >> -> A (hard limit 20G, soft limit 12G, usage 20G) >> =A0 =A0 -> A1 ( soft limit 2G, =A0 usage 1G) >> =A0 =A0 -> A2 ( soft limit 10G, usage 19G) >> =A0 =A0 =A0 =A0 =A0 =A0->B1 (soft limit 5G, usage 4G) >> =A0 =A0 =A0 =A0 =A0 =A0->B2 (soft limit 5G, usage 15G) >> >> Now A is hitting its hard limit and start hierarchical reclaim under= A. >> If we choose B1 to go through mem_cgroup_over_soft_limit, it will >> return true because its parent A2 has a large usage and will lead to >> priority=3D0 reclaiming. But in fact it should be B2 to be punished. > > Because A2 is over its soft limit, the whole hierarchy below it shoul= d > be preferred over A1, so both B1 and B2 should be soft limit reclaime= d > to be consistent with behaviour at the root level. > >> IMHO, it may checking the cgroup's soft limit standalone without >> looking up its ancestors just as Ying said. > > Again, this would be a regression as soft limits have been applied > hierarchically forever. If we are comparing it to the current implementation, agree that the soft reclaim is applied hierarchically. In the example above, A2 will be picked for soft reclaim while A is hitting its hard limit, which in turns reclaim from B1 and B2 regardless of their soft limit setting. However, I haven't convinced myself this is how we are gonna use the soft limit. The soft limit setting for each cgroup is a hit for applying pressure under memory contention. One way of setting the soft limit is based on the cgroup's working set size. Thus, we allow cgroup to grow above its soft limit with cold page cache unless there is a memory pressure comes from above. Under the hierarchical reclaim, we will exam the soft limit and only apply extra pressure to the ones above their soft limit. Here the same example: root -> A (hard limit 20G, soft limit 12G, usage 20G) -> A1 ( soft limit 2G, usage 1G) -> A2 ( soft limit 10G, usage 19G) ->B1 (soft limit 5G, usage 4G) ->B2 (soft limit 5G, usage 15G) If A is hitting its hard limit, we will reclaim all the children under A hierarchically but only adding extra pressure to the ones above their soft limits (A2, B2). Adding extra pressure to B1 will introduce known regression based on customer expectation since the 4G usage are hot memory. I am not aware of how the existing soft reclaim being used, i bet there are not a lot. If we are making changes on the current implementation, we should also take the opportunity to think about the initial design as well. Thoughts? --Ying From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Date: Tue, 17 Jan 2012 22:56:26 +0100 Message-ID: <20120117215626.GA2380@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120112085904.GG24386@cmpxchg.org> <20120113224424.GC1653@cmpxchg.org> <4F158418.2090509@gmail.com> <20120117145348.GA3144@cmpxchg.org> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Ying Han Cc: Sha , Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Tue, Jan 17, 2012 at 12:25:31PM -0800, Ying Han wrote: > On Tue, Jan 17, 2012 at 6:53 AM, Johannes Weiner wrote: > > On Tue, Jan 17, 2012 at 10:22:16PM +0800, Sha wrote: > >> IMHO, it may checking the cgroup's soft limit standalone without > >> looking up its ancestors just as Ying said. > > > > Again, this would be a regression as soft limits have been applied > > hierarchically forever. > > If we are comparing it to the current implementation, agree that the > soft reclaim is applied hierarchically. In the example above, A2 will > be picked for soft reclaim while A is hitting its hard limit, which in > turns reclaim from B1 and B2 regardless of their soft limit setting. > However, I haven't convinced myself this is how we are gonna use the > soft limit. Of course I'm comparing it to the current implementation, this is what I'm changing! > The soft limit setting for each cgroup is a hit for applying pressure > under memory contention. One way of setting the soft limit is based on > the cgroup's working set size. Thus, we allow cgroup to grow above its > soft limit with cold page cache unless there is a memory pressure > comes from above. Under the hierarchical reclaim, we will exam the > soft limit and only apply extra pressure to the ones above their soft > limit. Here the same example: > > root > -> A (hard limit 20G, soft limit 12G, usage 20G) > -> A1 ( soft limit 2G, usage 1G) > -> A2 ( soft limit 10G, usage 19G) > > ->B1 (soft limit 5G, usage 4G) > ->B2 (soft limit 5G, usage 15G) > > If A is hitting its hard limit, we will reclaim all the children under > A hierarchically but only adding extra pressure to the ones above > their soft limits (A2, B2). Adding extra pressure to B1 will introduce > known regression based on customer expectation since the 4G usage are > hot memory. I can only repeat myself: A has a soft limit set, so the customer expects global pressure to arise sooner or later. If that happens, A will be soft-limit reclaimed hierarchically in the _existing code_. That's how the soft limit currently works and I don't mean to change it _with this patch_. The customer has to expect that B1 can be reclaimed as a consequence of the soft limit in A or A2 today, so I don't know where this expectation of different behaviour should even come from. How can this be a regression?! > I am not aware of how the existing soft reclaim being used, i bet > there are not a lot. If we are making changes on the current > implementation, we should also take the opportunity to think about the > initial design as well. Thoughts? I agree that these semantics should be up for debate. And I think changing it to something like you have in mind is indeed a good idea; to not have soft limits apply hierarchically but instead follow down the whole chain and only soft limit reclaim those that are themselves above their soft limit. But it's an entirely different matter! This patch is supposed to do only two things: 1. refactor the soft limit implementation, staying as close as possible/practical to the current semantics and 2. fix the inconsistency that soft limits are ignored when pressure does not originate at the root_mem_cgroup. If that is too much change in semantics I can easily ditch 2., I just didn't see the use of maintaining an inconsistency that resulted purely from the limitations of the current implementation by re-adding more code and because I think that this would not be surprising behaviour. It would be as simple as adding an extra check in reclaim that only minds soft limits upon global pressure: if (global_reclaim(sc) && mem_cgroup_over_soft_limit(root, memcg)) /* resulting action */ and it would have nothing to do how soft limits are actually applied once triggered. I can include this in the next version, but it won't fix the problem you seem to be having with the _existing_ behaviour. I also don't think that my patch will get in the way of what you are planning to do: in fact, you already have code that easily turns mem_cgroup_over_soft_limit() into a non-hierarchical predicate. Even more will change when you invert the soft limits to become actual guarantees and skip reclaiming memcgs that are below their soft limits but I don't think this patch is in the way of doing that, either. I feel that these are all orthogonal changes. So if possible, could we take just one step at a time and leave hypothetical behaviour out of it unless the proposed changes clearly get in the way of where we agreed we want to go? If I misunderstood everything completely and you actually believe this patch will get in the way, could you tell me where and how? Thanks. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ying Han Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Date: Tue, 17 Jan 2012 15:39:04 -0800 Message-ID: References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120112085904.GG24386@cmpxchg.org> <20120113224424.GC1653@cmpxchg.org> <4F158418.2090509@gmail.com> <20120117145348.GA3144@cmpxchg.org> <20120117215626.GA2380@cmpxchg.org> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:x-system-of-record:content-type:content-transfer-encoding; bh=Zk88vIlGJ4/xC41wF2RSEW8AJ+yNDLiMXZ3Vs7aSjRc=; b=SP811DT0pBW4Dg6aeniU6Jj1IsuUXz/c5GRO22Y4I7aAx+U1fO+15S+Tk8zYjH1Kg6 lktDK7Nc/tocdRyfR0wdjebx/i6M7KhphtxTpF7DI2uTWnAE9vlJNnXixaVWceLBdFNB ntgJsRd9lcNDkSQOemtPhw+MqPTne+2p4LDNE= In-Reply-To: <20120117215626.GA2380@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="iso-8859-1" To: Johannes Weiner Cc: Sha , Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org On Tue, Jan 17, 2012 at 1:56 PM, Johannes Weiner wrote= : > On Tue, Jan 17, 2012 at 12:25:31PM -0800, Ying Han wrote: >> On Tue, Jan 17, 2012 at 6:53 AM, Johannes Weiner wr= ote: >> > On Tue, Jan 17, 2012 at 10:22:16PM +0800, Sha wrote: >> >> IMHO, it may checking the cgroup's soft limit standalone without >> >> looking up its ancestors just as Ying said. >> > >> > Again, this would be a regression as soft limits have been applied >> > hierarchically forever. >> >> If we are comparing it to the current implementation, agree that the >> soft reclaim is applied hierarchically. In the example above, A2 will >> be picked for soft reclaim while A is hitting its hard limit, which in >> turns reclaim from B1 and B2 regardless of their soft limit setting. >> However, I haven't convinced myself this is how we are gonna use the >> soft limit. > > Of course I'm comparing it to the current implementation, this is what > I'm changing! Thank you for clarifying it. Apparently i confused myself by comparing this patch with the one I had last time. >> The soft limit setting for each cgroup is a hit for applying pressure >> under memory contention. One way of setting the soft limit is based on >> the cgroup's working set size. Thus, we allow cgroup to grow above its >> soft limit with cold page cache unless there is a memory pressure >> comes from above. Under the hierarchical reclaim, we will exam the >> soft limit and only apply extra pressure to the ones above their soft >> limit. Here the same example: >> >> root >> -> A (hard limit 20G, soft limit 12G, usage 20G) >> =A0 =A0-> A1 ( soft limit 2G, =A0 usage 1G) >> =A0 =A0-> A2 ( soft limit 10G, usage 19G) >> >> =A0 =A0 =A0 =A0 =A0 ->B1 (soft limit 5G, usage 4G) >> =A0 =A0 =A0 =A0 =A0 ->B2 (soft limit 5G, usage 15G) >> >> If A is hitting its hard limit, we will reclaim all the children under >> A hierarchically but only adding extra pressure to the ones above >> their soft limits (A2, B2). Adding extra pressure to B1 will introduce >> known regression based on customer expectation since the 4G usage are >> hot memory. > > I can only repeat myself: A has a soft limit set, so the customer > expects global pressure to arise sooner or later. =A0If that happens, A > will be soft-limit reclaimed hierarchically in the _existing code_. > That's how the soft limit currently works and I don't mean to change > it _with this patch_. =A0The customer has to expect that B1 can be > reclaimed as a consequence of the soft limit in A or A2 today, so I > don't know where this expectation of different behaviour should even > come from. =A0How can this be a regression?! sorry for the confusion :( I wasn't comparing this patch to the current implementation, maybe I should. If the goal of this patch set is to bring as close as possible to the current implementation, I don't have objections. > >> I am not aware of how the existing soft reclaim being used, i bet >> there are not a lot. If we are making changes on the current >> implementation, we should also take the opportunity to think about the >> initial design as well. Thoughts? > > I agree that these semantics should be up for debate. =A0And I think > changing it to something like you have in mind is indeed a good idea; > to not have soft limits apply hierarchically but instead follow down > the whole chain and only soft limit reclaim those that are themselves > above their soft limit. =A0But it's an entirely different matter! thanks, agree that patch could come after this. > > This patch is supposed to do only two things: 1. refactor the soft > limit implementation, staying as close as possible/practical to the > current semantics and 2. fix the inconsistency that soft limits are > ignored when pressure does not originate at the root_mem_cgroup. =A0If > that is too much change in semantics I can easily ditch 2., It would be nice to split the two into separate patches. The second which adds soft reclaim into per-memcg reclaim is a new functionality from the current implementation. I just > didn't see the use of maintaining an inconsistency that resulted > purely from the limitations of the current implementation by re-adding > more code and because I think that this would not be surprising > behaviour. agree. It would be as simple as adding an extra check in reclaim > that only minds soft limits upon global pressure: > > =A0 =A0 =A0 =A0if (global_reclaim(sc) && mem_cgroup_over_soft_limit(root,= memcg)) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0/* resulting action */ > > and it would have nothing to do how soft limits are actually applied > once triggered. =A0I can include this in the next version, but it won't > fix the problem you seem to be having with the _existing_ behaviour. No, it won't solve all the problems but close. It looks pretty much as what I have, except the priority part. We can leave it for the following patch to further improve soft limit reclaim. I have no strong opinion to whether include the global_reclaim() or not, however it might bring your patch closer to the _existing_ implementation. (considers soft reclaim only under global reclaim ). > I also don't think that my patch will get in the way of what you are > planning to do: in fact, you already have code that easily turns > mem_cgroup_over_soft_limit() into a non-hierarchical predicate. > > Even more will change when you invert the soft limits to become actual > guarantees and skip reclaiming memcgs that are below their soft limits > but I don't think this patch is in the way of doing that, either. > I feel that these are all orthogonal changes. =A0So if possible, could > we take just one step at a time and leave hypothetical behaviour out > of it unless the proposed changes clearly get in the way of where we > agreed we want to go? > > If I misunderstood everything completely and you actually believe this > patch will get in the way, could you tell me where and how? The change by itself is easy to apply on top of yours. The hierarchical part took some of my time to understand, which now is clear to make it as close as possible to the _existing_ code. Feel free to post the updated patch whenever they are ready. Thanks --Ying > Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: KAMEZAWA Hiroyuki Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Date: Wed, 18 Jan 2012 14:26:38 +0900 Message-ID: <20120118142638.11667d2c.kamezawa.hiroyu@jp.fujitsu.com> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120112105427.4b80437b.kamezawa.hiroyu@jp.fujitsu.com> <20120113121645.GA1653@cmpxchg.org> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20120113121645.GA1653-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" To: Johannes Weiner Cc: Andrew Morton , Michal Hocko , Balbir Singh , Ying Han , cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Fri, 13 Jan 2012 13:16:56 +0100 Johannes Weiner wrote: > On Thu, Jan 12, 2012 at 10:54:27AM +0900, KAMEZAWA Hiroyuki wrote: > > Thank you for your work and the result seems atractive and code is much > > simpler. My small concerns are.. > > > > 1. This approach may increase latency of direct-reclaim because of priority=0. > > I think strictly speaking yes, but note that with kswapd being less > likely to get stuck in hammering on one group, the need for allocators > to enter direct reclaim itself is reduced. > > However, if this really becomes a problem in real world loads, the fix > is pretty easy: just ignore the soft limit for direct reclaim. We can > still consider it from hard limit reclaim and kswapd. > > > 2. In a case numa-spread/interleave application run in its own container, > > pages on a node may paged-out again and again becasue of priority=0 > > if some other application runs in the node. > > It seems difficult to use soft-limit with numa-aware applications. > > Do you have suggestions ? > > This is a question about soft limits in general rather than about this > particular patch, right? > Partially, yes. My concern is related to "1". Assume an application is binded to some cpu/node and try to allocate memory. If its memcg's usage is over softlimit, this application will play bad because newly allocated memory will be reclaim target soon, again.... > And if I understand correctly, the problem you are referring to is > this: an application and parts of a soft-limited container share a > node, the soft limit setting means that the container's pages on that > node are reclaimed harder. At that point, the container's share on > that node becomes tiny, but since the soft limit is oblivious to > nodes, the expansion of the other application pushes the soft-limited > container off that node completely as long as the container stays > above its soft limit with the usage on other nodes. > > What would you think about having node-local soft limits that take the > node size into account? > > local_soft_limit = soft_limit * node_size / memcg_size > > The soft limit can be exceeded globally, but the container is no > longer pushed off a node on which it's only occupying a small share of > memory. > Yes, I think this kind of care is required. What is the 'node_size' here ? size of pgdat ? size of per-node usage in the memcg ? > Putting it into proportion of the memcg size, not overall memory size > has the following advantages: > > 1. if the container is sitting on only one of several available > nodes without exceeding the limit globally, the memcg will not be > reclaimed harder just because it has a relatively large share of the > node. > > 2. if the soft limit excess is ridiculously high, the local soft > limits will be pushed down, so the tolerance for smaller shares on > nodes goes down in proportion to the global soft limit excess. > > Example: > > 4G soft limit * 2G node / 4G container = 2G node-local limit > > The container is globally within its soft limit, so the local limit is > at least the size of the node. It's never reclaimed harder compared > to other applications on the node. > > 4G soft limit * 2G node / 5G container = ~1.6G node-local limit > > Here, it will experience more pressure initially, but it will level > off when the shrinking usage and the thereby increasing node-local > soft limit meet. From that point on, the container and the competing > application will be treated equally during reclaim. > > Finally, if the container is 16G in size, i.e. 300% in excess, the > per-node tolerance is at 512M node-local soft limit, which IMO strikes > a good balance between zero tolerance and still applying some stress > to the hugely oversized container when other applications (with > virtually unlimited soft limits) want to run on the same node. > > What do you think? I like the idea. Another idea is changing 'priority' based on per-node stats if not too complicated... Thanks, -Kame From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sha Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Date: Wed, 18 Jan 2012 15:17:25 +0800 Message-ID: References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120112085904.GG24386@cmpxchg.org> <20120113224424.GC1653@cmpxchg.org> <4F158418.2090509@gmail.com> <20120117145348.GA3144@cmpxchg.org> Mime-Version: 1.0 Content-Type: multipart/alternative; boundary=f46d041b4aa45187ca04b6c83d25 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=FglZDbKFBcuFMK4cLRb70j7o+bA7EiertUWk39zOaK0=; b=BT/A75Ed2tJxfp8oPowchJrQzm1EZZdcMy5MCkgTqZlN5EG7MnovimAR1tCYCIooie dFCKtLtW1tqu6R4SedGloSuJTMphcVVg0FudEpyULAFlUzi2790jeZMvDhetshM4qqwg DmXBDzmonY9lRFRSvpxQjGsJ9IxXJZvgrYxoY= In-Reply-To: <20120117145348.GA3144@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Ying Han , Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org --f46d041b4aa45187ca04b6c83d25 Content-Type: text/plain; charset=ISO-8859-1 ** On 01/17/2012 10:53 PM, Johannes Weiner wrote: On Tue, Jan 17, 2012 at 10:22:16PM +0800, Sha wrote: On 01/14/2012 06:44 AM, Johannes Weiner wrote: On Fri, Jan 13, 2012 at 01:31:16PM -0800, Ying Han wrote: On Thu, Jan 12, 2012 at 12:59 AM, Johannes Weiner wrote: On Wed, Jan 11, 2012 at 01:42:31PM -0800, Ying Han wrote: On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner wrote: @@ -1318,6 +1123,36 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) return margin>> PAGE_SHIFT; } +/** + * mem_cgroup_over_softlimit + * @root: hierarchy root + * @memcg: child of @root to test + * + * Returns %true if @memcg exceeds its own soft limit or contributes + * to the soft limit excess of one of its parents up to and including + * @root. + */ +bool mem_cgroup_over_softlimit(struct mem_cgroup *root, + struct mem_cgroup *memcg) +{ + if (mem_cgroup_disabled()) + return false; + + if (!root) + root = root_mem_cgroup; + + for (; memcg; memcg = parent_mem_cgroup(memcg)) { + /* root_mem_cgroup does not have a soft limit */ + if (memcg == root_mem_cgroup) + break; + if (res_counter_soft_limit_excess(&memcg->res)) + return true; + if (memcg == root) + break; + } Here it adds pressure on a cgroup if one of its parents exceeds soft limit, although the cgroup itself is under soft limit. It does change my understanding of soft limit, and might introduce regression of our existing use cases. Here is an example: Machine capacity 32G and we over-commit by 8G. root -> A (hard limit 20G, soft limit 15G, usage 16G) -> A1 (soft limit 5G, usage 4G) -> A2 (soft limit 10G, usage 12G) -> B (hard limit 20G, soft limit 10G, usage 16G) under global reclaim, we don't want to add pressure on A1 although its parent A exceeds its soft limit. Assume that if we set the soft limit corresponding to each cgroup's working set size (hot memory), and it will introduce regression to A1 in that case. In my existing implementation, i am checking the cgroup's soft limit standalone w/o looking its ancestors. Why do you set the soft limit of A in the first place if you don't want it to be enforced? The soft limit should be enforced under certain condition, not always. The soft limit of A is set to be enforced when the parent of A and B is under memory pressure. For example: Machine capacity 32G and we over-commit by 8G root -> A (hard limit 20G, soft limit 12G, usage 20G) -> A1 (soft limit 2G, usage 1G) -> A2 (soft limit 10G, usage 19G) -> B (hard limit 20G, soft limit 10G, usage 0G) Now, A is under memory pressure since the total usage is hitting its hard limit. Then we start hierarchical reclaim under A, and each cgroup under A also takes consideration of soft limit. In this case, we should only set priority = 0 to A2 since it contributes to A's charge as well as exceeding its own soft limit. Why punishing A1 (set priority = 0) also which has usage under its soft limit ? I can imagine it will introduce regression to existing environment which the soft limit is set based on the working set size of the cgroup. To answer the question why we set soft limit to A, it is used to over-commit the host while sharing the resource with its sibling (B in this case). If the machine is under memory contention, we would like to push down memory to A or B depends on their usage and soft limit. D'oh, I think the problem is just that we walk up the hierarchy one too many when checking whether a group exceeds a soft limit. The soft limit is a signal to distribute pressure that comes from above, it's meaningless and should indeed be ignored on the level the pressure originates from. Say mem_cgroup_over_soft_limit(root, memcg) would check everyone up to but not including root, wouldn't that do exactly what we both want? Example: 1. If global memory is short, we reclaim with root=root_mem_cgroup. A1 and A2 get soft limit reclaimed because of A's soft limit excess, just like the current kernel would do. 2. If A hits its hard limit, we reclaim with root=A, so we only mind the soft limits of A1 and A2. A1 is below its soft limit, all good. A2 is above its soft limit, gets treated accordingly. This is new behaviour, the current kernel would just reclaim them equally. Code: bool mem_cgroup_over_soft_limit(struct mem_cgroup *root, struct mem_cgroup *memcg) { if (mem_cgroup_disabled()) return false; if (!root) root = root_mem_cgroup; for (; memcg; memcg = parent_mem_cgroup(memcg)) { if (memcg == root) break; if (res_counter_soft_limit_excess(&memcg->res)) return true; } return false; } Hi Johannes, I don't think it solve the root of the problem, example: root -> A (hard limit 20G, soft limit 12G, usage 20G) -> A1 ( soft limit 2G, usage 1G) -> A2 ( soft limit 10G, usage 19G) ->B1 (soft limit 5G, usage 4G) ->B2 (soft limit 5G, usage 15G) Now A is hitting its hard limit and start hierarchical reclaim under A. If we choose B1 to go through mem_cgroup_over_soft_limit, it will return true because its parent A2 has a large usage and will lead to priority=0 reclaiming. But in fact it should be B2 to be punished. Because A2 is over its soft limit, the whole hierarchy below it should be preferred over A1, so both B1 and B2 should be soft limit reclaimed to be consistent with behaviour at the root level. Well it is just the behavior that I'm expecting actually. But with my humble comprehension, I can't catch the soft-limit-based hierarchical reclaiming under the target cgroup (A2) in the current implementation or after the patch. Both the current mem_cgroup_soft_reclaim or shrink_zone select victim sub-cgroup by mem_cgroup_iter, but it doesn't take soft limit into consideration, do I left anything ? Thanks, Sha --f46d041b4aa45187ca04b6c83d25 Content-Type: text/html; charset=ISO-8859-1
On 01/17/2012 10:53 PM, Johannes Weiner wrote:
On Tue, Jan 17, 2012 at 10:22:16PM +0800, Sha wrote:
On 01/14/2012 06:44 AM, Johannes Weiner wrote:
On Fri, Jan 13, 2012 at 01:31:16PM -0800, Ying Han wrote:
On Thu, Jan 12, 2012 at 12:59 AM, Johannes Weiner<hannes@cmpxchg.org>  wrote:
On Wed, Jan 11, 2012 at 01:42:31PM -0800, Ying Han wrote:
On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner<hannes@cmpxchg.org>  wrote:
@@ -1318,6 +1123,36 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
       return margin>>  PAGE_SHIFT;
 }

+/**
+ * mem_cgroup_over_softlimit
+ * @root: hierarchy root
+ * @memcg: child of @root to test
+ *
+ * Returns %true if @memcg exceeds its own soft limit or contributes
+ * to the soft limit excess of one of its parents up to and including
+ * @root.
+ */
+bool mem_cgroup_over_softlimit(struct mem_cgroup *root,
+                              struct mem_cgroup *memcg)
+{
+       if (mem_cgroup_disabled())
+               return false;
+
+       if (!root)
+               root = root_mem_cgroup;
+
+       for (; memcg; memcg = parent_mem_cgroup(memcg)) {
+               /* root_mem_cgroup does not have a soft limit */
+               if (memcg == root_mem_cgroup)
+                       break;
+               if (res_counter_soft_limit_excess(&memcg->res))
+                       return true;
+               if (memcg == root)
+                       break;
+       }
Here it adds pressure on a cgroup if one of its parents exceeds soft
limit, although the cgroup itself is under soft limit. It does change
my understanding of soft limit, and might introduce regression of our
existing use cases.

Here is an example:

Machine capacity 32G and we over-commit by 8G.

root
  ->  A (hard limit 20G, soft limit 15G, usage 16G)
       ->  A1 (soft limit 5G, usage 4G)
       ->  A2 (soft limit 10G, usage 12G)
  ->  B (hard limit 20G, soft limit 10G, usage 16G)

under global reclaim, we don't want to add pressure on A1 although its
parent A exceeds its soft limit. Assume that if we set the soft limit
corresponding to each cgroup's working set size (hot memory), and it
will introduce regression to A1 in that case.

In my existing implementation, i am checking the cgroup's soft limit
standalone w/o looking its ancestors.
Why do you set the soft limit of A in the first place if you don't
want it to be enforced?
The soft limit should be enforced under certain condition, not always.
The soft limit of A is set to be enforced when the parent of A and B
is under memory pressure. For example:

Machine capacity 32G and we over-commit by 8G

root
->  A (hard limit 20G, soft limit 12G, usage 20G)
       ->  A1 (soft limit 2G, usage 1G)
       ->  A2 (soft limit 10G, usage 19G)
->  B (hard limit 20G, soft limit 10G, usage 0G)

Now, A is under memory pressure since the total usage is hitting its
hard limit. Then we start hierarchical reclaim under A, and each
cgroup under A also takes consideration of soft limit. In this case,
we should only set priority = 0 to A2 since it contributes to A's
charge as well as exceeding its own soft limit. Why punishing A1 (set
priority = 0) also which has usage under its soft limit ? I can
imagine it will introduce regression to existing environment which the
soft limit is set based on the working set size of the cgroup.

To answer the question why we set soft limit to A, it is used to
over-commit the host while sharing the resource with its sibling (B in
this case). If the machine is under memory contention, we would like
to push down memory to A or B depends on their usage and soft limit.
D'oh, I think the problem is just that we walk up the hierarchy one
too many when checking whether a group exceeds a soft limit.  The soft
limit is a signal to distribute pressure that comes from above, it's
meaningless and should indeed be ignored on the level the pressure
originates from.

Say mem_cgroup_over_soft_limit(root, memcg) would check everyone up to
but not including root, wouldn't that do exactly what we both want?

Example:

1. If global memory is short, we reclaim with root=root_mem_cgroup.
   A1 and A2 get soft limit reclaimed because of A's soft limit
   excess, just like the current kernel would do.

2. If A hits its hard limit, we reclaim with root=A, so we only mind
   the soft limits of A1 and A2.  A1 is below its soft limit, all
   good.  A2 is above its soft limit, gets treated accordingly.  This
   is new behaviour, the current kernel would just reclaim them
   equally.

Code:

bool mem_cgroup_over_soft_limit(struct mem_cgroup *root,
			        struct mem_cgroup *memcg)
{
	if (mem_cgroup_disabled())
		return false;

	if (!root)
		root = root_mem_cgroup;

	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
		if (memcg == root)
			break;
		if (res_counter_soft_limit_excess(&memcg->res))
			return true;
	}
	return false;
}
Hi Johannes,

I don't think it solve the root of the problem, example:
root
-> A (hard limit 20G, soft limit 12G, usage 20G)
    -> A1 ( soft limit 2G,   usage 1G)
    -> A2 ( soft limit 10G, usage 19G)
           ->B1 (soft limit 5G, usage 4G)
           ->B2 (soft limit 5G, usage 15G)

Now A is hitting its hard limit and start hierarchical reclaim under A.
If we choose B1 to go through mem_cgroup_over_soft_limit, it will
return true because its parent A2 has a large usage and will lead to
priority=0 reclaiming. But in fact it should be B2 to be punished.
Because A2 is over its soft limit, the whole hierarchy below it should
be preferred over A1, so both B1 and B2 should be soft limit reclaimed
to be consistent with behaviour at the root level.
Well it is just the behavior that I'm expecting actually. But with my
humble comprehension, I can't catch the soft-limit-based hierarchical
reclaiming under the target cgroup (A2) in the current implementation
or after the patch. Both the current mem_cgroup_soft_reclaim or
shrink_zone select victim sub-cgroup by mem_cgroup_iter, but it
doesn't take soft limit into consideration, do I left anything ?

Thanks,
Sha
--f46d041b4aa45187ca04b6c83d25-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Date: Wed, 18 Jan 2012 10:25:09 +0100 Message-ID: <20120118092509.GI24386@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120112085904.GG24386@cmpxchg.org> <20120113224424.GC1653@cmpxchg.org> <4F158418.2090509@gmail.com> <20120117145348.GA3144@cmpxchg.org> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Sha Cc: Ying Han , Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Wed, Jan 18, 2012 at 03:17:25PM +0800, Sha wrote: > > > I don't think it solve the root of the problem, example: > > > root > > > -> A (hard limit 20G, soft limit 12G, usage 20G) > > > -> A1 ( soft limit 2G, usage 1G) > > > -> A2 ( soft limit 10G, usage 19G) > > > ->B1 (soft limit 5G, usage 4G) > > > ->B2 (soft limit 5G, usage 15G) > > > > > > Now A is hitting its hard limit and start hierarchical reclaim under A. > > > If we choose B1 to go through mem_cgroup_over_soft_limit, it will > > > return true because its parent A2 has a large usage and will lead to > > > priority=0 reclaiming. But in fact it should be B2 to be punished. > > > Because A2 is over its soft limit, the whole hierarchy below it should > > be preferred over A1, so both B1 and B2 should be soft limit reclaimed > > to be consistent with behaviour at the root level. > > Well it is just the behavior that I'm expecting actually. But with my > humble comprehension, I can't catch the soft-limit-based hierarchical > reclaiming under the target cgroup (A2) in the current implementation > or after the patch. Both the current mem_cgroup_soft_reclaim or > shrink_zone select victim sub-cgroup by mem_cgroup_iter, but it > doesn't take soft limit into consideration, do I left anything ? No, currently soft limits are ignored if pressure originates from below root_mem_cgroup. But iff soft limits are applied right now, they are applied hierarchically, see mem_cgroup_soft_limit_reclaim(). In my opinion, the fact that soft limits are ignored when pressure is triggered sub-root_mem_cgroup is an artifact of the per-zone tree, so I allowed soft limits to be taken into account below root_mem_cgroup. But IMO, this is something different from how soft limit reclaim is applied once triggered: currently, soft limit reclaim applies to a whole hierarchy, including all children. And this I left unchanged. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johannes Weiner Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Date: Wed, 18 Jan 2012 10:45:23 +0100 Message-ID: <20120118094523.GJ24386@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120113120406.GC17060@tiehlicka.suse.cz> <20120113155001.GB1653@cmpxchg.org> <20120113163423.GG17060@tiehlicka.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="iso-8859-1" To: Ying Han Cc: Michal Hocko , Andrew Morton , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org On Fri, Jan 13, 2012 at 01:45:30PM -0800, Ying Han wrote: > On Fri, Jan 13, 2012 at 8:34 AM, Michal Hocko wrote: > > On Fri 13-01-12 16:50:01, Johannes Weiner wrote: > >> On Fri, Jan 13, 2012 at 01:04:06PM +0100, Michal Hocko wrote: > >> > On Tue 10-01-12 16:02:52, Johannes Weiner wrote: > > [...] > >> > > +bool mem_cgroup_over_softlimit(struct mem_cgroup *root, > >> > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0struct mem_cgro= up *memcg) > >> > > +{ > >> > > + if (mem_cgroup_disabled()) > >> > > + =A0 =A0 =A0 =A0 return false; > >> > > + > >> > > + if (!root) > >> > > + =A0 =A0 =A0 =A0 root =3D root_mem_cgroup; > >> > > + > >> > > + for (; memcg; memcg =3D parent_mem_cgroup(memcg)) { > >> > > + =A0 =A0 =A0 =A0 /* root_mem_cgroup does not have a soft limit = */ > >> > > + =A0 =A0 =A0 =A0 if (memcg =3D=3D root_mem_cgroup) > >> > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break; > >> > > + =A0 =A0 =A0 =A0 if (res_counter_soft_limit_excess(&memcg->res)= ) > >> > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 return true; > >> > > + =A0 =A0 =A0 =A0 if (memcg =3D=3D root) > >> > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break; > >> > > + } > >> > > + return false; > >> > > +} > >> > > >> > Well, this might be little bit tricky. We do not check whether mem= cg and > >> > root are in a hierarchy (in terms of use_hierarchy) relation. > >> > > >> > If we are under global reclaim then we iterate over all memcgs and= so > >> > there is no guarantee that there is a hierarchical relation betwee= n the > >> > given memcg and its parent. While, on the other hand, if we are do= ing > >> > memcg reclaim then we have this guarantee. > >> > > >> > Why should we punish a group (subtree) which is perfectly under it= s soft > >> > limit just because some other subtree contributes to the common pa= rent's > >> > usage and makes it over its limit? > >> > Should we check memcg->use_hierarchy here? > >> > >> We do, actually. =A0parent_mem_cgroup() checks the res_counter paren= t, > >> which is only set when ->use_hierarchy is also set. > > > > Of course I am blind.. We do not setup res_counter parent for > > !use_hierarchy case. Sorry for noise... > > Now it makes much better sense. I was wondering how !use_hierarchy co= uld > > ever work, this should be a signal that I am overlooking something > > terribly. > > > > [...] > >> > > @@ -2121,8 +2121,16 @@ static void shrink_zone(int priority, str= uct zone *zone, > >> > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 .mem_cgroup =3D memcg, > >> > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 .zone =3D zone, > >> > > =A0 =A0 =A0 =A0 =A0 }; > >> > > + =A0 =A0 =A0 =A0 int epriority =3D priority; > >> > > + =A0 =A0 =A0 =A0 /* > >> > > + =A0 =A0 =A0 =A0 =A0* Put more pressure on hierarchies that exc= eed their > >> > > + =A0 =A0 =A0 =A0 =A0* soft limit, to push them back harder than= their > >> > > + =A0 =A0 =A0 =A0 =A0* well-behaving siblings. > >> > > + =A0 =A0 =A0 =A0 =A0*/ > >> > > + =A0 =A0 =A0 =A0 if (mem_cgroup_over_softlimit(root, memcg)) > >> > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 epriority =3D 0; > >> > > >> > This sounds too aggressive to me. Shouldn't we just double the pre= ssure > >> > or something like that? > >> > >> That's the historical value. =A0When I tried priority - 1, it was no= t > >> aggressive enough. > > > > Probably because we want to reclaim too much. Maybe we should do > > reduce nr_to_reclaim (ugly) or reclaim only overlimit groups until ce= rtain > > priority level as Ying suggested in her patchset. >=20 > I plan to post that change on top of this, and this patch set does the > basic stuff to allow us doing further improvement. >=20 > I still like the design to skip over_soft_limit cgroups until certain > priority. One way to set up the soft limit for each cgroup is to base > on its actual working set size, and we prefer to punish A first with > lots of page cache ( cold file pages above soft limit) than reclaiming > anon pages from B ( below soft limit ). Unless we can not get enough > pages reclaimed from A, we will start reclaiming from B. >=20 > This might not be the ideal solution, but should be a good start. Thoug= hts? I don't like this design at all because unless you add weird code to detect if soft limits apply to any memcgs on the reclaimed hierarchy you may iterate over the same bunch of memcgs doing nothing for several times. For example in the default case of no softlimits set anywhere and you repeatedly walk ALL memcgs in the system doing jack until you reach your threshold priority level. Elegant is something else in my book. Once we invert soft limits to mean guarantees and make the default soft limit not infinity but zero, then we can ignore memcgs below their soft limit for a few priority levels just fine because being below the soft limit is the exception. But I don't really want to make this quite invasive behavioural change a requirement for a refactoring patch if possible. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter= .ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sha Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Date: Wed, 18 Jan 2012 19:25:27 +0800 Message-ID: <4F16AC27.1080906@gmail.com> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120112085904.GG24386@cmpxchg.org> <20120113224424.GC1653@cmpxchg.org> <4F158418.2090509@gmail.com> <20120117145348.GA3144@cmpxchg.org> <20120118092509.GI24386@cmpxchg.org> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding; bh=xGmnopKKKj8PE4IIpaPJ270u1IDnICKPtdjJSYZBdNE=; b=ppwt/xKU5GG6B0Rg4NMABNP1bpRD82DTa3v+eNXANf6QS4nQKnWX6Fh7PBpHNQBfMn 4JgjIRSL+VL7qhERmjyye4xq0gvyxo/1O3/pCbDTynoUod12VuMvfFVtrS+9nAXNPD0I oglfI93qmQuWhp3dB7L7O0qHpEQdVCPcvk4QA= In-Reply-To: <20120118092509.GI24386@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii"; format="flowed" To: Johannes Weiner Cc: Ying Han , Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org On 01/18/2012 05:25 PM, Johannes Weiner wrote: > On Wed, Jan 18, 2012 at 03:17:25PM +0800, Sha wrote: >>>> I don't think it solve the root of the problem, example: >>>> root >>>> -> A (hard limit 20G, soft limit 12G, usage 20G) >>>> -> A1 ( soft limit 2G, usage 1G) >>>> -> A2 ( soft limit 10G, usage 19G) >>>> ->B1 (soft limit 5G, usage 4G) >>>> ->B2 (soft limit 5G, usage 15G) >>>> >>>> Now A is hitting its hard limit and start hierarchical reclaim under A. >>>> If we choose B1 to go through mem_cgroup_over_soft_limit, it will >>>> return true because its parent A2 has a large usage and will lead to >>>> priority=0 reclaiming. But in fact it should be B2 to be punished. >>> Because A2 is over its soft limit, the whole hierarchy below it should >>> be preferred over A1, so both B1 and B2 should be soft limit reclaimed >>> to be consistent with behaviour at the root level. >> Well it is just the behavior that I'm expecting actually. But with my >> humble comprehension, I can't catch the soft-limit-based hierarchical >> reclaiming under the target cgroup (A2) in the current implementation >> or after the patch. Both the current mem_cgroup_soft_reclaim or >> shrink_zone select victim sub-cgroup by mem_cgroup_iter, but it >> doesn't take soft limit into consideration, do I left anything ? > No, currently soft limits are ignored if pressure originates from > below root_mem_cgroup. > > But iff soft limits are applied right now, they are applied > hierarchically, see mem_cgroup_soft_limit_reclaim(). Er... I'm even more confused: mem_cgroup_soft_limit_reclaim indeed choses the biggest soft-limit excessor first, but in the succeeding reclaim mem_cgroup_hierarchical_reclaim just selects a child cgroup by css_id which has nothing to do with soft limit (see mem_cgroup_select_victim). IMHO, it's not a genuine hierarchical reclaim. I check this from the latest memcg-devel git tree (branch since-3.1)... > In my opinion, the fact that soft limits are ignored when pressure is > triggered sub-root_mem_cgroup is an artifact of the per-zone tree, so > I allowed soft limits to be taken into account below root_mem_cgroup. > > But IMO, this is something different from how soft limit reclaim is > applied once triggered: currently, soft limit reclaim applies to a > whole hierarchy, including all children. And this I left unchanged. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Date: Wed, 18 Jan 2012 16:27:08 +0100 Message-ID: <20120118152708.GG31112@tiehlicka.suse.cz> References: <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120112085904.GG24386@cmpxchg.org> <20120113224424.GC1653@cmpxchg.org> <4F158418.2090509@gmail.com> <20120117145348.GA3144@cmpxchg.org> <20120118092509.GI24386@cmpxchg.org> <4F16AC27.1080906@gmail.com> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <4F16AC27.1080906@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Sha Cc: Johannes Weiner , Ying Han , Andrew Morton , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org On Wed 18-01-12 19:25:27, Sha wrote: [...] > Er... I'm even more confused: mem_cgroup_soft_limit_reclaim indeed > choses the biggest soft-limit excessor first, but in the succeeding reclaim > mem_cgroup_hierarchical_reclaim just selects a child cgroup by css_id mem_cgroup_soft_limit_reclaim picks up the hierarchy root (most excessing one) and mem_cgroup_hierarchical_reclaim reclaims from that subtree). It doesn't care who exceeds the soft limit under that hierarchy it just tries to push the root under its limit as much as it can. This is what Johannes tried to explain in the other email in the thred. > which has nothing to do with soft limit (see mem_cgroup_select_victim). > IMHO, it's not a genuine hierarchical reclaim. It is hierarchical because it iterates over hierarchy it is not and never was recursively soft-hierarchical... -- Michal Hocko SUSE Labs SUSE LINUX s.r.o. Lihovarska 1060/12 190 00 Praha 9 Czech Republic -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ying Han Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Date: Wed, 18 Jan 2012 12:38:54 -0800 Message-ID: References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120113120406.GC17060@tiehlicka.suse.cz> <20120113155001.GB1653@cmpxchg.org> <20120113163423.GG17060@tiehlicka.suse.cz> <20120118094523.GJ24386@cmpxchg.org> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:x-system-of-record:content-type:content-transfer-encoding; bh=EYm1VY1lN+gFVjbN0Qugeq8X9bCdj1KtZcPWwARlnZY=; b=qjZqgJLn0z1LX+0xkWR1YcAqSHTJdWoZjw88ni1A8wzA+rzuBBfVQR/D4gmRssyL15 RDNNTzVcQqdJagitxQIo/IJ4acNaeztbUz3NvDE9fZZSUuy96upwvq2Z0QLLCH9qPZTq Ur9jDjIDdZibrPm91lGs9Yi0b89KxywMwNnXM= In-Reply-To: <20120118094523.GJ24386@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="iso-8859-1" To: Johannes Weiner Cc: Michal Hocko , Andrew Morton , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org On Wed, Jan 18, 2012 at 1:45 AM, Johannes Weiner wrote= : > On Fri, Jan 13, 2012 at 01:45:30PM -0800, Ying Han wrote: >> On Fri, Jan 13, 2012 at 8:34 AM, Michal Hocko wrote: >> > On Fri 13-01-12 16:50:01, Johannes Weiner wrote: >> >> On Fri, Jan 13, 2012 at 01:04:06PM +0100, Michal Hocko wrote: >> >> > On Tue 10-01-12 16:02:52, Johannes Weiner wrote: >> > [...] >> >> > > +bool mem_cgroup_over_softlimit(struct mem_cgroup *root, >> >> > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0struct mem_cgrou= p *memcg) >> >> > > +{ >> >> > > + if (mem_cgroup_disabled()) >> >> > > + =A0 =A0 =A0 =A0 return false; >> >> > > + >> >> > > + if (!root) >> >> > > + =A0 =A0 =A0 =A0 root =3D root_mem_cgroup; >> >> > > + >> >> > > + for (; memcg; memcg =3D parent_mem_cgroup(memcg)) { >> >> > > + =A0 =A0 =A0 =A0 /* root_mem_cgroup does not have a soft limit *= / >> >> > > + =A0 =A0 =A0 =A0 if (memcg =3D=3D root_mem_cgroup) >> >> > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break; >> >> > > + =A0 =A0 =A0 =A0 if (res_counter_soft_limit_excess(&memcg->res)) >> >> > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 return true; >> >> > > + =A0 =A0 =A0 =A0 if (memcg =3D=3D root) >> >> > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break; >> >> > > + } >> >> > > + return false; >> >> > > +} >> >> > >> >> > Well, this might be little bit tricky. We do not check whether memc= g and >> >> > root are in a hierarchy (in terms of use_hierarchy) relation. >> >> > >> >> > If we are under global reclaim then we iterate over all memcgs and = so >> >> > there is no guarantee that there is a hierarchical relation between= the >> >> > given memcg and its parent. While, on the other hand, if we are doi= ng >> >> > memcg reclaim then we have this guarantee. >> >> > >> >> > Why should we punish a group (subtree) which is perfectly under its= soft >> >> > limit just because some other subtree contributes to the common par= ent's >> >> > usage and makes it over its limit? >> >> > Should we check memcg->use_hierarchy here? >> >> >> >> We do, actually. =A0parent_mem_cgroup() checks the res_counter parent= , >> >> which is only set when ->use_hierarchy is also set. >> > >> > Of course I am blind.. We do not setup res_counter parent for >> > !use_hierarchy case. Sorry for noise... >> > Now it makes much better sense. I was wondering how !use_hierarchy cou= ld >> > ever work, this should be a signal that I am overlooking something >> > terribly. >> > >> > [...] >> >> > > @@ -2121,8 +2121,16 @@ static void shrink_zone(int priority, stru= ct zone *zone, >> >> > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 .mem_cgroup =3D memcg, >> >> > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 .zone =3D zone, >> >> > > =A0 =A0 =A0 =A0 =A0 }; >> >> > > + =A0 =A0 =A0 =A0 int epriority =3D priority; >> >> > > + =A0 =A0 =A0 =A0 /* >> >> > > + =A0 =A0 =A0 =A0 =A0* Put more pressure on hierarchies that exce= ed their >> >> > > + =A0 =A0 =A0 =A0 =A0* soft limit, to push them back harder than = their >> >> > > + =A0 =A0 =A0 =A0 =A0* well-behaving siblings. >> >> > > + =A0 =A0 =A0 =A0 =A0*/ >> >> > > + =A0 =A0 =A0 =A0 if (mem_cgroup_over_softlimit(root, memcg)) >> >> > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 epriority =3D 0; >> >> > >> >> > This sounds too aggressive to me. Shouldn't we just double the pres= sure >> >> > or something like that? >> >> >> >> That's the historical value. =A0When I tried priority - 1, it was not >> >> aggressive enough. >> > >> > Probably because we want to reclaim too much. Maybe we should do >> > reduce nr_to_reclaim (ugly) or reclaim only overlimit groups until cer= tain >> > priority level as Ying suggested in her patchset. >> >> I plan to post that change on top of this, and this patch set does the >> basic stuff to allow us doing further improvement. >> >> I still like the design to skip over_soft_limit cgroups until certain >> priority. One way to set up the soft limit for each cgroup is to base >> on its actual working set size, and we prefer to punish A first with >> lots of page cache ( cold file pages above soft limit) than reclaiming >> anon pages from B ( below soft limit ). Unless we can not get enough >> pages reclaimed from A, we will start reclaiming from B. >> >> This might not be the ideal solution, but should be a good start. Though= ts? > > I don't like this design at all because unless you add weird code to > detect if soft limits apply to any memcgs on the reclaimed hierarchy > you may iterate over the same bunch of memcgs doing nothing for > several times. =A0For example in the default case of no softlimits set > anywhere and you repeatedly walk ALL memcgs in the system doing jack > until you reach your threshold priority level. =A0Elegant is something > else in my book. Agree that change isn't ready until the default soft limit is changed to "0= ". > Once we invert soft limits to mean guarantees and make the default > soft limit not infinity but zero, then we can ignore memcgs below > their soft limit for a few priority levels just fine because being > below the soft limit is the exception. =A0But I don't really want to > make this quite invasive behavioural change a requirement for a > refactoring patch if possible. Sounds reasonable to me. --Ying -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sha Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Date: Thu, 19 Jan 2012 14:38:16 +0800 Message-ID: <4F17BA58.2090403@gmail.com> References: <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120112085904.GG24386@cmpxchg.org> <20120113224424.GC1653@cmpxchg.org> <4F158418.2090509@gmail.com> <20120117145348.GA3144@cmpxchg.org> <20120118092509.GI24386@cmpxchg.org> <4F16AC27.1080906@gmail.com> <20120118152708.GG31112@tiehlicka.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding; bh=YTF21S/pNc9lHWw37BLKbJ9zwwC7Bsj123z2CoUcq6U=; b=lu9lpDlN60x+QmktCAXMTQWQdSeHizoeiKLt+Ngxs/v1Y+Byo96uIJd/X9hyWMJr1u 0gQKsTsNvXBUfBgXSb6DMG1gTkQD3ETe0kATpDETCWST2pF/Ld1hDYHB1emACsHzGXmW 86XRy6J+haX5Jfja7aBUjJGAyQ/TdLB4HcUDE= In-Reply-To: <20120118152708.GG31112@tiehlicka.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii"; format="flowed" To: Michal Hocko Cc: Johannes Weiner , Ying Han , Andrew Morton , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org On 01/18/2012 11:27 PM, Michal Hocko wrote: > On Wed 18-01-12 19:25:27, Sha wrote: > [...] >> Er... I'm even more confused: mem_cgroup_soft_limit_reclaim indeed >> choses the biggest soft-limit excessor first, but in the succeeding reclaim >> mem_cgroup_hierarchical_reclaim just selects a child cgroup by css_id > mem_cgroup_soft_limit_reclaim picks up the hierarchy root (most > excessing one) and mem_cgroup_hierarchical_reclaim reclaims from that > subtree). It doesn't care who exceeds the soft limit under that > hierarchy it just tries to push the root under its limit as much as it > can. This is what Johannes tried to explain in the other email in the > thred. yeah, I finally twig what he meant... I'm not quite familiar with this part. Thanks a lot for the explanation. :-) Sha >> which has nothing to do with soft limit (see mem_cgroup_select_victim). >> IMHO, it's not a genuine hierarchical reclaim. > It is hierarchical because it iterates over hierarchy it is not and > never was recursively soft-hierarchical... > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx145.postini.com [74.125.245.145]) by kanga.kvack.org (Postfix) with SMTP id 57E086B005C for ; Tue, 10 Jan 2012 18:54:06 -0500 (EST) Received: by qcsd17 with SMTP id d17so85012qcs.14 for ; Tue, 10 Jan 2012 15:54:05 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <1326207772-16762-2-git-send-email-hannes@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-2-git-send-email-hannes@cmpxchg.org> Date: Tue, 10 Jan 2012 15:54:05 -0800 Message-ID: Subject: Re: [patch 1/2] mm: memcg: per-memcg reclaim statistics From: Ying Han Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Thank you for the patch and the stats looks reasonable to me, few questions as below: On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner wrote= : > With the single per-zone LRU gone and global reclaim scanning > individual memcgs, it's straight-forward to collect meaningful and > accurate per-memcg reclaim statistics. > > This adds the following items to memory.stat: Some of the previous discussions including patches have similar stats in memory.vmscan_stat API, which collects all the per-memcg vmscan stats. I would like to understand more why we add into memory.stat instead, and do we have plan to keep extending memory.stat for those vmstat like stats? > > pgreclaim Not sure if we want to keep this more consistent to /proc/vmstat, then it will be "pgsteal"? > pgscan > > =A0Number of pages reclaimed/scanned from that memcg due to its own > =A0hard limit (or physical limit in case of the root memcg) by the > =A0allocating task. > > kswapd_pgreclaim > kswapd_pgscan we have "pgscan_kswapd_*" in vmstat, so maybe ? "pgsteal_kswapd" "pgscan_kswapd" > > =A0Reclaim activity from kswapd due to the memcg's own limit. =A0Only > =A0applicable to the root memcg for now since kswapd is only triggered > =A0by physical limits, but kswapd-style reclaim based on memcg hard > =A0limits is being developped. > > hierarchy_pgreclaim > hierarchy_pgscan > hierarchy_kswapd_pgreclaim > hierarchy_kswapd_pgscan "pgsteal_hierarchy" "pgsteal_kswapd_hierarchy" .. No strong option on the naming, but try to make it more consistent to existing API. > > =A0Reclaim activity due to limitations in one of the memcg's parents. > > Signed-off-by: Johannes Weiner > --- > =A0Documentation/cgroups/memory.txt | =A0 =A04 ++ > =A0include/linux/memcontrol.h =A0 =A0 =A0 | =A0 10 +++++ > =A0mm/memcontrol.c =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0| =A0 84 ++++++++++= +++++++++++++++++++++++++++- > =A0mm/vmscan.c =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0| =A0 =A07 +++ > =A04 files changed, 103 insertions(+), 2 deletions(-) > > diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/mem= ory.txt > index cc0ebc5..eb9e982 100644 > --- a/Documentation/cgroups/memory.txt > +++ b/Documentation/cgroups/memory.txt > @@ -389,6 +389,10 @@ mapped_file =A0 =A0 =A0 =A0- # of bytes of mapped fi= le (includes tmpfs/shmem) > =A0pgpgin =A0 =A0 =A0 =A0 - # of pages paged in (equivalent to # of charg= ing events). > =A0pgpgout =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0- # of pages paged out (equival= ent to # of uncharging events). > =A0swap =A0 =A0 =A0 =A0 =A0 - # of bytes of swap usage > +pgreclaim =A0 =A0 =A0- # of pages reclaimed due to this memcg's limit > +pgscan =A0 =A0 =A0 =A0 - # of pages scanned due to this memcg's limit > +kswapd_* =A0 =A0 =A0 - # reclaim activity by background daemon due to th= is memcg's limit > +hierarchy_* =A0 =A0- # reclaim activity due to pressure from parental me= mcg > =A0inactive_anon =A0- # of bytes of anonymous memory and swap cache memor= y on > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0LRU list. > =A0active_anon =A0 =A0- # of bytes of anonymous and swap cache memory on = active > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index bd3b102..6c1d69e 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -121,6 +121,8 @@ struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat= (struct mem_cgroup *memcg, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0struct zone *zone); > =A0struct zone_reclaim_stat* > =A0mem_cgroup_get_reclaim_stat_from_page(struct page *page); > +void mem_cgroup_account_reclaim(struct mem_cgroup *, struct mem_cgroup *= , > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 unsigned lo= ng, unsigned long, bool); > =A0extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0struct task_struct *p); > =A0extern void mem_cgroup_replace_page_cache(struct page *oldpage, > @@ -347,6 +349,14 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *p= age) > =A0 =A0 =A0 =A0return NULL; > =A0} > > +static inline void mem_cgroup_account_reclaim(struct mem_cgroup *root, > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 struct mem_cgroup *memcg, > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 unsigned long nr_reclaimed, > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 unsigned long nr_scanned, > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 bool kswapd) > +{ > +} > + > =A0static inline void > =A0mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct= *p) > =A0{ > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 8e2a80d..170dff4 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -91,12 +91,23 @@ enum mem_cgroup_stat_index { > =A0 =A0 =A0 =A0MEM_CGROUP_STAT_NSTATS, > =A0}; > > +#define MEM_CGROUP_EVENTS_KSWAPD 2 > +#define MEM_CGROUP_EVENTS_HIERARCHY 4 > + > =A0enum mem_cgroup_events_index { > =A0 =A0 =A0 =A0MEM_CGROUP_EVENTS_PGPGIN, =A0 =A0 =A0 /* # of pages paged = in */ > =A0 =A0 =A0 =A0MEM_CGROUP_EVENTS_PGPGOUT, =A0 =A0 =A0/* # of pages paged = out */ > =A0 =A0 =A0 =A0MEM_CGROUP_EVENTS_COUNT, =A0 =A0 =A0 =A0/* # of pages page= d in/out */ > =A0 =A0 =A0 =A0MEM_CGROUP_EVENTS_PGFAULT, =A0 =A0 =A0/* # of page-faults = */ > =A0 =A0 =A0 =A0MEM_CGROUP_EVENTS_PGMAJFAULT, =A0 /* # of major page-fault= s */ > + =A0 =A0 =A0 MEM_CGROUP_EVENTS_PGRECLAIM, > + =A0 =A0 =A0 MEM_CGROUP_EVENTS_PGSCAN, > + =A0 =A0 =A0 MEM_CGROUP_EVENTS_KSWAPD_PGRECLAIM, > + =A0 =A0 =A0 MEM_CGROUP_EVENTS_KSWAPD_PGSCAN, > + =A0 =A0 =A0 MEM_CGROUP_EVENTS_HIERARCHY_PGRECLAIM, > + =A0 =A0 =A0 MEM_CGROUP_EVENTS_HIERARCHY_PGSCAN, > + =A0 =A0 =A0 MEM_CGROUP_EVENTS_HIERARCHY_KSWAPD_PGRECLAIM, > + =A0 =A0 =A0 MEM_CGROUP_EVENTS_HIERARCHY_KSWAPD_PGSCAN, missing comment here? > =A0 =A0 =A0 =A0MEM_CGROUP_EVENTS_NSTATS, > =A0}; > =A0/* > @@ -889,6 +900,38 @@ static inline bool mem_cgroup_is_root(struct mem_cgr= oup *memcg) > =A0 =A0 =A0 =A0return (memcg =3D=3D root_mem_cgroup); > =A0} > > +/** > + * mem_cgroup_account_reclaim - update per-memcg reclaim statistics > + * @root: memcg that triggered reclaim > + * @memcg: memcg that is actually being scanned > + * @nr_reclaimed: number of pages reclaimed from @memcg > + * @nr_scanned: number of pages scanned from @memcg > + * @kswapd: whether reclaiming task is kswapd or allocator itself > + */ > +void mem_cgroup_account_reclaim(struct mem_cgroup *root, > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 struct mem_= cgroup *memcg, > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 unsigned lo= ng nr_reclaimed, > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 unsigned lo= ng nr_scanned, > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 bool kswapd= ) > +{ > + =A0 =A0 =A0 unsigned int offset =3D 0; > + > + =A0 =A0 =A0 if (!root) > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 root =3D root_mem_cgroup; > + > + =A0 =A0 =A0 if (kswapd) > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 offset +=3D MEM_CGROUP_EVENTS_KSWAPD; > + =A0 =A0 =A0 if (root !=3D memcg) > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 offset +=3D MEM_CGROUP_EVENTS_HIERARCHY; Just to be clear, here root cgroup has hierarchy_* stats always 0 ? Also, we might want to consider renaming the root here, something like target? The root is confusing with root_mem_cgroup. --Ying > + > + =A0 =A0 =A0 preempt_disable(); > + =A0 =A0 =A0 __this_cpu_add(memcg->stat->events[MEM_CGROUP_EVENTS_PGRECL= AIM + offset], > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0nr_reclaimed); > + =A0 =A0 =A0 __this_cpu_add(memcg->stat->events[MEM_CGROUP_EVENTS_PGSCAN= + offset], > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0nr_scanned); > + =A0 =A0 =A0 preempt_enable(); > +} > + > =A0void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_ite= m idx) > =A0{ > =A0 =A0 =A0 =A0struct mem_cgroup *memcg; > @@ -1662,6 +1705,8 @@ static int mem_cgroup_soft_reclaim(struct mem_cgrou= p *root_memcg, > =A0 =A0 =A0 =A0excess =3D res_counter_soft_limit_excess(&root_memcg->res)= >> PAGE_SHIFT; > > =A0 =A0 =A0 =A0while (1) { > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 unsigned long nr_reclaimed; > + > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0victim =3D mem_cgroup_iter(root_memcg, vic= tim, &reclaim); > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (!victim) { > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0loop++; > @@ -1687,8 +1732,11 @@ static int mem_cgroup_soft_reclaim(struct mem_cgro= up *root_memcg, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0} > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (!mem_cgroup_reclaimable(victim, false)= ) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0continue; > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 total +=3D mem_cgroup_shrink_node_zone(vict= im, gfp_mask, false, > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0zone, &nr_scanned); > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 nr_reclaimed =3D mem_cgroup_shrink_node_zon= e(victim, gfp_mask, false, > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0zone, &nr_scanned); > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 mem_cgroup_account_reclaim(root_mem_cgroup,= victim, nr_reclaimed, > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0nr_scanned, current_is_kswapd()); > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 total +=3D nr_reclaimed; > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*total_scanned +=3D nr_scanned; > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (!res_counter_soft_limit_excess(&root_m= emcg->res)) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0break; > @@ -4023,6 +4071,14 @@ enum { > =A0 =A0 =A0 =A0MCS_SWAP, > =A0 =A0 =A0 =A0MCS_PGFAULT, > =A0 =A0 =A0 =A0MCS_PGMAJFAULT, > + =A0 =A0 =A0 MCS_PGRECLAIM, > + =A0 =A0 =A0 MCS_PGSCAN, > + =A0 =A0 =A0 MCS_KSWAPD_PGRECLAIM, > + =A0 =A0 =A0 MCS_KSWAPD_PGSCAN, > + =A0 =A0 =A0 MCS_HIERARCHY_PGRECLAIM, > + =A0 =A0 =A0 MCS_HIERARCHY_PGSCAN, > + =A0 =A0 =A0 MCS_HIERARCHY_KSWAPD_PGRECLAIM, > + =A0 =A0 =A0 MCS_HIERARCHY_KSWAPD_PGSCAN, > =A0 =A0 =A0 =A0MCS_INACTIVE_ANON, > =A0 =A0 =A0 =A0MCS_ACTIVE_ANON, > =A0 =A0 =A0 =A0MCS_INACTIVE_FILE, > @@ -4047,6 +4103,14 @@ struct { > =A0 =A0 =A0 =A0{"swap", "total_swap"}, > =A0 =A0 =A0 =A0{"pgfault", "total_pgfault"}, > =A0 =A0 =A0 =A0{"pgmajfault", "total_pgmajfault"}, > + =A0 =A0 =A0 {"pgreclaim", "total_pgreclaim"}, > + =A0 =A0 =A0 {"pgscan", "total_pgscan"}, > + =A0 =A0 =A0 {"kswapd_pgreclaim", "total_kswapd_pgreclaim"}, > + =A0 =A0 =A0 {"kswapd_pgscan", "total_kswapd_pgscan"}, > + =A0 =A0 =A0 {"hierarchy_pgreclaim", "total_hierarchy_pgreclaim"}, > + =A0 =A0 =A0 {"hierarchy_pgscan", "total_hierarchy_pgscan"}, > + =A0 =A0 =A0 {"hierarchy_kswapd_pgreclaim", "total_hierarchy_kswapd_pgre= claim"}, > + =A0 =A0 =A0 {"hierarchy_kswapd_pgscan", "total_hierarchy_kswapd_pgscan"= }, > =A0 =A0 =A0 =A0{"inactive_anon", "total_inactive_anon"}, > =A0 =A0 =A0 =A0{"active_anon", "total_active_anon"}, > =A0 =A0 =A0 =A0{"inactive_file", "total_inactive_file"}, > @@ -4079,6 +4143,22 @@ mem_cgroup_get_local_stat(struct mem_cgroup *memcg= , struct mcs_total_stat *s) > =A0 =A0 =A0 =A0s->stat[MCS_PGFAULT] +=3D val; > =A0 =A0 =A0 =A0val =3D mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_PG= MAJFAULT); > =A0 =A0 =A0 =A0s->stat[MCS_PGMAJFAULT] +=3D val; > + =A0 =A0 =A0 val =3D mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_PGR= ECLAIM); > + =A0 =A0 =A0 s->stat[MCS_PGRECLAIM] +=3D val; > + =A0 =A0 =A0 val =3D mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_PGS= CAN); > + =A0 =A0 =A0 s->stat[MCS_PGSCAN] +=3D val; > + =A0 =A0 =A0 val =3D mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_KSW= APD_PGRECLAIM); > + =A0 =A0 =A0 s->stat[MCS_KSWAPD_PGRECLAIM] +=3D val; > + =A0 =A0 =A0 val =3D mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_KSW= APD_PGSCAN); > + =A0 =A0 =A0 s->stat[MCS_KSWAPD_PGSCAN] +=3D val; > + =A0 =A0 =A0 val =3D mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_HIE= RARCHY_PGRECLAIM); > + =A0 =A0 =A0 s->stat[MCS_HIERARCHY_PGRECLAIM] +=3D val; > + =A0 =A0 =A0 val =3D mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_HIE= RARCHY_PGSCAN); > + =A0 =A0 =A0 s->stat[MCS_HIERARCHY_PGSCAN] +=3D val; > + =A0 =A0 =A0 val =3D mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_HIE= RARCHY_KSWAPD_PGRECLAIM); > + =A0 =A0 =A0 s->stat[MCS_HIERARCHY_KSWAPD_PGRECLAIM] +=3D val; > + =A0 =A0 =A0 val =3D mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_HIE= RARCHY_KSWAPD_PGSCAN); > + =A0 =A0 =A0 s->stat[MCS_HIERARCHY_KSWAPD_PGSCAN] +=3D val; > > =A0 =A0 =A0 =A0/* per zone stat */ > =A0 =A0 =A0 =A0val =3D mem_cgroup_nr_lru_pages(memcg, BIT(LRU_INACTIVE_AN= ON)); > diff --git a/mm/vmscan.c b/mm/vmscan.c > index c631234..e3fd8a7 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2115,12 +2115,19 @@ static void shrink_zone(int priority, struct zone= *zone, > > =A0 =A0 =A0 =A0memcg =3D mem_cgroup_iter(root, NULL, &reclaim); > =A0 =A0 =A0 =A0do { > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 unsigned long nr_reclaimed =3D sc->nr_recla= imed; > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 unsigned long nr_scanned =3D sc->nr_scanned= ; > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0struct mem_cgroup_zone mz =3D { > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0.mem_cgroup =3D memcg, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0.zone =3D zone, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0}; > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0shrink_mem_cgroup_zone(priority, &mz, sc); > + > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 mem_cgroup_account_reclaim(root, memcg, > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0sc->nr_reclaimed - nr_reclaimed, > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0sc->nr_scanned - nr_scanned, > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0current_is_kswapd()); > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0/* > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 * Limit reclaim has historically picked o= ne memcg and > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 * scanned it with decreasing priority lev= els until > -- > 1.7.7.5 > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx198.postini.com [74.125.245.198]) by kanga.kvack.org (Postfix) with SMTP id DF9086B005C for ; Tue, 10 Jan 2012 19:30:46 -0500 (EST) Date: Wed, 11 Jan 2012 01:30:20 +0100 From: Johannes Weiner Subject: Re: [patch 1/2] mm: memcg: per-memcg reclaim statistics Message-ID: <20120111003020.GD24386@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-2-git-send-email-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Ying Han Cc: Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org On Tue, Jan 10, 2012 at 03:54:05PM -0800, Ying Han wrote: > Thank you for the patch and the stats looks reasonable to me, few > questions as below: > > On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner wrote: > > With the single per-zone LRU gone and global reclaim scanning > > individual memcgs, it's straight-forward to collect meaningful and > > accurate per-memcg reclaim statistics. > > > > This adds the following items to memory.stat: > > Some of the previous discussions including patches have similar stats > in memory.vmscan_stat API, which collects all the per-memcg vmscan > stats. I would like to understand more why we add into memory.stat > instead, and do we have plan to keep extending memory.stat for those > vmstat like stats? I think they were put into an extra file in particular to be able to write to this file to reset the statistics. But in my opinion, it's trivial to calculate a delta from before and after running a workload, so I didn't really like adding kernel code for that. Did you have another reason for a separate file in mind? > > pgreclaim > > Not sure if we want to keep this more consistent to /proc/vmstat, then > it will be "pgsteal"? The problem with that was that we didn't like to call pages stolen when they were reclaimed from within the cgroup, so we had pgfree for inner reclaim and pgsteal for outer reclaim, respectively. I found it cleaner to just go with pgreclaim, it's unambiguous and straight-forward. Outer reclaim is designated by the hierarchy_ prefix. > > pgscan > > > > Number of pages reclaimed/scanned from that memcg due to its own > > hard limit (or physical limit in case of the root memcg) by the > > allocating task. > > > > kswapd_pgreclaim > > kswapd_pgscan > > we have "pgscan_kswapd_*" in vmstat, so maybe ? > "pgsteal_kswapd" > "pgscan_kswapd" > > > Reclaim activity from kswapd due to the memcg's own limit. Only > > applicable to the root memcg for now since kswapd is only triggered > > by physical limits, but kswapd-style reclaim based on memcg hard > > limits is being developped. > > > > hierarchy_pgreclaim > > hierarchy_pgscan > > hierarchy_kswapd_pgreclaim > > hierarchy_kswapd_pgscan > > "pgsteal_hierarchy" > "pgsteal_kswapd_hierarchy" > .. > > No strong option on the naming, but try to make it more consistent to > existing API. I swear I tried, but the existing naming is pretty screwed up :( For example, pgscan_direct_* and pgscan_kswapd_* allow you to compare scan rates of direct reclaim vs. kswapd reclaim. To get the total number of pages reclaimed, you sum them up. On the other hand, pgsteal_* does not differentiate between direct reclaim and kswapd, so to get direct reclaim numbers, you add up the pgsteal_* counters and subtract kswapd_steal (notice the lack of pg?), which is in turn not available at zone granularity. > > +#define MEM_CGROUP_EVENTS_KSWAPD 2 > > +#define MEM_CGROUP_EVENTS_HIERARCHY 4 These two function as namespaces, that's why I put hierarchy_ and kswapd_ at the beginning of the names. Given that we have kswapd_steal, would you be okay with doing it like this? I mean, at least my naming conforms to ONE of the standards in /proc/vmstat, right? ;-) > > @@ -91,12 +91,23 @@ enum mem_cgroup_stat_index { > > MEM_CGROUP_STAT_NSTATS, > > }; > > > > +#define MEM_CGROUP_EVENTS_KSWAPD 2 > > +#define MEM_CGROUP_EVENTS_HIERARCHY 4 > > + > > enum mem_cgroup_events_index { > > MEM_CGROUP_EVENTS_PGPGIN, /* # of pages paged in */ > > MEM_CGROUP_EVENTS_PGPGOUT, /* # of pages paged out */ > > MEM_CGROUP_EVENTS_COUNT, /* # of pages paged in/out */ > > MEM_CGROUP_EVENTS_PGFAULT, /* # of page-faults */ > > MEM_CGROUP_EVENTS_PGMAJFAULT, /* # of major page-faults */ > > + MEM_CGROUP_EVENTS_PGRECLAIM, > > + MEM_CGROUP_EVENTS_PGSCAN, > > + MEM_CGROUP_EVENTS_KSWAPD_PGRECLAIM, > > + MEM_CGROUP_EVENTS_KSWAPD_PGSCAN, > > + MEM_CGROUP_EVENTS_HIERARCHY_PGRECLAIM, > > + MEM_CGROUP_EVENTS_HIERARCHY_PGSCAN, > > + MEM_CGROUP_EVENTS_HIERARCHY_KSWAPD_PGRECLAIM, > > + MEM_CGROUP_EVENTS_HIERARCHY_KSWAPD_PGSCAN, > > missing comment here? As if the lines weren't long enough already ;-) I'll add some. > > MEM_CGROUP_EVENTS_NSTATS, > > }; > > /* > > @@ -889,6 +900,38 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg) > > return (memcg == root_mem_cgroup); > > } > > > > +/** > > + * mem_cgroup_account_reclaim - update per-memcg reclaim statistics > > + * @root: memcg that triggered reclaim > > + * @memcg: memcg that is actually being scanned > > + * @nr_reclaimed: number of pages reclaimed from @memcg > > + * @nr_scanned: number of pages scanned from @memcg > > + * @kswapd: whether reclaiming task is kswapd or allocator itself > > + */ > > +void mem_cgroup_account_reclaim(struct mem_cgroup *root, > > + struct mem_cgroup *memcg, > > + unsigned long nr_reclaimed, > > + unsigned long nr_scanned, > > + bool kswapd) > > +{ > > + unsigned int offset = 0; > > + > > + if (!root) > > + root = root_mem_cgroup; > > + > > + if (kswapd) > > + offset += MEM_CGROUP_EVENTS_KSWAPD; > > + if (root != memcg) > > + offset += MEM_CGROUP_EVENTS_HIERARCHY; > > Just to be clear, here root cgroup has hierarchy_* stats always 0 ? That's correct, there can't be any hierarchical pressure on the topmost parent. > Also, we might want to consider renaming the root here, something like > target? The root is confusing with root_mem_cgroup. It's the same naming scheme I used for the iterator functions (mem_cgroup_iter() and friends), so if we change it, I'd like to change it consistently. Having target and memcg as parameters is even more confusing and non-descriptive, IMO. Other places use mem_over_limit, which is a bit better, but quite long. Any other ideas for great names for parameters that designate a hierarchy root and a memcg in that hierarchy? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx202.postini.com [74.125.245.202]) by kanga.kvack.org (Postfix) with SMTP id 0B7226B004D for ; Wed, 11 Jan 2012 20:55:45 -0500 (EST) Received: from m3.gw.fujitsu.co.jp (unknown [10.0.50.73]) by fgwmail5.fujitsu.co.jp (Postfix) with ESMTP id F34C53EE081 for ; Thu, 12 Jan 2012 10:55:43 +0900 (JST) Received: from smail (m3 [127.0.0.1]) by outgoing.m3.gw.fujitsu.co.jp (Postfix) with ESMTP id D453945DEED for ; Thu, 12 Jan 2012 10:55:43 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (s3.gw.fujitsu.co.jp [10.0.50.93]) by m3.gw.fujitsu.co.jp (Postfix) with ESMTP id BEE7145DEEA for ; Thu, 12 Jan 2012 10:55:43 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id ADEAA1DB8038 for ; Thu, 12 Jan 2012 10:55:43 +0900 (JST) Received: from ml14.s.css.fujitsu.com (ml14.s.css.fujitsu.com [10.240.81.134]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id 583971DB803F for ; Thu, 12 Jan 2012 10:55:43 +0900 (JST) Date: Thu, 12 Jan 2012 10:54:27 +0900 From: KAMEZAWA Hiroyuki Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Message-Id: <20120112105427.4b80437b.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Andrew Morton , Michal Hocko , Balbir Singh , Ying Han , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org On Tue, 10 Jan 2012 16:02:52 +0100 Johannes Weiner wrote: > Right now, memcg soft limits are implemented by having a sorted tree > of memcgs that are in excess of their limits. Under global memory > pressure, kswapd first reclaims from the biggest excessor and then > proceeds to do regular global reclaim. The result of this is that > pages are reclaimed from all memcgs, but more scanning happens against > those above their soft limit. > > With global reclaim doing memcg-aware hierarchical reclaim by default, > this is a lot easier to implement: everytime a memcg is reclaimed > from, scan more aggressively (per tradition with a priority of 0) if > it's above its soft limit. With the same end result of scanning > everybody, but soft limit excessors a bit more. > > Advantages: > > o smoother reclaim: soft limit reclaim is a separate stage before > global reclaim, whose result is not communicated down the line and > so overreclaim of the groups in excess is very likely. After this > patch, soft limit reclaim is fully integrated into regular reclaim > and each memcg is considered exactly once per cycle. > > o true hierarchy support: soft limits are only considered when > kswapd does global reclaim, but after this patch, targetted > reclaim of a memcg will mind the soft limit settings of its child > groups. > > o code size: soft limit reclaim requires a lot of code to maintain > the per-node per-zone rb-trees to quickly find the biggest > offender, dedicated paths for soft limit reclaim etc. while this > new implementation gets away without all that. > > Test: > > The test consists of two concurrent kernel build jobs in separate > source trees, the master and the slave. The two jobs get along nicely > on 600MB of available memory, so this is the zero overcommit control > case. When available memory is decreased, the overcommit is > compensated by decreasing the soft limit of the slave by the same > amount, in the hope that the slave takes the hit and the master stays > unaffected. > > 600M-0M-vanilla 600M-0M-patched > Master walltime (s) 552.65 ( +0.00%) 552.38 ( -0.05%) > Master walltime (stddev) 1.25 ( +0.00%) 0.92 ( -14.66%) > Master major faults 204.38 ( +0.00%) 205.38 ( +0.49%) > Master major faults (stddev) 27.16 ( +0.00%) 13.80 ( -47.43%) > Master reclaim 31.88 ( +0.00%) 37.75 ( +17.87%) > Master reclaim (stddev) 34.01 ( +0.00%) 75.88 (+119.59%) > Master scan 31.88 ( +0.00%) 37.75 ( +17.87%) > Master scan (stddev) 34.01 ( +0.00%) 75.88 (+119.59%) > Master kswapd reclaim 33922.12 ( +0.00%) 33887.12 ( -0.10%) > Master kswapd reclaim (stddev) 969.08 ( +0.00%) 492.22 ( -49.16%) > Master kswapd scan 34085.75 ( +0.00%) 33985.75 ( -0.29%) > Master kswapd scan (stddev) 1101.07 ( +0.00%) 563.33 ( -48.79%) > Slave walltime (s) 552.68 ( +0.00%) 552.12 ( -0.10%) > Slave walltime (stddev) 0.79 ( +0.00%) 1.05 ( +14.76%) > Slave major faults 212.50 ( +0.00%) 204.50 ( -3.75%) > Slave major faults (stddev) 26.90 ( +0.00%) 13.17 ( -49.20%) > Slave reclaim 26.12 ( +0.00%) 35.00 ( +32.72%) > Slave reclaim (stddev) 29.42 ( +0.00%) 74.91 (+149.55%) > Slave scan 31.38 ( +0.00%) 35.00 ( +11.20%) > Slave scan (stddev) 33.31 ( +0.00%) 74.91 (+121.24%) > Slave kswapd reclaim 34259.00 ( +0.00%) 33469.88 ( -2.30%) > Slave kswapd reclaim (stddev) 925.15 ( +0.00%) 565.07 ( -38.88%) > Slave kswapd scan 34354.62 ( +0.00%) 33555.75 ( -2.33%) > Slave kswapd scan (stddev) 969.62 ( +0.00%) 581.70 ( -39.97%) > > In the control case, the differences in elapsed time, number of major > faults taken, and reclaim statistics are within the noise for both the > master and the slave job. > > 600M-280M-vanilla 600M-280M-patched > Master walltime (s) 595.13 ( +0.00%) 553.19 ( -7.04%) > Master walltime (stddev) 8.31 ( +0.00%) 2.57 ( -61.64%) > Master major faults 3729.75 ( +0.00%) 783.25 ( -78.98%) > Master major faults (stddev) 258.79 ( +0.00%) 226.68 ( -12.36%) > Master reclaim 705.00 ( +0.00%) 29.50 ( -95.68%) > Master reclaim (stddev) 232.87 ( +0.00%) 44.72 ( -80.45%) > Master scan 714.88 ( +0.00%) 30.00 ( -95.67%) > Master scan (stddev) 237.44 ( +0.00%) 45.39 ( -80.54%) > Master kswapd reclaim 114.75 ( +0.00%) 50.00 ( -55.94%) > Master kswapd reclaim (stddev) 128.51 ( +0.00%) 9.45 ( -91.93%) > Master kswapd scan 115.75 ( +0.00%) 50.00 ( -56.32%) > Master kswapd scan (stddev) 130.31 ( +0.00%) 9.45 ( -92.04%) > Slave walltime (s) 631.18 ( +0.00%) 577.68 ( -8.46%) > Slave walltime (stddev) 9.89 ( +0.00%) 3.63 ( -57.47%) > Slave major faults 28401.75 ( +0.00%) 14656.75 ( -48.39%) > Slave major faults (stddev) 2629.97 ( +0.00%) 1911.81 ( -27.30%) > Slave reclaim 65400.62 ( +0.00%) 1479.62 ( -97.74%) > Slave reclaim (stddev) 11623.02 ( +0.00%) 1482.13 ( -87.24%) > Slave scan 9050047.88 ( +0.00%) 95968.25 ( -98.94%) > Slave scan (stddev) 1912786.94 ( +0.00%) 93390.71 ( -95.12%) > Slave kswapd reclaim 327894.50 ( +0.00%) 227099.88 ( -30.74%) > Slave kswapd reclaim (stddev) 22289.43 ( +0.00%) 16113.14 ( -27.71%) > Slave kswapd scan 34987335.75 ( +0.00%) 1362367.12 ( -96.11%) > Slave kswapd scan (stddev) 2523642.98 ( +0.00%) 156754.74 ( -93.79%) > > Here, the available memory is limited to 320 MB, the machine is > overcommitted by 280 MB. The soft limit of the master is 300 MB, that > of the slave merely 20 MB. > > Looking at the slave job first, it is much better off with the patched > kernel: direct reclaim is almost gone, kswapd reclaim is decreased by > a third. The result is much fewer major faults taken, which in turn > lets the job finish quicker. > > It would be a zero-sum game if the improvement happened at the cost of > the master but looking at the numbers, even the master performs better > with the patched kernel. In fact, the master job is almost unaffected > on the patched kernel compared to the control case. > > This is an odd phenomenon, as the patch does not directly change how > the master is reclaimed. An explanation for this is that the severe > overreclaim of the slave in the unpatched kernel results in the master > growing bigger than in the patched case. Combining the fact that > memcgs are scanned according to their size with the increased refault > rate of the overreclaimed slave triggering global reclaim more often > means that overall pressure on the master job is higher in the > unpatched kernel. > > At any rate, the patched kernel seems to do a much better job at both > overall resource allocation under soft limit overcommit as well as the > requested prioritization of the master job. > > Signed-off-by: Johannes Weiner Thank you for your work and the result seems atractive and code is much simpler. My small concerns are.. 1. This approach may increase latency of direct-reclaim because of priority=0. 2. In a case numa-spread/interleave application run in its own container, pages on a node may paged-out again and again becasue of priority=0 if some other application runs in the node. It seems difficult to use soft-limit with numa-aware applications. Do you have suggestions ? Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx188.postini.com [74.125.245.188]) by kanga.kvack.org (Postfix) with SMTP id 0BC046B004D for ; Thu, 12 Jan 2012 03:59:19 -0500 (EST) Date: Thu, 12 Jan 2012 09:59:04 +0100 From: Johannes Weiner Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Message-ID: <20120112085904.GG24386@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Ying Han Cc: Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org On Wed, Jan 11, 2012 at 01:42:31PM -0800, Ying Han wrote: > On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner wrote: > > Right now, memcg soft limits are implemented by having a sorted tree > > of memcgs that are in excess of their limits. Under global memory > > pressure, kswapd first reclaims from the biggest excessor and then > > proceeds to do regular global reclaim. The result of this is that > > pages are reclaimed from all memcgs, but more scanning happens against > > those above their soft limit. > > > > With global reclaim doing memcg-aware hierarchical reclaim by default, > > this is a lot easier to implement: everytime a memcg is reclaimed > > from, scan more aggressively (per tradition with a priority of 0) if > > it's above its soft limit. With the same end result of scanning > > everybody, but soft limit excessors a bit more. > > > > Advantages: > > > > o smoother reclaim: soft limit reclaim is a separate stage before > > global reclaim, whose result is not communicated down the line and > > so overreclaim of the groups in excess is very likely. After this > > patch, soft limit reclaim is fully integrated into regular reclaim > > and each memcg is considered exactly once per cycle. > > > > o true hierarchy support: soft limits are only considered when > > kswapd does global reclaim, but after this patch, targetted > > reclaim of a memcg will mind the soft limit settings of its child > > groups. > > Why we add soft limit reclaim into target reclaim? -> A hard limit 10G, usage 10G -> A1 soft limit 8G, usage 5G -> A2 soft limit 2G, usage 5G When A hits its hard limit, A2 will experience more pressure than A1. Soft limits are already applied hierarchically: the memcg that is picked from the tree is reclaimed hierarchically. What I wanted to add is the soft limit also being /triggerable/ from non-global hierarchy levels. > Based on the discussions, my understanding is that the soft limit only > take effect while the whole machine is under memory contention. We > don't want to add extra pressure on a cgroup if there is free memory > on the system even the cgroup is above its limit. If a hierarchy is under pressure, we will reclaim that hierarchy. We allow groups to be prioritized under global pressure, why not allow it for local pressure as well? I am not quite sure what you are objecting to. > > o code size: soft limit reclaim requires a lot of code to maintain > > the per-node per-zone rb-trees to quickly find the biggest > > offender, dedicated paths for soft limit reclaim etc. while this > > new implementation gets away without all that. > > > > Test: > > > > The test consists of two concurrent kernel build jobs in separate > > source trees, the master and the slave. The two jobs get along nicely > > on 600MB of available memory, so this is the zero overcommit control > > case. When available memory is decreased, the overcommit is > > compensated by decreasing the soft limit of the slave by the same > > amount, in the hope that the slave takes the hit and the master stays > > unaffected. > > > > 600M-0M-vanilla 600M-0M-patched > > Master walltime (s) 552.65 ( +0.00%) 552.38 ( -0.05%) > > Master walltime (stddev) 1.25 ( +0.00%) 0.92 ( -14.66%) > > Master major faults 204.38 ( +0.00%) 205.38 ( +0.49%) > > Master major faults (stddev) 27.16 ( +0.00%) 13.80 ( -47.43%) > > Master reclaim 31.88 ( +0.00%) 37.75 ( +17.87%) > > Master reclaim (stddev) 34.01 ( +0.00%) 75.88 (+119.59%) > > Master scan 31.88 ( +0.00%) 37.75 ( +17.87%) > > Master scan (stddev) 34.01 ( +0.00%) 75.88 (+119.59%) > > Master kswapd reclaim 33922.12 ( +0.00%) 33887.12 ( -0.10%) > > Master kswapd reclaim (stddev) 969.08 ( +0.00%) 492.22 ( -49.16%) > > Master kswapd scan 34085.75 ( +0.00%) 33985.75 ( -0.29%) > > Master kswapd scan (stddev) 1101.07 ( +0.00%) 563.33 ( -48.79%) > > Slave walltime (s) 552.68 ( +0.00%) 552.12 ( -0.10%) > > Slave walltime (stddev) 0.79 ( +0.00%) 1.05 ( +14.76%) > > Slave major faults 212.50 ( +0.00%) 204.50 ( -3.75%) > > Slave major faults (stddev) 26.90 ( +0.00%) 13.17 ( -49.20%) > > Slave reclaim 26.12 ( +0.00%) 35.00 ( +32.72%) > > Slave reclaim (stddev) 29.42 ( +0.00%) 74.91 (+149.55%) > > Slave scan 31.38 ( +0.00%) 35.00 ( +11.20%) > > Slave scan (stddev) 33.31 ( +0.00%) 74.91 (+121.24%) > > Slave kswapd reclaim 34259.00 ( +0.00%) 33469.88 ( -2.30%) > > Slave kswapd reclaim (stddev) 925.15 ( +0.00%) 565.07 ( -38.88%) > > Slave kswapd scan 34354.62 ( +0.00%) 33555.75 ( -2.33%) > > Slave kswapd scan (stddev) 969.62 ( +0.00%) 581.70 ( -39.97%) > > > > In the control case, the differences in elapsed time, number of major > > faults taken, and reclaim statistics are within the noise for both the > > master and the slave job. > > What's the soft limit setting in the controlled case? 300MB for both jobs. > I assume it is the default RESOURCE_MAX. So both Master and Slave get > equal pressure before/after the patch, and no differences on the stats > should be observed. Yes. The control case demonstrates that both jobs can fit comfortably, don't compete for space and that in general the patch does not have unexpected negative impact (after all, it modifies codepaths that were invoked regularly outside of reclaim). > > 600M-280M-vanilla 600M-280M-patched > > Master walltime (s) 595.13 ( +0.00%) 553.19 ( -7.04%) > > Master walltime (stddev) 8.31 ( +0.00%) 2.57 ( -61.64%) > > Master major faults 3729.75 ( +0.00%) 783.25 ( -78.98%) > > Master major faults (stddev) 258.79 ( +0.00%) 226.68 ( -12.36%) > > Master reclaim 705.00 ( +0.00%) 29.50 ( -95.68%) > > Master reclaim (stddev) 232.87 ( +0.00%) 44.72 ( -80.45%) > > Master scan 714.88 ( +0.00%) 30.00 ( -95.67%) > > Master scan (stddev) 237.44 ( +0.00%) 45.39 ( -80.54%) > > Master kswapd reclaim 114.75 ( +0.00%) 50.00 ( -55.94%) > > Master kswapd reclaim (stddev) 128.51 ( +0.00%) 9.45 ( -91.93%) > > Master kswapd scan 115.75 ( +0.00%) 50.00 ( -56.32%) > > Master kswapd scan (stddev) 130.31 ( +0.00%) 9.45 ( -92.04%) > > Slave walltime (s) 631.18 ( +0.00%) 577.68 ( -8.46%) > > Slave walltime (stddev) 9.89 ( +0.00%) 3.63 ( -57.47%) > > Slave major faults 28401.75 ( +0.00%) 14656.75 ( -48.39%) > > Slave major faults (stddev) 2629.97 ( +0.00%) 1911.81 ( -27.30%) > > Slave reclaim 65400.62 ( +0.00%) 1479.62 ( -97.74%) > > Slave reclaim (stddev) 11623.02 ( +0.00%) 1482.13 ( -87.24%) > > Slave scan 9050047.88 ( +0.00%) 95968.25 ( -98.94%) > > Slave scan (stddev) 1912786.94 ( +0.00%) 93390.71 ( -95.12%) > > Slave kswapd reclaim 327894.50 ( +0.00%) 227099.88 ( -30.74%) > > Slave kswapd reclaim (stddev) 22289.43 ( +0.00%) 16113.14 ( -27.71%) > > Slave kswapd scan 34987335.75 ( +0.00%) 1362367.12 ( -96.11%) > > Slave kswapd scan (stddev) 2523642.98 ( +0.00%) 156754.74 ( -93.79%) > > > > Here, the available memory is limited to 320 MB, the machine is > > overcommitted by 280 MB. The soft limit of the master is 300 MB, that > > of the slave merely 20 MB. > > > > Looking at the slave job first, it is much better off with the patched > > kernel: direct reclaim is almost gone, kswapd reclaim is decreased by > > a third. The result is much fewer major faults taken, which in turn > > lets the job finish quicker. > > What's the setting of the hard limit here? Is the direct reclaim > referring to per-memcg directly reclaim or global one. The machine's memory is limited to 600M, the hard limits are unset. All reclaim is a result of global memory pressure. With the patched kernel, I could have used a dedicated parent cgroup and let master and slave run in children of this group, the soft limits would be taken into account just the same. But this does not work on the unpatched kernel, as soft limits are only recognized on the global level there. > > It would be a zero-sum game if the improvement happened at the cost of > > the master but looking at the numbers, even the master performs better > > with the patched kernel. In fact, the master job is almost unaffected > > on the patched kernel compared to the control case. > > It makes sense since the master job get less affected by the patch > than the slave job under the example. Under the control case, if both > master and slave have RESOURCE_MAX soft limit setting, they are under > equal memory pressure(priority = DEF_PRIORITY) . On the second > example, only the slave pressure being increased by priority = 0, and > the Master got scanned with same priority = DEF_PRIORITY pretty much. > > So I would expect to see more reclaim activities happens in slave on > the patched kernel compared to the control case. It seems match the > testing result. Uhm, > > Slave reclaim 65400.62 ( +0.00%) 1479.62 ( -97.74%) > > Slave reclaim (stddev) 11623.02 ( +0.00%) 1482.13 ( -87.24%) > > Slave scan 9050047.88 ( +0.00%) 95968.25 ( -98.94%) > > Slave scan (stddev) 1912786.94 ( +0.00%) 93390.71 ( -95.12%) > > Slave kswapd reclaim 327894.50 ( +0.00%) 227099.88 ( -30.74%) > > Slave kswapd reclaim (stddev) 22289.43 ( +0.00%) 16113.14 ( -27.71%) > > Slave kswapd scan 34987335.75 ( +0.00%) 1362367.12 ( -96.11%) > > Slave kswapd scan (stddev) 2523642.98 ( +0.00%) 156754.74 ( -93.79%) Direct reclaim _shrunk_ by 98%, kswapd reclaim by 31%. > > This is an odd phenomenon, as the patch does not directly change how > > the master is reclaimed. An explanation for this is that the severe > > overreclaim of the slave in the unpatched kernel results in the master > > growing bigger than in the patched case. Combining the fact that > > memcgs are scanned according to their size with the increased refault > > rate of the overreclaimed slave triggering global reclaim more often > > means that overall pressure on the master job is higher in the > > unpatched kernel. > > We can check the Master memory.usage_in_bytes while the job is running. Yep, the plots of cache/rss over time confirmed exactly this. The unpatched kernel shows higher spikes in the size of the master job followed by deeper pits when reclaim kicked in. The patched kernel is much smoother in that regard. > On the other hand, I don't see why we expect the Master being less > reclaimed in the controlled case? On the unpatched kernel, the Master > is being reclaimed under global pressure each time anyway since we > ignore the return value of softlimit. I didn't expect that, I expected both jobs to perform equally in the control case. And in the pressurized case, the master being unaffected and the slave taking the hit. The patched kernel does this, the unpatched one does not. > > @@ -121,6 +121,7 @@ struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, > > struct zone *zone); > > struct zone_reclaim_stat* > > mem_cgroup_get_reclaim_stat_from_page(struct page *page); > > +bool mem_cgroup_over_softlimit(struct mem_cgroup *, struct mem_cgroup *); > > Maybe something like "mem_cgroup_over_soft_limit()" ? Probably more consistent, yeah. Will do. > > @@ -343,7 +314,6 @@ static bool move_file(void) > > * limit reclaim to prevent infinite loops, if they ever occur. > > */ > > #define MEM_CGROUP_MAX_RECLAIM_LOOPS (100) > > -#define MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS (2) > > You might need to remove the comment above as well. Oops, will fix. > > @@ -1318,6 +1123,36 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) > > return margin >> PAGE_SHIFT; > > } > > > > +/** > > + * mem_cgroup_over_softlimit > > + * @root: hierarchy root > > + * @memcg: child of @root to test > > + * > > + * Returns %true if @memcg exceeds its own soft limit or contributes > > + * to the soft limit excess of one of its parents up to and including > > + * @root. > > + */ > > +bool mem_cgroup_over_softlimit(struct mem_cgroup *root, > > + struct mem_cgroup *memcg) > > +{ > > + if (mem_cgroup_disabled()) > > + return false; > > + > > + if (!root) > > + root = root_mem_cgroup; > > + > > + for (; memcg; memcg = parent_mem_cgroup(memcg)) { > > + /* root_mem_cgroup does not have a soft limit */ > > + if (memcg == root_mem_cgroup) > > + break; > > + if (res_counter_soft_limit_excess(&memcg->res)) > > + return true; > > + if (memcg == root) > > + break; > > + } > > Here it adds pressure on a cgroup if one of its parents exceeds soft > limit, although the cgroup itself is under soft limit. It does change > my understanding of soft limit, and might introduce regression of our > existing use cases. > > Here is an example: > > Machine capacity 32G and we over-commit by 8G. > > root > -> A (hard limit 20G, soft limit 15G, usage 16G) > -> A1 (soft limit 5G, usage 4G) > -> A2 (soft limit 10G, usage 12G) > -> B (hard limit 20G, soft limit 10G, usage 16G) > > under global reclaim, we don't want to add pressure on A1 although its > parent A exceeds its soft limit. Assume that if we set the soft limit > corresponding to each cgroup's working set size (hot memory), and it > will introduce regression to A1 in that case. > > In my existing implementation, i am checking the cgroup's soft limit > standalone w/o looking its ancestors. Why do you set the soft limit of A in the first place if you don't want it to be enforced? This is not really new behaviour, soft limit reclaim has always been operating hierarchically on the biggest excessor. In your case, the excess of A is smaller than the excess of A2 and so that weird "only pick the biggest excessor" behaviour hides it, but consider this: -> A soft 30G, usage 39G -> A1 soft 5G, usage 4G -> A2 soft 10G, usage 15G -> A3 soft 15G, usage 20G Upstream would pick A from the soft limit tree and reclaim its children with priority 0, including A1. On the other hand, if you don't consider ancestral soft limits, you break perfectly reasonable setups like these -> A soft 10G, usage 20G -> A1 usage 10G -> A2 usage 10G -> B soft 10G, usage 11G where upstream would pick A and reclaim it recursively, but your version would only apply higher pressure to B. If you would just not set the soft limit of A in your case: -> A (hard limit 20G, usage 16G) -> A1 (soft limit 5G, usage 4G) -> A2 (soft limit 10G, usage 12G) -> B (hard limit 20G, soft limit 10G, usage 16G) only A2 and B would experience higher pressure upon global pressure. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx168.postini.com [74.125.245.168]) by kanga.kvack.org (Postfix) with SMTP id 5C7576B004D for ; Thu, 12 Jan 2012 04:17:36 -0500 (EST) Date: Thu, 12 Jan 2012 10:17:21 +0100 From: Johannes Weiner Subject: Re: [patch 1/2] mm: memcg: per-memcg reclaim statistics Message-ID: <20120112091721.GH24386@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-2-git-send-email-hannes@cmpxchg.org> <20120111003020.GD24386@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Ying Han Cc: Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org On Wed, Jan 11, 2012 at 02:33:59PM -0800, Ying Han wrote: > On Tue, Jan 10, 2012 at 4:30 PM, Johannes Weiner wrote: > > On Tue, Jan 10, 2012 at 03:54:05PM -0800, Ying Han wrote: > >> Thank you for the patch and the stats looks reasonable to me, few > >> questions as below: > >> > >> On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner wrote: > >> > With the single per-zone LRU gone and global reclaim scanning > >> > individual memcgs, it's straight-forward to collect meaningful and > >> > accurate per-memcg reclaim statistics. > >> > > >> > This adds the following items to memory.stat: > >> > >> Some of the previous discussions including patches have similar stats > >> in memory.vmscan_stat API, which collects all the per-memcg vmscan > >> stats. I would like to understand more why we add into memory.stat > >> instead, and do we have plan to keep extending memory.stat for those > >> vmstat like stats? > > > > I think they were put into an extra file in particular to be able to > > write to this file to reset the statistics. But in my opinion, it's > > trivial to calculate a delta from before and after running a workload, > > so I didn't really like adding kernel code for that. > > > > Did you have another reason for a separate file in mind? > > Another reason I had them in separate file is easier to extend. I > don't know if we have plan to have something like memory.vmstat, or > just keep adding stuff into memory.stat. In general, I wanted to keep > the memory.stat being reasonable size including only the basic > statistics. In my existing vmscan_stat path, i have breakdowns of > reclaim stats into file/anon which will make the memory.stat even > larger. Do you think it's a problem of presentation, where we want to allow admins to figure out the memcg parameters at a glance when looking at memory.stat but be able to debug malfunction by looking at the more extensive vmstat file? > >> > aReclaim activity from kswapd due to the memcg's own limit. aOnly > >> > aapplicable to the root memcg for now since kswapd is only triggered > >> > aby physical limits, but kswapd-style reclaim based on memcg hard > >> > alimits is being developped. > >> > > >> > hierarchy_pgreclaim > >> > hierarchy_pgscan > >> > hierarchy_kswapd_pgreclaim > >> > hierarchy_kswapd_pgscan > >> > >> "pgsteal_hierarchy" > >> "pgsteal_kswapd_hierarchy" > >> .. > >> > >> No strong option on the naming, but try to make it more consistent to > >> existing API. > > > > I swear I tried, but the existing naming is pretty screwed up :( > > > > For example, pgscan_direct_* and pgscan_kswapd_* allow you to compare > > scan rates of direct reclaim vs. kswapd reclaim. To get the total > > number of pages reclaimed, you sum them up. > > > > On the other hand, pgsteal_* does not differentiate between direct > > reclaim and kswapd, so to get direct reclaim numbers, you add up the > > pgsteal_* counters and subtract kswapd_steal (notice the lack of pg?), > > which is in turn not available at zone granularity. > > agree and that always confuses me. I just have scripts that present it as 'Direct page reclaimed' and 'Kswapd page reclaimed' when evaluating data so I don't have to remember anymore :-) But I think the wish for consistency is a bit misguided when we end up with something like pgpgin that means something completely different in memcg than it does on the global level. Likewise, I don't want to use pgsteal_* and pgsteal_kswapd_* because of their similarity to /proc/vmstat while the numbers represent something different. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx113.postini.com [74.125.245.113]) by kanga.kvack.org (Postfix) with SMTP id 1338E6B004F for ; Fri, 13 Jan 2012 07:04:10 -0500 (EST) Date: Fri, 13 Jan 2012 13:04:06 +0100 From: Michal Hocko Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Message-ID: <20120113120406.GC17060@tiehlicka.suse.cz> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Andrew Morton , KAMEZAWA Hiroyuki , Balbir Singh , Ying Han , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org On Tue 10-01-12 16:02:52, Johannes Weiner wrote: > Right now, memcg soft limits are implemented by having a sorted tree > of memcgs that are in excess of their limits. Under global memory > pressure, kswapd first reclaims from the biggest excessor and then > proceeds to do regular global reclaim. The result of this is that > pages are reclaimed from all memcgs, but more scanning happens against > those above their soft limit. > > With global reclaim doing memcg-aware hierarchical reclaim by default, > this is a lot easier to implement: everytime a memcg is reclaimed > from, scan more aggressively (per tradition with a priority of 0) if > it's above its soft limit. With the same end result of scanning > everybody, but soft limit excessors a bit more. > > Advantages: > > o smoother reclaim: soft limit reclaim is a separate stage before > global reclaim, whose result is not communicated down the line and > so overreclaim of the groups in excess is very likely. After this > patch, soft limit reclaim is fully integrated into regular reclaim > and each memcg is considered exactly once per cycle. > > o true hierarchy support: soft limits are only considered when > kswapd does global reclaim, but after this patch, targetted > reclaim of a memcg will mind the soft limit settings of its child > groups. Yes it makes sense. At first I was thinking that soft limit should be considered only under global mem. pressure (at least documentation says so) but now it makes sense. We can push on over-soft limit groups more because they told us they could sacrifice something... Anyway documentation needs an update as well. But we have to be little bit careful here. I am still quite confuses how we should handle hierarchies vs. subtrees. See bellow. > > o code size: soft limit reclaim requires a lot of code to maintain > the per-node per-zone rb-trees to quickly find the biggest > offender, dedicated paths for soft limit reclaim etc. while this > new implementation gets away without all that. on my i386 pae setup (including swap extension enabled): Before text data bss dec hex filename 310086 29970 35372 375428 5ba84 mm/built-in.o After size mm/built-in.o text data bss dec hex filename 309048 30030 35372 374450 5b6b2 mm/built-in.o I would expect a bigger difference but still good. > Test: Will look into results later. [...] > Signed-off-by: Johannes Weiner > --- > include/linux/memcontrol.h | 18 +-- > mm/memcontrol.c | 412 ++++---------------------------------------- > mm/vmscan.c | 80 +-------- > 3 files changed, 48 insertions(+), 462 deletions(-) Really nice to see [...] > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 170dff4..d4f7ae5 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c [...] > @@ -1318,6 +1123,36 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) > return margin >> PAGE_SHIFT; > } > > +/** > + * mem_cgroup_over_softlimit > + * @root: hierarchy root > + * @memcg: child of @root to test > + * > + * Returns %true if @memcg exceeds its own soft limit or contributes > + * to the soft limit excess of one of its parents up to and including > + * @root. > + */ > +bool mem_cgroup_over_softlimit(struct mem_cgroup *root, > + struct mem_cgroup *memcg) > +{ > + if (mem_cgroup_disabled()) > + return false; > + > + if (!root) > + root = root_mem_cgroup; > + > + for (; memcg; memcg = parent_mem_cgroup(memcg)) { > + /* root_mem_cgroup does not have a soft limit */ > + if (memcg == root_mem_cgroup) > + break; > + if (res_counter_soft_limit_excess(&memcg->res)) > + return true; > + if (memcg == root) > + break; > + } > + return false; > +} Well, this might be little bit tricky. We do not check whether memcg and root are in a hierarchy (in terms of use_hierarchy) relation. If we are under global reclaim then we iterate over all memcgs and so there is no guarantee that there is a hierarchical relation between the given memcg and its parent. While, on the other hand, if we are doing memcg reclaim then we have this guarantee. Why should we punish a group (subtree) which is perfectly under its soft limit just because some other subtree contributes to the common parent's usage and makes it over its limit? Should we check memcg->use_hierarchy here? Does it even makes sense to setup soft limit on a parent group without hierarchies? Well I have to admit that hierarchies makes me headache. > + > int mem_cgroup_swappiness(struct mem_cgroup *memcg) > { > struct cgroup *cgrp = memcg->css.cgroup; [...] > diff --git a/mm/vmscan.c b/mm/vmscan.c > index e3fd8a7..4279549 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2121,8 +2121,16 @@ static void shrink_zone(int priority, struct zone *zone, > .mem_cgroup = memcg, > .zone = zone, > }; > + int epriority = priority; > + /* > + * Put more pressure on hierarchies that exceed their > + * soft limit, to push them back harder than their > + * well-behaving siblings. > + */ > + if (mem_cgroup_over_softlimit(root, memcg)) > + epriority = 0; This sounds too aggressive to me. Shouldn't we just double the pressure or something like that? Previously we always had nr_to_reclaim == SWAP_CLUSTER_MAX when we did memcg reclaim but this is not the case now. For the kswapd we have nr_to_reclaim == ULONG_MAX so we will not break out of the reclaim early and we have to scan a lot. Direct reclaim (shrink or hard limit) shouldn't be affected here. > > - shrink_mem_cgroup_zone(priority, &mz, sc); > + shrink_mem_cgroup_zone(epriority, &mz, sc); > > mem_cgroup_account_reclaim(root, memcg, > sc->nr_reclaimed - nr_reclaimed, -- Michal Hocko SUSE Labs SUSE LINUX s.r.o. Lihovarska 1060/12 190 00 Praha 9 Czech Republic -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx118.postini.com [74.125.245.118]) by kanga.kvack.org (Postfix) with SMTP id EF6A26B004F for ; Fri, 13 Jan 2012 10:50:10 -0500 (EST) Date: Fri, 13 Jan 2012 16:50:01 +0100 From: Johannes Weiner Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Message-ID: <20120113155001.GB1653@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120113120406.GC17060@tiehlicka.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120113120406.GC17060@tiehlicka.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: Andrew Morton , KAMEZAWA Hiroyuki , Balbir Singh , Ying Han , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org On Fri, Jan 13, 2012 at 01:04:06PM +0100, Michal Hocko wrote: > On Tue 10-01-12 16:02:52, Johannes Weiner wrote: > > Right now, memcg soft limits are implemented by having a sorted tree > > of memcgs that are in excess of their limits. Under global memory > > pressure, kswapd first reclaims from the biggest excessor and then > > proceeds to do regular global reclaim. The result of this is that > > pages are reclaimed from all memcgs, but more scanning happens against > > those above their soft limit. > > > > With global reclaim doing memcg-aware hierarchical reclaim by default, > > this is a lot easier to implement: everytime a memcg is reclaimed > > from, scan more aggressively (per tradition with a priority of 0) if > > it's above its soft limit. With the same end result of scanning > > everybody, but soft limit excessors a bit more. > > > > Advantages: > > > > o smoother reclaim: soft limit reclaim is a separate stage before > > global reclaim, whose result is not communicated down the line and > > so overreclaim of the groups in excess is very likely. After this > > patch, soft limit reclaim is fully integrated into regular reclaim > > and each memcg is considered exactly once per cycle. > > > > o true hierarchy support: soft limits are only considered when > > kswapd does global reclaim, but after this patch, targetted > > reclaim of a memcg will mind the soft limit settings of its child > > groups. > > Yes it makes sense. At first I was thinking that soft limit should be > considered only under global mem. pressure (at least documentation says > so) but now it makes sense. > We can push on over-soft limit groups more because they told us they > could sacrifice something... Anyway documentation needs an update as > well. You are right, I'll look into it. > But we have to be little bit careful here. I am still quite confuses how > we should handle hierarchies vs. subtrees. See bellow. > > @@ -1318,6 +1123,36 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) > > return margin >> PAGE_SHIFT; > > } > > > > +/** > > + * mem_cgroup_over_softlimit > > + * @root: hierarchy root > > + * @memcg: child of @root to test > > + * > > + * Returns %true if @memcg exceeds its own soft limit or contributes > > + * to the soft limit excess of one of its parents up to and including > > + * @root. > > + */ > > +bool mem_cgroup_over_softlimit(struct mem_cgroup *root, > > + struct mem_cgroup *memcg) > > +{ > > + if (mem_cgroup_disabled()) > > + return false; > > + > > + if (!root) > > + root = root_mem_cgroup; > > + > > + for (; memcg; memcg = parent_mem_cgroup(memcg)) { > > + /* root_mem_cgroup does not have a soft limit */ > > + if (memcg == root_mem_cgroup) > > + break; > > + if (res_counter_soft_limit_excess(&memcg->res)) > > + return true; > > + if (memcg == root) > > + break; > > + } > > + return false; > > +} > > Well, this might be little bit tricky. We do not check whether memcg and > root are in a hierarchy (in terms of use_hierarchy) relation. > > If we are under global reclaim then we iterate over all memcgs and so > there is no guarantee that there is a hierarchical relation between the > given memcg and its parent. While, on the other hand, if we are doing > memcg reclaim then we have this guarantee. > > Why should we punish a group (subtree) which is perfectly under its soft > limit just because some other subtree contributes to the common parent's > usage and makes it over its limit? > Should we check memcg->use_hierarchy here? We do, actually. parent_mem_cgroup() checks the res_counter parent, which is only set when ->use_hierarchy is also set. The loop should never walk upwards outside of a hierarchy. And yes, if you have this: A / \ B C and configured a soft limit for A, you asked for both B and C to be responsible when this limit is exceeded, that's not new behaviour. > Does it even makes sense to setup soft limit on a parent group without > hierarchies? > Well I have to admit that hierarchies makes me headache. There is no parent without a hierarchy. It is insofar pretty confusing that you can actually create a directory hierarchy that does not reflect a memcg hierarchy: # pwd /sys/fs/cgroup/memory/foo/bar # cat memory.usage_in_bytes 450560 # cat ../memory.usage_in_bytes 0 there is no accounting/limiting/whatever parent-child relationship between foo and bar. > > @@ -2121,8 +2121,16 @@ static void shrink_zone(int priority, struct zone *zone, > > .mem_cgroup = memcg, > > .zone = zone, > > }; > > + int epriority = priority; > > + /* > > + * Put more pressure on hierarchies that exceed their > > + * soft limit, to push them back harder than their > > + * well-behaving siblings. > > + */ > > + if (mem_cgroup_over_softlimit(root, memcg)) > > + epriority = 0; > > This sounds too aggressive to me. Shouldn't we just double the pressure > or something like that? That's the historical value. When I tried priority - 1, it was not aggressive enough. > Previously we always had nr_to_reclaim == SWAP_CLUSTER_MAX when we did > memcg reclaim but this is not the case now. For the kswapd we have > nr_to_reclaim == ULONG_MAX so we will not break out of the reclaim early > and we have to scan a lot. > Direct reclaim (shrink or hard limit) shouldn't be affected here. It took me a while: we had SWAP_CLUSTER_MAX in _soft limit reclaim_, which means that even with priority 0 we would bail after reclaiming SWAP_CLUSTER_MAX from each lru of a zone. But it's now happening with kswapd's own scan_control, so the overreclaim protection is gone. That is indeed a change in behaviour I haven't noticed, good catch! I will look into it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx154.postini.com [74.125.245.154]) by kanga.kvack.org (Postfix) with SMTP id B3C4B6B004F for ; Fri, 13 Jan 2012 11:34:26 -0500 (EST) Date: Fri, 13 Jan 2012 17:34:23 +0100 From: Michal Hocko Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Message-ID: <20120113163423.GG17060@tiehlicka.suse.cz> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120113120406.GC17060@tiehlicka.suse.cz> <20120113155001.GB1653@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120113155001.GB1653@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Andrew Morton , KAMEZAWA Hiroyuki , Balbir Singh , Ying Han , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org On Fri 13-01-12 16:50:01, Johannes Weiner wrote: > On Fri, Jan 13, 2012 at 01:04:06PM +0100, Michal Hocko wrote: > > On Tue 10-01-12 16:02:52, Johannes Weiner wrote: [...] > > > +bool mem_cgroup_over_softlimit(struct mem_cgroup *root, > > > + struct mem_cgroup *memcg) > > > +{ > > > + if (mem_cgroup_disabled()) > > > + return false; > > > + > > > + if (!root) > > > + root = root_mem_cgroup; > > > + > > > + for (; memcg; memcg = parent_mem_cgroup(memcg)) { > > > + /* root_mem_cgroup does not have a soft limit */ > > > + if (memcg == root_mem_cgroup) > > > + break; > > > + if (res_counter_soft_limit_excess(&memcg->res)) > > > + return true; > > > + if (memcg == root) > > > + break; > > > + } > > > + return false; > > > +} > > > > Well, this might be little bit tricky. We do not check whether memcg and > > root are in a hierarchy (in terms of use_hierarchy) relation. > > > > If we are under global reclaim then we iterate over all memcgs and so > > there is no guarantee that there is a hierarchical relation between the > > given memcg and its parent. While, on the other hand, if we are doing > > memcg reclaim then we have this guarantee. > > > > Why should we punish a group (subtree) which is perfectly under its soft > > limit just because some other subtree contributes to the common parent's > > usage and makes it over its limit? > > Should we check memcg->use_hierarchy here? > > We do, actually. parent_mem_cgroup() checks the res_counter parent, > which is only set when ->use_hierarchy is also set. Of course I am blind.. We do not setup res_counter parent for !use_hierarchy case. Sorry for noise... Now it makes much better sense. I was wondering how !use_hierarchy could ever work, this should be a signal that I am overlooking something terribly. [...] > > > @@ -2121,8 +2121,16 @@ static void shrink_zone(int priority, struct zone *zone, > > > .mem_cgroup = memcg, > > > .zone = zone, > > > }; > > > + int epriority = priority; > > > + /* > > > + * Put more pressure on hierarchies that exceed their > > > + * soft limit, to push them back harder than their > > > + * well-behaving siblings. > > > + */ > > > + if (mem_cgroup_over_softlimit(root, memcg)) > > > + epriority = 0; > > > > This sounds too aggressive to me. Shouldn't we just double the pressure > > or something like that? > > That's the historical value. When I tried priority - 1, it was not > aggressive enough. Probably because we want to reclaim too much. Maybe we should do reduce nr_to_reclaim (ugly) or reclaim only overlimit groups until certain priority level as Ying suggested in her patchset. -- Michal Hocko SUSE Labs SUSE LINUX s.r.o. Lihovarska 1060/12 190 00 Praha 9 Czech Republic -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx178.postini.com [74.125.245.178]) by kanga.kvack.org (Postfix) with SMTP id D5C216B004F for ; Fri, 13 Jan 2012 16:31:17 -0500 (EST) Received: by qadb10 with SMTP id b10so26611qad.14 for ; Fri, 13 Jan 2012 13:31:16 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <20120112085904.GG24386@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120112085904.GG24386@cmpxchg.org> Date: Fri, 13 Jan 2012 13:31:16 -0800 Message-ID: Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim From: Ying Han Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org On Thu, Jan 12, 2012 at 12:59 AM, Johannes Weiner wrot= e: > On Wed, Jan 11, 2012 at 01:42:31PM -0800, Ying Han wrote: >> On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner wr= ote: >> > Right now, memcg soft limits are implemented by having a sorted tree >> > of memcgs that are in excess of their limits. =A0Under global memory >> > pressure, kswapd first reclaims from the biggest excessor and then >> > proceeds to do regular global reclaim. =A0The result of this is that >> > pages are reclaimed from all memcgs, but more scanning happens against >> > those above their soft limit. >> > >> > With global reclaim doing memcg-aware hierarchical reclaim by default, >> > this is a lot easier to implement: everytime a memcg is reclaimed >> > from, scan more aggressively (per tradition with a priority of 0) if >> > it's above its soft limit. =A0With the same end result of scanning >> > everybody, but soft limit excessors a bit more. >> > >> > Advantages: >> > >> > =A0o smoother reclaim: soft limit reclaim is a separate stage before >> > =A0 =A0global reclaim, whose result is not communicated down the line = and >> > =A0 =A0so overreclaim of the groups in excess is very likely. =A0After= this >> > =A0 =A0patch, soft limit reclaim is fully integrated into regular recl= aim >> > =A0 =A0and each memcg is considered exactly once per cycle. >> > >> > =A0o true hierarchy support: soft limits are only considered when >> > =A0 =A0kswapd does global reclaim, but after this patch, targetted >> > =A0 =A0reclaim of a memcg will mind the soft limit settings of its chi= ld >> > =A0 =A0groups. >> >> Why we add soft limit reclaim into target reclaim? > > =A0 =A0 =A0 =A0-> A hard limit 10G, usage 10G > =A0 =A0 =A0 =A0 =A0 -> A1 soft limit 8G, usage 5G > =A0 =A0 =A0 =A0 =A0 -> A2 soft limit 2G, usage 5G > > When A hits its hard limit, A2 will experience more pressure than A1. > > Soft limits are already applied hierarchically: the memcg that is > picked from the tree is reclaimed hierarchically. =A0What I wanted to > add is the soft limit also being /triggerable/ from non-global > hierarchy levels. > >> Based on the discussions, my understanding is that the soft limit only >> take effect while the whole machine is under memory contention. We >> don't want to add extra pressure on a cgroup if there is free memory >> on the system even the cgroup is above its limit. > > If a hierarchy is under pressure, we will reclaim that hierarchy. =A0We > allow groups to be prioritized under global pressure, why not allow it > for local pressure as well? > > I am not quite sure what you are objecting to. > >> > =A0o code size: soft limit reclaim requires a lot of code to maintain >> > =A0 =A0the per-node per-zone rb-trees to quickly find the biggest >> > =A0 =A0offender, dedicated paths for soft limit reclaim etc. while thi= s >> > =A0 =A0new implementation gets away without all that. >> > >> > Test: >> > >> > The test consists of two concurrent kernel build jobs in separate >> > source trees, the master and the slave. =A0The two jobs get along nice= ly >> > on 600MB of available memory, so this is the zero overcommit control >> > case. =A0When available memory is decreased, the overcommit is >> > compensated by decreasing the soft limit of the slave by the same >> > amount, in the hope that the slave takes the hit and the master stays >> > unaffected. >> > >> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0600M-0M-vanilla =A0 =A0 =A0 =A0 600M-0M-patched >> > Master walltime (s) =A0 =A0 =A0 =A0 =A0 =A0 =A0 552.65 ( =A0+0.00%) = =A0 =A0 =A0 552.38 ( =A0-0.05%) >> > Master walltime (stddev) =A0 =A0 =A0 =A0 =A0 =A01.25 ( =A0+0.00%) =A0 = =A0 =A0 =A0 0.92 ( -14.66%) >> > Master major faults =A0 =A0 =A0 =A0 =A0 =A0 =A0 204.38 ( =A0+0.00%) = =A0 =A0 =A0 205.38 ( =A0+0.49%) >> > Master major faults (stddev) =A0 =A0 =A0 27.16 ( =A0+0.00%) =A0 =A0 = =A0 =A013.80 ( -47.43%) >> > Master reclaim =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 31.88 ( =A0+0.0= 0%) =A0 =A0 =A0 =A037.75 ( +17.87%) >> > Master reclaim (stddev) =A0 =A0 =A0 =A0 =A0 =A034.01 ( =A0+0.00%) =A0 = =A0 =A0 =A075.88 (+119.59%) >> > Master scan =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A031.88 ( =A0= +0.00%) =A0 =A0 =A0 =A037.75 ( +17.87%) >> > Master scan (stddev) =A0 =A0 =A0 =A0 =A0 =A0 =A0 34.01 ( =A0+0.00%) = =A0 =A0 =A0 =A075.88 (+119.59%) >> > Master kswapd reclaim =A0 =A0 =A0 =A0 =A0 33922.12 ( =A0+0.00%) =A0 = =A0 33887.12 ( =A0-0.10%) >> > Master kswapd reclaim (stddev) =A0 =A0969.08 ( =A0+0.00%) =A0 =A0 =A0 = 492.22 ( -49.16%) >> > Master kswapd scan =A0 =A0 =A0 =A0 =A0 =A0 =A034085.75 ( =A0+0.00%) = =A0 =A0 33985.75 ( =A0-0.29%) >> > Master kswapd scan (stddev) =A0 =A0 =A01101.07 ( =A0+0.00%) =A0 =A0 = =A0 563.33 ( -48.79%) >> > Slave walltime (s) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0552.68 ( =A0+0.00%) = =A0 =A0 =A0 552.12 ( =A0-0.10%) >> > Slave walltime (stddev) =A0 =A0 =A0 =A0 =A0 =A0 0.79 ( =A0+0.00%) =A0 = =A0 =A0 =A0 1.05 ( +14.76%) >> > Slave major faults =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0212.50 ( =A0+0.00%) = =A0 =A0 =A0 204.50 ( =A0-3.75%) >> > Slave major faults (stddev) =A0 =A0 =A0 =A026.90 ( =A0+0.00%) =A0 =A0 = =A0 =A013.17 ( -49.20%) >> > Slave reclaim =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A026.12 ( =A0+0= .00%) =A0 =A0 =A0 =A035.00 ( +32.72%) >> > Slave reclaim (stddev) =A0 =A0 =A0 =A0 =A0 =A0 29.42 ( =A0+0.00%) =A0 = =A0 =A0 =A074.91 (+149.55%) >> > Slave scan =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 31.38 ( =A0= +0.00%) =A0 =A0 =A0 =A035.00 ( +11.20%) >> > Slave scan (stddev) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A033.31 ( =A0+0.00%) = =A0 =A0 =A0 =A074.91 (+121.24%) >> > Slave kswapd reclaim =A0 =A0 =A0 =A0 =A0 =A034259.00 ( =A0+0.00%) =A0 = =A0 33469.88 ( =A0-2.30%) >> > Slave kswapd reclaim (stddev) =A0 =A0 925.15 ( =A0+0.00%) =A0 =A0 =A0 = 565.07 ( -38.88%) >> > Slave kswapd scan =A0 =A0 =A0 =A0 =A0 =A0 =A0 34354.62 ( =A0+0.00%) = =A0 =A0 33555.75 ( =A0-2.33%) >> > Slave kswapd scan (stddev) =A0 =A0 =A0 =A0969.62 ( =A0+0.00%) =A0 =A0 = =A0 581.70 ( -39.97%) >> > >> > In the control case, the differences in elapsed time, number of major >> > faults taken, and reclaim statistics are within the noise for both the >> > master and the slave job. >> >> What's the soft limit setting in the controlled case? > > 300MB for both jobs. > >> I assume it is the default RESOURCE_MAX. So both Master and Slave get >> equal pressure before/after the patch, and no differences on the stats >> should be observed. > > Yes. =A0The control case demonstrates that both jobs can fit > comfortably, don't compete for space and that in general the patch > does not have unexpected negative impact (after all, it modifies > codepaths that were invoked regularly outside of reclaim). > >> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 600M-280M-vanilla =A0 =A0 =A0600M-280M-patched >> > Master walltime (s) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0595.13 ( =A0+0.= 00%) =A0 =A0 =A0553.19 ( =A0-7.04%) >> > Master walltime (stddev) =A0 =A0 =A0 =A0 =A0 =A0 =A0 8.31 ( =A0+0.00%)= =A0 =A0 =A0 =A02.57 ( -61.64%) >> > Master major faults =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 3729.75 ( =A0+0.00= %) =A0 =A0 =A0783.25 ( -78.98%) >> > Master major faults (stddev) =A0 =A0 =A0 =A0 258.79 ( =A0+0.00%) =A0 = =A0 =A0226.68 ( -12.36%) >> > Master reclaim =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 705.00 ( = =A0+0.00%) =A0 =A0 =A0 29.50 ( -95.68%) >> > Master reclaim (stddev) =A0 =A0 =A0 =A0 =A0 =A0 =A0232.87 ( =A0+0.00%)= =A0 =A0 =A0 44.72 ( -80.45%) >> > Master scan =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0714.88 = ( =A0+0.00%) =A0 =A0 =A0 30.00 ( -95.67%) >> > Master scan (stddev) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 237.44 ( =A0+0.00= %) =A0 =A0 =A0 45.39 ( -80.54%) >> > Master kswapd reclaim =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0114.75 ( =A0+0.00= %) =A0 =A0 =A0 50.00 ( -55.94%) >> > Master kswapd reclaim (stddev) =A0 =A0 =A0 128.51 ( =A0+0.00%) =A0 =A0= =A0 =A09.45 ( -91.93%) >> > Master kswapd scan =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 115.75 ( =A0+0.= 00%) =A0 =A0 =A0 50.00 ( -56.32%) >> > Master kswapd scan (stddev) =A0 =A0 =A0 =A0 =A0130.31 ( =A0+0.00%) =A0= =A0 =A0 =A09.45 ( -92.04%) >> > Slave walltime (s) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 631.18 ( =A0+0.= 00%) =A0 =A0 =A0577.68 ( =A0-8.46%) >> > Slave walltime (stddev) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A09.89 ( =A0+0.00= %) =A0 =A0 =A0 =A03.63 ( -57.47%) >> > Slave major faults =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 28401.75 ( =A0+0.00= %) =A0 =A014656.75 ( -48.39%) >> > Slave major faults (stddev) =A0 =A0 =A0 =A0 2629.97 ( =A0+0.00%) =A0 = =A0 1911.81 ( -27.30%) >> > Slave reclaim =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A065400.62 ( = =A0+0.00%) =A0 =A0 1479.62 ( -97.74%) >> > Slave reclaim (stddev) =A0 =A0 =A0 =A0 =A0 =A0 11623.02 ( =A0+0.00%) = =A0 =A0 1482.13 ( -87.24%) >> > Slave scan =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 9050047.88 ( = =A0+0.00%) =A0 =A095968.25 ( -98.94%) >> > Slave scan (stddev) =A0 =A0 =A0 =A0 =A0 =A0 =A01912786.94 ( =A0+0.00%)= =A0 =A093390.71 ( -95.12%) >> > Slave kswapd reclaim =A0 =A0 =A0 =A0 =A0 =A0 =A0327894.50 ( =A0+0.00%)= =A0 227099.88 ( -30.74%) >> > Slave kswapd reclaim (stddev) =A0 =A0 =A022289.43 ( =A0+0.00%) =A0 =A0= 16113.14 ( -27.71%) >> > Slave kswapd scan =A0 =A0 =A0 =A0 =A0 =A0 =A0 34987335.75 ( =A0+0.00%)= =A01362367.12 ( -96.11%) >> > Slave kswapd scan (stddev) =A0 =A0 =A0 2523642.98 ( =A0+0.00%) =A0 156= 754.74 ( -93.79%) >> > >> > Here, the available memory is limited to 320 MB, the machine is >> > overcommitted by 280 MB. =A0The soft limit of the master is 300 MB, th= at >> > of the slave merely 20 MB. >> > >> > Looking at the slave job first, it is much better off with the patched >> > kernel: direct reclaim is almost gone, kswapd reclaim is decreased by >> > a third. =A0The result is much fewer major faults taken, which in turn >> > lets the job finish quicker. >> >> What's the setting of the hard limit here? Is the direct reclaim >> referring to per-memcg directly reclaim or global one. > > The machine's memory is limited to 600M, the hard limits are unset. > All reclaim is a result of global memory pressure. > > With the patched kernel, I could have used a dedicated parent cgroup > and let master and slave run in children of this group, the soft > limits would be taken into account just the same. =A0But this does not > work on the unpatched kernel, as soft limits are only recognized on > the global level there. > >> > It would be a zero-sum game if the improvement happened at the cost of >> > the master but looking at the numbers, even the master performs better >> > with the patched kernel. =A0In fact, the master job is almost unaffect= ed >> > on the patched kernel compared to the control case. >> >> It makes sense since the master job get less affected by the patch >> than the slave job under the example. Under the control case, if both >> master and slave have RESOURCE_MAX soft limit setting, they are under >> equal memory pressure(priority =3D DEF_PRIORITY) . On the second >> example, only the slave pressure being increased by priority =3D 0, and >> the Master got scanned with same priority =3D DEF_PRIORITY pretty much. >> >> So I would expect to see more reclaim activities happens in slave on >> the patched kernel compared to the control case. It seems match the >> testing result. > > Uhm, > >> > Slave reclaim =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A065400.62 ( = =A0+0.00%) =A0 =A0 1479.62 ( -97.74%) >> > Slave reclaim (stddev) =A0 =A0 =A0 =A0 =A0 =A0 11623.02 ( =A0+0.00%) = =A0 =A0 1482.13 ( -87.24%) >> > Slave scan =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 9050047.88 ( = =A0+0.00%) =A0 =A095968.25 ( -98.94%) >> > Slave scan (stddev) =A0 =A0 =A0 =A0 =A0 =A0 =A01912786.94 ( =A0+0.00%)= =A0 =A093390.71 ( -95.12%) >> > Slave kswapd reclaim =A0 =A0 =A0 =A0 =A0 =A0 =A0327894.50 ( =A0+0.00%)= =A0 227099.88 ( -30.74%) >> > Slave kswapd reclaim (stddev) =A0 =A0 =A022289.43 ( =A0+0.00%) =A0 =A0= 16113.14 ( -27.71%) >> > Slave kswapd scan =A0 =A0 =A0 =A0 =A0 =A0 =A0 34987335.75 ( =A0+0.00%)= =A01362367.12 ( -96.11%) >> > Slave kswapd scan (stddev) =A0 =A0 =A0 2523642.98 ( =A0+0.00%) =A0 156= 754.74 ( -93.79%) > > Direct reclaim _shrunk_ by 98%, kswapd reclaim by 31%. > >> > This is an odd phenomenon, as the patch does not directly change how >> > the master is reclaimed. =A0An explanation for this is that the severe >> > overreclaim of the slave in the unpatched kernel results in the master >> > growing bigger than in the patched case. =A0Combining the fact that >> > memcgs are scanned according to their size with the increased refault >> > rate of the overreclaimed slave triggering global reclaim more often >> > means that overall pressure on the master job is higher in the >> > unpatched kernel. >> >> We can check the Master memory.usage_in_bytes while the job is running. > > Yep, the plots of cache/rss over time confirmed exactly this. =A0The > unpatched kernel shows higher spikes in the size of the master job > followed by deeper pits when reclaim kicked in. =A0The patched kernel is > much smoother in that regard. > >> On the other hand, I don't see why we expect the Master being less >> reclaimed in the controlled case? On the unpatched kernel, the Master >> is being reclaimed under global pressure each time anyway since we >> ignore the return value of softlimit. > > I didn't expect that, I expected both jobs to perform equally in the > control case. =A0And in the pressurized case, the master being > unaffected and the slave taking the hit. =A0The patched kernel does > this, the unpatched one does not. > >> > @@ -121,6 +121,7 @@ struct zone_reclaim_stat *mem_cgroup_get_reclaim_s= tat(struct mem_cgroup *memcg, >> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0struct zone *zone); >> > =A0struct zone_reclaim_stat* >> > =A0mem_cgroup_get_reclaim_stat_from_page(struct page *page); >> > +bool mem_cgroup_over_softlimit(struct mem_cgroup *, struct mem_cgroup= *); >> >> Maybe something like "mem_cgroup_over_soft_limit()" ? > > Probably more consistent, yeah. =A0Will do. > >> > @@ -343,7 +314,6 @@ static bool move_file(void) >> > =A0* limit reclaim to prevent infinite loops, if they ever occur. >> > =A0*/ >> > =A0#define =A0 =A0 =A0 =A0MEM_CGROUP_MAX_RECLAIM_LOOPS =A0 =A0 =A0 =A0= =A0 =A0(100) >> > -#define =A0 =A0 =A0 =A0MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS (2) >> >> You might need to remove the comment above as well. > > Oops, will fix. > >> > @@ -1318,6 +1123,36 @@ static unsigned long mem_cgroup_margin(struct m= em_cgroup *memcg) >> > =A0 =A0 =A0 =A0return margin >> PAGE_SHIFT; >> > =A0} >> > >> > +/** >> > + * mem_cgroup_over_softlimit >> > + * @root: hierarchy root >> > + * @memcg: child of @root to test >> > + * >> > + * Returns %true if @memcg exceeds its own soft limit or contributes >> > + * to the soft limit excess of one of its parents up to and including >> > + * @root. >> > + */ >> > +bool mem_cgroup_over_softlimit(struct mem_cgroup *root, >> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0struct me= m_cgroup *memcg) >> > +{ >> > + =A0 =A0 =A0 if (mem_cgroup_disabled()) >> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 return false; >> > + >> > + =A0 =A0 =A0 if (!root) >> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 root =3D root_mem_cgroup; >> > + >> > + =A0 =A0 =A0 for (; memcg; memcg =3D parent_mem_cgroup(memcg)) { >> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* root_mem_cgroup does not have a soft = limit */ >> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (memcg =3D=3D root_mem_cgroup) >> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break; >> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (res_counter_soft_limit_excess(&memcg= ->res)) >> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 return true; >> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (memcg =3D=3D root) >> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break; >> > + =A0 =A0 =A0 } >> >> Here it adds pressure on a cgroup if one of its parents exceeds soft >> limit, although the cgroup itself is under soft limit. It does change >> my understanding of soft limit, and might introduce regression of our >> existing use cases. >> >> Here is an example: >> >> Machine capacity 32G and we over-commit by 8G. >> >> root >> =A0 -> A (hard limit 20G, soft limit 15G, usage 16G) >> =A0 =A0 =A0 =A0-> A1 (soft limit 5G, usage 4G) >> =A0 =A0 =A0 =A0-> A2 (soft limit 10G, usage 12G) >> =A0 -> B (hard limit 20G, soft limit 10G, usage 16G) >> >> under global reclaim, we don't want to add pressure on A1 although its >> parent A exceeds its soft limit. Assume that if we set the soft limit >> corresponding to each cgroup's working set size (hot memory), and it >> will introduce regression to A1 in that case. >> >> In my existing implementation, i am checking the cgroup's soft limit >> standalone w/o looking its ancestors. > > Why do you set the soft limit of A in the first place if you don't > want it to be enforced? The soft limit should be enforced under certain condition, not always. The soft limit of A is set to be enforced when the parent of A and B is under memory pressure. For example: Machine capacity 32G and we over-commit by 8G root -> A (hard limit 20G, soft limit 12G, usage 20G) =A0 =A0 =A0 =A0-> A1 (soft limit 2G, usage 1G) =A0 =A0 =A0 =A0-> A2 (soft limit 10G, usage 19G) -> B (hard limit 20G, soft limit 10G, usage 0G) Now, A is under memory pressure since the total usage is hitting its hard limit. Then we start hierarchical reclaim under A, and each cgroup under A also takes consideration of soft limit. In this case, we should only set priority =3D 0 to A2 since it contributes to A's charge as well as exceeding its own soft limit. Why punishing A1 (set priority =3D 0) also which has usage under its soft limit ? I can imagine it will introduce regression to existing environment which the soft limit is set based on the working set size of the cgroup. To answer the question why we set soft limit to A, it is used to over-commit the host while sharing the resource with its sibling (B in this case). If the machine is under memory contention, we would like to push down memory to A or B depends on their usage and soft limit. --Ying > > This is not really new behaviour, soft limit reclaim has always been > operating hierarchically on the biggest excessor. =A0In your case, the > excess of A is smaller than the excess of A2 and so that weird "only > pick the biggest excessor" behaviour hides it, but consider this: > > =A0 =A0 =A0 =A0-> A soft 30G, usage 39G > =A0 =A0 =A0 =A0 =A0 -> A1 soft 5G, usage 4G > =A0 =A0 =A0 =A0 =A0 -> A2 soft 10G, usage 15G > =A0 =A0 =A0 =A0 =A0 -> A3 soft 15G, usage 20G > > Upstream would pick A from the soft limit tree and reclaim its > children with priority 0, including A1. > > On the other hand, if you don't consider ancestral soft limits, you > break perfectly reasonable setups like these > > =A0 =A0 =A0 =A0-> A soft 10G, usage 20G > =A0 =A0 =A0 =A0 =A0 -> A1 usage 10G > =A0 =A0 =A0 =A0 =A0 -> A2 usage 10G > =A0 =A0 =A0 =A0-> B soft 10G, usage 11G > > where upstream would pick A and reclaim it recursively, but your > version would only apply higher pressure to B. > > If you would just not set the soft limit of A in your case: > > =A0 =A0 =A0 =A0-> A (hard limit 20G, usage 16G) > =A0 =A0 =A0 =A0 =A0 -> A1 (soft limit 5G, usage 4G) > =A0 =A0 =A0 =A0 =A0 -> A2 (soft limit 10G, usage 12G) > =A0 =A0 =A0 =A0-> B (hard limit 20G, soft limit 10G, usage 16G) > > only A2 and B would experience higher pressure upon global pressure. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx161.postini.com [74.125.245.161]) by kanga.kvack.org (Postfix) with SMTP id 88C636B004F for ; Fri, 13 Jan 2012 16:45:31 -0500 (EST) Received: by qcsg13 with SMTP id g13so643386qcs.14 for ; Fri, 13 Jan 2012 13:45:30 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <20120113163423.GG17060@tiehlicka.suse.cz> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120113120406.GC17060@tiehlicka.suse.cz> <20120113155001.GB1653@cmpxchg.org> <20120113163423.GG17060@tiehlicka.suse.cz> Date: Fri, 13 Jan 2012 13:45:30 -0800 Message-ID: Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim From: Ying Han Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: Johannes Weiner , Andrew Morton , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org On Fri, Jan 13, 2012 at 8:34 AM, Michal Hocko wrote: > On Fri 13-01-12 16:50:01, Johannes Weiner wrote: >> On Fri, Jan 13, 2012 at 01:04:06PM +0100, Michal Hocko wrote: >> > On Tue 10-01-12 16:02:52, Johannes Weiner wrote: > [...] >> > > +bool mem_cgroup_over_softlimit(struct mem_cgroup *root, >> > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0struct mem_cgroup *= memcg) >> > > +{ >> > > + if (mem_cgroup_disabled()) >> > > + =A0 =A0 =A0 =A0 return false; >> > > + >> > > + if (!root) >> > > + =A0 =A0 =A0 =A0 root =3D root_mem_cgroup; >> > > + >> > > + for (; memcg; memcg =3D parent_mem_cgroup(memcg)) { >> > > + =A0 =A0 =A0 =A0 /* root_mem_cgroup does not have a soft limit */ >> > > + =A0 =A0 =A0 =A0 if (memcg =3D=3D root_mem_cgroup) >> > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break; >> > > + =A0 =A0 =A0 =A0 if (res_counter_soft_limit_excess(&memcg->res)) >> > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 return true; >> > > + =A0 =A0 =A0 =A0 if (memcg =3D=3D root) >> > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break; >> > > + } >> > > + return false; >> > > +} >> > >> > Well, this might be little bit tricky. We do not check whether memcg a= nd >> > root are in a hierarchy (in terms of use_hierarchy) relation. >> > >> > If we are under global reclaim then we iterate over all memcgs and so >> > there is no guarantee that there is a hierarchical relation between th= e >> > given memcg and its parent. While, on the other hand, if we are doing >> > memcg reclaim then we have this guarantee. >> > >> > Why should we punish a group (subtree) which is perfectly under its so= ft >> > limit just because some other subtree contributes to the common parent= 's >> > usage and makes it over its limit? >> > Should we check memcg->use_hierarchy here? >> >> We do, actually. =A0parent_mem_cgroup() checks the res_counter parent, >> which is only set when ->use_hierarchy is also set. > > Of course I am blind.. We do not setup res_counter parent for > !use_hierarchy case. Sorry for noise... > Now it makes much better sense. I was wondering how !use_hierarchy could > ever work, this should be a signal that I am overlooking something > terribly. > > [...] >> > > @@ -2121,8 +2121,16 @@ static void shrink_zone(int priority, struct = zone *zone, >> > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 .mem_cgroup =3D memcg, >> > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 .zone =3D zone, >> > > =A0 =A0 =A0 =A0 =A0 }; >> > > + =A0 =A0 =A0 =A0 int epriority =3D priority; >> > > + =A0 =A0 =A0 =A0 /* >> > > + =A0 =A0 =A0 =A0 =A0* Put more pressure on hierarchies that exceed = their >> > > + =A0 =A0 =A0 =A0 =A0* soft limit, to push them back harder than the= ir >> > > + =A0 =A0 =A0 =A0 =A0* well-behaving siblings. >> > > + =A0 =A0 =A0 =A0 =A0*/ >> > > + =A0 =A0 =A0 =A0 if (mem_cgroup_over_softlimit(root, memcg)) >> > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 epriority =3D 0; >> > >> > This sounds too aggressive to me. Shouldn't we just double the pressur= e >> > or something like that? >> >> That's the historical value. =A0When I tried priority - 1, it was not >> aggressive enough. > > Probably because we want to reclaim too much. Maybe we should do > reduce nr_to_reclaim (ugly) or reclaim only overlimit groups until certai= n > priority level as Ying suggested in her patchset. I plan to post that change on top of this, and this patch set does the basic stuff to allow us doing further improvement. I still like the design to skip over_soft_limit cgroups until certain priority. One way to set up the soft limit for each cgroup is to base on its actual working set size, and we prefer to punish A first with lots of page cache ( cold file pages above soft limit) than reclaiming anon pages from B ( below soft limit ). Unless we can not get enough pages reclaimed from A, we will start reclaiming from B. This might not be the ideal solution, but should be a good start. Thoughts? --Ying > -- > Michal Hocko > SUSE Labs > SUSE LINUX s.r.o. > Lihovarska 1060/12 > 190 00 Praha 9 > Czech Republic -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx152.postini.com [74.125.245.152]) by kanga.kvack.org (Postfix) with SMTP id 6E72A6B004F for ; Fri, 13 Jan 2012 17:44:41 -0500 (EST) Date: Fri, 13 Jan 2012 23:44:24 +0100 From: Johannes Weiner Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Message-ID: <20120113224424.GC1653@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120112085904.GG24386@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Ying Han Cc: Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org On Fri, Jan 13, 2012 at 01:31:16PM -0800, Ying Han wrote: > On Thu, Jan 12, 2012 at 12:59 AM, Johannes Weiner wrote: > > On Wed, Jan 11, 2012 at 01:42:31PM -0800, Ying Han wrote: > >> On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner wrote: > >> > @@ -1318,6 +1123,36 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) > >> > return margin >> PAGE_SHIFT; > >> > } > >> > > >> > +/** > >> > + * mem_cgroup_over_softlimit > >> > + * @root: hierarchy root > >> > + * @memcg: child of @root to test > >> > + * > >> > + * Returns %true if @memcg exceeds its own soft limit or contributes > >> > + * to the soft limit excess of one of its parents up to and including > >> > + * @root. > >> > + */ > >> > +bool mem_cgroup_over_softlimit(struct mem_cgroup *root, > >> > + struct mem_cgroup *memcg) > >> > +{ > >> > + if (mem_cgroup_disabled()) > >> > + return false; > >> > + > >> > + if (!root) > >> > + root = root_mem_cgroup; > >> > + > >> > + for (; memcg; memcg = parent_mem_cgroup(memcg)) { > >> > + /* root_mem_cgroup does not have a soft limit */ > >> > + if (memcg == root_mem_cgroup) > >> > + break; > >> > + if (res_counter_soft_limit_excess(&memcg->res)) > >> > + return true; > >> > + if (memcg == root) > >> > + break; > >> > + } > >> > >> Here it adds pressure on a cgroup if one of its parents exceeds soft > >> limit, although the cgroup itself is under soft limit. It does change > >> my understanding of soft limit, and might introduce regression of our > >> existing use cases. > >> > >> Here is an example: > >> > >> Machine capacity 32G and we over-commit by 8G. > >> > >> root > >> -> A (hard limit 20G, soft limit 15G, usage 16G) > >> -> A1 (soft limit 5G, usage 4G) > >> -> A2 (soft limit 10G, usage 12G) > >> -> B (hard limit 20G, soft limit 10G, usage 16G) > >> > >> under global reclaim, we don't want to add pressure on A1 although its > >> parent A exceeds its soft limit. Assume that if we set the soft limit > >> corresponding to each cgroup's working set size (hot memory), and it > >> will introduce regression to A1 in that case. > >> > >> In my existing implementation, i am checking the cgroup's soft limit > >> standalone w/o looking its ancestors. > > > > Why do you set the soft limit of A in the first place if you don't > > want it to be enforced? > > The soft limit should be enforced under certain condition, not always. > The soft limit of A is set to be enforced when the parent of A and B > is under memory pressure. For example: > > Machine capacity 32G and we over-commit by 8G > > root > -> A (hard limit 20G, soft limit 12G, usage 20G) > -> A1 (soft limit 2G, usage 1G) > -> A2 (soft limit 10G, usage 19G) > -> B (hard limit 20G, soft limit 10G, usage 0G) > > Now, A is under memory pressure since the total usage is hitting its > hard limit. Then we start hierarchical reclaim under A, and each > cgroup under A also takes consideration of soft limit. In this case, > we should only set priority = 0 to A2 since it contributes to A's > charge as well as exceeding its own soft limit. Why punishing A1 (set > priority = 0) also which has usage under its soft limit ? I can > imagine it will introduce regression to existing environment which the > soft limit is set based on the working set size of the cgroup. > > To answer the question why we set soft limit to A, it is used to > over-commit the host while sharing the resource with its sibling (B in > this case). If the machine is under memory contention, we would like > to push down memory to A or B depends on their usage and soft limit. D'oh, I think the problem is just that we walk up the hierarchy one too many when checking whether a group exceeds a soft limit. The soft limit is a signal to distribute pressure that comes from above, it's meaningless and should indeed be ignored on the level the pressure originates from. Say mem_cgroup_over_soft_limit(root, memcg) would check everyone up to but not including root, wouldn't that do exactly what we both want? Example: 1. If global memory is short, we reclaim with root=root_mem_cgroup. A1 and A2 get soft limit reclaimed because of A's soft limit excess, just like the current kernel would do. 2. If A hits its hard limit, we reclaim with root=A, so we only mind the soft limits of A1 and A2. A1 is below its soft limit, all good. A2 is above its soft limit, gets treated accordingly. This is new behaviour, the current kernel would just reclaim them equally. Code: bool mem_cgroup_over_soft_limit(struct mem_cgroup *root, struct mem_cgroup *memcg) { if (mem_cgroup_disabled()) return false; if (!root) root = root_mem_cgroup; for (; memcg; memcg = parent_mem_cgroup(memcg)) { if (memcg == root) break; if (res_counter_soft_limit_excess(&memcg->res)) return true; } return false; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx131.postini.com [74.125.245.131]) by kanga.kvack.org (Postfix) with SMTP id 9A7976B00C0 for ; Tue, 17 Jan 2012 09:22:35 -0500 (EST) Received: by ggnp4 with SMTP id p4so3989814ggn.14 for ; Tue, 17 Jan 2012 06:22:34 -0800 (PST) Message-ID: <4F158418.2090509@gmail.com> Date: Tue, 17 Jan 2012 22:22:16 +0800 From: Sha MIME-Version: 1.0 Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120112085904.GG24386@cmpxchg.org> <20120113224424.GC1653@cmpxchg.org> In-Reply-To: <20120113224424.GC1653@cmpxchg.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Ying Han , Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org On 01/14/2012 06:44 AM, Johannes Weiner wrote: > On Fri, Jan 13, 2012 at 01:31:16PM -0800, Ying Han wrote: >> On Thu, Jan 12, 2012 at 12:59 AM, Johannes Weiner wrote: >>> On Wed, Jan 11, 2012 at 01:42:31PM -0800, Ying Han wrote: >>>> On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner wrote: >>>>> @@ -1318,6 +1123,36 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) >>>>> return margin>> PAGE_SHIFT; >>>>> } >>>>> >>>>> +/** >>>>> + * mem_cgroup_over_softlimit >>>>> + * @root: hierarchy root >>>>> + * @memcg: child of @root to test >>>>> + * >>>>> + * Returns %true if @memcg exceeds its own soft limit or contributes >>>>> + * to the soft limit excess of one of its parents up to and including >>>>> + * @root. >>>>> + */ >>>>> +bool mem_cgroup_over_softlimit(struct mem_cgroup *root, >>>>> + struct mem_cgroup *memcg) >>>>> +{ >>>>> + if (mem_cgroup_disabled()) >>>>> + return false; >>>>> + >>>>> + if (!root) >>>>> + root = root_mem_cgroup; >>>>> + >>>>> + for (; memcg; memcg = parent_mem_cgroup(memcg)) { >>>>> + /* root_mem_cgroup does not have a soft limit */ >>>>> + if (memcg == root_mem_cgroup) >>>>> + break; >>>>> + if (res_counter_soft_limit_excess(&memcg->res)) >>>>> + return true; >>>>> + if (memcg == root) >>>>> + break; >>>>> + } >>>> Here it adds pressure on a cgroup if one of its parents exceeds soft >>>> limit, although the cgroup itself is under soft limit. It does change >>>> my understanding of soft limit, and might introduce regression of our >>>> existing use cases. >>>> >>>> Here is an example: >>>> >>>> Machine capacity 32G and we over-commit by 8G. >>>> >>>> root >>>> -> A (hard limit 20G, soft limit 15G, usage 16G) >>>> -> A1 (soft limit 5G, usage 4G) >>>> -> A2 (soft limit 10G, usage 12G) >>>> -> B (hard limit 20G, soft limit 10G, usage 16G) >>>> >>>> under global reclaim, we don't want to add pressure on A1 although its >>>> parent A exceeds its soft limit. Assume that if we set the soft limit >>>> corresponding to each cgroup's working set size (hot memory), and it >>>> will introduce regression to A1 in that case. >>>> >>>> In my existing implementation, i am checking the cgroup's soft limit >>>> standalone w/o looking its ancestors. >>> Why do you set the soft limit of A in the first place if you don't >>> want it to be enforced? >> The soft limit should be enforced under certain condition, not always. >> The soft limit of A is set to be enforced when the parent of A and B >> is under memory pressure. For example: >> >> Machine capacity 32G and we over-commit by 8G >> >> root >> -> A (hard limit 20G, soft limit 12G, usage 20G) >> -> A1 (soft limit 2G, usage 1G) >> -> A2 (soft limit 10G, usage 19G) >> -> B (hard limit 20G, soft limit 10G, usage 0G) >> >> Now, A is under memory pressure since the total usage is hitting its >> hard limit. Then we start hierarchical reclaim under A, and each >> cgroup under A also takes consideration of soft limit. In this case, >> we should only set priority = 0 to A2 since it contributes to A's >> charge as well as exceeding its own soft limit. Why punishing A1 (set >> priority = 0) also which has usage under its soft limit ? I can >> imagine it will introduce regression to existing environment which the >> soft limit is set based on the working set size of the cgroup. >> >> To answer the question why we set soft limit to A, it is used to >> over-commit the host while sharing the resource with its sibling (B in >> this case). If the machine is under memory contention, we would like >> to push down memory to A or B depends on their usage and soft limit. > D'oh, I think the problem is just that we walk up the hierarchy one > too many when checking whether a group exceeds a soft limit. The soft > limit is a signal to distribute pressure that comes from above, it's > meaningless and should indeed be ignored on the level the pressure > originates from. > > Say mem_cgroup_over_soft_limit(root, memcg) would check everyone up to > but not including root, wouldn't that do exactly what we both want? > > Example: > > 1. If global memory is short, we reclaim with root=root_mem_cgroup. > A1 and A2 get soft limit reclaimed because of A's soft limit > excess, just like the current kernel would do. > > 2. If A hits its hard limit, we reclaim with root=A, so we only mind > the soft limits of A1 and A2. A1 is below its soft limit, all > good. A2 is above its soft limit, gets treated accordingly. This > is new behaviour, the current kernel would just reclaim them > equally. > > Code: > > bool mem_cgroup_over_soft_limit(struct mem_cgroup *root, > struct mem_cgroup *memcg) > { > if (mem_cgroup_disabled()) > return false; > > if (!root) > root = root_mem_cgroup; > > for (; memcg; memcg = parent_mem_cgroup(memcg)) { > if (memcg == root) > break; > if (res_counter_soft_limit_excess(&memcg->res)) > return true; > } > return false; > } Hi Johannes, I don't think it solve the root of the problem, example: root -> A (hard limit 20G, soft limit 12G, usage 20G) -> A1 ( soft limit 2G, usage 1G) -> A2 ( soft limit 10G, usage 19G) ->B1 (soft limit 5G, usage 4G) ->B2 (soft limit 5G, usage 15G) Now A is hitting its hard limit and start hierarchical reclaim under A. If we choose B1 to go through mem_cgroup_over_soft_limit, it will return true because its parent A2 has a large usage and will lead to priority=0 reclaiming. But in fact it should be B2 to be punished. IMHO, it may checking the cgroup's soft limit standalone without looking up its ancestors just as Ying said. Thanks, Sha -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx133.postini.com [74.125.245.133]) by kanga.kvack.org (Postfix) with SMTP id 866326B00C2 for ; Tue, 17 Jan 2012 09:54:08 -0500 (EST) Date: Tue, 17 Jan 2012 15:53:48 +0100 From: Johannes Weiner Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Message-ID: <20120117145348.GA3144@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120112085904.GG24386@cmpxchg.org> <20120113224424.GC1653@cmpxchg.org> <4F158418.2090509@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4F158418.2090509@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Sha Cc: Ying Han , Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org On Tue, Jan 17, 2012 at 10:22:16PM +0800, Sha wrote: > On 01/14/2012 06:44 AM, Johannes Weiner wrote: > >On Fri, Jan 13, 2012 at 01:31:16PM -0800, Ying Han wrote: > >>On Thu, Jan 12, 2012 at 12:59 AM, Johannes Weiner wrote: > >>>On Wed, Jan 11, 2012 at 01:42:31PM -0800, Ying Han wrote: > >>>>On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner wrote: > >>>>>@@ -1318,6 +1123,36 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) > >>>>> return margin>> PAGE_SHIFT; > >>>>> } > >>>>> > >>>>>+/** > >>>>>+ * mem_cgroup_over_softlimit > >>>>>+ * @root: hierarchy root > >>>>>+ * @memcg: child of @root to test > >>>>>+ * > >>>>>+ * Returns %true if @memcg exceeds its own soft limit or contributes > >>>>>+ * to the soft limit excess of one of its parents up to and including > >>>>>+ * @root. > >>>>>+ */ > >>>>>+bool mem_cgroup_over_softlimit(struct mem_cgroup *root, > >>>>>+ struct mem_cgroup *memcg) > >>>>>+{ > >>>>>+ if (mem_cgroup_disabled()) > >>>>>+ return false; > >>>>>+ > >>>>>+ if (!root) > >>>>>+ root = root_mem_cgroup; > >>>>>+ > >>>>>+ for (; memcg; memcg = parent_mem_cgroup(memcg)) { > >>>>>+ /* root_mem_cgroup does not have a soft limit */ > >>>>>+ if (memcg == root_mem_cgroup) > >>>>>+ break; > >>>>>+ if (res_counter_soft_limit_excess(&memcg->res)) > >>>>>+ return true; > >>>>>+ if (memcg == root) > >>>>>+ break; > >>>>>+ } > >>>>Here it adds pressure on a cgroup if one of its parents exceeds soft > >>>>limit, although the cgroup itself is under soft limit. It does change > >>>>my understanding of soft limit, and might introduce regression of our > >>>>existing use cases. > >>>> > >>>>Here is an example: > >>>> > >>>>Machine capacity 32G and we over-commit by 8G. > >>>> > >>>>root > >>>> -> A (hard limit 20G, soft limit 15G, usage 16G) > >>>> -> A1 (soft limit 5G, usage 4G) > >>>> -> A2 (soft limit 10G, usage 12G) > >>>> -> B (hard limit 20G, soft limit 10G, usage 16G) > >>>> > >>>>under global reclaim, we don't want to add pressure on A1 although its > >>>>parent A exceeds its soft limit. Assume that if we set the soft limit > >>>>corresponding to each cgroup's working set size (hot memory), and it > >>>>will introduce regression to A1 in that case. > >>>> > >>>>In my existing implementation, i am checking the cgroup's soft limit > >>>>standalone w/o looking its ancestors. > >>>Why do you set the soft limit of A in the first place if you don't > >>>want it to be enforced? > >>The soft limit should be enforced under certain condition, not always. > >>The soft limit of A is set to be enforced when the parent of A and B > >>is under memory pressure. For example: > >> > >>Machine capacity 32G and we over-commit by 8G > >> > >>root > >>-> A (hard limit 20G, soft limit 12G, usage 20G) > >> -> A1 (soft limit 2G, usage 1G) > >> -> A2 (soft limit 10G, usage 19G) > >>-> B (hard limit 20G, soft limit 10G, usage 0G) > >> > >>Now, A is under memory pressure since the total usage is hitting its > >>hard limit. Then we start hierarchical reclaim under A, and each > >>cgroup under A also takes consideration of soft limit. In this case, > >>we should only set priority = 0 to A2 since it contributes to A's > >>charge as well as exceeding its own soft limit. Why punishing A1 (set > >>priority = 0) also which has usage under its soft limit ? I can > >>imagine it will introduce regression to existing environment which the > >>soft limit is set based on the working set size of the cgroup. > >> > >>To answer the question why we set soft limit to A, it is used to > >>over-commit the host while sharing the resource with its sibling (B in > >>this case). If the machine is under memory contention, we would like > >>to push down memory to A or B depends on their usage and soft limit. > >D'oh, I think the problem is just that we walk up the hierarchy one > >too many when checking whether a group exceeds a soft limit. The soft > >limit is a signal to distribute pressure that comes from above, it's > >meaningless and should indeed be ignored on the level the pressure > >originates from. > > > >Say mem_cgroup_over_soft_limit(root, memcg) would check everyone up to > >but not including root, wouldn't that do exactly what we both want? > > > >Example: > > > >1. If global memory is short, we reclaim with root=root_mem_cgroup. > > A1 and A2 get soft limit reclaimed because of A's soft limit > > excess, just like the current kernel would do. > > > >2. If A hits its hard limit, we reclaim with root=A, so we only mind > > the soft limits of A1 and A2. A1 is below its soft limit, all > > good. A2 is above its soft limit, gets treated accordingly. This > > is new behaviour, the current kernel would just reclaim them > > equally. > > > >Code: > > > >bool mem_cgroup_over_soft_limit(struct mem_cgroup *root, > > struct mem_cgroup *memcg) > >{ > > if (mem_cgroup_disabled()) > > return false; > > > > if (!root) > > root = root_mem_cgroup; > > > > for (; memcg; memcg = parent_mem_cgroup(memcg)) { > > if (memcg == root) > > break; > > if (res_counter_soft_limit_excess(&memcg->res)) > > return true; > > } > > return false; > >} > Hi Johannes, > > I don't think it solve the root of the problem, example: > root > -> A (hard limit 20G, soft limit 12G, usage 20G) > -> A1 ( soft limit 2G, usage 1G) > -> A2 ( soft limit 10G, usage 19G) > ->B1 (soft limit 5G, usage 4G) > ->B2 (soft limit 5G, usage 15G) > > Now A is hitting its hard limit and start hierarchical reclaim under A. > If we choose B1 to go through mem_cgroup_over_soft_limit, it will > return true because its parent A2 has a large usage and will lead to > priority=0 reclaiming. But in fact it should be B2 to be punished. Because A2 is over its soft limit, the whole hierarchy below it should be preferred over A1, so both B1 and B2 should be soft limit reclaimed to be consistent with behaviour at the root level. > IMHO, it may checking the cgroup's soft limit standalone without > looking up its ancestors just as Ying said. Again, this would be a regression as soft limits have been applied hierarchically forever. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx189.postini.com [74.125.245.189]) by kanga.kvack.org (Postfix) with SMTP id 1F6226B004F for ; Tue, 17 Jan 2012 15:25:33 -0500 (EST) Received: by qcsf14 with SMTP id f14so1073721qcs.14 for ; Tue, 17 Jan 2012 12:25:31 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <20120117145348.GA3144@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120112085904.GG24386@cmpxchg.org> <20120113224424.GC1653@cmpxchg.org> <4F158418.2090509@gmail.com> <20120117145348.GA3144@cmpxchg.org> Date: Tue, 17 Jan 2012 12:25:31 -0800 Message-ID: Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim From: Ying Han Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Sha , Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org On Tue, Jan 17, 2012 at 6:53 AM, Johannes Weiner wrote= : > On Tue, Jan 17, 2012 at 10:22:16PM +0800, Sha wrote: >> On 01/14/2012 06:44 AM, Johannes Weiner wrote: >> >On Fri, Jan 13, 2012 at 01:31:16PM -0800, Ying Han wrote: >> >>On Thu, Jan 12, 2012 at 12:59 AM, Johannes Weiner = =A0wrote: >> >>>On Wed, Jan 11, 2012 at 01:42:31PM -0800, Ying Han wrote: >> >>>>On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner= =A0wrote: >> >>>>>@@ -1318,6 +1123,36 @@ static unsigned long mem_cgroup_margin(struc= t mem_cgroup *memcg) >> >>>>> =A0 =A0 =A0 =A0return margin>> =A0PAGE_SHIFT; >> >>>>> =A0} >> >>>>> >> >>>>>+/** >> >>>>>+ * mem_cgroup_over_softlimit >> >>>>>+ * @root: hierarchy root >> >>>>>+ * @memcg: child of @root to test >> >>>>>+ * >> >>>>>+ * Returns %true if @memcg exceeds its own soft limit or contribut= es >> >>>>>+ * to the soft limit excess of one of its parents up to and includ= ing >> >>>>>+ * @root. >> >>>>>+ */ >> >>>>>+bool mem_cgroup_over_softlimit(struct mem_cgroup *root, >> >>>>>+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0struct= mem_cgroup *memcg) >> >>>>>+{ >> >>>>>+ =A0 =A0 =A0 if (mem_cgroup_disabled()) >> >>>>>+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 return false; >> >>>>>+ >> >>>>>+ =A0 =A0 =A0 if (!root) >> >>>>>+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 root =3D root_mem_cgroup; >> >>>>>+ >> >>>>>+ =A0 =A0 =A0 for (; memcg; memcg =3D parent_mem_cgroup(memcg)) { >> >>>>>+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* root_mem_cgroup does not have a so= ft limit */ >> >>>>>+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (memcg =3D=3D root_mem_cgroup) >> >>>>>+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break; >> >>>>>+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (res_counter_soft_limit_excess(&me= mcg->res)) >> >>>>>+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 return true; >> >>>>>+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (memcg =3D=3D root) >> >>>>>+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break; >> >>>>>+ =A0 =A0 =A0 } >> >>>>Here it adds pressure on a cgroup if one of its parents exceeds soft >> >>>>limit, although the cgroup itself is under soft limit. It does chang= e >> >>>>my understanding of soft limit, and might introduce regression of ou= r >> >>>>existing use cases. >> >>>> >> >>>>Here is an example: >> >>>> >> >>>>Machine capacity 32G and we over-commit by 8G. >> >>>> >> >>>>root >> >>>> =A0 -> =A0A (hard limit 20G, soft limit 15G, usage 16G) >> >>>> =A0 =A0 =A0 =A0-> =A0A1 (soft limit 5G, usage 4G) >> >>>> =A0 =A0 =A0 =A0-> =A0A2 (soft limit 10G, usage 12G) >> >>>> =A0 -> =A0B (hard limit 20G, soft limit 10G, usage 16G) >> >>>> >> >>>>under global reclaim, we don't want to add pressure on A1 although i= ts >> >>>>parent A exceeds its soft limit. Assume that if we set the soft limi= t >> >>>>corresponding to each cgroup's working set size (hot memory), and it >> >>>>will introduce regression to A1 in that case. >> >>>> >> >>>>In my existing implementation, i am checking the cgroup's soft limit >> >>>>standalone w/o looking its ancestors. >> >>>Why do you set the soft limit of A in the first place if you don't >> >>>want it to be enforced? >> >>The soft limit should be enforced under certain condition, not always. >> >>The soft limit of A is set to be enforced when the parent of A and B >> >>is under memory pressure. For example: >> >> >> >>Machine capacity 32G and we over-commit by 8G >> >> >> >>root >> >>-> =A0A (hard limit 20G, soft limit 12G, usage 20G) >> >> =A0 =A0 =A0 =A0-> =A0A1 (soft limit 2G, usage 1G) >> >> =A0 =A0 =A0 =A0-> =A0A2 (soft limit 10G, usage 19G) >> >>-> =A0B (hard limit 20G, soft limit 10G, usage 0G) >> >> >> >>Now, A is under memory pressure since the total usage is hitting its >> >>hard limit. Then we start hierarchical reclaim under A, and each >> >>cgroup under A also takes consideration of soft limit. In this case, >> >>we should only set priority =3D 0 to A2 since it contributes to A's >> >>charge as well as exceeding its own soft limit. Why punishing A1 (set >> >>priority =3D 0) also which has usage under its soft limit ? I can >> >>imagine it will introduce regression to existing environment which the >> >>soft limit is set based on the working set size of the cgroup. >> >> >> >>To answer the question why we set soft limit to A, it is used to >> >>over-commit the host while sharing the resource with its sibling (B in >> >>this case). If the machine is under memory contention, we would like >> >>to push down memory to A or B depends on their usage and soft limit. >> >D'oh, I think the problem is just that we walk up the hierarchy one >> >too many when checking whether a group exceeds a soft limit. =A0The sof= t >> >limit is a signal to distribute pressure that comes from above, it's >> >meaningless and should indeed be ignored on the level the pressure >> >originates from. >> > >> >Say mem_cgroup_over_soft_limit(root, memcg) would check everyone up to >> >but not including root, wouldn't that do exactly what we both want? >> > >> >Example: >> > >> >1. If global memory is short, we reclaim with root=3Droot_mem_cgroup. >> > =A0 =A0A1 and A2 get soft limit reclaimed because of A's soft limit >> > =A0 =A0excess, just like the current kernel would do. >> > >> >2. If A hits its hard limit, we reclaim with root=3DA, so we only mind >> > =A0 =A0the soft limits of A1 and A2. =A0A1 is below its soft limit, al= l >> > =A0 =A0good. =A0A2 is above its soft limit, gets treated accordingly. = =A0This >> > =A0 =A0is new behaviour, the current kernel would just reclaim them >> > =A0 =A0equally. >> > >> >Code: >> > >> >bool mem_cgroup_over_soft_limit(struct mem_cgroup *root, >> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 struct mem_cgr= oup *memcg) >> >{ >> > =A0 =A0 if (mem_cgroup_disabled()) >> > =A0 =A0 =A0 =A0 =A0 =A0 return false; >> > >> > =A0 =A0 if (!root) >> > =A0 =A0 =A0 =A0 =A0 =A0 root =3D root_mem_cgroup; >> > >> > =A0 =A0 for (; memcg; memcg =3D parent_mem_cgroup(memcg)) { >> > =A0 =A0 =A0 =A0 =A0 =A0 if (memcg =3D=3D root) >> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break; >> > =A0 =A0 =A0 =A0 =A0 =A0 if (res_counter_soft_limit_excess(&memcg->res)= ) >> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 return true; >> > =A0 =A0 } >> > =A0 =A0 return false; >> >} >> Hi Johannes, >> >> I don't think it solve the root of the problem, example: >> root >> -> A (hard limit 20G, soft limit 12G, usage 20G) >> =A0 =A0 -> A1 ( soft limit 2G, =A0 usage 1G) >> =A0 =A0 -> A2 ( soft limit 10G, usage 19G) >> =A0 =A0 =A0 =A0 =A0 =A0->B1 (soft limit 5G, usage 4G) >> =A0 =A0 =A0 =A0 =A0 =A0->B2 (soft limit 5G, usage 15G) >> >> Now A is hitting its hard limit and start hierarchical reclaim under A. >> If we choose B1 to go through mem_cgroup_over_soft_limit, it will >> return true because its parent A2 has a large usage and will lead to >> priority=3D0 reclaiming. But in fact it should be B2 to be punished. > > Because A2 is over its soft limit, the whole hierarchy below it should > be preferred over A1, so both B1 and B2 should be soft limit reclaimed > to be consistent with behaviour at the root level. > >> IMHO, it may checking the cgroup's soft limit standalone without >> looking up its ancestors just as Ying said. > > Again, this would be a regression as soft limits have been applied > hierarchically forever. If we are comparing it to the current implementation, agree that the soft reclaim is applied hierarchically. In the example above, A2 will be picked for soft reclaim while A is hitting its hard limit, which in turns reclaim from B1 and B2 regardless of their soft limit setting. However, I haven't convinced myself this is how we are gonna use the soft limit. The soft limit setting for each cgroup is a hit for applying pressure under memory contention. One way of setting the soft limit is based on the cgroup's working set size. Thus, we allow cgroup to grow above its soft limit with cold page cache unless there is a memory pressure comes from above. Under the hierarchical reclaim, we will exam the soft limit and only apply extra pressure to the ones above their soft limit. Here the same example: root -> A (hard limit 20G, soft limit 12G, usage 20G) -> A1 ( soft limit 2G, usage 1G) -> A2 ( soft limit 10G, usage 19G) ->B1 (soft limit 5G, usage 4G) ->B2 (soft limit 5G, usage 15G) If A is hitting its hard limit, we will reclaim all the children under A hierarchically but only adding extra pressure to the ones above their soft limits (A2, B2). Adding extra pressure to B1 will introduce known regression based on customer expectation since the 4G usage are hot memory. I am not aware of how the existing soft reclaim being used, i bet there are not a lot. If we are making changes on the current implementation, we should also take the opportunity to think about the initial design as well. Thoughts? --Ying -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx126.postini.com [74.125.245.126]) by kanga.kvack.org (Postfix) with SMTP id 6651D6B004D for ; Tue, 17 Jan 2012 16:56:42 -0500 (EST) Date: Tue, 17 Jan 2012 22:56:26 +0100 From: Johannes Weiner Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Message-ID: <20120117215626.GA2380@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120112085904.GG24386@cmpxchg.org> <20120113224424.GC1653@cmpxchg.org> <4F158418.2090509@gmail.com> <20120117145348.GA3144@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Ying Han Cc: Sha , Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org On Tue, Jan 17, 2012 at 12:25:31PM -0800, Ying Han wrote: > On Tue, Jan 17, 2012 at 6:53 AM, Johannes Weiner wrote: > > On Tue, Jan 17, 2012 at 10:22:16PM +0800, Sha wrote: > >> IMHO, it may checking the cgroup's soft limit standalone without > >> looking up its ancestors just as Ying said. > > > > Again, this would be a regression as soft limits have been applied > > hierarchically forever. > > If we are comparing it to the current implementation, agree that the > soft reclaim is applied hierarchically. In the example above, A2 will > be picked for soft reclaim while A is hitting its hard limit, which in > turns reclaim from B1 and B2 regardless of their soft limit setting. > However, I haven't convinced myself this is how we are gonna use the > soft limit. Of course I'm comparing it to the current implementation, this is what I'm changing! > The soft limit setting for each cgroup is a hit for applying pressure > under memory contention. One way of setting the soft limit is based on > the cgroup's working set size. Thus, we allow cgroup to grow above its > soft limit with cold page cache unless there is a memory pressure > comes from above. Under the hierarchical reclaim, we will exam the > soft limit and only apply extra pressure to the ones above their soft > limit. Here the same example: > > root > -> A (hard limit 20G, soft limit 12G, usage 20G) > -> A1 ( soft limit 2G, usage 1G) > -> A2 ( soft limit 10G, usage 19G) > > ->B1 (soft limit 5G, usage 4G) > ->B2 (soft limit 5G, usage 15G) > > If A is hitting its hard limit, we will reclaim all the children under > A hierarchically but only adding extra pressure to the ones above > their soft limits (A2, B2). Adding extra pressure to B1 will introduce > known regression based on customer expectation since the 4G usage are > hot memory. I can only repeat myself: A has a soft limit set, so the customer expects global pressure to arise sooner or later. If that happens, A will be soft-limit reclaimed hierarchically in the _existing code_. That's how the soft limit currently works and I don't mean to change it _with this patch_. The customer has to expect that B1 can be reclaimed as a consequence of the soft limit in A or A2 today, so I don't know where this expectation of different behaviour should even come from. How can this be a regression?! > I am not aware of how the existing soft reclaim being used, i bet > there are not a lot. If we are making changes on the current > implementation, we should also take the opportunity to think about the > initial design as well. Thoughts? I agree that these semantics should be up for debate. And I think changing it to something like you have in mind is indeed a good idea; to not have soft limits apply hierarchically but instead follow down the whole chain and only soft limit reclaim those that are themselves above their soft limit. But it's an entirely different matter! This patch is supposed to do only two things: 1. refactor the soft limit implementation, staying as close as possible/practical to the current semantics and 2. fix the inconsistency that soft limits are ignored when pressure does not originate at the root_mem_cgroup. If that is too much change in semantics I can easily ditch 2., I just didn't see the use of maintaining an inconsistency that resulted purely from the limitations of the current implementation by re-adding more code and because I think that this would not be surprising behaviour. It would be as simple as adding an extra check in reclaim that only minds soft limits upon global pressure: if (global_reclaim(sc) && mem_cgroup_over_soft_limit(root, memcg)) /* resulting action */ and it would have nothing to do how soft limits are actually applied once triggered. I can include this in the next version, but it won't fix the problem you seem to be having with the _existing_ behaviour. I also don't think that my patch will get in the way of what you are planning to do: in fact, you already have code that easily turns mem_cgroup_over_soft_limit() into a non-hierarchical predicate. Even more will change when you invert the soft limits to become actual guarantees and skip reclaiming memcgs that are below their soft limits but I don't think this patch is in the way of doing that, either. I feel that these are all orthogonal changes. So if possible, could we take just one step at a time and leave hypothetical behaviour out of it unless the proposed changes clearly get in the way of where we agreed we want to go? If I misunderstood everything completely and you actually believe this patch will get in the way, could you tell me where and how? Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx116.postini.com [74.125.245.116]) by kanga.kvack.org (Postfix) with SMTP id EA0046B004D for ; Wed, 18 Jan 2012 00:27:58 -0500 (EST) Received: from m4.gw.fujitsu.co.jp (unknown [10.0.50.74]) by fgwmail6.fujitsu.co.jp (Postfix) with ESMTP id D3BEE3EE0BB for ; Wed, 18 Jan 2012 14:27:56 +0900 (JST) Received: from smail (m4 [127.0.0.1]) by outgoing.m4.gw.fujitsu.co.jp (Postfix) with ESMTP id BC27A45DE50 for ; Wed, 18 Jan 2012 14:27:56 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (s4.gw.fujitsu.co.jp [10.0.50.94]) by m4.gw.fujitsu.co.jp (Postfix) with ESMTP id 9FE3E45DE4D for ; Wed, 18 Jan 2012 14:27:56 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id 937321DB802F for ; Wed, 18 Jan 2012 14:27:56 +0900 (JST) Received: from m107.s.css.fujitsu.com (m107.s.css.fujitsu.com [10.240.81.147]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id 435171DB8037 for ; Wed, 18 Jan 2012 14:27:56 +0900 (JST) Date: Wed, 18 Jan 2012 14:26:38 +0900 From: KAMEZAWA Hiroyuki Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Message-Id: <20120118142638.11667d2c.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <20120113121645.GA1653@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120112105427.4b80437b.kamezawa.hiroyu@jp.fujitsu.com> <20120113121645.GA1653@cmpxchg.org> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Andrew Morton , Michal Hocko , Balbir Singh , Ying Han , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org On Fri, 13 Jan 2012 13:16:56 +0100 Johannes Weiner wrote: > On Thu, Jan 12, 2012 at 10:54:27AM +0900, KAMEZAWA Hiroyuki wrote: > > Thank you for your work and the result seems atractive and code is much > > simpler. My small concerns are.. > > > > 1. This approach may increase latency of direct-reclaim because of priority=0. > > I think strictly speaking yes, but note that with kswapd being less > likely to get stuck in hammering on one group, the need for allocators > to enter direct reclaim itself is reduced. > > However, if this really becomes a problem in real world loads, the fix > is pretty easy: just ignore the soft limit for direct reclaim. We can > still consider it from hard limit reclaim and kswapd. > > > 2. In a case numa-spread/interleave application run in its own container, > > pages on a node may paged-out again and again becasue of priority=0 > > if some other application runs in the node. > > It seems difficult to use soft-limit with numa-aware applications. > > Do you have suggestions ? > > This is a question about soft limits in general rather than about this > particular patch, right? > Partially, yes. My concern is related to "1". Assume an application is binded to some cpu/node and try to allocate memory. If its memcg's usage is over softlimit, this application will play bad because newly allocated memory will be reclaim target soon, again.... > And if I understand correctly, the problem you are referring to is > this: an application and parts of a soft-limited container share a > node, the soft limit setting means that the container's pages on that > node are reclaimed harder. At that point, the container's share on > that node becomes tiny, but since the soft limit is oblivious to > nodes, the expansion of the other application pushes the soft-limited > container off that node completely as long as the container stays > above its soft limit with the usage on other nodes. > > What would you think about having node-local soft limits that take the > node size into account? > > local_soft_limit = soft_limit * node_size / memcg_size > > The soft limit can be exceeded globally, but the container is no > longer pushed off a node on which it's only occupying a small share of > memory. > Yes, I think this kind of care is required. What is the 'node_size' here ? size of pgdat ? size of per-node usage in the memcg ? > Putting it into proportion of the memcg size, not overall memory size > has the following advantages: > > 1. if the container is sitting on only one of several available > nodes without exceeding the limit globally, the memcg will not be > reclaimed harder just because it has a relatively large share of the > node. > > 2. if the soft limit excess is ridiculously high, the local soft > limits will be pushed down, so the tolerance for smaller shares on > nodes goes down in proportion to the global soft limit excess. > > Example: > > 4G soft limit * 2G node / 4G container = 2G node-local limit > > The container is globally within its soft limit, so the local limit is > at least the size of the node. It's never reclaimed harder compared > to other applications on the node. > > 4G soft limit * 2G node / 5G container = ~1.6G node-local limit > > Here, it will experience more pressure initially, but it will level > off when the shrinking usage and the thereby increasing node-local > soft limit meet. From that point on, the container and the competing > application will be treated equally during reclaim. > > Finally, if the container is 16G in size, i.e. 300% in excess, the > per-node tolerance is at 512M node-local soft limit, which IMO strikes > a good balance between zero tolerance and still applying some stress > to the hugely oversized container when other applications (with > virtually unlimited soft limits) want to run on the same node. > > What do you think? I like the idea. Another idea is changing 'priority' based on per-node stats if not too complicated... Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx174.postini.com [74.125.245.174]) by kanga.kvack.org (Postfix) with SMTP id 521956B004D for ; Wed, 18 Jan 2012 04:25:38 -0500 (EST) Date: Wed, 18 Jan 2012 10:25:09 +0100 From: Johannes Weiner Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Message-ID: <20120118092509.GI24386@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120112085904.GG24386@cmpxchg.org> <20120113224424.GC1653@cmpxchg.org> <4F158418.2090509@gmail.com> <20120117145348.GA3144@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Sha Cc: Ying Han , Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org On Wed, Jan 18, 2012 at 03:17:25PM +0800, Sha wrote: > > > I don't think it solve the root of the problem, example: > > > root > > > -> A (hard limit 20G, soft limit 12G, usage 20G) > > > -> A1 ( soft limit 2G, usage 1G) > > > -> A2 ( soft limit 10G, usage 19G) > > > ->B1 (soft limit 5G, usage 4G) > > > ->B2 (soft limit 5G, usage 15G) > > > > > > Now A is hitting its hard limit and start hierarchical reclaim under A. > > > If we choose B1 to go through mem_cgroup_over_soft_limit, it will > > > return true because its parent A2 has a large usage and will lead to > > > priority=0 reclaiming. But in fact it should be B2 to be punished. > > > Because A2 is over its soft limit, the whole hierarchy below it should > > be preferred over A1, so both B1 and B2 should be soft limit reclaimed > > to be consistent with behaviour at the root level. > > Well it is just the behavior that I'm expecting actually. But with my > humble comprehension, I can't catch the soft-limit-based hierarchical > reclaiming under the target cgroup (A2) in the current implementation > or after the patch. Both the current mem_cgroup_soft_reclaim or > shrink_zone select victim sub-cgroup by mem_cgroup_iter, but it > doesn't take soft limit into consideration, do I left anything ? No, currently soft limits are ignored if pressure originates from below root_mem_cgroup. But iff soft limits are applied right now, they are applied hierarchically, see mem_cgroup_soft_limit_reclaim(). In my opinion, the fact that soft limits are ignored when pressure is triggered sub-root_mem_cgroup is an artifact of the per-zone tree, so I allowed soft limits to be taken into account below root_mem_cgroup. But IMO, this is something different from how soft limit reclaim is applied once triggered: currently, soft limit reclaim applies to a whole hierarchy, including all children. And this I left unchanged. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx142.postini.com [74.125.245.142]) by kanga.kvack.org (Postfix) with SMTP id 39AF76B004D for ; Wed, 18 Jan 2012 04:45:32 -0500 (EST) Date: Wed, 18 Jan 2012 10:45:23 +0100 From: Johannes Weiner Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Message-ID: <20120118094523.GJ24386@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120113120406.GC17060@tiehlicka.suse.cz> <20120113155001.GB1653@cmpxchg.org> <20120113163423.GG17060@tiehlicka.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Ying Han Cc: Michal Hocko , Andrew Morton , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org On Fri, Jan 13, 2012 at 01:45:30PM -0800, Ying Han wrote: > On Fri, Jan 13, 2012 at 8:34 AM, Michal Hocko wrote: > > On Fri 13-01-12 16:50:01, Johannes Weiner wrote: > >> On Fri, Jan 13, 2012 at 01:04:06PM +0100, Michal Hocko wrote: > >> > On Tue 10-01-12 16:02:52, Johannes Weiner wrote: > > [...] > >> > > +bool mem_cgroup_over_softlimit(struct mem_cgroup *root, > >> > > + struct mem_cgroup *memcg) > >> > > +{ > >> > > + if (mem_cgroup_disabled()) > >> > > + return false; > >> > > + > >> > > + if (!root) > >> > > + root = root_mem_cgroup; > >> > > + > >> > > + for (; memcg; memcg = parent_mem_cgroup(memcg)) { > >> > > + /* root_mem_cgroup does not have a soft limit */ > >> > > + if (memcg == root_mem_cgroup) > >> > > + break; > >> > > + if (res_counter_soft_limit_excess(&memcg->res)) > >> > > + return true; > >> > > + if (memcg == root) > >> > > + break; > >> > > + } > >> > > + return false; > >> > > +} > >> > > >> > Well, this might be little bit tricky. We do not check whether memcg and > >> > root are in a hierarchy (in terms of use_hierarchy) relation. > >> > > >> > If we are under global reclaim then we iterate over all memcgs and so > >> > there is no guarantee that there is a hierarchical relation between the > >> > given memcg and its parent. While, on the other hand, if we are doing > >> > memcg reclaim then we have this guarantee. > >> > > >> > Why should we punish a group (subtree) which is perfectly under its soft > >> > limit just because some other subtree contributes to the common parent's > >> > usage and makes it over its limit? > >> > Should we check memcg->use_hierarchy here? > >> > >> We do, actually. parent_mem_cgroup() checks the res_counter parent, > >> which is only set when ->use_hierarchy is also set. > > > > Of course I am blind.. We do not setup res_counter parent for > > !use_hierarchy case. Sorry for noise... > > Now it makes much better sense. I was wondering how !use_hierarchy could > > ever work, this should be a signal that I am overlooking something > > terribly. > > > > [...] > >> > > @@ -2121,8 +2121,16 @@ static void shrink_zone(int priority, struct zone *zone, > >> > > .mem_cgroup = memcg, > >> > > .zone = zone, > >> > > }; > >> > > + int epriority = priority; > >> > > + /* > >> > > + * Put more pressure on hierarchies that exceed their > >> > > + * soft limit, to push them back harder than their > >> > > + * well-behaving siblings. > >> > > + */ > >> > > + if (mem_cgroup_over_softlimit(root, memcg)) > >> > > + epriority = 0; > >> > > >> > This sounds too aggressive to me. Shouldn't we just double the pressure > >> > or something like that? > >> > >> That's the historical value. When I tried priority - 1, it was not > >> aggressive enough. > > > > Probably because we want to reclaim too much. Maybe we should do > > reduce nr_to_reclaim (ugly) or reclaim only overlimit groups until certain > > priority level as Ying suggested in her patchset. > > I plan to post that change on top of this, and this patch set does the > basic stuff to allow us doing further improvement. > > I still like the design to skip over_soft_limit cgroups until certain > priority. One way to set up the soft limit for each cgroup is to base > on its actual working set size, and we prefer to punish A first with > lots of page cache ( cold file pages above soft limit) than reclaiming > anon pages from B ( below soft limit ). Unless we can not get enough > pages reclaimed from A, we will start reclaiming from B. > > This might not be the ideal solution, but should be a good start. Thoughts? I don't like this design at all because unless you add weird code to detect if soft limits apply to any memcgs on the reclaimed hierarchy you may iterate over the same bunch of memcgs doing nothing for several times. For example in the default case of no softlimits set anywhere and you repeatedly walk ALL memcgs in the system doing jack until you reach your threshold priority level. Elegant is something else in my book. Once we invert soft limits to mean guarantees and make the default soft limit not infinity but zero, then we can ignore memcgs below their soft limit for a few priority levels just fine because being below the soft limit is the exception. But I don't really want to make this quite invasive behavioural change a requirement for a refactoring patch if possible. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756423Ab2AJPDN (ORCPT ); Tue, 10 Jan 2012 10:03:13 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:41395 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752239Ab2AJPDL (ORCPT ); Tue, 10 Jan 2012 10:03:11 -0500 From: Johannes Weiner To: Andrew Morton Cc: Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , Ying Han , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [patch 0/2] mm: memcg reclaim integration followups Date: Tue, 10 Jan 2012 16:02:50 +0100 Message-Id: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> X-Mailer: git-send-email 1.7.7.5 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, here are two patches based on memcg-aware global reclaim, which I dropped from the initial series to focus on the exclusive-lru changes. The first one is per-memcg reclaim statistics. For now, they include only pages scanned and pages reclaimed, separately for direct reclaim and kswapd, as well as separately for internal pressure or reclaim due to parental memcgs. The second one is integrating soft limit reclaim into the now memcg-aware global reclaim path. It kills a lot of code and performs better as far as I have tested it. Furthermore, Ying is working on turning soft limits into guarantees, as discussed in Prague, and this patch is also in preparation for that. Sorry for the odd point in time to submit this, I guess this will mean 3.4 at the earliest. But the soft limit removal is a bit heavy weight so it's probably easier conflict-wise to have it at the bottom of the -mm stack. Documentation/cgroups/memory.txt | 4 + include/linux/memcontrol.h | 28 ++- mm/memcontrol.c | 482 +++++++++----------------------------- mm/vmscan.c | 87 ++------ 4 files changed, 144 insertions(+), 457 deletions(-) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756454Ab2AJPDP (ORCPT ); Tue, 10 Jan 2012 10:03:15 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:41399 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756305Ab2AJPDN (ORCPT ); Tue, 10 Jan 2012 10:03:13 -0500 From: Johannes Weiner To: Andrew Morton Cc: Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , Ying Han , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [patch 1/2] mm: memcg: per-memcg reclaim statistics Date: Tue, 10 Jan 2012 16:02:51 +0100 Message-Id: <1326207772-16762-2-git-send-email-hannes@cmpxchg.org> X-Mailer: git-send-email 1.7.7.5 In-Reply-To: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org With the single per-zone LRU gone and global reclaim scanning individual memcgs, it's straight-forward to collect meaningful and accurate per-memcg reclaim statistics. This adds the following items to memory.stat: pgreclaim pgscan Number of pages reclaimed/scanned from that memcg due to its own hard limit (or physical limit in case of the root memcg) by the allocating task. kswapd_pgreclaim kswapd_pgscan Reclaim activity from kswapd due to the memcg's own limit. Only applicable to the root memcg for now since kswapd is only triggered by physical limits, but kswapd-style reclaim based on memcg hard limits is being developped. hierarchy_pgreclaim hierarchy_pgscan hierarchy_kswapd_pgreclaim hierarchy_kswapd_pgscan Reclaim activity due to limitations in one of the memcg's parents. Signed-off-by: Johannes Weiner --- Documentation/cgroups/memory.txt | 4 ++ include/linux/memcontrol.h | 10 +++++ mm/memcontrol.c | 84 +++++++++++++++++++++++++++++++++++++- mm/vmscan.c | 7 +++ 4 files changed, 103 insertions(+), 2 deletions(-) diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index cc0ebc5..eb9e982 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -389,6 +389,10 @@ mapped_file - # of bytes of mapped file (includes tmpfs/shmem) pgpgin - # of pages paged in (equivalent to # of charging events). pgpgout - # of pages paged out (equivalent to # of uncharging events). swap - # of bytes of swap usage +pgreclaim - # of pages reclaimed due to this memcg's limit +pgscan - # of pages scanned due to this memcg's limit +kswapd_* - # reclaim activity by background daemon due to this memcg's limit +hierarchy_* - # reclaim activity due to pressure from parental memcg inactive_anon - # of bytes of anonymous memory and swap cache memory on LRU list. active_anon - # of bytes of anonymous and swap cache memory on active diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index bd3b102..6c1d69e 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -121,6 +121,8 @@ struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, struct zone *zone); struct zone_reclaim_stat* mem_cgroup_get_reclaim_stat_from_page(struct page *page); +void mem_cgroup_account_reclaim(struct mem_cgroup *, struct mem_cgroup *, + unsigned long, unsigned long, bool); extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p); extern void mem_cgroup_replace_page_cache(struct page *oldpage, @@ -347,6 +349,14 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page) return NULL; } +static inline void mem_cgroup_account_reclaim(struct mem_cgroup *root, + struct mem_cgroup *memcg, + unsigned long nr_reclaimed, + unsigned long nr_scanned, + bool kswapd) +{ +} + static inline void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 8e2a80d..170dff4 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -91,12 +91,23 @@ enum mem_cgroup_stat_index { MEM_CGROUP_STAT_NSTATS, }; +#define MEM_CGROUP_EVENTS_KSWAPD 2 +#define MEM_CGROUP_EVENTS_HIERARCHY 4 + enum mem_cgroup_events_index { MEM_CGROUP_EVENTS_PGPGIN, /* # of pages paged in */ MEM_CGROUP_EVENTS_PGPGOUT, /* # of pages paged out */ MEM_CGROUP_EVENTS_COUNT, /* # of pages paged in/out */ MEM_CGROUP_EVENTS_PGFAULT, /* # of page-faults */ MEM_CGROUP_EVENTS_PGMAJFAULT, /* # of major page-faults */ + MEM_CGROUP_EVENTS_PGRECLAIM, + MEM_CGROUP_EVENTS_PGSCAN, + MEM_CGROUP_EVENTS_KSWAPD_PGRECLAIM, + MEM_CGROUP_EVENTS_KSWAPD_PGSCAN, + MEM_CGROUP_EVENTS_HIERARCHY_PGRECLAIM, + MEM_CGROUP_EVENTS_HIERARCHY_PGSCAN, + MEM_CGROUP_EVENTS_HIERARCHY_KSWAPD_PGRECLAIM, + MEM_CGROUP_EVENTS_HIERARCHY_KSWAPD_PGSCAN, MEM_CGROUP_EVENTS_NSTATS, }; /* @@ -889,6 +900,38 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg) return (memcg == root_mem_cgroup); } +/** + * mem_cgroup_account_reclaim - update per-memcg reclaim statistics + * @root: memcg that triggered reclaim + * @memcg: memcg that is actually being scanned + * @nr_reclaimed: number of pages reclaimed from @memcg + * @nr_scanned: number of pages scanned from @memcg + * @kswapd: whether reclaiming task is kswapd or allocator itself + */ +void mem_cgroup_account_reclaim(struct mem_cgroup *root, + struct mem_cgroup *memcg, + unsigned long nr_reclaimed, + unsigned long nr_scanned, + bool kswapd) +{ + unsigned int offset = 0; + + if (!root) + root = root_mem_cgroup; + + if (kswapd) + offset += MEM_CGROUP_EVENTS_KSWAPD; + if (root != memcg) + offset += MEM_CGROUP_EVENTS_HIERARCHY; + + preempt_disable(); + __this_cpu_add(memcg->stat->events[MEM_CGROUP_EVENTS_PGRECLAIM + offset], + nr_reclaimed); + __this_cpu_add(memcg->stat->events[MEM_CGROUP_EVENTS_PGSCAN + offset], + nr_scanned); + preempt_enable(); +} + void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx) { struct mem_cgroup *memcg; @@ -1662,6 +1705,8 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg, excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT; while (1) { + unsigned long nr_reclaimed; + victim = mem_cgroup_iter(root_memcg, victim, &reclaim); if (!victim) { loop++; @@ -1687,8 +1732,11 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg, } if (!mem_cgroup_reclaimable(victim, false)) continue; - total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false, - zone, &nr_scanned); + nr_reclaimed = mem_cgroup_shrink_node_zone(victim, gfp_mask, false, + zone, &nr_scanned); + mem_cgroup_account_reclaim(root_mem_cgroup, victim, nr_reclaimed, + nr_scanned, current_is_kswapd()); + total += nr_reclaimed; *total_scanned += nr_scanned; if (!res_counter_soft_limit_excess(&root_memcg->res)) break; @@ -4023,6 +4071,14 @@ enum { MCS_SWAP, MCS_PGFAULT, MCS_PGMAJFAULT, + MCS_PGRECLAIM, + MCS_PGSCAN, + MCS_KSWAPD_PGRECLAIM, + MCS_KSWAPD_PGSCAN, + MCS_HIERARCHY_PGRECLAIM, + MCS_HIERARCHY_PGSCAN, + MCS_HIERARCHY_KSWAPD_PGRECLAIM, + MCS_HIERARCHY_KSWAPD_PGSCAN, MCS_INACTIVE_ANON, MCS_ACTIVE_ANON, MCS_INACTIVE_FILE, @@ -4047,6 +4103,14 @@ struct { {"swap", "total_swap"}, {"pgfault", "total_pgfault"}, {"pgmajfault", "total_pgmajfault"}, + {"pgreclaim", "total_pgreclaim"}, + {"pgscan", "total_pgscan"}, + {"kswapd_pgreclaim", "total_kswapd_pgreclaim"}, + {"kswapd_pgscan", "total_kswapd_pgscan"}, + {"hierarchy_pgreclaim", "total_hierarchy_pgreclaim"}, + {"hierarchy_pgscan", "total_hierarchy_pgscan"}, + {"hierarchy_kswapd_pgreclaim", "total_hierarchy_kswapd_pgreclaim"}, + {"hierarchy_kswapd_pgscan", "total_hierarchy_kswapd_pgscan"}, {"inactive_anon", "total_inactive_anon"}, {"active_anon", "total_active_anon"}, {"inactive_file", "total_inactive_file"}, @@ -4079,6 +4143,22 @@ mem_cgroup_get_local_stat(struct mem_cgroup *memcg, struct mcs_total_stat *s) s->stat[MCS_PGFAULT] += val; val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_PGMAJFAULT); s->stat[MCS_PGMAJFAULT] += val; + val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_PGRECLAIM); + s->stat[MCS_PGRECLAIM] += val; + val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_PGSCAN); + s->stat[MCS_PGSCAN] += val; + val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_KSWAPD_PGRECLAIM); + s->stat[MCS_KSWAPD_PGRECLAIM] += val; + val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_KSWAPD_PGSCAN); + s->stat[MCS_KSWAPD_PGSCAN] += val; + val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_HIERARCHY_PGRECLAIM); + s->stat[MCS_HIERARCHY_PGRECLAIM] += val; + val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_HIERARCHY_PGSCAN); + s->stat[MCS_HIERARCHY_PGSCAN] += val; + val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_HIERARCHY_KSWAPD_PGRECLAIM); + s->stat[MCS_HIERARCHY_KSWAPD_PGRECLAIM] += val; + val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_HIERARCHY_KSWAPD_PGSCAN); + s->stat[MCS_HIERARCHY_KSWAPD_PGSCAN] += val; /* per zone stat */ val = mem_cgroup_nr_lru_pages(memcg, BIT(LRU_INACTIVE_ANON)); diff --git a/mm/vmscan.c b/mm/vmscan.c index c631234..e3fd8a7 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2115,12 +2115,19 @@ static void shrink_zone(int priority, struct zone *zone, memcg = mem_cgroup_iter(root, NULL, &reclaim); do { + unsigned long nr_reclaimed = sc->nr_reclaimed; + unsigned long nr_scanned = sc->nr_scanned; struct mem_cgroup_zone mz = { .mem_cgroup = memcg, .zone = zone, }; shrink_mem_cgroup_zone(priority, &mz, sc); + + mem_cgroup_account_reclaim(root, memcg, + sc->nr_reclaimed - nr_reclaimed, + sc->nr_scanned - nr_scanned, + current_is_kswapd()); /* * Limit reclaim has historically picked one memcg and * scanned it with decreasing priority levels until -- 1.7.7.5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756486Ab2AJPD2 (ORCPT ); Tue, 10 Jan 2012 10:03:28 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:41401 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756379Ab2AJPDO (ORCPT ); Tue, 10 Jan 2012 10:03:14 -0500 From: Johannes Weiner To: Andrew Morton Cc: Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , Ying Han , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Date: Tue, 10 Jan 2012 16:02:52 +0100 Message-Id: <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> X-Mailer: git-send-email 1.7.7.5 In-Reply-To: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Right now, memcg soft limits are implemented by having a sorted tree of memcgs that are in excess of their limits. Under global memory pressure, kswapd first reclaims from the biggest excessor and then proceeds to do regular global reclaim. The result of this is that pages are reclaimed from all memcgs, but more scanning happens against those above their soft limit. With global reclaim doing memcg-aware hierarchical reclaim by default, this is a lot easier to implement: everytime a memcg is reclaimed from, scan more aggressively (per tradition with a priority of 0) if it's above its soft limit. With the same end result of scanning everybody, but soft limit excessors a bit more. Advantages: o smoother reclaim: soft limit reclaim is a separate stage before global reclaim, whose result is not communicated down the line and so overreclaim of the groups in excess is very likely. After this patch, soft limit reclaim is fully integrated into regular reclaim and each memcg is considered exactly once per cycle. o true hierarchy support: soft limits are only considered when kswapd does global reclaim, but after this patch, targetted reclaim of a memcg will mind the soft limit settings of its child groups. o code size: soft limit reclaim requires a lot of code to maintain the per-node per-zone rb-trees to quickly find the biggest offender, dedicated paths for soft limit reclaim etc. while this new implementation gets away without all that. Test: The test consists of two concurrent kernel build jobs in separate source trees, the master and the slave. The two jobs get along nicely on 600MB of available memory, so this is the zero overcommit control case. When available memory is decreased, the overcommit is compensated by decreasing the soft limit of the slave by the same amount, in the hope that the slave takes the hit and the master stays unaffected. 600M-0M-vanilla 600M-0M-patched Master walltime (s) 552.65 ( +0.00%) 552.38 ( -0.05%) Master walltime (stddev) 1.25 ( +0.00%) 0.92 ( -14.66%) Master major faults 204.38 ( +0.00%) 205.38 ( +0.49%) Master major faults (stddev) 27.16 ( +0.00%) 13.80 ( -47.43%) Master reclaim 31.88 ( +0.00%) 37.75 ( +17.87%) Master reclaim (stddev) 34.01 ( +0.00%) 75.88 (+119.59%) Master scan 31.88 ( +0.00%) 37.75 ( +17.87%) Master scan (stddev) 34.01 ( +0.00%) 75.88 (+119.59%) Master kswapd reclaim 33922.12 ( +0.00%) 33887.12 ( -0.10%) Master kswapd reclaim (stddev) 969.08 ( +0.00%) 492.22 ( -49.16%) Master kswapd scan 34085.75 ( +0.00%) 33985.75 ( -0.29%) Master kswapd scan (stddev) 1101.07 ( +0.00%) 563.33 ( -48.79%) Slave walltime (s) 552.68 ( +0.00%) 552.12 ( -0.10%) Slave walltime (stddev) 0.79 ( +0.00%) 1.05 ( +14.76%) Slave major faults 212.50 ( +0.00%) 204.50 ( -3.75%) Slave major faults (stddev) 26.90 ( +0.00%) 13.17 ( -49.20%) Slave reclaim 26.12 ( +0.00%) 35.00 ( +32.72%) Slave reclaim (stddev) 29.42 ( +0.00%) 74.91 (+149.55%) Slave scan 31.38 ( +0.00%) 35.00 ( +11.20%) Slave scan (stddev) 33.31 ( +0.00%) 74.91 (+121.24%) Slave kswapd reclaim 34259.00 ( +0.00%) 33469.88 ( -2.30%) Slave kswapd reclaim (stddev) 925.15 ( +0.00%) 565.07 ( -38.88%) Slave kswapd scan 34354.62 ( +0.00%) 33555.75 ( -2.33%) Slave kswapd scan (stddev) 969.62 ( +0.00%) 581.70 ( -39.97%) In the control case, the differences in elapsed time, number of major faults taken, and reclaim statistics are within the noise for both the master and the slave job. 600M-280M-vanilla 600M-280M-patched Master walltime (s) 595.13 ( +0.00%) 553.19 ( -7.04%) Master walltime (stddev) 8.31 ( +0.00%) 2.57 ( -61.64%) Master major faults 3729.75 ( +0.00%) 783.25 ( -78.98%) Master major faults (stddev) 258.79 ( +0.00%) 226.68 ( -12.36%) Master reclaim 705.00 ( +0.00%) 29.50 ( -95.68%) Master reclaim (stddev) 232.87 ( +0.00%) 44.72 ( -80.45%) Master scan 714.88 ( +0.00%) 30.00 ( -95.67%) Master scan (stddev) 237.44 ( +0.00%) 45.39 ( -80.54%) Master kswapd reclaim 114.75 ( +0.00%) 50.00 ( -55.94%) Master kswapd reclaim (stddev) 128.51 ( +0.00%) 9.45 ( -91.93%) Master kswapd scan 115.75 ( +0.00%) 50.00 ( -56.32%) Master kswapd scan (stddev) 130.31 ( +0.00%) 9.45 ( -92.04%) Slave walltime (s) 631.18 ( +0.00%) 577.68 ( -8.46%) Slave walltime (stddev) 9.89 ( +0.00%) 3.63 ( -57.47%) Slave major faults 28401.75 ( +0.00%) 14656.75 ( -48.39%) Slave major faults (stddev) 2629.97 ( +0.00%) 1911.81 ( -27.30%) Slave reclaim 65400.62 ( +0.00%) 1479.62 ( -97.74%) Slave reclaim (stddev) 11623.02 ( +0.00%) 1482.13 ( -87.24%) Slave scan 9050047.88 ( +0.00%) 95968.25 ( -98.94%) Slave scan (stddev) 1912786.94 ( +0.00%) 93390.71 ( -95.12%) Slave kswapd reclaim 327894.50 ( +0.00%) 227099.88 ( -30.74%) Slave kswapd reclaim (stddev) 22289.43 ( +0.00%) 16113.14 ( -27.71%) Slave kswapd scan 34987335.75 ( +0.00%) 1362367.12 ( -96.11%) Slave kswapd scan (stddev) 2523642.98 ( +0.00%) 156754.74 ( -93.79%) Here, the available memory is limited to 320 MB, the machine is overcommitted by 280 MB. The soft limit of the master is 300 MB, that of the slave merely 20 MB. Looking at the slave job first, it is much better off with the patched kernel: direct reclaim is almost gone, kswapd reclaim is decreased by a third. The result is much fewer major faults taken, which in turn lets the job finish quicker. It would be a zero-sum game if the improvement happened at the cost of the master but looking at the numbers, even the master performs better with the patched kernel. In fact, the master job is almost unaffected on the patched kernel compared to the control case. This is an odd phenomenon, as the patch does not directly change how the master is reclaimed. An explanation for this is that the severe overreclaim of the slave in the unpatched kernel results in the master growing bigger than in the patched case. Combining the fact that memcgs are scanned according to their size with the increased refault rate of the overreclaimed slave triggering global reclaim more often means that overall pressure on the master job is higher in the unpatched kernel. At any rate, the patched kernel seems to do a much better job at both overall resource allocation under soft limit overcommit as well as the requested prioritization of the master job. Signed-off-by: Johannes Weiner --- include/linux/memcontrol.h | 18 +-- mm/memcontrol.c | 412 ++++---------------------------------------- mm/vmscan.c | 80 +-------- 3 files changed, 48 insertions(+), 462 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 6c1d69e..72368b7 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -121,6 +121,7 @@ struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, struct zone *zone); struct zone_reclaim_stat* mem_cgroup_get_reclaim_stat_from_page(struct page *page); +bool mem_cgroup_over_softlimit(struct mem_cgroup *, struct mem_cgroup *); void mem_cgroup_account_reclaim(struct mem_cgroup *, struct mem_cgroup *, unsigned long, unsigned long, bool); extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, @@ -155,9 +156,6 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, mem_cgroup_update_page_stat(page, idx, -1); } -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, - gfp_t gfp_mask, - unsigned long *total_scanned); u64 mem_cgroup_get_limit(struct mem_cgroup *memcg); void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx); @@ -362,22 +360,20 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) { } -static inline void mem_cgroup_inc_page_stat(struct page *page, - enum mem_cgroup_page_stat_item idx) +static inline bool +mem_cgroup_over_softlimit(struct mem_cgroup *root, struct mem_cgroup *memcg) { + return false; } -static inline void mem_cgroup_dec_page_stat(struct page *page, +static inline void mem_cgroup_inc_page_stat(struct page *page, enum mem_cgroup_page_stat_item idx) { } -static inline -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, - gfp_t gfp_mask, - unsigned long *total_scanned) +static inline void mem_cgroup_dec_page_stat(struct page *page, + enum mem_cgroup_page_stat_item idx) { - return 0; } static inline diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 170dff4..d4f7ae5 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -35,7 +35,6 @@ #include #include #include -#include #include #include #include @@ -118,12 +117,10 @@ enum mem_cgroup_events_index { */ enum mem_cgroup_events_target { MEM_CGROUP_TARGET_THRESH, - MEM_CGROUP_TARGET_SOFTLIMIT, MEM_CGROUP_TARGET_NUMAINFO, MEM_CGROUP_NTARGETS, }; #define THRESHOLDS_EVENTS_TARGET (128) -#define SOFTLIMIT_EVENTS_TARGET (1024) #define NUMAINFO_EVENTS_TARGET (1024) struct mem_cgroup_stat_cpu { @@ -149,12 +146,6 @@ struct mem_cgroup_per_zone { struct mem_cgroup_reclaim_iter reclaim_iter[DEF_PRIORITY + 1]; struct zone_reclaim_stat reclaim_stat; - struct rb_node tree_node; /* RB tree node */ - unsigned long long usage_in_excess;/* Set to the value by which */ - /* the soft limit is exceeded*/ - bool on_tree; - struct mem_cgroup *mem; /* Back pointer, we cannot */ - /* use container_of */ }; /* Macro for accessing counter */ #define MEM_CGROUP_ZSTAT(mz, idx) ((mz)->count[(idx)]) @@ -167,26 +158,6 @@ struct mem_cgroup_lru_info { struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES]; }; -/* - * Cgroups above their limits are maintained in a RB-Tree, independent of - * their hierarchy representation - */ - -struct mem_cgroup_tree_per_zone { - struct rb_root rb_root; - spinlock_t lock; -}; - -struct mem_cgroup_tree_per_node { - struct mem_cgroup_tree_per_zone rb_tree_per_zone[MAX_NR_ZONES]; -}; - -struct mem_cgroup_tree { - struct mem_cgroup_tree_per_node *rb_tree_per_node[MAX_NUMNODES]; -}; - -static struct mem_cgroup_tree soft_limit_tree __read_mostly; - struct mem_cgroup_threshold { struct eventfd_ctx *eventfd; u64 threshold; @@ -343,7 +314,6 @@ static bool move_file(void) * limit reclaim to prevent infinite loops, if they ever occur. */ #define MEM_CGROUP_MAX_RECLAIM_LOOPS (100) -#define MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS (2) enum charge_type { MEM_CGROUP_CHARGE_TYPE_CACHE = 0, @@ -398,164 +368,6 @@ page_cgroup_zoneinfo(struct mem_cgroup *memcg, struct page *page) return mem_cgroup_zoneinfo(memcg, nid, zid); } -static struct mem_cgroup_tree_per_zone * -soft_limit_tree_node_zone(int nid, int zid) -{ - return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid]; -} - -static struct mem_cgroup_tree_per_zone * -soft_limit_tree_from_page(struct page *page) -{ - int nid = page_to_nid(page); - int zid = page_zonenum(page); - - return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid]; -} - -static void -__mem_cgroup_insert_exceeded(struct mem_cgroup *memcg, - struct mem_cgroup_per_zone *mz, - struct mem_cgroup_tree_per_zone *mctz, - unsigned long long new_usage_in_excess) -{ - struct rb_node **p = &mctz->rb_root.rb_node; - struct rb_node *parent = NULL; - struct mem_cgroup_per_zone *mz_node; - - if (mz->on_tree) - return; - - mz->usage_in_excess = new_usage_in_excess; - if (!mz->usage_in_excess) - return; - while (*p) { - parent = *p; - mz_node = rb_entry(parent, struct mem_cgroup_per_zone, - tree_node); - if (mz->usage_in_excess < mz_node->usage_in_excess) - p = &(*p)->rb_left; - /* - * We can't avoid mem cgroups that are over their soft - * limit by the same amount - */ - else if (mz->usage_in_excess >= mz_node->usage_in_excess) - p = &(*p)->rb_right; - } - rb_link_node(&mz->tree_node, parent, p); - rb_insert_color(&mz->tree_node, &mctz->rb_root); - mz->on_tree = true; -} - -static void -__mem_cgroup_remove_exceeded(struct mem_cgroup *memcg, - struct mem_cgroup_per_zone *mz, - struct mem_cgroup_tree_per_zone *mctz) -{ - if (!mz->on_tree) - return; - rb_erase(&mz->tree_node, &mctz->rb_root); - mz->on_tree = false; -} - -static void -mem_cgroup_remove_exceeded(struct mem_cgroup *memcg, - struct mem_cgroup_per_zone *mz, - struct mem_cgroup_tree_per_zone *mctz) -{ - spin_lock(&mctz->lock); - __mem_cgroup_remove_exceeded(memcg, mz, mctz); - spin_unlock(&mctz->lock); -} - - -static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page) -{ - unsigned long long excess; - struct mem_cgroup_per_zone *mz; - struct mem_cgroup_tree_per_zone *mctz; - int nid = page_to_nid(page); - int zid = page_zonenum(page); - mctz = soft_limit_tree_from_page(page); - - /* - * Necessary to update all ancestors when hierarchy is used. - * because their event counter is not touched. - */ - for (; memcg; memcg = parent_mem_cgroup(memcg)) { - mz = mem_cgroup_zoneinfo(memcg, nid, zid); - excess = res_counter_soft_limit_excess(&memcg->res); - /* - * We have to update the tree if mz is on RB-tree or - * mem is over its softlimit. - */ - if (excess || mz->on_tree) { - spin_lock(&mctz->lock); - /* if on-tree, remove it */ - if (mz->on_tree) - __mem_cgroup_remove_exceeded(memcg, mz, mctz); - /* - * Insert again. mz->usage_in_excess will be updated. - * If excess is 0, no tree ops. - */ - __mem_cgroup_insert_exceeded(memcg, mz, mctz, excess); - spin_unlock(&mctz->lock); - } - } -} - -static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg) -{ - int node, zone; - struct mem_cgroup_per_zone *mz; - struct mem_cgroup_tree_per_zone *mctz; - - for_each_node_state(node, N_POSSIBLE) { - for (zone = 0; zone < MAX_NR_ZONES; zone++) { - mz = mem_cgroup_zoneinfo(memcg, node, zone); - mctz = soft_limit_tree_node_zone(node, zone); - mem_cgroup_remove_exceeded(memcg, mz, mctz); - } - } -} - -static struct mem_cgroup_per_zone * -__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz) -{ - struct rb_node *rightmost = NULL; - struct mem_cgroup_per_zone *mz; - -retry: - mz = NULL; - rightmost = rb_last(&mctz->rb_root); - if (!rightmost) - goto done; /* Nothing to reclaim from */ - - mz = rb_entry(rightmost, struct mem_cgroup_per_zone, tree_node); - /* - * Remove the node now but someone else can add it back, - * we will to add it back at the end of reclaim to its correct - * position in the tree. - */ - __mem_cgroup_remove_exceeded(mz->mem, mz, mctz); - if (!res_counter_soft_limit_excess(&mz->mem->res) || - !css_tryget(&mz->mem->css)) - goto retry; -done: - return mz; -} - -static struct mem_cgroup_per_zone * -mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz) -{ - struct mem_cgroup_per_zone *mz; - - spin_lock(&mctz->lock); - mz = __mem_cgroup_largest_soft_limit_node(mctz); - spin_unlock(&mctz->lock); - return mz; -} - /* * Implementation Note: reading percpu statistics for memcg. * @@ -696,9 +508,6 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg, case MEM_CGROUP_TARGET_THRESH: next = val + THRESHOLDS_EVENTS_TARGET; break; - case MEM_CGROUP_TARGET_SOFTLIMIT: - next = val + SOFTLIMIT_EVENTS_TARGET; - break; case MEM_CGROUP_TARGET_NUMAINFO: next = val + NUMAINFO_EVENTS_TARGET; break; @@ -718,13 +527,11 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg, static void memcg_check_events(struct mem_cgroup *memcg, struct page *page) { preempt_disable(); - /* threshold event is triggered in finer grain than soft limit */ + /* threshold event is triggered in finer grain than numa info */ if (unlikely(mem_cgroup_event_ratelimit(memcg, MEM_CGROUP_TARGET_THRESH))) { - bool do_softlimit, do_numainfo; + bool do_numainfo; - do_softlimit = mem_cgroup_event_ratelimit(memcg, - MEM_CGROUP_TARGET_SOFTLIMIT); #if MAX_NUMNODES > 1 do_numainfo = mem_cgroup_event_ratelimit(memcg, MEM_CGROUP_TARGET_NUMAINFO); @@ -732,8 +539,6 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page) preempt_enable(); mem_cgroup_threshold(memcg); - if (unlikely(do_softlimit)) - mem_cgroup_update_tree(memcg, page); #if MAX_NUMNODES > 1 if (unlikely(do_numainfo)) atomic_inc(&memcg->numainfo_events); @@ -1318,6 +1123,36 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) return margin >> PAGE_SHIFT; } +/** + * mem_cgroup_over_softlimit + * @root: hierarchy root + * @memcg: child of @root to test + * + * Returns %true if @memcg exceeds its own soft limit or contributes + * to the soft limit excess of one of its parents up to and including + * @root. + */ +bool mem_cgroup_over_softlimit(struct mem_cgroup *root, + struct mem_cgroup *memcg) +{ + if (mem_cgroup_disabled()) + return false; + + if (!root) + root = root_mem_cgroup; + + for (; memcg; memcg = parent_mem_cgroup(memcg)) { + /* root_mem_cgroup does not have a soft limit */ + if (memcg == root_mem_cgroup) + break; + if (res_counter_soft_limit_excess(&memcg->res)) + return true; + if (memcg == root) + break; + } + return false; +} + int mem_cgroup_swappiness(struct mem_cgroup *memcg) { struct cgroup *cgrp = memcg->css.cgroup; @@ -1687,64 +1522,6 @@ bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap) } #endif -static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg, - struct zone *zone, - gfp_t gfp_mask, - unsigned long *total_scanned) -{ - struct mem_cgroup *victim = NULL; - int total = 0; - int loop = 0; - unsigned long excess; - unsigned long nr_scanned; - struct mem_cgroup_reclaim_cookie reclaim = { - .zone = zone, - .priority = 0, - }; - - excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT; - - while (1) { - unsigned long nr_reclaimed; - - victim = mem_cgroup_iter(root_memcg, victim, &reclaim); - if (!victim) { - loop++; - if (loop >= 2) { - /* - * If we have not been able to reclaim - * anything, it might because there are - * no reclaimable pages under this hierarchy - */ - if (!total) - break; - /* - * We want to do more targeted reclaim. - * excess >> 2 is not to excessive so as to - * reclaim too much, nor too less that we keep - * coming back to reclaim from this cgroup - */ - if (total >= (excess >> 2) || - (loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) - break; - } - continue; - } - if (!mem_cgroup_reclaimable(victim, false)) - continue; - nr_reclaimed = mem_cgroup_shrink_node_zone(victim, gfp_mask, false, - zone, &nr_scanned); - mem_cgroup_account_reclaim(root_mem_cgroup, victim, nr_reclaimed, - nr_scanned, current_is_kswapd()); - total += nr_reclaimed; - *total_scanned += nr_scanned; - if (!res_counter_soft_limit_excess(&root_memcg->res)) - break; - } - mem_cgroup_iter_break(root_memcg, victim); - return total; -} - /* * Check OOM-Killer is already running under our hierarchy. * If someone is running, return false. @@ -2507,8 +2284,6 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg, unlock_page_cgroup(pc); /* * "charge_statistics" updated event counter. Then, check it. - * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree. - * if they exceeds softlimit. */ memcg_check_events(memcg, page); } @@ -3578,98 +3353,6 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg, return ret; } -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, - gfp_t gfp_mask, - unsigned long *total_scanned) -{ - unsigned long nr_reclaimed = 0; - struct mem_cgroup_per_zone *mz, *next_mz = NULL; - unsigned long reclaimed; - int loop = 0; - struct mem_cgroup_tree_per_zone *mctz; - unsigned long long excess; - unsigned long nr_scanned; - - if (order > 0) - return 0; - - mctz = soft_limit_tree_node_zone(zone_to_nid(zone), zone_idx(zone)); - /* - * This loop can run a while, specially if mem_cgroup's continuously - * keep exceeding their soft limit and putting the system under - * pressure - */ - do { - if (next_mz) - mz = next_mz; - else - mz = mem_cgroup_largest_soft_limit_node(mctz); - if (!mz) - break; - - nr_scanned = 0; - reclaimed = mem_cgroup_soft_reclaim(mz->mem, zone, - gfp_mask, &nr_scanned); - nr_reclaimed += reclaimed; - *total_scanned += nr_scanned; - spin_lock(&mctz->lock); - - /* - * If we failed to reclaim anything from this memory cgroup - * it is time to move on to the next cgroup - */ - next_mz = NULL; - if (!reclaimed) { - do { - /* - * Loop until we find yet another one. - * - * By the time we get the soft_limit lock - * again, someone might have aded the - * group back on the RB tree. Iterate to - * make sure we get a different mem. - * mem_cgroup_largest_soft_limit_node returns - * NULL if no other cgroup is present on - * the tree - */ - next_mz = - __mem_cgroup_largest_soft_limit_node(mctz); - if (next_mz == mz) - css_put(&next_mz->mem->css); - else /* next_mz == NULL or other memcg */ - break; - } while (1); - } - __mem_cgroup_remove_exceeded(mz->mem, mz, mctz); - excess = res_counter_soft_limit_excess(&mz->mem->res); - /* - * One school of thought says that we should not add - * back the node to the tree if reclaim returns 0. - * But our reclaim could return 0, simply because due - * to priority we are exposing a smaller subset of - * memory to reclaim from. Consider this as a longer - * term TODO. - */ - /* If excess == 0, no tree ops */ - __mem_cgroup_insert_exceeded(mz->mem, mz, mctz, excess); - spin_unlock(&mctz->lock); - css_put(&mz->mem->css); - loop++; - /* - * Could not reclaim anything and there are no more - * mem cgroups to try or we seem to be looping without - * reclaiming anything. - */ - if (!nr_reclaimed && - (next_mz == NULL || - loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS)) - break; - } while (!nr_reclaimed); - if (next_mz) - css_put(&next_mz->mem->css); - return nr_reclaimed; -} - /* * This routine traverse page_cgroup in given list and drop them all. * *And* this routine doesn't reclaim page itself, just removes page_cgroup. @@ -4816,9 +4499,6 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node) mz = &pn->zoneinfo[zone]; for_each_lru(l) INIT_LIST_HEAD(&mz->lruvec.lists[l]); - mz->usage_in_excess = 0; - mz->on_tree = false; - mz->mem = memcg; } memcg->info.nodeinfo[node] = pn; return 0; @@ -4872,7 +4552,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) { int node; - mem_cgroup_remove_from_trees(memcg); free_css_id(&mem_cgroup_subsys, &memcg->css); for_each_node_state(node, N_POSSIBLE) @@ -4927,31 +4606,6 @@ static void __init enable_swap_cgroup(void) } #endif -static int mem_cgroup_soft_limit_tree_init(void) -{ - struct mem_cgroup_tree_per_node *rtpn; - struct mem_cgroup_tree_per_zone *rtpz; - int tmp, node, zone; - - for_each_node_state(node, N_POSSIBLE) { - tmp = node; - if (!node_state(node, N_NORMAL_MEMORY)) - tmp = -1; - rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, tmp); - if (!rtpn) - return 1; - - soft_limit_tree.rb_tree_per_node[node] = rtpn; - - for (zone = 0; zone < MAX_NR_ZONES; zone++) { - rtpz = &rtpn->rb_tree_per_zone[zone]; - rtpz->rb_root = RB_ROOT; - spin_lock_init(&rtpz->lock); - } - } - return 0; -} - static struct cgroup_subsys_state * __ref mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) { @@ -4973,8 +4627,6 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) enable_swap_cgroup(); parent = NULL; root_mem_cgroup = memcg; - if (mem_cgroup_soft_limit_tree_init()) - goto free_out; for_each_possible_cpu(cpu) { struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu); diff --git a/mm/vmscan.c b/mm/vmscan.c index e3fd8a7..4279549 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2121,8 +2121,16 @@ static void shrink_zone(int priority, struct zone *zone, .mem_cgroup = memcg, .zone = zone, }; + int epriority = priority; + /* + * Put more pressure on hierarchies that exceed their + * soft limit, to push them back harder than their + * well-behaving siblings. + */ + if (mem_cgroup_over_softlimit(root, memcg)) + epriority = 0; - shrink_mem_cgroup_zone(priority, &mz, sc); + shrink_mem_cgroup_zone(epriority, &mz, sc); mem_cgroup_account_reclaim(root, memcg, sc->nr_reclaimed - nr_reclaimed, @@ -2171,8 +2179,6 @@ static bool shrink_zones(int priority, struct zonelist *zonelist, { struct zoneref *z; struct zone *zone; - unsigned long nr_soft_reclaimed; - unsigned long nr_soft_scanned; bool should_abort_reclaim = false; for_each_zone_zonelist_nodemask(zone, z, zonelist, @@ -2205,19 +2211,6 @@ static bool shrink_zones(int priority, struct zonelist *zonelist, continue; } } - /* - * This steals pages from memory cgroups over softlimit - * and returns the number of reclaimed pages and - * scanned pages. This works for global memory pressure - * and balancing, not for a memcg's limit. - */ - nr_soft_scanned = 0; - nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone, - sc->order, sc->gfp_mask, - &nr_soft_scanned); - sc->nr_reclaimed += nr_soft_reclaimed; - sc->nr_scanned += nr_soft_scanned; - /* need some check for avoid more shrink_zone() */ } shrink_zone(priority, zone, sc); @@ -2393,48 +2386,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, } #ifdef CONFIG_CGROUP_MEM_RES_CTLR - -unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg, - gfp_t gfp_mask, bool noswap, - struct zone *zone, - unsigned long *nr_scanned) -{ - struct scan_control sc = { - .nr_scanned = 0, - .nr_to_reclaim = SWAP_CLUSTER_MAX, - .may_writepage = !laptop_mode, - .may_unmap = 1, - .may_swap = !noswap, - .order = 0, - .target_mem_cgroup = memcg, - }; - struct mem_cgroup_zone mz = { - .mem_cgroup = memcg, - .zone = zone, - }; - - sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) | - (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK); - - trace_mm_vmscan_memcg_softlimit_reclaim_begin(0, - sc.may_writepage, - sc.gfp_mask); - - /* - * NOTE: Although we can get the priority field, using it - * here is not a good idea, since it limits the pages we can scan. - * if we don't reclaim here, the shrink_zone from balance_pgdat - * will pick up pages from other mem cgroup's as well. We hack - * the priority and make it zero. - */ - shrink_mem_cgroup_zone(0, &mz, &sc); - - trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed); - - *nr_scanned = sc.nr_scanned; - return sc.nr_reclaimed; -} - unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, gfp_t gfp_mask, bool noswap) @@ -2609,8 +2560,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */ unsigned long total_scanned; struct reclaim_state *reclaim_state = current->reclaim_state; - unsigned long nr_soft_reclaimed; - unsigned long nr_soft_scanned; struct scan_control sc = { .gfp_mask = GFP_KERNEL, .may_unmap = 1, @@ -2701,17 +2650,6 @@ loop_again: continue; sc.nr_scanned = 0; - - nr_soft_scanned = 0; - /* - * Call soft limit reclaim before calling shrink_zone. - */ - nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone, - order, sc.gfp_mask, - &nr_soft_scanned); - sc.nr_reclaimed += nr_soft_reclaimed; - total_scanned += nr_soft_scanned; - /* * We put equal pressure on every zone, unless * one zone has way too many pages free -- 1.7.7.5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756648Ab2AJXyK (ORCPT ); Tue, 10 Jan 2012 18:54:10 -0500 Received: from mail-qy0-f174.google.com ([209.85.216.174]:36592 "EHLO mail-qy0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752104Ab2AJXyI convert rfc822-to-8bit (ORCPT ); Tue, 10 Jan 2012 18:54:08 -0500 MIME-Version: 1.0 In-Reply-To: <1326207772-16762-2-git-send-email-hannes@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-2-git-send-email-hannes@cmpxchg.org> Date: Tue, 10 Jan 2012 15:54:05 -0800 Message-ID: Subject: Re: [patch 1/2] mm: memcg: per-memcg reclaim statistics From: Ying Han To: Johannes Weiner Cc: Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org X-System-Of-Record: true Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Thank you for the patch and the stats looks reasonable to me, few questions as below: On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner wrote: > With the single per-zone LRU gone and global reclaim scanning > individual memcgs, it's straight-forward to collect meaningful and > accurate per-memcg reclaim statistics. > > This adds the following items to memory.stat: Some of the previous discussions including patches have similar stats in memory.vmscan_stat API, which collects all the per-memcg vmscan stats. I would like to understand more why we add into memory.stat instead, and do we have plan to keep extending memory.stat for those vmstat like stats? > > pgreclaim Not sure if we want to keep this more consistent to /proc/vmstat, then it will be "pgsteal"? > pgscan > >  Number of pages reclaimed/scanned from that memcg due to its own >  hard limit (or physical limit in case of the root memcg) by the >  allocating task. > > kswapd_pgreclaim > kswapd_pgscan we have "pgscan_kswapd_*" in vmstat, so maybe ? "pgsteal_kswapd" "pgscan_kswapd" > >  Reclaim activity from kswapd due to the memcg's own limit.  Only >  applicable to the root memcg for now since kswapd is only triggered >  by physical limits, but kswapd-style reclaim based on memcg hard >  limits is being developped. > > hierarchy_pgreclaim > hierarchy_pgscan > hierarchy_kswapd_pgreclaim > hierarchy_kswapd_pgscan "pgsteal_hierarchy" "pgsteal_kswapd_hierarchy" .. No strong option on the naming, but try to make it more consistent to existing API. > >  Reclaim activity due to limitations in one of the memcg's parents. > > Signed-off-by: Johannes Weiner > --- >  Documentation/cgroups/memory.txt |    4 ++ >  include/linux/memcontrol.h       |   10 +++++ >  mm/memcontrol.c                  |   84 +++++++++++++++++++++++++++++++++++++- >  mm/vmscan.c                      |    7 +++ >  4 files changed, 103 insertions(+), 2 deletions(-) > > diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt > index cc0ebc5..eb9e982 100644 > --- a/Documentation/cgroups/memory.txt > +++ b/Documentation/cgroups/memory.txt > @@ -389,6 +389,10 @@ mapped_file        - # of bytes of mapped file (includes tmpfs/shmem) >  pgpgin         - # of pages paged in (equivalent to # of charging events). >  pgpgout                - # of pages paged out (equivalent to # of uncharging events). >  swap           - # of bytes of swap usage > +pgreclaim      - # of pages reclaimed due to this memcg's limit > +pgscan         - # of pages scanned due to this memcg's limit > +kswapd_*       - # reclaim activity by background daemon due to this memcg's limit > +hierarchy_*    - # reclaim activity due to pressure from parental memcg >  inactive_anon  - # of bytes of anonymous memory and swap cache memory on >                LRU list. >  active_anon    - # of bytes of anonymous and swap cache memory on active > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index bd3b102..6c1d69e 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -121,6 +121,8 @@ struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, >                                                      struct zone *zone); >  struct zone_reclaim_stat* >  mem_cgroup_get_reclaim_stat_from_page(struct page *page); > +void mem_cgroup_account_reclaim(struct mem_cgroup *, struct mem_cgroup *, > +                               unsigned long, unsigned long, bool); >  extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, >                                        struct task_struct *p); >  extern void mem_cgroup_replace_page_cache(struct page *oldpage, > @@ -347,6 +349,14 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page) >        return NULL; >  } > > +static inline void mem_cgroup_account_reclaim(struct mem_cgroup *root, > +                                             struct mem_cgroup *memcg, > +                                             unsigned long nr_reclaimed, > +                                             unsigned long nr_scanned, > +                                             bool kswapd) > +{ > +} > + >  static inline void >  mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) >  { > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 8e2a80d..170dff4 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -91,12 +91,23 @@ enum mem_cgroup_stat_index { >        MEM_CGROUP_STAT_NSTATS, >  }; > > +#define MEM_CGROUP_EVENTS_KSWAPD 2 > +#define MEM_CGROUP_EVENTS_HIERARCHY 4 > + >  enum mem_cgroup_events_index { >        MEM_CGROUP_EVENTS_PGPGIN,       /* # of pages paged in */ >        MEM_CGROUP_EVENTS_PGPGOUT,      /* # of pages paged out */ >        MEM_CGROUP_EVENTS_COUNT,        /* # of pages paged in/out */ >        MEM_CGROUP_EVENTS_PGFAULT,      /* # of page-faults */ >        MEM_CGROUP_EVENTS_PGMAJFAULT,   /* # of major page-faults */ > +       MEM_CGROUP_EVENTS_PGRECLAIM, > +       MEM_CGROUP_EVENTS_PGSCAN, > +       MEM_CGROUP_EVENTS_KSWAPD_PGRECLAIM, > +       MEM_CGROUP_EVENTS_KSWAPD_PGSCAN, > +       MEM_CGROUP_EVENTS_HIERARCHY_PGRECLAIM, > +       MEM_CGROUP_EVENTS_HIERARCHY_PGSCAN, > +       MEM_CGROUP_EVENTS_HIERARCHY_KSWAPD_PGRECLAIM, > +       MEM_CGROUP_EVENTS_HIERARCHY_KSWAPD_PGSCAN, missing comment here? >        MEM_CGROUP_EVENTS_NSTATS, >  }; >  /* > @@ -889,6 +900,38 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg) >        return (memcg == root_mem_cgroup); >  } > > +/** > + * mem_cgroup_account_reclaim - update per-memcg reclaim statistics > + * @root: memcg that triggered reclaim > + * @memcg: memcg that is actually being scanned > + * @nr_reclaimed: number of pages reclaimed from @memcg > + * @nr_scanned: number of pages scanned from @memcg > + * @kswapd: whether reclaiming task is kswapd or allocator itself > + */ > +void mem_cgroup_account_reclaim(struct mem_cgroup *root, > +                               struct mem_cgroup *memcg, > +                               unsigned long nr_reclaimed, > +                               unsigned long nr_scanned, > +                               bool kswapd) > +{ > +       unsigned int offset = 0; > + > +       if (!root) > +               root = root_mem_cgroup; > + > +       if (kswapd) > +               offset += MEM_CGROUP_EVENTS_KSWAPD; > +       if (root != memcg) > +               offset += MEM_CGROUP_EVENTS_HIERARCHY; Just to be clear, here root cgroup has hierarchy_* stats always 0 ? Also, we might want to consider renaming the root here, something like target? The root is confusing with root_mem_cgroup. --Ying > + > +       preempt_disable(); > +       __this_cpu_add(memcg->stat->events[MEM_CGROUP_EVENTS_PGRECLAIM + offset], > +                      nr_reclaimed); > +       __this_cpu_add(memcg->stat->events[MEM_CGROUP_EVENTS_PGSCAN + offset], > +                      nr_scanned); > +       preempt_enable(); > +} > + >  void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx) >  { >        struct mem_cgroup *memcg; > @@ -1662,6 +1705,8 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg, >        excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT; > >        while (1) { > +               unsigned long nr_reclaimed; > + >                victim = mem_cgroup_iter(root_memcg, victim, &reclaim); >                if (!victim) { >                        loop++; > @@ -1687,8 +1732,11 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg, >                } >                if (!mem_cgroup_reclaimable(victim, false)) >                        continue; > -               total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false, > -                                                    zone, &nr_scanned); > +               nr_reclaimed = mem_cgroup_shrink_node_zone(victim, gfp_mask, false, > +                                                          zone, &nr_scanned); > +               mem_cgroup_account_reclaim(root_mem_cgroup, victim, nr_reclaimed, > +                                          nr_scanned, current_is_kswapd()); > +               total += nr_reclaimed; >                *total_scanned += nr_scanned; >                if (!res_counter_soft_limit_excess(&root_memcg->res)) >                        break; > @@ -4023,6 +4071,14 @@ enum { >        MCS_SWAP, >        MCS_PGFAULT, >        MCS_PGMAJFAULT, > +       MCS_PGRECLAIM, > +       MCS_PGSCAN, > +       MCS_KSWAPD_PGRECLAIM, > +       MCS_KSWAPD_PGSCAN, > +       MCS_HIERARCHY_PGRECLAIM, > +       MCS_HIERARCHY_PGSCAN, > +       MCS_HIERARCHY_KSWAPD_PGRECLAIM, > +       MCS_HIERARCHY_KSWAPD_PGSCAN, >        MCS_INACTIVE_ANON, >        MCS_ACTIVE_ANON, >        MCS_INACTIVE_FILE, > @@ -4047,6 +4103,14 @@ struct { >        {"swap", "total_swap"}, >        {"pgfault", "total_pgfault"}, >        {"pgmajfault", "total_pgmajfault"}, > +       {"pgreclaim", "total_pgreclaim"}, > +       {"pgscan", "total_pgscan"}, > +       {"kswapd_pgreclaim", "total_kswapd_pgreclaim"}, > +       {"kswapd_pgscan", "total_kswapd_pgscan"}, > +       {"hierarchy_pgreclaim", "total_hierarchy_pgreclaim"}, > +       {"hierarchy_pgscan", "total_hierarchy_pgscan"}, > +       {"hierarchy_kswapd_pgreclaim", "total_hierarchy_kswapd_pgreclaim"}, > +       {"hierarchy_kswapd_pgscan", "total_hierarchy_kswapd_pgscan"}, >        {"inactive_anon", "total_inactive_anon"}, >        {"active_anon", "total_active_anon"}, >        {"inactive_file", "total_inactive_file"}, > @@ -4079,6 +4143,22 @@ mem_cgroup_get_local_stat(struct mem_cgroup *memcg, struct mcs_total_stat *s) >        s->stat[MCS_PGFAULT] += val; >        val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_PGMAJFAULT); >        s->stat[MCS_PGMAJFAULT] += val; > +       val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_PGRECLAIM); > +       s->stat[MCS_PGRECLAIM] += val; > +       val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_PGSCAN); > +       s->stat[MCS_PGSCAN] += val; > +       val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_KSWAPD_PGRECLAIM); > +       s->stat[MCS_KSWAPD_PGRECLAIM] += val; > +       val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_KSWAPD_PGSCAN); > +       s->stat[MCS_KSWAPD_PGSCAN] += val; > +       val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_HIERARCHY_PGRECLAIM); > +       s->stat[MCS_HIERARCHY_PGRECLAIM] += val; > +       val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_HIERARCHY_PGSCAN); > +       s->stat[MCS_HIERARCHY_PGSCAN] += val; > +       val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_HIERARCHY_KSWAPD_PGRECLAIM); > +       s->stat[MCS_HIERARCHY_KSWAPD_PGRECLAIM] += val; > +       val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_HIERARCHY_KSWAPD_PGSCAN); > +       s->stat[MCS_HIERARCHY_KSWAPD_PGSCAN] += val; > >        /* per zone stat */ >        val = mem_cgroup_nr_lru_pages(memcg, BIT(LRU_INACTIVE_ANON)); > diff --git a/mm/vmscan.c b/mm/vmscan.c > index c631234..e3fd8a7 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2115,12 +2115,19 @@ static void shrink_zone(int priority, struct zone *zone, > >        memcg = mem_cgroup_iter(root, NULL, &reclaim); >        do { > +               unsigned long nr_reclaimed = sc->nr_reclaimed; > +               unsigned long nr_scanned = sc->nr_scanned; >                struct mem_cgroup_zone mz = { >                        .mem_cgroup = memcg, >                        .zone = zone, >                }; > >                shrink_mem_cgroup_zone(priority, &mz, sc); > + > +               mem_cgroup_account_reclaim(root, memcg, > +                                          sc->nr_reclaimed - nr_reclaimed, > +                                          sc->nr_scanned - nr_scanned, > +                                          current_is_kswapd()); >                /* >                 * Limit reclaim has historically picked one memcg and >                 * scanned it with decreasing priority levels until > -- > 1.7.7.5 > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757255Ab2AKAap (ORCPT ); Tue, 10 Jan 2012 19:30:45 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:56748 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755767Ab2AKAan (ORCPT ); Tue, 10 Jan 2012 19:30:43 -0500 Date: Wed, 11 Jan 2012 01:30:20 +0100 From: Johannes Weiner To: Ying Han Cc: Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [patch 1/2] mm: memcg: per-memcg reclaim statistics Message-ID: <20120111003020.GD24386@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-2-git-send-email-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jan 10, 2012 at 03:54:05PM -0800, Ying Han wrote: > Thank you for the patch and the stats looks reasonable to me, few > questions as below: > > On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner wrote: > > With the single per-zone LRU gone and global reclaim scanning > > individual memcgs, it's straight-forward to collect meaningful and > > accurate per-memcg reclaim statistics. > > > > This adds the following items to memory.stat: > > Some of the previous discussions including patches have similar stats > in memory.vmscan_stat API, which collects all the per-memcg vmscan > stats. I would like to understand more why we add into memory.stat > instead, and do we have plan to keep extending memory.stat for those > vmstat like stats? I think they were put into an extra file in particular to be able to write to this file to reset the statistics. But in my opinion, it's trivial to calculate a delta from before and after running a workload, so I didn't really like adding kernel code for that. Did you have another reason for a separate file in mind? > > pgreclaim > > Not sure if we want to keep this more consistent to /proc/vmstat, then > it will be "pgsteal"? The problem with that was that we didn't like to call pages stolen when they were reclaimed from within the cgroup, so we had pgfree for inner reclaim and pgsteal for outer reclaim, respectively. I found it cleaner to just go with pgreclaim, it's unambiguous and straight-forward. Outer reclaim is designated by the hierarchy_ prefix. > > pgscan > > > >  Number of pages reclaimed/scanned from that memcg due to its own > >  hard limit (or physical limit in case of the root memcg) by the > >  allocating task. > > > > kswapd_pgreclaim > > kswapd_pgscan > > we have "pgscan_kswapd_*" in vmstat, so maybe ? > "pgsteal_kswapd" > "pgscan_kswapd" > > >  Reclaim activity from kswapd due to the memcg's own limit.  Only > >  applicable to the root memcg for now since kswapd is only triggered > >  by physical limits, but kswapd-style reclaim based on memcg hard > >  limits is being developped. > > > > hierarchy_pgreclaim > > hierarchy_pgscan > > hierarchy_kswapd_pgreclaim > > hierarchy_kswapd_pgscan > > "pgsteal_hierarchy" > "pgsteal_kswapd_hierarchy" > .. > > No strong option on the naming, but try to make it more consistent to > existing API. I swear I tried, but the existing naming is pretty screwed up :( For example, pgscan_direct_* and pgscan_kswapd_* allow you to compare scan rates of direct reclaim vs. kswapd reclaim. To get the total number of pages reclaimed, you sum them up. On the other hand, pgsteal_* does not differentiate between direct reclaim and kswapd, so to get direct reclaim numbers, you add up the pgsteal_* counters and subtract kswapd_steal (notice the lack of pg?), which is in turn not available at zone granularity. > > +#define MEM_CGROUP_EVENTS_KSWAPD 2 > > +#define MEM_CGROUP_EVENTS_HIERARCHY 4 These two function as namespaces, that's why I put hierarchy_ and kswapd_ at the beginning of the names. Given that we have kswapd_steal, would you be okay with doing it like this? I mean, at least my naming conforms to ONE of the standards in /proc/vmstat, right? ;-) > > @@ -91,12 +91,23 @@ enum mem_cgroup_stat_index { > >        MEM_CGROUP_STAT_NSTATS, > >  }; > > > > +#define MEM_CGROUP_EVENTS_KSWAPD 2 > > +#define MEM_CGROUP_EVENTS_HIERARCHY 4 > > + > >  enum mem_cgroup_events_index { > >        MEM_CGROUP_EVENTS_PGPGIN,       /* # of pages paged in */ > >        MEM_CGROUP_EVENTS_PGPGOUT,      /* # of pages paged out */ > >        MEM_CGROUP_EVENTS_COUNT,        /* # of pages paged in/out */ > >        MEM_CGROUP_EVENTS_PGFAULT,      /* # of page-faults */ > >        MEM_CGROUP_EVENTS_PGMAJFAULT,   /* # of major page-faults */ > > +       MEM_CGROUP_EVENTS_PGRECLAIM, > > +       MEM_CGROUP_EVENTS_PGSCAN, > > +       MEM_CGROUP_EVENTS_KSWAPD_PGRECLAIM, > > +       MEM_CGROUP_EVENTS_KSWAPD_PGSCAN, > > +       MEM_CGROUP_EVENTS_HIERARCHY_PGRECLAIM, > > +       MEM_CGROUP_EVENTS_HIERARCHY_PGSCAN, > > +       MEM_CGROUP_EVENTS_HIERARCHY_KSWAPD_PGRECLAIM, > > +       MEM_CGROUP_EVENTS_HIERARCHY_KSWAPD_PGSCAN, > > missing comment here? As if the lines weren't long enough already ;-) I'll add some. > >        MEM_CGROUP_EVENTS_NSTATS, > >  }; > >  /* > > @@ -889,6 +900,38 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg) > >        return (memcg == root_mem_cgroup); > >  } > > > > +/** > > + * mem_cgroup_account_reclaim - update per-memcg reclaim statistics > > + * @root: memcg that triggered reclaim > > + * @memcg: memcg that is actually being scanned > > + * @nr_reclaimed: number of pages reclaimed from @memcg > > + * @nr_scanned: number of pages scanned from @memcg > > + * @kswapd: whether reclaiming task is kswapd or allocator itself > > + */ > > +void mem_cgroup_account_reclaim(struct mem_cgroup *root, > > +                               struct mem_cgroup *memcg, > > +                               unsigned long nr_reclaimed, > > +                               unsigned long nr_scanned, > > +                               bool kswapd) > > +{ > > +       unsigned int offset = 0; > > + > > +       if (!root) > > +               root = root_mem_cgroup; > > + > > +       if (kswapd) > > +               offset += MEM_CGROUP_EVENTS_KSWAPD; > > +       if (root != memcg) > > +               offset += MEM_CGROUP_EVENTS_HIERARCHY; > > Just to be clear, here root cgroup has hierarchy_* stats always 0 ? That's correct, there can't be any hierarchical pressure on the topmost parent. > Also, we might want to consider renaming the root here, something like > target? The root is confusing with root_mem_cgroup. It's the same naming scheme I used for the iterator functions (mem_cgroup_iter() and friends), so if we change it, I'd like to change it consistently. Having target and memcg as parameters is even more confusing and non-descriptive, IMO. Other places use mem_over_limit, which is a bit better, but quite long. Any other ideas for great names for parameters that designate a hierarchy root and a memcg in that hierarchy? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933996Ab2AKVmf (ORCPT ); Wed, 11 Jan 2012 16:42:35 -0500 Received: from mail-pw0-f46.google.com ([209.85.160.46]:53482 "EHLO mail-pw0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758047Ab2AKVmc convert rfc822-to-8bit (ORCPT ); Wed, 11 Jan 2012 16:42:32 -0500 MIME-Version: 1.0 In-Reply-To: <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> Date: Wed, 11 Jan 2012 13:42:31 -0800 Message-ID: Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim From: Ying Han To: Johannes Weiner Cc: Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org X-System-Of-Record: true Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner wrote: > Right now, memcg soft limits are implemented by having a sorted tree > of memcgs that are in excess of their limits.  Under global memory > pressure, kswapd first reclaims from the biggest excessor and then > proceeds to do regular global reclaim.  The result of this is that > pages are reclaimed from all memcgs, but more scanning happens against > those above their soft limit. > > With global reclaim doing memcg-aware hierarchical reclaim by default, > this is a lot easier to implement: everytime a memcg is reclaimed > from, scan more aggressively (per tradition with a priority of 0) if > it's above its soft limit.  With the same end result of scanning > everybody, but soft limit excessors a bit more. > > Advantages: > >  o smoother reclaim: soft limit reclaim is a separate stage before >    global reclaim, whose result is not communicated down the line and >    so overreclaim of the groups in excess is very likely.  After this >    patch, soft limit reclaim is fully integrated into regular reclaim >    and each memcg is considered exactly once per cycle. > >  o true hierarchy support: soft limits are only considered when >    kswapd does global reclaim, but after this patch, targetted >    reclaim of a memcg will mind the soft limit settings of its child >    groups. Why we add soft limit reclaim into target reclaim? Based on the discussions, my understanding is that the soft limit only take effect while the whole machine is under memory contention. We don't want to add extra pressure on a cgroup if there is free memory on the system even the cgroup is above its limit. > >  o code size: soft limit reclaim requires a lot of code to maintain >    the per-node per-zone rb-trees to quickly find the biggest >    offender, dedicated paths for soft limit reclaim etc. while this >    new implementation gets away without all that. > > Test: > > The test consists of two concurrent kernel build jobs in separate > source trees, the master and the slave.  The two jobs get along nicely > on 600MB of available memory, so this is the zero overcommit control > case.  When available memory is decreased, the overcommit is > compensated by decreasing the soft limit of the slave by the same > amount, in the hope that the slave takes the hit and the master stays > unaffected. > >                                    600M-0M-vanilla         600M-0M-patched > Master walltime (s)               552.65 (  +0.00%)       552.38 (  -0.05%) > Master walltime (stddev)            1.25 (  +0.00%)         0.92 ( -14.66%) > Master major faults               204.38 (  +0.00%)       205.38 (  +0.49%) > Master major faults (stddev)       27.16 (  +0.00%)        13.80 ( -47.43%) > Master reclaim                     31.88 (  +0.00%)        37.75 ( +17.87%) > Master reclaim (stddev)            34.01 (  +0.00%)        75.88 (+119.59%) > Master scan                        31.88 (  +0.00%)        37.75 ( +17.87%) > Master scan (stddev)               34.01 (  +0.00%)        75.88 (+119.59%) > Master kswapd reclaim           33922.12 (  +0.00%)     33887.12 (  -0.10%) > Master kswapd reclaim (stddev)    969.08 (  +0.00%)       492.22 ( -49.16%) > Master kswapd scan              34085.75 (  +0.00%)     33985.75 (  -0.29%) > Master kswapd scan (stddev)      1101.07 (  +0.00%)       563.33 ( -48.79%) > Slave walltime (s)                552.68 (  +0.00%)       552.12 (  -0.10%) > Slave walltime (stddev)             0.79 (  +0.00%)         1.05 ( +14.76%) > Slave major faults                212.50 (  +0.00%)       204.50 (  -3.75%) > Slave major faults (stddev)        26.90 (  +0.00%)        13.17 ( -49.20%) > Slave reclaim                      26.12 (  +0.00%)        35.00 ( +32.72%) > Slave reclaim (stddev)             29.42 (  +0.00%)        74.91 (+149.55%) > Slave scan                         31.38 (  +0.00%)        35.00 ( +11.20%) > Slave scan (stddev)                33.31 (  +0.00%)        74.91 (+121.24%) > Slave kswapd reclaim            34259.00 (  +0.00%)     33469.88 (  -2.30%) > Slave kswapd reclaim (stddev)     925.15 (  +0.00%)       565.07 ( -38.88%) > Slave kswapd scan               34354.62 (  +0.00%)     33555.75 (  -2.33%) > Slave kswapd scan (stddev)        969.62 (  +0.00%)       581.70 ( -39.97%) > > In the control case, the differences in elapsed time, number of major > faults taken, and reclaim statistics are within the noise for both the > master and the slave job. What's the soft limit setting in the controlled case? I assume it is the default RESOURCE_MAX. So both Master and Slave get equal pressure before/after the patch, and no differences on the stats should be observed. >                                     600M-280M-vanilla      600M-280M-patched > Master walltime (s)                  595.13 (  +0.00%)      553.19 (  -7.04%) > Master walltime (stddev)               8.31 (  +0.00%)        2.57 ( -61.64%) > Master major faults                 3729.75 (  +0.00%)      783.25 ( -78.98%) > Master major faults (stddev)         258.79 (  +0.00%)      226.68 ( -12.36%) > Master reclaim                       705.00 (  +0.00%)       29.50 ( -95.68%) > Master reclaim (stddev)              232.87 (  +0.00%)       44.72 ( -80.45%) > Master scan                          714.88 (  +0.00%)       30.00 ( -95.67%) > Master scan (stddev)                 237.44 (  +0.00%)       45.39 ( -80.54%) > Master kswapd reclaim                114.75 (  +0.00%)       50.00 ( -55.94%) > Master kswapd reclaim (stddev)       128.51 (  +0.00%)        9.45 ( -91.93%) > Master kswapd scan                   115.75 (  +0.00%)       50.00 ( -56.32%) > Master kswapd scan (stddev)          130.31 (  +0.00%)        9.45 ( -92.04%) > Slave walltime (s)                   631.18 (  +0.00%)      577.68 (  -8.46%) > Slave walltime (stddev)                9.89 (  +0.00%)        3.63 ( -57.47%) > Slave major faults                 28401.75 (  +0.00%)    14656.75 ( -48.39%) > Slave major faults (stddev)         2629.97 (  +0.00%)     1911.81 ( -27.30%) > Slave reclaim                      65400.62 (  +0.00%)     1479.62 ( -97.74%) > Slave reclaim (stddev)             11623.02 (  +0.00%)     1482.13 ( -87.24%) > Slave scan                       9050047.88 (  +0.00%)    95968.25 ( -98.94%) > Slave scan (stddev)              1912786.94 (  +0.00%)    93390.71 ( -95.12%) > Slave kswapd reclaim              327894.50 (  +0.00%)   227099.88 ( -30.74%) > Slave kswapd reclaim (stddev)      22289.43 (  +0.00%)    16113.14 ( -27.71%) > Slave kswapd scan               34987335.75 (  +0.00%)  1362367.12 ( -96.11%) > Slave kswapd scan (stddev)       2523642.98 (  +0.00%)   156754.74 ( -93.79%) > > Here, the available memory is limited to 320 MB, the machine is > overcommitted by 280 MB.  The soft limit of the master is 300 MB, that > of the slave merely 20 MB. > > Looking at the slave job first, it is much better off with the patched > kernel: direct reclaim is almost gone, kswapd reclaim is decreased by > a third.  The result is much fewer major faults taken, which in turn > lets the job finish quicker. What's the setting of the hard limit here? Is the direct reclaim referring to per-memcg directly reclaim or global one. > > It would be a zero-sum game if the improvement happened at the cost of > the master but looking at the numbers, even the master performs better > with the patched kernel.  In fact, the master job is almost unaffected > on the patched kernel compared to the control case. It makes sense since the master job get less affected by the patch than the slave job under the example. Under the control case, if both master and slave have RESOURCE_MAX soft limit setting, they are under equal memory pressure(priority = DEF_PRIORITY) . On the second example, only the slave pressure being increased by priority = 0, and the Master got scanned with same priority = DEF_PRIORITY pretty much. So I would expect to see more reclaim activities happens in slave on the patched kernel compared to the control case. It seems match the testing result. > > This is an odd phenomenon, as the patch does not directly change how > the master is reclaimed.  An explanation for this is that the severe > overreclaim of the slave in the unpatched kernel results in the master > growing bigger than in the patched case.  Combining the fact that > memcgs are scanned according to their size with the increased refault > rate of the overreclaimed slave triggering global reclaim more often > means that overall pressure on the master job is higher in the > unpatched kernel. We can check the Master memory.usage_in_bytes while the job is running. On the other hand, I don't see why we expect the Master being less reclaimed in the controlled case? On the unpatched kernel, the Master is being reclaimed under global pressure each time anyway since we ignore the return value of softlimit. > > At any rate, the patched kernel seems to do a much better job at both > overall resource allocation under soft limit overcommit as well as the > requested prioritization of the master job. > > Signed-off-by: Johannes Weiner > --- >  include/linux/memcontrol.h |   18 +-- >  mm/memcontrol.c            |  412 ++++---------------------------------------- >  mm/vmscan.c                |   80 +-------- >  3 files changed, 48 insertions(+), 462 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 6c1d69e..72368b7 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -121,6 +121,7 @@ struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, >                                                      struct zone *zone); >  struct zone_reclaim_stat* >  mem_cgroup_get_reclaim_stat_from_page(struct page *page); > +bool mem_cgroup_over_softlimit(struct mem_cgroup *, struct mem_cgroup *); Maybe something like "mem_cgroup_over_soft_limit()" ? >  void mem_cgroup_account_reclaim(struct mem_cgroup *, struct mem_cgroup *, >                                unsigned long, unsigned long, bool); >  extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, > @@ -155,9 +156,6 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, >        mem_cgroup_update_page_stat(page, idx, -1); >  } > > -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, > -                                               gfp_t gfp_mask, > -                                               unsigned long *total_scanned); >  u64 mem_cgroup_get_limit(struct mem_cgroup *memcg); > >  void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx); > @@ -362,22 +360,20 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) >  { >  } > > -static inline void mem_cgroup_inc_page_stat(struct page *page, > -                                           enum mem_cgroup_page_stat_item idx) > +static inline bool > +mem_cgroup_over_softlimit(struct mem_cgroup *root, struct mem_cgroup *memcg) >  { > +       return false; >  } > > -static inline void mem_cgroup_dec_page_stat(struct page *page, > +static inline void mem_cgroup_inc_page_stat(struct page *page, >                                            enum mem_cgroup_page_stat_item idx) >  { >  } > > -static inline > -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, > -                                           gfp_t gfp_mask, > -                                           unsigned long *total_scanned) > +static inline void mem_cgroup_dec_page_stat(struct page *page, > +                                           enum mem_cgroup_page_stat_item idx) >  { > -       return 0; >  } > >  static inline > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 170dff4..d4f7ae5 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -35,7 +35,6 @@ >  #include >  #include >  #include > -#include >  #include >  #include >  #include > @@ -118,12 +117,10 @@ enum mem_cgroup_events_index { >  */ >  enum mem_cgroup_events_target { >        MEM_CGROUP_TARGET_THRESH, > -       MEM_CGROUP_TARGET_SOFTLIMIT, >        MEM_CGROUP_TARGET_NUMAINFO, >        MEM_CGROUP_NTARGETS, >  }; >  #define THRESHOLDS_EVENTS_TARGET (128) > -#define SOFTLIMIT_EVENTS_TARGET (1024) >  #define NUMAINFO_EVENTS_TARGET (1024) > >  struct mem_cgroup_stat_cpu { > @@ -149,12 +146,6 @@ struct mem_cgroup_per_zone { >        struct mem_cgroup_reclaim_iter reclaim_iter[DEF_PRIORITY + 1]; > >        struct zone_reclaim_stat reclaim_stat; > -       struct rb_node          tree_node;      /* RB tree node */ > -       unsigned long long      usage_in_excess;/* Set to the value by which */ > -                                               /* the soft limit is exceeded*/ > -       bool                    on_tree; > -       struct mem_cgroup       *mem;           /* Back pointer, we cannot */ > -                                               /* use container_of        */ >  }; >  /* Macro for accessing counter */ >  #define MEM_CGROUP_ZSTAT(mz, idx)      ((mz)->count[(idx)]) > @@ -167,26 +158,6 @@ struct mem_cgroup_lru_info { >        struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES]; >  }; > > -/* > - * Cgroups above their limits are maintained in a RB-Tree, independent of > - * their hierarchy representation > - */ > - > -struct mem_cgroup_tree_per_zone { > -       struct rb_root rb_root; > -       spinlock_t lock; > -}; > - > -struct mem_cgroup_tree_per_node { > -       struct mem_cgroup_tree_per_zone rb_tree_per_zone[MAX_NR_ZONES]; > -}; > - > -struct mem_cgroup_tree { > -       struct mem_cgroup_tree_per_node *rb_tree_per_node[MAX_NUMNODES]; > -}; > - > -static struct mem_cgroup_tree soft_limit_tree __read_mostly; > - >  struct mem_cgroup_threshold { >        struct eventfd_ctx *eventfd; >        u64 threshold; > @@ -343,7 +314,6 @@ static bool move_file(void) >  * limit reclaim to prevent infinite loops, if they ever occur. >  */ >  #define        MEM_CGROUP_MAX_RECLAIM_LOOPS            (100) > -#define        MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS (2) You might need to remove the comment above as well. > >  enum charge_type { >        MEM_CGROUP_CHARGE_TYPE_CACHE = 0, > @@ -398,164 +368,6 @@ page_cgroup_zoneinfo(struct mem_cgroup *memcg, struct page *page) >        return mem_cgroup_zoneinfo(memcg, nid, zid); >  } > > -static struct mem_cgroup_tree_per_zone * > -soft_limit_tree_node_zone(int nid, int zid) > -{ > -       return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid]; > -} > - > -static struct mem_cgroup_tree_per_zone * > -soft_limit_tree_from_page(struct page *page) > -{ > -       int nid = page_to_nid(page); > -       int zid = page_zonenum(page); > - > -       return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid]; > -} > - > -static void > -__mem_cgroup_insert_exceeded(struct mem_cgroup *memcg, > -                               struct mem_cgroup_per_zone *mz, > -                               struct mem_cgroup_tree_per_zone *mctz, > -                               unsigned long long new_usage_in_excess) > -{ > -       struct rb_node **p = &mctz->rb_root.rb_node; > -       struct rb_node *parent = NULL; > -       struct mem_cgroup_per_zone *mz_node; > - > -       if (mz->on_tree) > -               return; > - > -       mz->usage_in_excess = new_usage_in_excess; > -       if (!mz->usage_in_excess) > -               return; > -       while (*p) { > -               parent = *p; > -               mz_node = rb_entry(parent, struct mem_cgroup_per_zone, > -                                       tree_node); > -               if (mz->usage_in_excess < mz_node->usage_in_excess) > -                       p = &(*p)->rb_left; > -               /* > -                * We can't avoid mem cgroups that are over their soft > -                * limit by the same amount > -                */ > -               else if (mz->usage_in_excess >= mz_node->usage_in_excess) > -                       p = &(*p)->rb_right; > -       } > -       rb_link_node(&mz->tree_node, parent, p); > -       rb_insert_color(&mz->tree_node, &mctz->rb_root); > -       mz->on_tree = true; > -} > - > -static void > -__mem_cgroup_remove_exceeded(struct mem_cgroup *memcg, > -                               struct mem_cgroup_per_zone *mz, > -                               struct mem_cgroup_tree_per_zone *mctz) > -{ > -       if (!mz->on_tree) > -               return; > -       rb_erase(&mz->tree_node, &mctz->rb_root); > -       mz->on_tree = false; > -} > - > -static void > -mem_cgroup_remove_exceeded(struct mem_cgroup *memcg, > -                               struct mem_cgroup_per_zone *mz, > -                               struct mem_cgroup_tree_per_zone *mctz) > -{ > -       spin_lock(&mctz->lock); > -       __mem_cgroup_remove_exceeded(memcg, mz, mctz); > -       spin_unlock(&mctz->lock); > -} > - > - > -static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page) > -{ > -       unsigned long long excess; > -       struct mem_cgroup_per_zone *mz; > -       struct mem_cgroup_tree_per_zone *mctz; > -       int nid = page_to_nid(page); > -       int zid = page_zonenum(page); > -       mctz = soft_limit_tree_from_page(page); > - > -       /* > -        * Necessary to update all ancestors when hierarchy is used. > -        * because their event counter is not touched. > -        */ > -       for (; memcg; memcg = parent_mem_cgroup(memcg)) { > -               mz = mem_cgroup_zoneinfo(memcg, nid, zid); > -               excess = res_counter_soft_limit_excess(&memcg->res); > -               /* > -                * We have to update the tree if mz is on RB-tree or > -                * mem is over its softlimit. > -                */ > -               if (excess || mz->on_tree) { > -                       spin_lock(&mctz->lock); > -                       /* if on-tree, remove it */ > -                       if (mz->on_tree) > -                               __mem_cgroup_remove_exceeded(memcg, mz, mctz); > -                       /* > -                        * Insert again. mz->usage_in_excess will be updated. > -                        * If excess is 0, no tree ops. > -                        */ > -                       __mem_cgroup_insert_exceeded(memcg, mz, mctz, excess); > -                       spin_unlock(&mctz->lock); > -               } > -       } > -} > - > -static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg) > -{ > -       int node, zone; > -       struct mem_cgroup_per_zone *mz; > -       struct mem_cgroup_tree_per_zone *mctz; > - > -       for_each_node_state(node, N_POSSIBLE) { > -               for (zone = 0; zone < MAX_NR_ZONES; zone++) { > -                       mz = mem_cgroup_zoneinfo(memcg, node, zone); > -                       mctz = soft_limit_tree_node_zone(node, zone); > -                       mem_cgroup_remove_exceeded(memcg, mz, mctz); > -               } > -       } > -} > - > -static struct mem_cgroup_per_zone * > -__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz) > -{ > -       struct rb_node *rightmost = NULL; > -       struct mem_cgroup_per_zone *mz; > - > -retry: > -       mz = NULL; > -       rightmost = rb_last(&mctz->rb_root); > -       if (!rightmost) > -               goto done;              /* Nothing to reclaim from */ > - > -       mz = rb_entry(rightmost, struct mem_cgroup_per_zone, tree_node); > -       /* > -        * Remove the node now but someone else can add it back, > -        * we will to add it back at the end of reclaim to its correct > -        * position in the tree. > -        */ > -       __mem_cgroup_remove_exceeded(mz->mem, mz, mctz); > -       if (!res_counter_soft_limit_excess(&mz->mem->res) || > -               !css_tryget(&mz->mem->css)) > -               goto retry; > -done: > -       return mz; > -} > - > -static struct mem_cgroup_per_zone * > -mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz) > -{ > -       struct mem_cgroup_per_zone *mz; > - > -       spin_lock(&mctz->lock); > -       mz = __mem_cgroup_largest_soft_limit_node(mctz); > -       spin_unlock(&mctz->lock); > -       return mz; > -} > - >  /* >  * Implementation Note: reading percpu statistics for memcg. >  * > @@ -696,9 +508,6 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg, >                case MEM_CGROUP_TARGET_THRESH: >                        next = val + THRESHOLDS_EVENTS_TARGET; >                        break; > -               case MEM_CGROUP_TARGET_SOFTLIMIT: > -                       next = val + SOFTLIMIT_EVENTS_TARGET; > -                       break; >                case MEM_CGROUP_TARGET_NUMAINFO: >                        next = val + NUMAINFO_EVENTS_TARGET; >                        break; > @@ -718,13 +527,11 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg, >  static void memcg_check_events(struct mem_cgroup *memcg, struct page *page) >  { >        preempt_disable(); > -       /* threshold event is triggered in finer grain than soft limit */ > +       /* threshold event is triggered in finer grain than numa info */ >        if (unlikely(mem_cgroup_event_ratelimit(memcg, >                                                MEM_CGROUP_TARGET_THRESH))) { > -               bool do_softlimit, do_numainfo; > +               bool do_numainfo; > > -               do_softlimit = mem_cgroup_event_ratelimit(memcg, > -                                               MEM_CGROUP_TARGET_SOFTLIMIT); >  #if MAX_NUMNODES > 1 >                do_numainfo = mem_cgroup_event_ratelimit(memcg, >                                                MEM_CGROUP_TARGET_NUMAINFO); > @@ -732,8 +539,6 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page) >                preempt_enable(); > >                mem_cgroup_threshold(memcg); > -               if (unlikely(do_softlimit)) > -                       mem_cgroup_update_tree(memcg, page); >  #if MAX_NUMNODES > 1 >                if (unlikely(do_numainfo)) >                        atomic_inc(&memcg->numainfo_events); > @@ -1318,6 +1123,36 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) >        return margin >> PAGE_SHIFT; >  } > > +/** > + * mem_cgroup_over_softlimit > + * @root: hierarchy root > + * @memcg: child of @root to test > + * > + * Returns %true if @memcg exceeds its own soft limit or contributes > + * to the soft limit excess of one of its parents up to and including > + * @root. > + */ > +bool mem_cgroup_over_softlimit(struct mem_cgroup *root, > +                              struct mem_cgroup *memcg) > +{ > +       if (mem_cgroup_disabled()) > +               return false; > + > +       if (!root) > +               root = root_mem_cgroup; > + > +       for (; memcg; memcg = parent_mem_cgroup(memcg)) { > +               /* root_mem_cgroup does not have a soft limit */ > +               if (memcg == root_mem_cgroup) > +                       break; > +               if (res_counter_soft_limit_excess(&memcg->res)) > +                       return true; > +               if (memcg == root) > +                       break; > +       } Here it adds pressure on a cgroup if one of its parents exceeds soft limit, although the cgroup itself is under soft limit. It does change my understanding of soft limit, and might introduce regression of our existing use cases. Here is an example: Machine capacity 32G and we over-commit by 8G. root -> A (hard limit 20G, soft limit 15G, usage 16G) -> A1 (soft limit 5G, usage 4G) -> A2 (soft limit 10G, usage 12G) -> B (hard limit 20G, soft limit 10G, usage 16G) under global reclaim, we don't want to add pressure on A1 although its parent A exceeds its soft limit. Assume that if we set the soft limit corresponding to each cgroup's working set size (hot memory), and it will introduce regression to A1 in that case. In my existing implementation, i am checking the cgroup's soft limit standalone w/o looking its ancestors. > +       return false; > +} > + >  int mem_cgroup_swappiness(struct mem_cgroup *memcg) >  { >        struct cgroup *cgrp = memcg->css.cgroup; > @@ -1687,64 +1522,6 @@ bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap) >  } >  #endif > > -static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg, > -                                  struct zone *zone, > -                                  gfp_t gfp_mask, > -                                  unsigned long *total_scanned) > -{ > -       struct mem_cgroup *victim = NULL; > -       int total = 0; > -       int loop = 0; > -       unsigned long excess; > -       unsigned long nr_scanned; > -       struct mem_cgroup_reclaim_cookie reclaim = { > -               .zone = zone, > -               .priority = 0, > -       }; > - > -       excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT; > - > -       while (1) { > -               unsigned long nr_reclaimed; > - > -               victim = mem_cgroup_iter(root_memcg, victim, &reclaim); > -               if (!victim) { > -                       loop++; > -                       if (loop >= 2) { > -                               /* > -                                * If we have not been able to reclaim > -                                * anything, it might because there are > -                                * no reclaimable pages under this hierarchy > -                                */ > -                               if (!total) > -                                       break; > -                               /* > -                                * We want to do more targeted reclaim. > -                                * excess >> 2 is not to excessive so as to > -                                * reclaim too much, nor too less that we keep > -                                * coming back to reclaim from this cgroup > -                                */ > -                               if (total >= (excess >> 2) || > -                                       (loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) > -                                       break; > -                       } > -                       continue; > -               } > -               if (!mem_cgroup_reclaimable(victim, false)) > -                       continue; > -               nr_reclaimed = mem_cgroup_shrink_node_zone(victim, gfp_mask, false, > -                                                          zone, &nr_scanned); > -               mem_cgroup_account_reclaim(root_mem_cgroup, victim, nr_reclaimed, > -                                          nr_scanned, current_is_kswapd()); > -               total += nr_reclaimed; > -               *total_scanned += nr_scanned; > -               if (!res_counter_soft_limit_excess(&root_memcg->res)) > -                       break; > -       } > -       mem_cgroup_iter_break(root_memcg, victim); > -       return total; > -} > - >  /* >  * Check OOM-Killer is already running under our hierarchy. >  * If someone is running, return false. > @@ -2507,8 +2284,6 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg, >        unlock_page_cgroup(pc); >        /* >         * "charge_statistics" updated event counter. Then, check it. > -        * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree. > -        * if they exceeds softlimit. >         */ >        memcg_check_events(memcg, page); >  } > @@ -3578,98 +3353,6 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg, >        return ret; >  } > > -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, > -                                           gfp_t gfp_mask, > -                                           unsigned long *total_scanned) > -{ > -       unsigned long nr_reclaimed = 0; > -       struct mem_cgroup_per_zone *mz, *next_mz = NULL; > -       unsigned long reclaimed; > -       int loop = 0; > -       struct mem_cgroup_tree_per_zone *mctz; > -       unsigned long long excess; > -       unsigned long nr_scanned; > - > -       if (order > 0) > -               return 0; > - > -       mctz = soft_limit_tree_node_zone(zone_to_nid(zone), zone_idx(zone)); > -       /* > -        * This loop can run a while, specially if mem_cgroup's continuously > -        * keep exceeding their soft limit and putting the system under > -        * pressure > -        */ > -       do { > -               if (next_mz) > -                       mz = next_mz; > -               else > -                       mz = mem_cgroup_largest_soft_limit_node(mctz); > -               if (!mz) > -                       break; > - > -               nr_scanned = 0; > -               reclaimed = mem_cgroup_soft_reclaim(mz->mem, zone, > -                                                   gfp_mask, &nr_scanned); > -               nr_reclaimed += reclaimed; > -               *total_scanned += nr_scanned; > -               spin_lock(&mctz->lock); > - > -               /* > -                * If we failed to reclaim anything from this memory cgroup > -                * it is time to move on to the next cgroup > -                */ > -               next_mz = NULL; > -               if (!reclaimed) { > -                       do { > -                               /* > -                                * Loop until we find yet another one. > -                                * > -                                * By the time we get the soft_limit lock > -                                * again, someone might have aded the > -                                * group back on the RB tree. Iterate to > -                                * make sure we get a different mem. > -                                * mem_cgroup_largest_soft_limit_node returns > -                                * NULL if no other cgroup is present on > -                                * the tree > -                                */ > -                               next_mz = > -                               __mem_cgroup_largest_soft_limit_node(mctz); > -                               if (next_mz == mz) > -                                       css_put(&next_mz->mem->css); > -                               else /* next_mz == NULL or other memcg */ > -                                       break; > -                       } while (1); > -               } > -               __mem_cgroup_remove_exceeded(mz->mem, mz, mctz); > -               excess = res_counter_soft_limit_excess(&mz->mem->res); > -               /* > -                * One school of thought says that we should not add > -                * back the node to the tree if reclaim returns 0. > -                * But our reclaim could return 0, simply because due > -                * to priority we are exposing a smaller subset of > -                * memory to reclaim from. Consider this as a longer > -                * term TODO. > -                */ > -               /* If excess == 0, no tree ops */ > -               __mem_cgroup_insert_exceeded(mz->mem, mz, mctz, excess); > -               spin_unlock(&mctz->lock); > -               css_put(&mz->mem->css); > -               loop++; > -               /* > -                * Could not reclaim anything and there are no more > -                * mem cgroups to try or we seem to be looping without > -                * reclaiming anything. > -                */ > -               if (!nr_reclaimed && > -                       (next_mz == NULL || > -                       loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS)) > -                       break; > -       } while (!nr_reclaimed); > -       if (next_mz) > -               css_put(&next_mz->mem->css); > -       return nr_reclaimed; > -} > - >  /* >  * This routine traverse page_cgroup in given list and drop them all. >  * *And* this routine doesn't reclaim page itself, just removes page_cgroup. > @@ -4816,9 +4499,6 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node) >                mz = &pn->zoneinfo[zone]; >                for_each_lru(l) >                        INIT_LIST_HEAD(&mz->lruvec.lists[l]); > -               mz->usage_in_excess = 0; > -               mz->on_tree = false; > -               mz->mem = memcg; >        } >        memcg->info.nodeinfo[node] = pn; >        return 0; > @@ -4872,7 +4552,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) >  { >        int node; > > -       mem_cgroup_remove_from_trees(memcg); >        free_css_id(&mem_cgroup_subsys, &memcg->css); > >        for_each_node_state(node, N_POSSIBLE) > @@ -4927,31 +4606,6 @@ static void __init enable_swap_cgroup(void) >  } >  #endif > > -static int mem_cgroup_soft_limit_tree_init(void) > -{ > -       struct mem_cgroup_tree_per_node *rtpn; > -       struct mem_cgroup_tree_per_zone *rtpz; > -       int tmp, node, zone; > - > -       for_each_node_state(node, N_POSSIBLE) { > -               tmp = node; > -               if (!node_state(node, N_NORMAL_MEMORY)) > -                       tmp = -1; > -               rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, tmp); > -               if (!rtpn) > -                       return 1; > - > -               soft_limit_tree.rb_tree_per_node[node] = rtpn; > - > -               for (zone = 0; zone < MAX_NR_ZONES; zone++) { > -                       rtpz = &rtpn->rb_tree_per_zone[zone]; > -                       rtpz->rb_root = RB_ROOT; > -                       spin_lock_init(&rtpz->lock); > -               } > -       } > -       return 0; > -} > - >  static struct cgroup_subsys_state * __ref >  mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) >  { > @@ -4973,8 +4627,6 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) >                enable_swap_cgroup(); >                parent = NULL; >                root_mem_cgroup = memcg; > -               if (mem_cgroup_soft_limit_tree_init()) > -                       goto free_out; >                for_each_possible_cpu(cpu) { >                        struct memcg_stock_pcp *stock = >                                                &per_cpu(memcg_stock, cpu); > diff --git a/mm/vmscan.c b/mm/vmscan.c > index e3fd8a7..4279549 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2121,8 +2121,16 @@ static void shrink_zone(int priority, struct zone *zone, >                        .mem_cgroup = memcg, >                        .zone = zone, >                }; > +               int epriority = priority; > +               /* > +                * Put more pressure on hierarchies that exceed their > +                * soft limit, to push them back harder than their > +                * well-behaving siblings. > +                */ > +               if (mem_cgroup_over_softlimit(root, memcg)) > +                       epriority = 0; > > -               shrink_mem_cgroup_zone(priority, &mz, sc); > +               shrink_mem_cgroup_zone(epriority, &mz, sc); > >                mem_cgroup_account_reclaim(root, memcg, >                                           sc->nr_reclaimed - nr_reclaimed, > @@ -2171,8 +2179,6 @@ static bool shrink_zones(int priority, struct zonelist *zonelist, >  { >        struct zoneref *z; >        struct zone *zone; > -       unsigned long nr_soft_reclaimed; > -       unsigned long nr_soft_scanned; >        bool should_abort_reclaim = false; > >        for_each_zone_zonelist_nodemask(zone, z, zonelist, > @@ -2205,19 +2211,6 @@ static bool shrink_zones(int priority, struct zonelist *zonelist, >                                        continue; >                                } >                        } > -                       /* > -                        * This steals pages from memory cgroups over softlimit > -                        * and returns the number of reclaimed pages and > -                        * scanned pages. This works for global memory pressure > -                        * and balancing, not for a memcg's limit. > -                        */ > -                       nr_soft_scanned = 0; > -                       nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone, > -                                               sc->order, sc->gfp_mask, > -                                               &nr_soft_scanned); > -                       sc->nr_reclaimed += nr_soft_reclaimed; > -                       sc->nr_scanned += nr_soft_scanned; > -                       /* need some check for avoid more shrink_zone() */ >                } > >                shrink_zone(priority, zone, sc); > @@ -2393,48 +2386,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, >  } > >  #ifdef CONFIG_CGROUP_MEM_RES_CTLR > - > -unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg, > -                                               gfp_t gfp_mask, bool noswap, > -                                               struct zone *zone, > -                                               unsigned long *nr_scanned) > -{ > -       struct scan_control sc = { > -               .nr_scanned = 0, > -               .nr_to_reclaim = SWAP_CLUSTER_MAX, > -               .may_writepage = !laptop_mode, > -               .may_unmap = 1, > -               .may_swap = !noswap, > -               .order = 0, > -               .target_mem_cgroup = memcg, > -       }; > -       struct mem_cgroup_zone mz = { > -               .mem_cgroup = memcg, > -               .zone = zone, > -       }; > - > -       sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) | > -                       (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK); > - > -       trace_mm_vmscan_memcg_softlimit_reclaim_begin(0, > -                                                     sc.may_writepage, > -                                                     sc.gfp_mask); > - > -       /* > -        * NOTE: Although we can get the priority field, using it > -        * here is not a good idea, since it limits the pages we can scan. > -        * if we don't reclaim here, the shrink_zone from balance_pgdat > -        * will pick up pages from other mem cgroup's as well. We hack > -        * the priority and make it zero. > -        */ > -       shrink_mem_cgroup_zone(0, &mz, &sc); > - > -       trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed); > - > -       *nr_scanned = sc.nr_scanned; > -       return sc.nr_reclaimed; > -} > - >  unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, >                                           gfp_t gfp_mask, >                                           bool noswap) > @@ -2609,8 +2560,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, >        int end_zone = 0;       /* Inclusive.  0 = ZONE_DMA */ >        unsigned long total_scanned; >        struct reclaim_state *reclaim_state = current->reclaim_state; > -       unsigned long nr_soft_reclaimed; > -       unsigned long nr_soft_scanned; >        struct scan_control sc = { >                .gfp_mask = GFP_KERNEL, >                .may_unmap = 1, > @@ -2701,17 +2650,6 @@ loop_again: >                                continue; > >                        sc.nr_scanned = 0; > - > -                       nr_soft_scanned = 0; > -                       /* > -                        * Call soft limit reclaim before calling shrink_zone. > -                        */ > -                       nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone, > -                                                       order, sc.gfp_mask, > -                                                       &nr_soft_scanned); > -                       sc.nr_reclaimed += nr_soft_reclaimed; > -                       total_scanned += nr_soft_scanned; > - >                        /* >                         * We put equal pressure on every zone, unless >                         * one zone has way too many pages free > -- > 1.7.7.5 > --Ying From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934179Ab2AKWeJ (ORCPT ); Wed, 11 Jan 2012 17:34:09 -0500 Received: from mail-qy0-f174.google.com ([209.85.216.174]:39854 "EHLO mail-qy0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934164Ab2AKWeA convert rfc822-to-8bit (ORCPT ); Wed, 11 Jan 2012 17:34:00 -0500 MIME-Version: 1.0 In-Reply-To: <20120111003020.GD24386@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-2-git-send-email-hannes@cmpxchg.org> <20120111003020.GD24386@cmpxchg.org> Date: Wed, 11 Jan 2012 14:33:59 -0800 Message-ID: Subject: Re: [patch 1/2] mm: memcg: per-memcg reclaim statistics From: Ying Han To: Johannes Weiner Cc: Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org X-System-Of-Record: true Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jan 10, 2012 at 4:30 PM, Johannes Weiner wrote: > On Tue, Jan 10, 2012 at 03:54:05PM -0800, Ying Han wrote: >> Thank you for the patch and the stats looks reasonable to me, few >> questions as below: >> >> On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner wrote: >> > With the single per-zone LRU gone and global reclaim scanning >> > individual memcgs, it's straight-forward to collect meaningful and >> > accurate per-memcg reclaim statistics. >> > >> > This adds the following items to memory.stat: >> >> Some of the previous discussions including patches have similar stats >> in memory.vmscan_stat API, which collects all the per-memcg vmscan >> stats. I would like to understand more why we add into memory.stat >> instead, and do we have plan to keep extending memory.stat for those >> vmstat like stats? > > I think they were put into an extra file in particular to be able to > write to this file to reset the statistics.  But in my opinion, it's > trivial to calculate a delta from before and after running a workload, > so I didn't really like adding kernel code for that. > > Did you have another reason for a separate file in mind? Another reason I had them in separate file is easier to extend. I don't know if we have plan to have something like memory.vmstat, or just keep adding stuff into memory.stat. In general, I wanted to keep the memory.stat being reasonable size including only the basic statistics. In my existing vmscan_stat path, i have breakdowns of reclaim stats into file/anon which will make the memory.stat even larger. >> > pgreclaim >> >> Not sure if we want to keep this more consistent to /proc/vmstat, then >> it will be "pgsteal"? > > The problem with that was that we didn't like to call pages stolen > when they were reclaimed from within the cgroup, so we had pgfree for > inner reclaim and pgsteal for outer reclaim, respectively. > > I found it cleaner to just go with pgreclaim, it's unambiguous and > straight-forward.  Outer reclaim is designated by the hierarchy_ > prefix. > >> > pgscan >> > >> > áNumber of pages reclaimed/scanned from that memcg due to its own >> > áhard limit (or physical limit in case of the root memcg) by the >> > áallocating task. >> > >> > kswapd_pgreclaim >> > kswapd_pgscan >> >> we have "pgscan_kswapd_*" in vmstat, so maybe ? >> "pgsteal_kswapd" >> "pgscan_kswapd" >> >> > áReclaim activity from kswapd due to the memcg's own limit. áOnly >> > áapplicable to the root memcg for now since kswapd is only triggered >> > áby physical limits, but kswapd-style reclaim based on memcg hard >> > álimits is being developped. >> > >> > hierarchy_pgreclaim >> > hierarchy_pgscan >> > hierarchy_kswapd_pgreclaim >> > hierarchy_kswapd_pgscan >> >> "pgsteal_hierarchy" >> "pgsteal_kswapd_hierarchy" >> .. >> >> No strong option on the naming, but try to make it more consistent to >> existing API. > > I swear I tried, but the existing naming is pretty screwed up :( > > For example, pgscan_direct_* and pgscan_kswapd_* allow you to compare > scan rates of direct reclaim vs. kswapd reclaim.  To get the total > number of pages reclaimed, you sum them up. > > On the other hand, pgsteal_* does not differentiate between direct > reclaim and kswapd, so to get direct reclaim numbers, you add up the > pgsteal_* counters and subtract kswapd_steal (notice the lack of pg?), > which is in turn not available at zone granularity. agree and that always confuses me. > >> > +#define MEM_CGROUP_EVENTS_KSWAPD 2 >> > +#define MEM_CGROUP_EVENTS_HIERARCHY 4 > > These two function as namespaces, that's why I put hierarchy_ and > kswapd_ at the beginning of the names. > > Given that we have kswapd_steal, would you be okay with doing it like > this?  I mean, at least my naming conforms to ONE of the standards in > /proc/vmstat, right? ;-) I don't have much problem with the existing naming scheme, as long as we well document it and make it less confusing. > >> > @@ -91,12 +91,23 @@ enum mem_cgroup_stat_index { >> > á á á áMEM_CGROUP_STAT_NSTATS, >> > á}; >> > >> > +#define MEM_CGROUP_EVENTS_KSWAPD 2 >> > +#define MEM_CGROUP_EVENTS_HIERARCHY 4 >> > + >> > áenum mem_cgroup_events_index { >> > á á á áMEM_CGROUP_EVENTS_PGPGIN, á á á /* # of pages paged in */ >> > á á á áMEM_CGROUP_EVENTS_PGPGOUT, á á á/* # of pages paged out */ >> > á á á áMEM_CGROUP_EVENTS_COUNT, á á á á/* # of pages paged in/out */ >> > á á á áMEM_CGROUP_EVENTS_PGFAULT, á á á/* # of page-faults */ >> > á á á áMEM_CGROUP_EVENTS_PGMAJFAULT, á /* # of major page-faults */ >> > + á á á MEM_CGROUP_EVENTS_PGRECLAIM, >> > + á á á MEM_CGROUP_EVENTS_PGSCAN, >> > + á á á MEM_CGROUP_EVENTS_KSWAPD_PGRECLAIM, >> > + á á á MEM_CGROUP_EVENTS_KSWAPD_PGSCAN, >> > + á á á MEM_CGROUP_EVENTS_HIERARCHY_PGRECLAIM, >> > + á á á MEM_CGROUP_EVENTS_HIERARCHY_PGSCAN, >> > + á á á MEM_CGROUP_EVENTS_HIERARCHY_KSWAPD_PGRECLAIM, >> > + á á á MEM_CGROUP_EVENTS_HIERARCHY_KSWAPD_PGSCAN, >> >> missing comment here? > > As if the lines weren't long enough already ;-) I'll add some. Thanks. > >> > á á á áMEM_CGROUP_EVENTS_NSTATS, >> > á}; >> > á/* >> > @@ -889,6 +900,38 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg) >> > á á á áreturn (memcg == root_mem_cgroup); >> > á} >> > >> > +/** >> > + * mem_cgroup_account_reclaim - update per-memcg reclaim statistics >> > + * @root: memcg that triggered reclaim >> > + * @memcg: memcg that is actually being scanned >> > + * @nr_reclaimed: number of pages reclaimed from @memcg >> > + * @nr_scanned: number of pages scanned from @memcg >> > + * @kswapd: whether reclaiming task is kswapd or allocator itself >> > + */ >> > +void mem_cgroup_account_reclaim(struct mem_cgroup *root, >> > + á á á á á á á á á á á á á á á struct mem_cgroup *memcg, >> > + á á á á á á á á á á á á á á á unsigned long nr_reclaimed, >> > + á á á á á á á á á á á á á á á unsigned long nr_scanned, >> > + á á á á á á á á á á á á á á á bool kswapd) >> > +{ >> > + á á á unsigned int offset = 0; >> > + >> > + á á á if (!root) >> > + á á á á á á á root = root_mem_cgroup; >> > + >> > + á á á if (kswapd) >> > + á á á á á á á offset += MEM_CGROUP_EVENTS_KSWAPD; >> > + á á á if (root != memcg) >> > + á á á á á á á offset += MEM_CGROUP_EVENTS_HIERARCHY; >> >> Just to be clear, here root cgroup has hierarchy_* stats always 0 ? > > That's correct, there can't be any hierarchical pressure on the > topmost parent. Thank you for clarifying. > >> Also, we might want to consider renaming the root here, something like >> target? The root is confusing with root_mem_cgroup. > > It's the same naming scheme I used for the iterator functions > (mem_cgroup_iter() and friends), so if we change it, I'd like to > change it consistently. That sounds good, and the change is separate from this effort. > > Having target and memcg as parameters is even more confusing and > non-descriptive, IMO. > > Other places use mem_over_limit, which is a bit better, but quite > long. > > Any other ideas for great names for parameters that designate a > hierarchy root and a memcg in that hierarchy? I don't have better name other than "target", which matches the naming in scan_control as well. Or in this case, we can avoid passing both target and memcg by doing something like: +static inline void mem_cgroup_account_reclaim( + struct mem_cgroup *memcg, + unsigned long nr_reclaimed, + unsigned long nr_scanned, + bool kswapd, + bool hierarchy) +{ +} + + mem_cgroup_account_reclaim(victim, nr_reclaimed, + nr_scanned, current_is_kswapd(), + target != victim); then we need to do something on the root_mem_cgroup before that. Just a thought. --Ying From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752122Ab2ALBzr (ORCPT ); Wed, 11 Jan 2012 20:55:47 -0500 Received: from fgwmail6.fujitsu.co.jp ([192.51.44.36]:33531 "EHLO fgwmail6.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751890Ab2ALBzq (ORCPT ); Wed, 11 Jan 2012 20:55:46 -0500 X-SecurityPolicyCheck-FJ: OK by FujitsuOutboundMailChecker v1.3.1 Date: Thu, 12 Jan 2012 10:54:27 +0900 From: KAMEZAWA Hiroyuki To: Johannes Weiner Cc: Andrew Morton , Michal Hocko , Balbir Singh , Ying Han , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Message-Id: <20120112105427.4b80437b.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> Organization: FUJITSU Co. LTD. X-Mailer: Sylpheed 3.1.1 (GTK+ 2.10.14; i686-pc-mingw32) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 10 Jan 2012 16:02:52 +0100 Johannes Weiner wrote: > Right now, memcg soft limits are implemented by having a sorted tree > of memcgs that are in excess of their limits. Under global memory > pressure, kswapd first reclaims from the biggest excessor and then > proceeds to do regular global reclaim. The result of this is that > pages are reclaimed from all memcgs, but more scanning happens against > those above their soft limit. > > With global reclaim doing memcg-aware hierarchical reclaim by default, > this is a lot easier to implement: everytime a memcg is reclaimed > from, scan more aggressively (per tradition with a priority of 0) if > it's above its soft limit. With the same end result of scanning > everybody, but soft limit excessors a bit more. > > Advantages: > > o smoother reclaim: soft limit reclaim is a separate stage before > global reclaim, whose result is not communicated down the line and > so overreclaim of the groups in excess is very likely. After this > patch, soft limit reclaim is fully integrated into regular reclaim > and each memcg is considered exactly once per cycle. > > o true hierarchy support: soft limits are only considered when > kswapd does global reclaim, but after this patch, targetted > reclaim of a memcg will mind the soft limit settings of its child > groups. > > o code size: soft limit reclaim requires a lot of code to maintain > the per-node per-zone rb-trees to quickly find the biggest > offender, dedicated paths for soft limit reclaim etc. while this > new implementation gets away without all that. > > Test: > > The test consists of two concurrent kernel build jobs in separate > source trees, the master and the slave. The two jobs get along nicely > on 600MB of available memory, so this is the zero overcommit control > case. When available memory is decreased, the overcommit is > compensated by decreasing the soft limit of the slave by the same > amount, in the hope that the slave takes the hit and the master stays > unaffected. > > 600M-0M-vanilla 600M-0M-patched > Master walltime (s) 552.65 ( +0.00%) 552.38 ( -0.05%) > Master walltime (stddev) 1.25 ( +0.00%) 0.92 ( -14.66%) > Master major faults 204.38 ( +0.00%) 205.38 ( +0.49%) > Master major faults (stddev) 27.16 ( +0.00%) 13.80 ( -47.43%) > Master reclaim 31.88 ( +0.00%) 37.75 ( +17.87%) > Master reclaim (stddev) 34.01 ( +0.00%) 75.88 (+119.59%) > Master scan 31.88 ( +0.00%) 37.75 ( +17.87%) > Master scan (stddev) 34.01 ( +0.00%) 75.88 (+119.59%) > Master kswapd reclaim 33922.12 ( +0.00%) 33887.12 ( -0.10%) > Master kswapd reclaim (stddev) 969.08 ( +0.00%) 492.22 ( -49.16%) > Master kswapd scan 34085.75 ( +0.00%) 33985.75 ( -0.29%) > Master kswapd scan (stddev) 1101.07 ( +0.00%) 563.33 ( -48.79%) > Slave walltime (s) 552.68 ( +0.00%) 552.12 ( -0.10%) > Slave walltime (stddev) 0.79 ( +0.00%) 1.05 ( +14.76%) > Slave major faults 212.50 ( +0.00%) 204.50 ( -3.75%) > Slave major faults (stddev) 26.90 ( +0.00%) 13.17 ( -49.20%) > Slave reclaim 26.12 ( +0.00%) 35.00 ( +32.72%) > Slave reclaim (stddev) 29.42 ( +0.00%) 74.91 (+149.55%) > Slave scan 31.38 ( +0.00%) 35.00 ( +11.20%) > Slave scan (stddev) 33.31 ( +0.00%) 74.91 (+121.24%) > Slave kswapd reclaim 34259.00 ( +0.00%) 33469.88 ( -2.30%) > Slave kswapd reclaim (stddev) 925.15 ( +0.00%) 565.07 ( -38.88%) > Slave kswapd scan 34354.62 ( +0.00%) 33555.75 ( -2.33%) > Slave kswapd scan (stddev) 969.62 ( +0.00%) 581.70 ( -39.97%) > > In the control case, the differences in elapsed time, number of major > faults taken, and reclaim statistics are within the noise for both the > master and the slave job. > > 600M-280M-vanilla 600M-280M-patched > Master walltime (s) 595.13 ( +0.00%) 553.19 ( -7.04%) > Master walltime (stddev) 8.31 ( +0.00%) 2.57 ( -61.64%) > Master major faults 3729.75 ( +0.00%) 783.25 ( -78.98%) > Master major faults (stddev) 258.79 ( +0.00%) 226.68 ( -12.36%) > Master reclaim 705.00 ( +0.00%) 29.50 ( -95.68%) > Master reclaim (stddev) 232.87 ( +0.00%) 44.72 ( -80.45%) > Master scan 714.88 ( +0.00%) 30.00 ( -95.67%) > Master scan (stddev) 237.44 ( +0.00%) 45.39 ( -80.54%) > Master kswapd reclaim 114.75 ( +0.00%) 50.00 ( -55.94%) > Master kswapd reclaim (stddev) 128.51 ( +0.00%) 9.45 ( -91.93%) > Master kswapd scan 115.75 ( +0.00%) 50.00 ( -56.32%) > Master kswapd scan (stddev) 130.31 ( +0.00%) 9.45 ( -92.04%) > Slave walltime (s) 631.18 ( +0.00%) 577.68 ( -8.46%) > Slave walltime (stddev) 9.89 ( +0.00%) 3.63 ( -57.47%) > Slave major faults 28401.75 ( +0.00%) 14656.75 ( -48.39%) > Slave major faults (stddev) 2629.97 ( +0.00%) 1911.81 ( -27.30%) > Slave reclaim 65400.62 ( +0.00%) 1479.62 ( -97.74%) > Slave reclaim (stddev) 11623.02 ( +0.00%) 1482.13 ( -87.24%) > Slave scan 9050047.88 ( +0.00%) 95968.25 ( -98.94%) > Slave scan (stddev) 1912786.94 ( +0.00%) 93390.71 ( -95.12%) > Slave kswapd reclaim 327894.50 ( +0.00%) 227099.88 ( -30.74%) > Slave kswapd reclaim (stddev) 22289.43 ( +0.00%) 16113.14 ( -27.71%) > Slave kswapd scan 34987335.75 ( +0.00%) 1362367.12 ( -96.11%) > Slave kswapd scan (stddev) 2523642.98 ( +0.00%) 156754.74 ( -93.79%) > > Here, the available memory is limited to 320 MB, the machine is > overcommitted by 280 MB. The soft limit of the master is 300 MB, that > of the slave merely 20 MB. > > Looking at the slave job first, it is much better off with the patched > kernel: direct reclaim is almost gone, kswapd reclaim is decreased by > a third. The result is much fewer major faults taken, which in turn > lets the job finish quicker. > > It would be a zero-sum game if the improvement happened at the cost of > the master but looking at the numbers, even the master performs better > with the patched kernel. In fact, the master job is almost unaffected > on the patched kernel compared to the control case. > > This is an odd phenomenon, as the patch does not directly change how > the master is reclaimed. An explanation for this is that the severe > overreclaim of the slave in the unpatched kernel results in the master > growing bigger than in the patched case. Combining the fact that > memcgs are scanned according to their size with the increased refault > rate of the overreclaimed slave triggering global reclaim more often > means that overall pressure on the master job is higher in the > unpatched kernel. > > At any rate, the patched kernel seems to do a much better job at both > overall resource allocation under soft limit overcommit as well as the > requested prioritization of the master job. > > Signed-off-by: Johannes Weiner Thank you for your work and the result seems atractive and code is much simpler. My small concerns are.. 1. This approach may increase latency of direct-reclaim because of priority=0. 2. In a case numa-spread/interleave application run in its own container, pages on a node may paged-out again and again becasue of priority=0 if some other application runs in the node. It seems difficult to use soft-limit with numa-aware applications. Do you have suggestions ? Thanks, -Kame From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753194Ab2ALI7T (ORCPT ); Thu, 12 Jan 2012 03:59:19 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:57672 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753156Ab2ALI7R (ORCPT ); Thu, 12 Jan 2012 03:59:17 -0500 Date: Thu, 12 Jan 2012 09:59:04 +0100 From: Johannes Weiner To: Ying Han Cc: Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Message-ID: <20120112085904.GG24386@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jan 11, 2012 at 01:42:31PM -0800, Ying Han wrote: > On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner wrote: > > Right now, memcg soft limits are implemented by having a sorted tree > > of memcgs that are in excess of their limits.  Under global memory > > pressure, kswapd first reclaims from the biggest excessor and then > > proceeds to do regular global reclaim.  The result of this is that > > pages are reclaimed from all memcgs, but more scanning happens against > > those above their soft limit. > > > > With global reclaim doing memcg-aware hierarchical reclaim by default, > > this is a lot easier to implement: everytime a memcg is reclaimed > > from, scan more aggressively (per tradition with a priority of 0) if > > it's above its soft limit.  With the same end result of scanning > > everybody, but soft limit excessors a bit more. > > > > Advantages: > > > >  o smoother reclaim: soft limit reclaim is a separate stage before > >    global reclaim, whose result is not communicated down the line and > >    so overreclaim of the groups in excess is very likely.  After this > >    patch, soft limit reclaim is fully integrated into regular reclaim > >    and each memcg is considered exactly once per cycle. > > > >  o true hierarchy support: soft limits are only considered when > >    kswapd does global reclaim, but after this patch, targetted > >    reclaim of a memcg will mind the soft limit settings of its child > >    groups. > > Why we add soft limit reclaim into target reclaim? -> A hard limit 10G, usage 10G -> A1 soft limit 8G, usage 5G -> A2 soft limit 2G, usage 5G When A hits its hard limit, A2 will experience more pressure than A1. Soft limits are already applied hierarchically: the memcg that is picked from the tree is reclaimed hierarchically. What I wanted to add is the soft limit also being /triggerable/ from non-global hierarchy levels. > Based on the discussions, my understanding is that the soft limit only > take effect while the whole machine is under memory contention. We > don't want to add extra pressure on a cgroup if there is free memory > on the system even the cgroup is above its limit. If a hierarchy is under pressure, we will reclaim that hierarchy. We allow groups to be prioritized under global pressure, why not allow it for local pressure as well? I am not quite sure what you are objecting to. > >  o code size: soft limit reclaim requires a lot of code to maintain > >    the per-node per-zone rb-trees to quickly find the biggest > >    offender, dedicated paths for soft limit reclaim etc. while this > >    new implementation gets away without all that. > > > > Test: > > > > The test consists of two concurrent kernel build jobs in separate > > source trees, the master and the slave.  The two jobs get along nicely > > on 600MB of available memory, so this is the zero overcommit control > > case.  When available memory is decreased, the overcommit is > > compensated by decreasing the soft limit of the slave by the same > > amount, in the hope that the slave takes the hit and the master stays > > unaffected. > > > >                                    600M-0M-vanilla         600M-0M-patched > > Master walltime (s)               552.65 (  +0.00%)       552.38 (  -0.05%) > > Master walltime (stddev)            1.25 (  +0.00%)         0.92 ( -14.66%) > > Master major faults               204.38 (  +0.00%)       205.38 (  +0.49%) > > Master major faults (stddev)       27.16 (  +0.00%)        13.80 ( -47.43%) > > Master reclaim                     31.88 (  +0.00%)        37.75 ( +17.87%) > > Master reclaim (stddev)            34.01 (  +0.00%)        75.88 (+119.59%) > > Master scan                        31.88 (  +0.00%)        37.75 ( +17.87%) > > Master scan (stddev)               34.01 (  +0.00%)        75.88 (+119.59%) > > Master kswapd reclaim           33922.12 (  +0.00%)     33887.12 (  -0.10%) > > Master kswapd reclaim (stddev)    969.08 (  +0.00%)       492.22 ( -49.16%) > > Master kswapd scan              34085.75 (  +0.00%)     33985.75 (  -0.29%) > > Master kswapd scan (stddev)      1101.07 (  +0.00%)       563.33 ( -48.79%) > > Slave walltime (s)                552.68 (  +0.00%)       552.12 (  -0.10%) > > Slave walltime (stddev)             0.79 (  +0.00%)         1.05 ( +14.76%) > > Slave major faults                212.50 (  +0.00%)       204.50 (  -3.75%) > > Slave major faults (stddev)        26.90 (  +0.00%)        13.17 ( -49.20%) > > Slave reclaim                      26.12 (  +0.00%)        35.00 ( +32.72%) > > Slave reclaim (stddev)             29.42 (  +0.00%)        74.91 (+149.55%) > > Slave scan                         31.38 (  +0.00%)        35.00 ( +11.20%) > > Slave scan (stddev)                33.31 (  +0.00%)        74.91 (+121.24%) > > Slave kswapd reclaim            34259.00 (  +0.00%)     33469.88 (  -2.30%) > > Slave kswapd reclaim (stddev)     925.15 (  +0.00%)       565.07 ( -38.88%) > > Slave kswapd scan               34354.62 (  +0.00%)     33555.75 (  -2.33%) > > Slave kswapd scan (stddev)        969.62 (  +0.00%)       581.70 ( -39.97%) > > > > In the control case, the differences in elapsed time, number of major > > faults taken, and reclaim statistics are within the noise for both the > > master and the slave job. > > What's the soft limit setting in the controlled case? 300MB for both jobs. > I assume it is the default RESOURCE_MAX. So both Master and Slave get > equal pressure before/after the patch, and no differences on the stats > should be observed. Yes. The control case demonstrates that both jobs can fit comfortably, don't compete for space and that in general the patch does not have unexpected negative impact (after all, it modifies codepaths that were invoked regularly outside of reclaim). > >                                     600M-280M-vanilla      600M-280M-patched > > Master walltime (s)                  595.13 (  +0.00%)      553.19 (  -7.04%) > > Master walltime (stddev)               8.31 (  +0.00%)        2.57 ( -61.64%) > > Master major faults                 3729.75 (  +0.00%)      783.25 ( -78.98%) > > Master major faults (stddev)         258.79 (  +0.00%)      226.68 ( -12.36%) > > Master reclaim                       705.00 (  +0.00%)       29.50 ( -95.68%) > > Master reclaim (stddev)              232.87 (  +0.00%)       44.72 ( -80.45%) > > Master scan                          714.88 (  +0.00%)       30.00 ( -95.67%) > > Master scan (stddev)                 237.44 (  +0.00%)       45.39 ( -80.54%) > > Master kswapd reclaim                114.75 (  +0.00%)       50.00 ( -55.94%) > > Master kswapd reclaim (stddev)       128.51 (  +0.00%)        9.45 ( -91.93%) > > Master kswapd scan                   115.75 (  +0.00%)       50.00 ( -56.32%) > > Master kswapd scan (stddev)          130.31 (  +0.00%)        9.45 ( -92.04%) > > Slave walltime (s)                   631.18 (  +0.00%)      577.68 (  -8.46%) > > Slave walltime (stddev)                9.89 (  +0.00%)        3.63 ( -57.47%) > > Slave major faults                 28401.75 (  +0.00%)    14656.75 ( -48.39%) > > Slave major faults (stddev)         2629.97 (  +0.00%)     1911.81 ( -27.30%) > > Slave reclaim                      65400.62 (  +0.00%)     1479.62 ( -97.74%) > > Slave reclaim (stddev)             11623.02 (  +0.00%)     1482.13 ( -87.24%) > > Slave scan                       9050047.88 (  +0.00%)    95968.25 ( -98.94%) > > Slave scan (stddev)              1912786.94 (  +0.00%)    93390.71 ( -95.12%) > > Slave kswapd reclaim              327894.50 (  +0.00%)   227099.88 ( -30.74%) > > Slave kswapd reclaim (stddev)      22289.43 (  +0.00%)    16113.14 ( -27.71%) > > Slave kswapd scan               34987335.75 (  +0.00%)  1362367.12 ( -96.11%) > > Slave kswapd scan (stddev)       2523642.98 (  +0.00%)   156754.74 ( -93.79%) > > > > Here, the available memory is limited to 320 MB, the machine is > > overcommitted by 280 MB.  The soft limit of the master is 300 MB, that > > of the slave merely 20 MB. > > > > Looking at the slave job first, it is much better off with the patched > > kernel: direct reclaim is almost gone, kswapd reclaim is decreased by > > a third.  The result is much fewer major faults taken, which in turn > > lets the job finish quicker. > > What's the setting of the hard limit here? Is the direct reclaim > referring to per-memcg directly reclaim or global one. The machine's memory is limited to 600M, the hard limits are unset. All reclaim is a result of global memory pressure. With the patched kernel, I could have used a dedicated parent cgroup and let master and slave run in children of this group, the soft limits would be taken into account just the same. But this does not work on the unpatched kernel, as soft limits are only recognized on the global level there. > > It would be a zero-sum game if the improvement happened at the cost of > > the master but looking at the numbers, even the master performs better > > with the patched kernel.  In fact, the master job is almost unaffected > > on the patched kernel compared to the control case. > > It makes sense since the master job get less affected by the patch > than the slave job under the example. Under the control case, if both > master and slave have RESOURCE_MAX soft limit setting, they are under > equal memory pressure(priority = DEF_PRIORITY) . On the second > example, only the slave pressure being increased by priority = 0, and > the Master got scanned with same priority = DEF_PRIORITY pretty much. > > So I would expect to see more reclaim activities happens in slave on > the patched kernel compared to the control case. It seems match the > testing result. Uhm, > > Slave reclaim                      65400.62 (  +0.00%)     1479.62 ( -97.74%) > > Slave reclaim (stddev)             11623.02 (  +0.00%)     1482.13 ( -87.24%) > > Slave scan                       9050047.88 (  +0.00%)    95968.25 ( -98.94%) > > Slave scan (stddev)              1912786.94 (  +0.00%)    93390.71 ( -95.12%) > > Slave kswapd reclaim              327894.50 (  +0.00%)   227099.88 ( -30.74%) > > Slave kswapd reclaim (stddev)      22289.43 (  +0.00%)    16113.14 ( -27.71%) > > Slave kswapd scan               34987335.75 (  +0.00%)  1362367.12 ( -96.11%) > > Slave kswapd scan (stddev)       2523642.98 (  +0.00%)   156754.74 ( -93.79%) Direct reclaim _shrunk_ by 98%, kswapd reclaim by 31%. > > This is an odd phenomenon, as the patch does not directly change how > > the master is reclaimed.  An explanation for this is that the severe > > overreclaim of the slave in the unpatched kernel results in the master > > growing bigger than in the patched case.  Combining the fact that > > memcgs are scanned according to their size with the increased refault > > rate of the overreclaimed slave triggering global reclaim more often > > means that overall pressure on the master job is higher in the > > unpatched kernel. > > We can check the Master memory.usage_in_bytes while the job is running. Yep, the plots of cache/rss over time confirmed exactly this. The unpatched kernel shows higher spikes in the size of the master job followed by deeper pits when reclaim kicked in. The patched kernel is much smoother in that regard. > On the other hand, I don't see why we expect the Master being less > reclaimed in the controlled case? On the unpatched kernel, the Master > is being reclaimed under global pressure each time anyway since we > ignore the return value of softlimit. I didn't expect that, I expected both jobs to perform equally in the control case. And in the pressurized case, the master being unaffected and the slave taking the hit. The patched kernel does this, the unpatched one does not. > > @@ -121,6 +121,7 @@ struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, > >                                                      struct zone *zone); > >  struct zone_reclaim_stat* > >  mem_cgroup_get_reclaim_stat_from_page(struct page *page); > > +bool mem_cgroup_over_softlimit(struct mem_cgroup *, struct mem_cgroup *); > > Maybe something like "mem_cgroup_over_soft_limit()" ? Probably more consistent, yeah. Will do. > > @@ -343,7 +314,6 @@ static bool move_file(void) > >  * limit reclaim to prevent infinite loops, if they ever occur. > >  */ > >  #define        MEM_CGROUP_MAX_RECLAIM_LOOPS            (100) > > -#define        MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS (2) > > You might need to remove the comment above as well. Oops, will fix. > > @@ -1318,6 +1123,36 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) > >        return margin >> PAGE_SHIFT; > >  } > > > > +/** > > + * mem_cgroup_over_softlimit > > + * @root: hierarchy root > > + * @memcg: child of @root to test > > + * > > + * Returns %true if @memcg exceeds its own soft limit or contributes > > + * to the soft limit excess of one of its parents up to and including > > + * @root. > > + */ > > +bool mem_cgroup_over_softlimit(struct mem_cgroup *root, > > +                              struct mem_cgroup *memcg) > > +{ > > +       if (mem_cgroup_disabled()) > > +               return false; > > + > > +       if (!root) > > +               root = root_mem_cgroup; > > + > > +       for (; memcg; memcg = parent_mem_cgroup(memcg)) { > > +               /* root_mem_cgroup does not have a soft limit */ > > +               if (memcg == root_mem_cgroup) > > +                       break; > > +               if (res_counter_soft_limit_excess(&memcg->res)) > > +                       return true; > > +               if (memcg == root) > > +                       break; > > +       } > > Here it adds pressure on a cgroup if one of its parents exceeds soft > limit, although the cgroup itself is under soft limit. It does change > my understanding of soft limit, and might introduce regression of our > existing use cases. > > Here is an example: > > Machine capacity 32G and we over-commit by 8G. > > root > -> A (hard limit 20G, soft limit 15G, usage 16G) > -> A1 (soft limit 5G, usage 4G) > -> A2 (soft limit 10G, usage 12G) > -> B (hard limit 20G, soft limit 10G, usage 16G) > > under global reclaim, we don't want to add pressure on A1 although its > parent A exceeds its soft limit. Assume that if we set the soft limit > corresponding to each cgroup's working set size (hot memory), and it > will introduce regression to A1 in that case. > > In my existing implementation, i am checking the cgroup's soft limit > standalone w/o looking its ancestors. Why do you set the soft limit of A in the first place if you don't want it to be enforced? This is not really new behaviour, soft limit reclaim has always been operating hierarchically on the biggest excessor. In your case, the excess of A is smaller than the excess of A2 and so that weird "only pick the biggest excessor" behaviour hides it, but consider this: -> A soft 30G, usage 39G -> A1 soft 5G, usage 4G -> A2 soft 10G, usage 15G -> A3 soft 15G, usage 20G Upstream would pick A from the soft limit tree and reclaim its children with priority 0, including A1. On the other hand, if you don't consider ancestral soft limits, you break perfectly reasonable setups like these -> A soft 10G, usage 20G -> A1 usage 10G -> A2 usage 10G -> B soft 10G, usage 11G where upstream would pick A and reclaim it recursively, but your version would only apply higher pressure to B. If you would just not set the soft limit of A in your case: -> A (hard limit 20G, usage 16G) -> A1 (soft limit 5G, usage 4G) -> A2 (soft limit 10G, usage 12G) -> B (hard limit 20G, soft limit 10G, usage 16G) only A2 and B would experience higher pressure upon global pressure. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753084Ab2ALJRi (ORCPT ); Thu, 12 Jan 2012 04:17:38 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:60261 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751715Ab2ALJRe (ORCPT ); Thu, 12 Jan 2012 04:17:34 -0500 Date: Thu, 12 Jan 2012 10:17:21 +0100 From: Johannes Weiner To: Ying Han Cc: Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [patch 1/2] mm: memcg: per-memcg reclaim statistics Message-ID: <20120112091721.GH24386@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-2-git-send-email-hannes@cmpxchg.org> <20120111003020.GD24386@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jan 11, 2012 at 02:33:59PM -0800, Ying Han wrote: > On Tue, Jan 10, 2012 at 4:30 PM, Johannes Weiner wrote: > > On Tue, Jan 10, 2012 at 03:54:05PM -0800, Ying Han wrote: > >> Thank you for the patch and the stats looks reasonable to me, few > >> questions as below: > >> > >> On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner wrote: > >> > With the single per-zone LRU gone and global reclaim scanning > >> > individual memcgs, it's straight-forward to collect meaningful and > >> > accurate per-memcg reclaim statistics. > >> > > >> > This adds the following items to memory.stat: > >> > >> Some of the previous discussions including patches have similar stats > >> in memory.vmscan_stat API, which collects all the per-memcg vmscan > >> stats. I would like to understand more why we add into memory.stat > >> instead, and do we have plan to keep extending memory.stat for those > >> vmstat like stats? > > > > I think they were put into an extra file in particular to be able to > > write to this file to reset the statistics.  But in my opinion, it's > > trivial to calculate a delta from before and after running a workload, > > so I didn't really like adding kernel code for that. > > > > Did you have another reason for a separate file in mind? > > Another reason I had them in separate file is easier to extend. I > don't know if we have plan to have something like memory.vmstat, or > just keep adding stuff into memory.stat. In general, I wanted to keep > the memory.stat being reasonable size including only the basic > statistics. In my existing vmscan_stat path, i have breakdowns of > reclaim stats into file/anon which will make the memory.stat even > larger. Do you think it's a problem of presentation, where we want to allow admins to figure out the memcg parameters at a glance when looking at memory.stat but be able to debug malfunction by looking at the more extensive vmstat file? > >> > áReclaim activity from kswapd due to the memcg's own limit. áOnly > >> > áapplicable to the root memcg for now since kswapd is only triggered > >> > áby physical limits, but kswapd-style reclaim based on memcg hard > >> > álimits is being developped. > >> > > >> > hierarchy_pgreclaim > >> > hierarchy_pgscan > >> > hierarchy_kswapd_pgreclaim > >> > hierarchy_kswapd_pgscan > >> > >> "pgsteal_hierarchy" > >> "pgsteal_kswapd_hierarchy" > >> .. > >> > >> No strong option on the naming, but try to make it more consistent to > >> existing API. > > > > I swear I tried, but the existing naming is pretty screwed up :( > > > > For example, pgscan_direct_* and pgscan_kswapd_* allow you to compare > > scan rates of direct reclaim vs. kswapd reclaim.  To get the total > > number of pages reclaimed, you sum them up. > > > > On the other hand, pgsteal_* does not differentiate between direct > > reclaim and kswapd, so to get direct reclaim numbers, you add up the > > pgsteal_* counters and subtract kswapd_steal (notice the lack of pg?), > > which is in turn not available at zone granularity. > > agree and that always confuses me. I just have scripts that present it as 'Direct page reclaimed' and 'Kswapd page reclaimed' when evaluating data so I don't have to remember anymore :-) But I think the wish for consistency is a bit misguided when we end up with something like pgpgin that means something completely different in memcg than it does on the global level. Likewise, I don't want to use pgsteal_* and pgsteal_kswapd_* because of their similarity to /proc/vmstat while the numbers represent something different. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758044Ab2AMMRO (ORCPT ); Fri, 13 Jan 2012 07:17:14 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:47501 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753099Ab2AMMRJ (ORCPT ); Fri, 13 Jan 2012 07:17:09 -0500 Date: Fri, 13 Jan 2012 13:16:56 +0100 From: Johannes Weiner To: KAMEZAWA Hiroyuki Cc: Andrew Morton , Michal Hocko , Balbir Singh , Ying Han , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Message-ID: <20120113121645.GA1653@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120112105427.4b80437b.kamezawa.hiroyu@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120112105427.4b80437b.kamezawa.hiroyu@jp.fujitsu.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jan 12, 2012 at 10:54:27AM +0900, KAMEZAWA Hiroyuki wrote: > On Tue, 10 Jan 2012 16:02:52 +0100 > Johannes Weiner wrote: > > > Right now, memcg soft limits are implemented by having a sorted tree > > of memcgs that are in excess of their limits. Under global memory > > pressure, kswapd first reclaims from the biggest excessor and then > > proceeds to do regular global reclaim. The result of this is that > > pages are reclaimed from all memcgs, but more scanning happens against > > those above their soft limit. > > > > With global reclaim doing memcg-aware hierarchical reclaim by default, > > this is a lot easier to implement: everytime a memcg is reclaimed > > from, scan more aggressively (per tradition with a priority of 0) if > > it's above its soft limit. With the same end result of scanning > > everybody, but soft limit excessors a bit more. > > > > Advantages: > > > > o smoother reclaim: soft limit reclaim is a separate stage before > > global reclaim, whose result is not communicated down the line and > > so overreclaim of the groups in excess is very likely. After this > > patch, soft limit reclaim is fully integrated into regular reclaim > > and each memcg is considered exactly once per cycle. > > > > o true hierarchy support: soft limits are only considered when > > kswapd does global reclaim, but after this patch, targetted > > reclaim of a memcg will mind the soft limit settings of its child > > groups. > > > > o code size: soft limit reclaim requires a lot of code to maintain > > the per-node per-zone rb-trees to quickly find the biggest > > offender, dedicated paths for soft limit reclaim etc. while this > > new implementation gets away without all that. > > > > Test: > > > > The test consists of two concurrent kernel build jobs in separate > > source trees, the master and the slave. The two jobs get along nicely > > on 600MB of available memory, so this is the zero overcommit control > > case. When available memory is decreased, the overcommit is > > compensated by decreasing the soft limit of the slave by the same > > amount, in the hope that the slave takes the hit and the master stays > > unaffected. > > > > 600M-0M-vanilla 600M-0M-patched > > Master walltime (s) 552.65 ( +0.00%) 552.38 ( -0.05%) > > Master walltime (stddev) 1.25 ( +0.00%) 0.92 ( -14.66%) > > Master major faults 204.38 ( +0.00%) 205.38 ( +0.49%) > > Master major faults (stddev) 27.16 ( +0.00%) 13.80 ( -47.43%) > > Master reclaim 31.88 ( +0.00%) 37.75 ( +17.87%) > > Master reclaim (stddev) 34.01 ( +0.00%) 75.88 (+119.59%) > > Master scan 31.88 ( +0.00%) 37.75 ( +17.87%) > > Master scan (stddev) 34.01 ( +0.00%) 75.88 (+119.59%) > > Master kswapd reclaim 33922.12 ( +0.00%) 33887.12 ( -0.10%) > > Master kswapd reclaim (stddev) 969.08 ( +0.00%) 492.22 ( -49.16%) > > Master kswapd scan 34085.75 ( +0.00%) 33985.75 ( -0.29%) > > Master kswapd scan (stddev) 1101.07 ( +0.00%) 563.33 ( -48.79%) > > Slave walltime (s) 552.68 ( +0.00%) 552.12 ( -0.10%) > > Slave walltime (stddev) 0.79 ( +0.00%) 1.05 ( +14.76%) > > Slave major faults 212.50 ( +0.00%) 204.50 ( -3.75%) > > Slave major faults (stddev) 26.90 ( +0.00%) 13.17 ( -49.20%) > > Slave reclaim 26.12 ( +0.00%) 35.00 ( +32.72%) > > Slave reclaim (stddev) 29.42 ( +0.00%) 74.91 (+149.55%) > > Slave scan 31.38 ( +0.00%) 35.00 ( +11.20%) > > Slave scan (stddev) 33.31 ( +0.00%) 74.91 (+121.24%) > > Slave kswapd reclaim 34259.00 ( +0.00%) 33469.88 ( -2.30%) > > Slave kswapd reclaim (stddev) 925.15 ( +0.00%) 565.07 ( -38.88%) > > Slave kswapd scan 34354.62 ( +0.00%) 33555.75 ( -2.33%) > > Slave kswapd scan (stddev) 969.62 ( +0.00%) 581.70 ( -39.97%) > > > > In the control case, the differences in elapsed time, number of major > > faults taken, and reclaim statistics are within the noise for both the > > master and the slave job. > > > > 600M-280M-vanilla 600M-280M-patched > > Master walltime (s) 595.13 ( +0.00%) 553.19 ( -7.04%) > > Master walltime (stddev) 8.31 ( +0.00%) 2.57 ( -61.64%) > > Master major faults 3729.75 ( +0.00%) 783.25 ( -78.98%) > > Master major faults (stddev) 258.79 ( +0.00%) 226.68 ( -12.36%) > > Master reclaim 705.00 ( +0.00%) 29.50 ( -95.68%) > > Master reclaim (stddev) 232.87 ( +0.00%) 44.72 ( -80.45%) > > Master scan 714.88 ( +0.00%) 30.00 ( -95.67%) > > Master scan (stddev) 237.44 ( +0.00%) 45.39 ( -80.54%) > > Master kswapd reclaim 114.75 ( +0.00%) 50.00 ( -55.94%) > > Master kswapd reclaim (stddev) 128.51 ( +0.00%) 9.45 ( -91.93%) > > Master kswapd scan 115.75 ( +0.00%) 50.00 ( -56.32%) > > Master kswapd scan (stddev) 130.31 ( +0.00%) 9.45 ( -92.04%) > > Slave walltime (s) 631.18 ( +0.00%) 577.68 ( -8.46%) > > Slave walltime (stddev) 9.89 ( +0.00%) 3.63 ( -57.47%) > > Slave major faults 28401.75 ( +0.00%) 14656.75 ( -48.39%) > > Slave major faults (stddev) 2629.97 ( +0.00%) 1911.81 ( -27.30%) > > Slave reclaim 65400.62 ( +0.00%) 1479.62 ( -97.74%) > > Slave reclaim (stddev) 11623.02 ( +0.00%) 1482.13 ( -87.24%) > > Slave scan 9050047.88 ( +0.00%) 95968.25 ( -98.94%) > > Slave scan (stddev) 1912786.94 ( +0.00%) 93390.71 ( -95.12%) > > Slave kswapd reclaim 327894.50 ( +0.00%) 227099.88 ( -30.74%) > > Slave kswapd reclaim (stddev) 22289.43 ( +0.00%) 16113.14 ( -27.71%) > > Slave kswapd scan 34987335.75 ( +0.00%) 1362367.12 ( -96.11%) > > Slave kswapd scan (stddev) 2523642.98 ( +0.00%) 156754.74 ( -93.79%) > > > > Here, the available memory is limited to 320 MB, the machine is > > overcommitted by 280 MB. The soft limit of the master is 300 MB, that > > of the slave merely 20 MB. > > > > Looking at the slave job first, it is much better off with the patched > > kernel: direct reclaim is almost gone, kswapd reclaim is decreased by > > a third. The result is much fewer major faults taken, which in turn > > lets the job finish quicker. > > > > It would be a zero-sum game if the improvement happened at the cost of > > the master but looking at the numbers, even the master performs better > > with the patched kernel. In fact, the master job is almost unaffected > > on the patched kernel compared to the control case. > > > > This is an odd phenomenon, as the patch does not directly change how > > the master is reclaimed. An explanation for this is that the severe > > overreclaim of the slave in the unpatched kernel results in the master > > growing bigger than in the patched case. Combining the fact that > > memcgs are scanned according to their size with the increased refault > > rate of the overreclaimed slave triggering global reclaim more often > > means that overall pressure on the master job is higher in the > > unpatched kernel. > > > > At any rate, the patched kernel seems to do a much better job at both > > overall resource allocation under soft limit overcommit as well as the > > requested prioritization of the master job. > > > > Signed-off-by: Johannes Weiner > > Thank you for your work and the result seems atractive and code is much > simpler. My small concerns are.. > > 1. This approach may increase latency of direct-reclaim because of priority=0. I think strictly speaking yes, but note that with kswapd being less likely to get stuck in hammering on one group, the need for allocators to enter direct reclaim itself is reduced. However, if this really becomes a problem in real world loads, the fix is pretty easy: just ignore the soft limit for direct reclaim. We can still consider it from hard limit reclaim and kswapd. > 2. In a case numa-spread/interleave application run in its own container, > pages on a node may paged-out again and again becasue of priority=0 > if some other application runs in the node. > It seems difficult to use soft-limit with numa-aware applications. > Do you have suggestions ? This is a question about soft limits in general rather than about this particular patch, right? And if I understand correctly, the problem you are referring to is this: an application and parts of a soft-limited container share a node, the soft limit setting means that the container's pages on that node are reclaimed harder. At that point, the container's share on that node becomes tiny, but since the soft limit is oblivious to nodes, the expansion of the other application pushes the soft-limited container off that node completely as long as the container stays above its soft limit with the usage on other nodes. What would you think about having node-local soft limits that take the node size into account? local_soft_limit = soft_limit * node_size / memcg_size The soft limit can be exceeded globally, but the container is no longer pushed off a node on which it's only occupying a small share of memory. Putting it into proportion of the memcg size, not overall memory size has the following advantages: 1. if the container is sitting on only one of several available nodes without exceeding the limit globally, the memcg will not be reclaimed harder just because it has a relatively large share of the node. 2. if the soft limit excess is ridiculously high, the local soft limits will be pushed down, so the tolerance for smaller shares on nodes goes down in proportion to the global soft limit excess. Example: 4G soft limit * 2G node / 4G container = 2G node-local limit The container is globally within its soft limit, so the local limit is at least the size of the node. It's never reclaimed harder compared to other applications on the node. 4G soft limit * 2G node / 5G container = ~1.6G node-local limit Here, it will experience more pressure initially, but it will level off when the shrinking usage and the thereby increasing node-local soft limit meet. From that point on, the container and the competing application will be treated equally during reclaim. Finally, if the container is 16G in size, i.e. 300% in excess, the per-node tolerance is at 512M node-local soft limit, which IMO strikes a good balance between zero tolerance and still applying some stress to the hugely oversized container when other applications (with virtually unlimited soft limits) want to run on the same node. What do you think? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758014Ab2AMMEM (ORCPT ); Fri, 13 Jan 2012 07:04:12 -0500 Received: from cantor2.suse.de ([195.135.220.15]:50334 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752643Ab2AMMEJ (ORCPT ); Fri, 13 Jan 2012 07:04:09 -0500 Date: Fri, 13 Jan 2012 13:04:06 +0100 From: Michal Hocko To: Johannes Weiner Cc: Andrew Morton , KAMEZAWA Hiroyuki , Balbir Singh , Ying Han , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Message-ID: <20120113120406.GC17060@tiehlicka.suse.cz> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 10-01-12 16:02:52, Johannes Weiner wrote: > Right now, memcg soft limits are implemented by having a sorted tree > of memcgs that are in excess of their limits. Under global memory > pressure, kswapd first reclaims from the biggest excessor and then > proceeds to do regular global reclaim. The result of this is that > pages are reclaimed from all memcgs, but more scanning happens against > those above their soft limit. > > With global reclaim doing memcg-aware hierarchical reclaim by default, > this is a lot easier to implement: everytime a memcg is reclaimed > from, scan more aggressively (per tradition with a priority of 0) if > it's above its soft limit. With the same end result of scanning > everybody, but soft limit excessors a bit more. > > Advantages: > > o smoother reclaim: soft limit reclaim is a separate stage before > global reclaim, whose result is not communicated down the line and > so overreclaim of the groups in excess is very likely. After this > patch, soft limit reclaim is fully integrated into regular reclaim > and each memcg is considered exactly once per cycle. > > o true hierarchy support: soft limits are only considered when > kswapd does global reclaim, but after this patch, targetted > reclaim of a memcg will mind the soft limit settings of its child > groups. Yes it makes sense. At first I was thinking that soft limit should be considered only under global mem. pressure (at least documentation says so) but now it makes sense. We can push on over-soft limit groups more because they told us they could sacrifice something... Anyway documentation needs an update as well. But we have to be little bit careful here. I am still quite confuses how we should handle hierarchies vs. subtrees. See bellow. > > o code size: soft limit reclaim requires a lot of code to maintain > the per-node per-zone rb-trees to quickly find the biggest > offender, dedicated paths for soft limit reclaim etc. while this > new implementation gets away without all that. on my i386 pae setup (including swap extension enabled): Before text data bss dec hex filename 310086 29970 35372 375428 5ba84 mm/built-in.o After size mm/built-in.o text data bss dec hex filename 309048 30030 35372 374450 5b6b2 mm/built-in.o I would expect a bigger difference but still good. > Test: Will look into results later. [...] > Signed-off-by: Johannes Weiner > --- > include/linux/memcontrol.h | 18 +-- > mm/memcontrol.c | 412 ++++---------------------------------------- > mm/vmscan.c | 80 +-------- > 3 files changed, 48 insertions(+), 462 deletions(-) Really nice to see [...] > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 170dff4..d4f7ae5 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c [...] > @@ -1318,6 +1123,36 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) > return margin >> PAGE_SHIFT; > } > > +/** > + * mem_cgroup_over_softlimit > + * @root: hierarchy root > + * @memcg: child of @root to test > + * > + * Returns %true if @memcg exceeds its own soft limit or contributes > + * to the soft limit excess of one of its parents up to and including > + * @root. > + */ > +bool mem_cgroup_over_softlimit(struct mem_cgroup *root, > + struct mem_cgroup *memcg) > +{ > + if (mem_cgroup_disabled()) > + return false; > + > + if (!root) > + root = root_mem_cgroup; > + > + for (; memcg; memcg = parent_mem_cgroup(memcg)) { > + /* root_mem_cgroup does not have a soft limit */ > + if (memcg == root_mem_cgroup) > + break; > + if (res_counter_soft_limit_excess(&memcg->res)) > + return true; > + if (memcg == root) > + break; > + } > + return false; > +} Well, this might be little bit tricky. We do not check whether memcg and root are in a hierarchy (in terms of use_hierarchy) relation. If we are under global reclaim then we iterate over all memcgs and so there is no guarantee that there is a hierarchical relation between the given memcg and its parent. While, on the other hand, if we are doing memcg reclaim then we have this guarantee. Why should we punish a group (subtree) which is perfectly under its soft limit just because some other subtree contributes to the common parent's usage and makes it over its limit? Should we check memcg->use_hierarchy here? Does it even makes sense to setup soft limit on a parent group without hierarchies? Well I have to admit that hierarchies makes me headache. > + > int mem_cgroup_swappiness(struct mem_cgroup *memcg) > { > struct cgroup *cgrp = memcg->css.cgroup; [...] > diff --git a/mm/vmscan.c b/mm/vmscan.c > index e3fd8a7..4279549 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2121,8 +2121,16 @@ static void shrink_zone(int priority, struct zone *zone, > .mem_cgroup = memcg, > .zone = zone, > }; > + int epriority = priority; > + /* > + * Put more pressure on hierarchies that exceed their > + * soft limit, to push them back harder than their > + * well-behaving siblings. > + */ > + if (mem_cgroup_over_softlimit(root, memcg)) > + epriority = 0; This sounds too aggressive to me. Shouldn't we just double the pressure or something like that? Previously we always had nr_to_reclaim == SWAP_CLUSTER_MAX when we did memcg reclaim but this is not the case now. For the kswapd we have nr_to_reclaim == ULONG_MAX so we will not break out of the reclaim early and we have to scan a lot. Direct reclaim (shrink or hard limit) shouldn't be affected here. > > - shrink_mem_cgroup_zone(priority, &mz, sc); > + shrink_mem_cgroup_zone(epriority, &mz, sc); > > mem_cgroup_account_reclaim(root, memcg, > sc->nr_reclaimed - nr_reclaimed, -- Michal Hocko SUSE Labs SUSE LINUX s.r.o. Lihovarska 1060/12 190 00 Praha 9 Czech Republic From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932350Ab2AMQe2 (ORCPT ); Fri, 13 Jan 2012 11:34:28 -0500 Received: from cantor2.suse.de ([195.135.220.15]:36780 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758486Ab2AMQe0 (ORCPT ); Fri, 13 Jan 2012 11:34:26 -0500 Date: Fri, 13 Jan 2012 17:34:23 +0100 From: Michal Hocko To: Johannes Weiner Cc: Andrew Morton , KAMEZAWA Hiroyuki , Balbir Singh , Ying Han , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Message-ID: <20120113163423.GG17060@tiehlicka.suse.cz> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120113120406.GC17060@tiehlicka.suse.cz> <20120113155001.GB1653@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120113155001.GB1653@cmpxchg.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 13-01-12 16:50:01, Johannes Weiner wrote: > On Fri, Jan 13, 2012 at 01:04:06PM +0100, Michal Hocko wrote: > > On Tue 10-01-12 16:02:52, Johannes Weiner wrote: [...] > > > +bool mem_cgroup_over_softlimit(struct mem_cgroup *root, > > > + struct mem_cgroup *memcg) > > > +{ > > > + if (mem_cgroup_disabled()) > > > + return false; > > > + > > > + if (!root) > > > + root = root_mem_cgroup; > > > + > > > + for (; memcg; memcg = parent_mem_cgroup(memcg)) { > > > + /* root_mem_cgroup does not have a soft limit */ > > > + if (memcg == root_mem_cgroup) > > > + break; > > > + if (res_counter_soft_limit_excess(&memcg->res)) > > > + return true; > > > + if (memcg == root) > > > + break; > > > + } > > > + return false; > > > +} > > > > Well, this might be little bit tricky. We do not check whether memcg and > > root are in a hierarchy (in terms of use_hierarchy) relation. > > > > If we are under global reclaim then we iterate over all memcgs and so > > there is no guarantee that there is a hierarchical relation between the > > given memcg and its parent. While, on the other hand, if we are doing > > memcg reclaim then we have this guarantee. > > > > Why should we punish a group (subtree) which is perfectly under its soft > > limit just because some other subtree contributes to the common parent's > > usage and makes it over its limit? > > Should we check memcg->use_hierarchy here? > > We do, actually. parent_mem_cgroup() checks the res_counter parent, > which is only set when ->use_hierarchy is also set. Of course I am blind.. We do not setup res_counter parent for !use_hierarchy case. Sorry for noise... Now it makes much better sense. I was wondering how !use_hierarchy could ever work, this should be a signal that I am overlooking something terribly. [...] > > > @@ -2121,8 +2121,16 @@ static void shrink_zone(int priority, struct zone *zone, > > > .mem_cgroup = memcg, > > > .zone = zone, > > > }; > > > + int epriority = priority; > > > + /* > > > + * Put more pressure on hierarchies that exceed their > > > + * soft limit, to push them back harder than their > > > + * well-behaving siblings. > > > + */ > > > + if (mem_cgroup_over_softlimit(root, memcg)) > > > + epriority = 0; > > > > This sounds too aggressive to me. Shouldn't we just double the pressure > > or something like that? > > That's the historical value. When I tried priority - 1, it was not > aggressive enough. Probably because we want to reclaim too much. Maybe we should do reduce nr_to_reclaim (ugly) or reclaim only overlimit groups until certain priority level as Ying suggested in her patchset. -- Michal Hocko SUSE Labs SUSE LINUX s.r.o. Lihovarska 1060/12 190 00 Praha 9 Czech Republic From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759255Ab2AMVbU (ORCPT ); Fri, 13 Jan 2012 16:31:20 -0500 Received: from mail-qw0-f46.google.com ([209.85.216.46]:50157 "EHLO mail-qw0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1759225Ab2AMVbR convert rfc822-to-8bit (ORCPT ); Fri, 13 Jan 2012 16:31:17 -0500 MIME-Version: 1.0 In-Reply-To: <20120112085904.GG24386@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120112085904.GG24386@cmpxchg.org> Date: Fri, 13 Jan 2012 13:31:16 -0800 Message-ID: Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim From: Ying Han To: Johannes Weiner Cc: Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org X-System-Of-Record: true Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jan 12, 2012 at 12:59 AM, Johannes Weiner wrote: > On Wed, Jan 11, 2012 at 01:42:31PM -0800, Ying Han wrote: >> On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner wrote: >> > Right now, memcg soft limits are implemented by having a sorted tree >> > of memcgs that are in excess of their limits.  Under global memory >> > pressure, kswapd first reclaims from the biggest excessor and then >> > proceeds to do regular global reclaim.  The result of this is that >> > pages are reclaimed from all memcgs, but more scanning happens against >> > those above their soft limit. >> > >> > With global reclaim doing memcg-aware hierarchical reclaim by default, >> > this is a lot easier to implement: everytime a memcg is reclaimed >> > from, scan more aggressively (per tradition with a priority of 0) if >> > it's above its soft limit.  With the same end result of scanning >> > everybody, but soft limit excessors a bit more. >> > >> > Advantages: >> > >> >  o smoother reclaim: soft limit reclaim is a separate stage before >> >    global reclaim, whose result is not communicated down the line and >> >    so overreclaim of the groups in excess is very likely.  After this >> >    patch, soft limit reclaim is fully integrated into regular reclaim >> >    and each memcg is considered exactly once per cycle. >> > >> >  o true hierarchy support: soft limits are only considered when >> >    kswapd does global reclaim, but after this patch, targetted >> >    reclaim of a memcg will mind the soft limit settings of its child >> >    groups. >> >> Why we add soft limit reclaim into target reclaim? > >        -> A hard limit 10G, usage 10G >           -> A1 soft limit 8G, usage 5G >           -> A2 soft limit 2G, usage 5G > > When A hits its hard limit, A2 will experience more pressure than A1. > > Soft limits are already applied hierarchically: the memcg that is > picked from the tree is reclaimed hierarchically.  What I wanted to > add is the soft limit also being /triggerable/ from non-global > hierarchy levels. > >> Based on the discussions, my understanding is that the soft limit only >> take effect while the whole machine is under memory contention. We >> don't want to add extra pressure on a cgroup if there is free memory >> on the system even the cgroup is above its limit. > > If a hierarchy is under pressure, we will reclaim that hierarchy.  We > allow groups to be prioritized under global pressure, why not allow it > for local pressure as well? > > I am not quite sure what you are objecting to. > >> >  o code size: soft limit reclaim requires a lot of code to maintain >> >    the per-node per-zone rb-trees to quickly find the biggest >> >    offender, dedicated paths for soft limit reclaim etc. while this >> >    new implementation gets away without all that. >> > >> > Test: >> > >> > The test consists of two concurrent kernel build jobs in separate >> > source trees, the master and the slave.  The two jobs get along nicely >> > on 600MB of available memory, so this is the zero overcommit control >> > case.  When available memory is decreased, the overcommit is >> > compensated by decreasing the soft limit of the slave by the same >> > amount, in the hope that the slave takes the hit and the master stays >> > unaffected. >> > >> >                                    600M-0M-vanilla         600M-0M-patched >> > Master walltime (s)               552.65 (  +0.00%)       552.38 (  -0.05%) >> > Master walltime (stddev)            1.25 (  +0.00%)         0.92 ( -14.66%) >> > Master major faults               204.38 (  +0.00%)       205.38 (  +0.49%) >> > Master major faults (stddev)       27.16 (  +0.00%)        13.80 ( -47.43%) >> > Master reclaim                     31.88 (  +0.00%)        37.75 ( +17.87%) >> > Master reclaim (stddev)            34.01 (  +0.00%)        75.88 (+119.59%) >> > Master scan                        31.88 (  +0.00%)        37.75 ( +17.87%) >> > Master scan (stddev)               34.01 (  +0.00%)        75.88 (+119.59%) >> > Master kswapd reclaim           33922.12 (  +0.00%)     33887.12 (  -0.10%) >> > Master kswapd reclaim (stddev)    969.08 (  +0.00%)       492.22 ( -49.16%) >> > Master kswapd scan              34085.75 (  +0.00%)     33985.75 (  -0.29%) >> > Master kswapd scan (stddev)      1101.07 (  +0.00%)       563.33 ( -48.79%) >> > Slave walltime (s)                552.68 (  +0.00%)       552.12 (  -0.10%) >> > Slave walltime (stddev)             0.79 (  +0.00%)         1.05 ( +14.76%) >> > Slave major faults                212.50 (  +0.00%)       204.50 (  -3.75%) >> > Slave major faults (stddev)        26.90 (  +0.00%)        13.17 ( -49.20%) >> > Slave reclaim                      26.12 (  +0.00%)        35.00 ( +32.72%) >> > Slave reclaim (stddev)             29.42 (  +0.00%)        74.91 (+149.55%) >> > Slave scan                         31.38 (  +0.00%)        35.00 ( +11.20%) >> > Slave scan (stddev)                33.31 (  +0.00%)        74.91 (+121.24%) >> > Slave kswapd reclaim            34259.00 (  +0.00%)     33469.88 (  -2.30%) >> > Slave kswapd reclaim (stddev)     925.15 (  +0.00%)       565.07 ( -38.88%) >> > Slave kswapd scan               34354.62 (  +0.00%)     33555.75 (  -2.33%) >> > Slave kswapd scan (stddev)        969.62 (  +0.00%)       581.70 ( -39.97%) >> > >> > In the control case, the differences in elapsed time, number of major >> > faults taken, and reclaim statistics are within the noise for both the >> > master and the slave job. >> >> What's the soft limit setting in the controlled case? > > 300MB for both jobs. > >> I assume it is the default RESOURCE_MAX. So both Master and Slave get >> equal pressure before/after the patch, and no differences on the stats >> should be observed. > > Yes.  The control case demonstrates that both jobs can fit > comfortably, don't compete for space and that in general the patch > does not have unexpected negative impact (after all, it modifies > codepaths that were invoked regularly outside of reclaim). > >> >                                     600M-280M-vanilla      600M-280M-patched >> > Master walltime (s)                  595.13 (  +0.00%)      553.19 (  -7.04%) >> > Master walltime (stddev)               8.31 (  +0.00%)        2.57 ( -61.64%) >> > Master major faults                 3729.75 (  +0.00%)      783.25 ( -78.98%) >> > Master major faults (stddev)         258.79 (  +0.00%)      226.68 ( -12.36%) >> > Master reclaim                       705.00 (  +0.00%)       29.50 ( -95.68%) >> > Master reclaim (stddev)              232.87 (  +0.00%)       44.72 ( -80.45%) >> > Master scan                          714.88 (  +0.00%)       30.00 ( -95.67%) >> > Master scan (stddev)                 237.44 (  +0.00%)       45.39 ( -80.54%) >> > Master kswapd reclaim                114.75 (  +0.00%)       50.00 ( -55.94%) >> > Master kswapd reclaim (stddev)       128.51 (  +0.00%)        9.45 ( -91.93%) >> > Master kswapd scan                   115.75 (  +0.00%)       50.00 ( -56.32%) >> > Master kswapd scan (stddev)          130.31 (  +0.00%)        9.45 ( -92.04%) >> > Slave walltime (s)                   631.18 (  +0.00%)      577.68 (  -8.46%) >> > Slave walltime (stddev)                9.89 (  +0.00%)        3.63 ( -57.47%) >> > Slave major faults                 28401.75 (  +0.00%)    14656.75 ( -48.39%) >> > Slave major faults (stddev)         2629.97 (  +0.00%)     1911.81 ( -27.30%) >> > Slave reclaim                      65400.62 (  +0.00%)     1479.62 ( -97.74%) >> > Slave reclaim (stddev)             11623.02 (  +0.00%)     1482.13 ( -87.24%) >> > Slave scan                       9050047.88 (  +0.00%)    95968.25 ( -98.94%) >> > Slave scan (stddev)              1912786.94 (  +0.00%)    93390.71 ( -95.12%) >> > Slave kswapd reclaim              327894.50 (  +0.00%)   227099.88 ( -30.74%) >> > Slave kswapd reclaim (stddev)      22289.43 (  +0.00%)    16113.14 ( -27.71%) >> > Slave kswapd scan               34987335.75 (  +0.00%)  1362367.12 ( -96.11%) >> > Slave kswapd scan (stddev)       2523642.98 (  +0.00%)   156754.74 ( -93.79%) >> > >> > Here, the available memory is limited to 320 MB, the machine is >> > overcommitted by 280 MB.  The soft limit of the master is 300 MB, that >> > of the slave merely 20 MB. >> > >> > Looking at the slave job first, it is much better off with the patched >> > kernel: direct reclaim is almost gone, kswapd reclaim is decreased by >> > a third.  The result is much fewer major faults taken, which in turn >> > lets the job finish quicker. >> >> What's the setting of the hard limit here? Is the direct reclaim >> referring to per-memcg directly reclaim or global one. > > The machine's memory is limited to 600M, the hard limits are unset. > All reclaim is a result of global memory pressure. > > With the patched kernel, I could have used a dedicated parent cgroup > and let master and slave run in children of this group, the soft > limits would be taken into account just the same.  But this does not > work on the unpatched kernel, as soft limits are only recognized on > the global level there. > >> > It would be a zero-sum game if the improvement happened at the cost of >> > the master but looking at the numbers, even the master performs better >> > with the patched kernel.  In fact, the master job is almost unaffected >> > on the patched kernel compared to the control case. >> >> It makes sense since the master job get less affected by the patch >> than the slave job under the example. Under the control case, if both >> master and slave have RESOURCE_MAX soft limit setting, they are under >> equal memory pressure(priority = DEF_PRIORITY) . On the second >> example, only the slave pressure being increased by priority = 0, and >> the Master got scanned with same priority = DEF_PRIORITY pretty much. >> >> So I would expect to see more reclaim activities happens in slave on >> the patched kernel compared to the control case. It seems match the >> testing result. > > Uhm, > >> > Slave reclaim                      65400.62 (  +0.00%)     1479.62 ( -97.74%) >> > Slave reclaim (stddev)             11623.02 (  +0.00%)     1482.13 ( -87.24%) >> > Slave scan                       9050047.88 (  +0.00%)    95968.25 ( -98.94%) >> > Slave scan (stddev)              1912786.94 (  +0.00%)    93390.71 ( -95.12%) >> > Slave kswapd reclaim              327894.50 (  +0.00%)   227099.88 ( -30.74%) >> > Slave kswapd reclaim (stddev)      22289.43 (  +0.00%)    16113.14 ( -27.71%) >> > Slave kswapd scan               34987335.75 (  +0.00%)  1362367.12 ( -96.11%) >> > Slave kswapd scan (stddev)       2523642.98 (  +0.00%)   156754.74 ( -93.79%) > > Direct reclaim _shrunk_ by 98%, kswapd reclaim by 31%. > >> > This is an odd phenomenon, as the patch does not directly change how >> > the master is reclaimed.  An explanation for this is that the severe >> > overreclaim of the slave in the unpatched kernel results in the master >> > growing bigger than in the patched case.  Combining the fact that >> > memcgs are scanned according to their size with the increased refault >> > rate of the overreclaimed slave triggering global reclaim more often >> > means that overall pressure on the master job is higher in the >> > unpatched kernel. >> >> We can check the Master memory.usage_in_bytes while the job is running. > > Yep, the plots of cache/rss over time confirmed exactly this.  The > unpatched kernel shows higher spikes in the size of the master job > followed by deeper pits when reclaim kicked in.  The patched kernel is > much smoother in that regard. > >> On the other hand, I don't see why we expect the Master being less >> reclaimed in the controlled case? On the unpatched kernel, the Master >> is being reclaimed under global pressure each time anyway since we >> ignore the return value of softlimit. > > I didn't expect that, I expected both jobs to perform equally in the > control case.  And in the pressurized case, the master being > unaffected and the slave taking the hit.  The patched kernel does > this, the unpatched one does not. > >> > @@ -121,6 +121,7 @@ struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, >> >                                                      struct zone *zone); >> >  struct zone_reclaim_stat* >> >  mem_cgroup_get_reclaim_stat_from_page(struct page *page); >> > +bool mem_cgroup_over_softlimit(struct mem_cgroup *, struct mem_cgroup *); >> >> Maybe something like "mem_cgroup_over_soft_limit()" ? > > Probably more consistent, yeah.  Will do. > >> > @@ -343,7 +314,6 @@ static bool move_file(void) >> >  * limit reclaim to prevent infinite loops, if they ever occur. >> >  */ >> >  #define        MEM_CGROUP_MAX_RECLAIM_LOOPS            (100) >> > -#define        MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS (2) >> >> You might need to remove the comment above as well. > > Oops, will fix. > >> > @@ -1318,6 +1123,36 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) >> >        return margin >> PAGE_SHIFT; >> >  } >> > >> > +/** >> > + * mem_cgroup_over_softlimit >> > + * @root: hierarchy root >> > + * @memcg: child of @root to test >> > + * >> > + * Returns %true if @memcg exceeds its own soft limit or contributes >> > + * to the soft limit excess of one of its parents up to and including >> > + * @root. >> > + */ >> > +bool mem_cgroup_over_softlimit(struct mem_cgroup *root, >> > +                              struct mem_cgroup *memcg) >> > +{ >> > +       if (mem_cgroup_disabled()) >> > +               return false; >> > + >> > +       if (!root) >> > +               root = root_mem_cgroup; >> > + >> > +       for (; memcg; memcg = parent_mem_cgroup(memcg)) { >> > +               /* root_mem_cgroup does not have a soft limit */ >> > +               if (memcg == root_mem_cgroup) >> > +                       break; >> > +               if (res_counter_soft_limit_excess(&memcg->res)) >> > +                       return true; >> > +               if (memcg == root) >> > +                       break; >> > +       } >> >> Here it adds pressure on a cgroup if one of its parents exceeds soft >> limit, although the cgroup itself is under soft limit. It does change >> my understanding of soft limit, and might introduce regression of our >> existing use cases. >> >> Here is an example: >> >> Machine capacity 32G and we over-commit by 8G. >> >> root >>   -> A (hard limit 20G, soft limit 15G, usage 16G) >>        -> A1 (soft limit 5G, usage 4G) >>        -> A2 (soft limit 10G, usage 12G) >>   -> B (hard limit 20G, soft limit 10G, usage 16G) >> >> under global reclaim, we don't want to add pressure on A1 although its >> parent A exceeds its soft limit. Assume that if we set the soft limit >> corresponding to each cgroup's working set size (hot memory), and it >> will introduce regression to A1 in that case. >> >> In my existing implementation, i am checking the cgroup's soft limit >> standalone w/o looking its ancestors. > > Why do you set the soft limit of A in the first place if you don't > want it to be enforced? The soft limit should be enforced under certain condition, not always. The soft limit of A is set to be enforced when the parent of A and B is under memory pressure. For example: Machine capacity 32G and we over-commit by 8G root -> A (hard limit 20G, soft limit 12G, usage 20G)        -> A1 (soft limit 2G, usage 1G)        -> A2 (soft limit 10G, usage 19G) -> B (hard limit 20G, soft limit 10G, usage 0G) Now, A is under memory pressure since the total usage is hitting its hard limit. Then we start hierarchical reclaim under A, and each cgroup under A also takes consideration of soft limit. In this case, we should only set priority = 0 to A2 since it contributes to A's charge as well as exceeding its own soft limit. Why punishing A1 (set priority = 0) also which has usage under its soft limit ? I can imagine it will introduce regression to existing environment which the soft limit is set based on the working set size of the cgroup. To answer the question why we set soft limit to A, it is used to over-commit the host while sharing the resource with its sibling (B in this case). If the machine is under memory contention, we would like to push down memory to A or B depends on their usage and soft limit. --Ying > > This is not really new behaviour, soft limit reclaim has always been > operating hierarchically on the biggest excessor.  In your case, the > excess of A is smaller than the excess of A2 and so that weird "only > pick the biggest excessor" behaviour hides it, but consider this: > >        -> A soft 30G, usage 39G >           -> A1 soft 5G, usage 4G >           -> A2 soft 10G, usage 15G >           -> A3 soft 15G, usage 20G > > Upstream would pick A from the soft limit tree and reclaim its > children with priority 0, including A1. > > On the other hand, if you don't consider ancestral soft limits, you > break perfectly reasonable setups like these > >        -> A soft 10G, usage 20G >           -> A1 usage 10G >           -> A2 usage 10G >        -> B soft 10G, usage 11G > > where upstream would pick A and reclaim it recursively, but your > version would only apply higher pressure to B. > > If you would just not set the soft limit of A in your case: > >        -> A (hard limit 20G, usage 16G) >           -> A1 (soft limit 5G, usage 4G) >           -> A2 (soft limit 10G, usage 12G) >        -> B (hard limit 20G, soft limit 10G, usage 16G) > > only A2 and B would experience higher pressure upon global pressure. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759291Ab2AMVpe (ORCPT ); Fri, 13 Jan 2012 16:45:34 -0500 Received: from mail-qy0-f174.google.com ([209.85.216.174]:57102 "EHLO mail-qy0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1759269Ab2AMVpb convert rfc822-to-8bit (ORCPT ); Fri, 13 Jan 2012 16:45:31 -0500 MIME-Version: 1.0 In-Reply-To: <20120113163423.GG17060@tiehlicka.suse.cz> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120113120406.GC17060@tiehlicka.suse.cz> <20120113155001.GB1653@cmpxchg.org> <20120113163423.GG17060@tiehlicka.suse.cz> Date: Fri, 13 Jan 2012 13:45:30 -0800 Message-ID: Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim From: Ying Han To: Michal Hocko Cc: Johannes Weiner , Andrew Morton , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org X-System-Of-Record: true Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 13, 2012 at 8:34 AM, Michal Hocko wrote: > On Fri 13-01-12 16:50:01, Johannes Weiner wrote: >> On Fri, Jan 13, 2012 at 01:04:06PM +0100, Michal Hocko wrote: >> > On Tue 10-01-12 16:02:52, Johannes Weiner wrote: > [...] >> > > +bool mem_cgroup_over_softlimit(struct mem_cgroup *root, >> > > +                        struct mem_cgroup *memcg) >> > > +{ >> > > + if (mem_cgroup_disabled()) >> > > +         return false; >> > > + >> > > + if (!root) >> > > +         root = root_mem_cgroup; >> > > + >> > > + for (; memcg; memcg = parent_mem_cgroup(memcg)) { >> > > +         /* root_mem_cgroup does not have a soft limit */ >> > > +         if (memcg == root_mem_cgroup) >> > > +                 break; >> > > +         if (res_counter_soft_limit_excess(&memcg->res)) >> > > +                 return true; >> > > +         if (memcg == root) >> > > +                 break; >> > > + } >> > > + return false; >> > > +} >> > >> > Well, this might be little bit tricky. We do not check whether memcg and >> > root are in a hierarchy (in terms of use_hierarchy) relation. >> > >> > If we are under global reclaim then we iterate over all memcgs and so >> > there is no guarantee that there is a hierarchical relation between the >> > given memcg and its parent. While, on the other hand, if we are doing >> > memcg reclaim then we have this guarantee. >> > >> > Why should we punish a group (subtree) which is perfectly under its soft >> > limit just because some other subtree contributes to the common parent's >> > usage and makes it over its limit? >> > Should we check memcg->use_hierarchy here? >> >> We do, actually.  parent_mem_cgroup() checks the res_counter parent, >> which is only set when ->use_hierarchy is also set. > > Of course I am blind.. We do not setup res_counter parent for > !use_hierarchy case. Sorry for noise... > Now it makes much better sense. I was wondering how !use_hierarchy could > ever work, this should be a signal that I am overlooking something > terribly. > > [...] >> > > @@ -2121,8 +2121,16 @@ static void shrink_zone(int priority, struct zone *zone, >> > >                   .mem_cgroup = memcg, >> > >                   .zone = zone, >> > >           }; >> > > +         int epriority = priority; >> > > +         /* >> > > +          * Put more pressure on hierarchies that exceed their >> > > +          * soft limit, to push them back harder than their >> > > +          * well-behaving siblings. >> > > +          */ >> > > +         if (mem_cgroup_over_softlimit(root, memcg)) >> > > +                 epriority = 0; >> > >> > This sounds too aggressive to me. Shouldn't we just double the pressure >> > or something like that? >> >> That's the historical value.  When I tried priority - 1, it was not >> aggressive enough. > > Probably because we want to reclaim too much. Maybe we should do > reduce nr_to_reclaim (ugly) or reclaim only overlimit groups until certain > priority level as Ying suggested in her patchset. I plan to post that change on top of this, and this patch set does the basic stuff to allow us doing further improvement. I still like the design to skip over_soft_limit cgroups until certain priority. One way to set up the soft limit for each cgroup is to base on its actual working set size, and we prefer to punish A first with lots of page cache ( cold file pages above soft limit) than reclaiming anon pages from B ( below soft limit ). Unless we can not get enough pages reclaimed from A, we will start reclaiming from B. This might not be the ideal solution, but should be a good start. Thoughts? --Ying > -- > Michal Hocko > SUSE Labs > SUSE LINUX s.r.o. > Lihovarska 1060/12 > 190 00 Praha 9 > Czech Republic From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932939Ab2AMWok (ORCPT ); Fri, 13 Jan 2012 17:44:40 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:33264 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932798Ab2AMWoj (ORCPT ); Fri, 13 Jan 2012 17:44:39 -0500 Date: Fri, 13 Jan 2012 23:44:24 +0100 From: Johannes Weiner To: Ying Han Cc: Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Message-ID: <20120113224424.GC1653@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120112085904.GG24386@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 13, 2012 at 01:31:16PM -0800, Ying Han wrote: > On Thu, Jan 12, 2012 at 12:59 AM, Johannes Weiner wrote: > > On Wed, Jan 11, 2012 at 01:42:31PM -0800, Ying Han wrote: > >> On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner wrote: > >> > @@ -1318,6 +1123,36 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) > >> >        return margin >> PAGE_SHIFT; > >> >  } > >> > > >> > +/** > >> > + * mem_cgroup_over_softlimit > >> > + * @root: hierarchy root > >> > + * @memcg: child of @root to test > >> > + * > >> > + * Returns %true if @memcg exceeds its own soft limit or contributes > >> > + * to the soft limit excess of one of its parents up to and including > >> > + * @root. > >> > + */ > >> > +bool mem_cgroup_over_softlimit(struct mem_cgroup *root, > >> > +                              struct mem_cgroup *memcg) > >> > +{ > >> > +       if (mem_cgroup_disabled()) > >> > +               return false; > >> > + > >> > +       if (!root) > >> > +               root = root_mem_cgroup; > >> > + > >> > +       for (; memcg; memcg = parent_mem_cgroup(memcg)) { > >> > +               /* root_mem_cgroup does not have a soft limit */ > >> > +               if (memcg == root_mem_cgroup) > >> > +                       break; > >> > +               if (res_counter_soft_limit_excess(&memcg->res)) > >> > +                       return true; > >> > +               if (memcg == root) > >> > +                       break; > >> > +       } > >> > >> Here it adds pressure on a cgroup if one of its parents exceeds soft > >> limit, although the cgroup itself is under soft limit. It does change > >> my understanding of soft limit, and might introduce regression of our > >> existing use cases. > >> > >> Here is an example: > >> > >> Machine capacity 32G and we over-commit by 8G. > >> > >> root > >>   -> A (hard limit 20G, soft limit 15G, usage 16G) > >>        -> A1 (soft limit 5G, usage 4G) > >>        -> A2 (soft limit 10G, usage 12G) > >>   -> B (hard limit 20G, soft limit 10G, usage 16G) > >> > >> under global reclaim, we don't want to add pressure on A1 although its > >> parent A exceeds its soft limit. Assume that if we set the soft limit > >> corresponding to each cgroup's working set size (hot memory), and it > >> will introduce regression to A1 in that case. > >> > >> In my existing implementation, i am checking the cgroup's soft limit > >> standalone w/o looking its ancestors. > > > > Why do you set the soft limit of A in the first place if you don't > > want it to be enforced? > > The soft limit should be enforced under certain condition, not always. > The soft limit of A is set to be enforced when the parent of A and B > is under memory pressure. For example: > > Machine capacity 32G and we over-commit by 8G > > root > -> A (hard limit 20G, soft limit 12G, usage 20G) >        -> A1 (soft limit 2G, usage 1G) >        -> A2 (soft limit 10G, usage 19G) > -> B (hard limit 20G, soft limit 10G, usage 0G) > > Now, A is under memory pressure since the total usage is hitting its > hard limit. Then we start hierarchical reclaim under A, and each > cgroup under A also takes consideration of soft limit. In this case, > we should only set priority = 0 to A2 since it contributes to A's > charge as well as exceeding its own soft limit. Why punishing A1 (set > priority = 0) also which has usage under its soft limit ? I can > imagine it will introduce regression to existing environment which the > soft limit is set based on the working set size of the cgroup. > > To answer the question why we set soft limit to A, it is used to > over-commit the host while sharing the resource with its sibling (B in > this case). If the machine is under memory contention, we would like > to push down memory to A or B depends on their usage and soft limit. D'oh, I think the problem is just that we walk up the hierarchy one too many when checking whether a group exceeds a soft limit. The soft limit is a signal to distribute pressure that comes from above, it's meaningless and should indeed be ignored on the level the pressure originates from. Say mem_cgroup_over_soft_limit(root, memcg) would check everyone up to but not including root, wouldn't that do exactly what we both want? Example: 1. If global memory is short, we reclaim with root=root_mem_cgroup. A1 and A2 get soft limit reclaimed because of A's soft limit excess, just like the current kernel would do. 2. If A hits its hard limit, we reclaim with root=A, so we only mind the soft limits of A1 and A2. A1 is below its soft limit, all good. A2 is above its soft limit, gets treated accordingly. This is new behaviour, the current kernel would just reclaim them equally. Code: bool mem_cgroup_over_soft_limit(struct mem_cgroup *root, struct mem_cgroup *memcg) { if (mem_cgroup_disabled()) return false; if (!root) root = root_mem_cgroup; for (; memcg; memcg = parent_mem_cgroup(memcg)) { if (memcg == root) break; if (res_counter_soft_limit_excess(&memcg->res)) return true; } return false; } From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754408Ab2AQOWh (ORCPT ); Tue, 17 Jan 2012 09:22:37 -0500 Received: from mail-iy0-f174.google.com ([209.85.210.174]:42406 "EHLO mail-iy0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753763Ab2AQOWf (ORCPT ); Tue, 17 Jan 2012 09:22:35 -0500 Message-ID: <4F158418.2090509@gmail.com> Date: Tue, 17 Jan 2012 22:22:16 +0800 From: Sha User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.23) Gecko/20110922 Thunderbird/3.1.15 MIME-Version: 1.0 To: Johannes Weiner CC: Ying Han , Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120112085904.GG24386@cmpxchg.org> <20120113224424.GC1653@cmpxchg.org> In-Reply-To: <20120113224424.GC1653@cmpxchg.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/14/2012 06:44 AM, Johannes Weiner wrote: > On Fri, Jan 13, 2012 at 01:31:16PM -0800, Ying Han wrote: >> On Thu, Jan 12, 2012 at 12:59 AM, Johannes Weiner wrote: >>> On Wed, Jan 11, 2012 at 01:42:31PM -0800, Ying Han wrote: >>>> On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner wrote: >>>>> @@ -1318,6 +1123,36 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) >>>>> return margin>> PAGE_SHIFT; >>>>> } >>>>> >>>>> +/** >>>>> + * mem_cgroup_over_softlimit >>>>> + * @root: hierarchy root >>>>> + * @memcg: child of @root to test >>>>> + * >>>>> + * Returns %true if @memcg exceeds its own soft limit or contributes >>>>> + * to the soft limit excess of one of its parents up to and including >>>>> + * @root. >>>>> + */ >>>>> +bool mem_cgroup_over_softlimit(struct mem_cgroup *root, >>>>> + struct mem_cgroup *memcg) >>>>> +{ >>>>> + if (mem_cgroup_disabled()) >>>>> + return false; >>>>> + >>>>> + if (!root) >>>>> + root = root_mem_cgroup; >>>>> + >>>>> + for (; memcg; memcg = parent_mem_cgroup(memcg)) { >>>>> + /* root_mem_cgroup does not have a soft limit */ >>>>> + if (memcg == root_mem_cgroup) >>>>> + break; >>>>> + if (res_counter_soft_limit_excess(&memcg->res)) >>>>> + return true; >>>>> + if (memcg == root) >>>>> + break; >>>>> + } >>>> Here it adds pressure on a cgroup if one of its parents exceeds soft >>>> limit, although the cgroup itself is under soft limit. It does change >>>> my understanding of soft limit, and might introduce regression of our >>>> existing use cases. >>>> >>>> Here is an example: >>>> >>>> Machine capacity 32G and we over-commit by 8G. >>>> >>>> root >>>> -> A (hard limit 20G, soft limit 15G, usage 16G) >>>> -> A1 (soft limit 5G, usage 4G) >>>> -> A2 (soft limit 10G, usage 12G) >>>> -> B (hard limit 20G, soft limit 10G, usage 16G) >>>> >>>> under global reclaim, we don't want to add pressure on A1 although its >>>> parent A exceeds its soft limit. Assume that if we set the soft limit >>>> corresponding to each cgroup's working set size (hot memory), and it >>>> will introduce regression to A1 in that case. >>>> >>>> In my existing implementation, i am checking the cgroup's soft limit >>>> standalone w/o looking its ancestors. >>> Why do you set the soft limit of A in the first place if you don't >>> want it to be enforced? >> The soft limit should be enforced under certain condition, not always. >> The soft limit of A is set to be enforced when the parent of A and B >> is under memory pressure. For example: >> >> Machine capacity 32G and we over-commit by 8G >> >> root >> -> A (hard limit 20G, soft limit 12G, usage 20G) >> -> A1 (soft limit 2G, usage 1G) >> -> A2 (soft limit 10G, usage 19G) >> -> B (hard limit 20G, soft limit 10G, usage 0G) >> >> Now, A is under memory pressure since the total usage is hitting its >> hard limit. Then we start hierarchical reclaim under A, and each >> cgroup under A also takes consideration of soft limit. In this case, >> we should only set priority = 0 to A2 since it contributes to A's >> charge as well as exceeding its own soft limit. Why punishing A1 (set >> priority = 0) also which has usage under its soft limit ? I can >> imagine it will introduce regression to existing environment which the >> soft limit is set based on the working set size of the cgroup. >> >> To answer the question why we set soft limit to A, it is used to >> over-commit the host while sharing the resource with its sibling (B in >> this case). If the machine is under memory contention, we would like >> to push down memory to A or B depends on their usage and soft limit. > D'oh, I think the problem is just that we walk up the hierarchy one > too many when checking whether a group exceeds a soft limit. The soft > limit is a signal to distribute pressure that comes from above, it's > meaningless and should indeed be ignored on the level the pressure > originates from. > > Say mem_cgroup_over_soft_limit(root, memcg) would check everyone up to > but not including root, wouldn't that do exactly what we both want? > > Example: > > 1. If global memory is short, we reclaim with root=root_mem_cgroup. > A1 and A2 get soft limit reclaimed because of A's soft limit > excess, just like the current kernel would do. > > 2. If A hits its hard limit, we reclaim with root=A, so we only mind > the soft limits of A1 and A2. A1 is below its soft limit, all > good. A2 is above its soft limit, gets treated accordingly. This > is new behaviour, the current kernel would just reclaim them > equally. > > Code: > > bool mem_cgroup_over_soft_limit(struct mem_cgroup *root, > struct mem_cgroup *memcg) > { > if (mem_cgroup_disabled()) > return false; > > if (!root) > root = root_mem_cgroup; > > for (; memcg; memcg = parent_mem_cgroup(memcg)) { > if (memcg == root) > break; > if (res_counter_soft_limit_excess(&memcg->res)) > return true; > } > return false; > } Hi Johannes, I don't think it solve the root of the problem, example: root -> A (hard limit 20G, soft limit 12G, usage 20G) -> A1 ( soft limit 2G, usage 1G) -> A2 ( soft limit 10G, usage 19G) ->B1 (soft limit 5G, usage 4G) ->B2 (soft limit 5G, usage 15G) Now A is hitting its hard limit and start hierarchical reclaim under A. If we choose B1 to go through mem_cgroup_over_soft_limit, it will return true because its parent A2 has a large usage and will lead to priority=0 reclaiming. But in fact it should be B2 to be punished. IMHO, it may checking the cgroup's soft limit standalone without looking up its ancestors just as Ying said. Thanks, Sha From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754622Ab2AQOyH (ORCPT ); Tue, 17 Jan 2012 09:54:07 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:33571 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754306Ab2AQOyG (ORCPT ); Tue, 17 Jan 2012 09:54:06 -0500 Date: Tue, 17 Jan 2012 15:53:48 +0100 From: Johannes Weiner To: Sha Cc: Ying Han , Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Message-ID: <20120117145348.GA3144@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120112085904.GG24386@cmpxchg.org> <20120113224424.GC1653@cmpxchg.org> <4F158418.2090509@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4F158418.2090509@gmail.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jan 17, 2012 at 10:22:16PM +0800, Sha wrote: > On 01/14/2012 06:44 AM, Johannes Weiner wrote: > >On Fri, Jan 13, 2012 at 01:31:16PM -0800, Ying Han wrote: > >>On Thu, Jan 12, 2012 at 12:59 AM, Johannes Weiner wrote: > >>>On Wed, Jan 11, 2012 at 01:42:31PM -0800, Ying Han wrote: > >>>>On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner wrote: > >>>>>@@ -1318,6 +1123,36 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) > >>>>> return margin>> PAGE_SHIFT; > >>>>> } > >>>>> > >>>>>+/** > >>>>>+ * mem_cgroup_over_softlimit > >>>>>+ * @root: hierarchy root > >>>>>+ * @memcg: child of @root to test > >>>>>+ * > >>>>>+ * Returns %true if @memcg exceeds its own soft limit or contributes > >>>>>+ * to the soft limit excess of one of its parents up to and including > >>>>>+ * @root. > >>>>>+ */ > >>>>>+bool mem_cgroup_over_softlimit(struct mem_cgroup *root, > >>>>>+ struct mem_cgroup *memcg) > >>>>>+{ > >>>>>+ if (mem_cgroup_disabled()) > >>>>>+ return false; > >>>>>+ > >>>>>+ if (!root) > >>>>>+ root = root_mem_cgroup; > >>>>>+ > >>>>>+ for (; memcg; memcg = parent_mem_cgroup(memcg)) { > >>>>>+ /* root_mem_cgroup does not have a soft limit */ > >>>>>+ if (memcg == root_mem_cgroup) > >>>>>+ break; > >>>>>+ if (res_counter_soft_limit_excess(&memcg->res)) > >>>>>+ return true; > >>>>>+ if (memcg == root) > >>>>>+ break; > >>>>>+ } > >>>>Here it adds pressure on a cgroup if one of its parents exceeds soft > >>>>limit, although the cgroup itself is under soft limit. It does change > >>>>my understanding of soft limit, and might introduce regression of our > >>>>existing use cases. > >>>> > >>>>Here is an example: > >>>> > >>>>Machine capacity 32G and we over-commit by 8G. > >>>> > >>>>root > >>>> -> A (hard limit 20G, soft limit 15G, usage 16G) > >>>> -> A1 (soft limit 5G, usage 4G) > >>>> -> A2 (soft limit 10G, usage 12G) > >>>> -> B (hard limit 20G, soft limit 10G, usage 16G) > >>>> > >>>>under global reclaim, we don't want to add pressure on A1 although its > >>>>parent A exceeds its soft limit. Assume that if we set the soft limit > >>>>corresponding to each cgroup's working set size (hot memory), and it > >>>>will introduce regression to A1 in that case. > >>>> > >>>>In my existing implementation, i am checking the cgroup's soft limit > >>>>standalone w/o looking its ancestors. > >>>Why do you set the soft limit of A in the first place if you don't > >>>want it to be enforced? > >>The soft limit should be enforced under certain condition, not always. > >>The soft limit of A is set to be enforced when the parent of A and B > >>is under memory pressure. For example: > >> > >>Machine capacity 32G and we over-commit by 8G > >> > >>root > >>-> A (hard limit 20G, soft limit 12G, usage 20G) > >> -> A1 (soft limit 2G, usage 1G) > >> -> A2 (soft limit 10G, usage 19G) > >>-> B (hard limit 20G, soft limit 10G, usage 0G) > >> > >>Now, A is under memory pressure since the total usage is hitting its > >>hard limit. Then we start hierarchical reclaim under A, and each > >>cgroup under A also takes consideration of soft limit. In this case, > >>we should only set priority = 0 to A2 since it contributes to A's > >>charge as well as exceeding its own soft limit. Why punishing A1 (set > >>priority = 0) also which has usage under its soft limit ? I can > >>imagine it will introduce regression to existing environment which the > >>soft limit is set based on the working set size of the cgroup. > >> > >>To answer the question why we set soft limit to A, it is used to > >>over-commit the host while sharing the resource with its sibling (B in > >>this case). If the machine is under memory contention, we would like > >>to push down memory to A or B depends on their usage and soft limit. > >D'oh, I think the problem is just that we walk up the hierarchy one > >too many when checking whether a group exceeds a soft limit. The soft > >limit is a signal to distribute pressure that comes from above, it's > >meaningless and should indeed be ignored on the level the pressure > >originates from. > > > >Say mem_cgroup_over_soft_limit(root, memcg) would check everyone up to > >but not including root, wouldn't that do exactly what we both want? > > > >Example: > > > >1. If global memory is short, we reclaim with root=root_mem_cgroup. > > A1 and A2 get soft limit reclaimed because of A's soft limit > > excess, just like the current kernel would do. > > > >2. If A hits its hard limit, we reclaim with root=A, so we only mind > > the soft limits of A1 and A2. A1 is below its soft limit, all > > good. A2 is above its soft limit, gets treated accordingly. This > > is new behaviour, the current kernel would just reclaim them > > equally. > > > >Code: > > > >bool mem_cgroup_over_soft_limit(struct mem_cgroup *root, > > struct mem_cgroup *memcg) > >{ > > if (mem_cgroup_disabled()) > > return false; > > > > if (!root) > > root = root_mem_cgroup; > > > > for (; memcg; memcg = parent_mem_cgroup(memcg)) { > > if (memcg == root) > > break; > > if (res_counter_soft_limit_excess(&memcg->res)) > > return true; > > } > > return false; > >} > Hi Johannes, > > I don't think it solve the root of the problem, example: > root > -> A (hard limit 20G, soft limit 12G, usage 20G) > -> A1 ( soft limit 2G, usage 1G) > -> A2 ( soft limit 10G, usage 19G) > ->B1 (soft limit 5G, usage 4G) > ->B2 (soft limit 5G, usage 15G) > > Now A is hitting its hard limit and start hierarchical reclaim under A. > If we choose B1 to go through mem_cgroup_over_soft_limit, it will > return true because its parent A2 has a large usage and will lead to > priority=0 reclaiming. But in fact it should be B2 to be punished. Because A2 is over its soft limit, the whole hierarchy below it should be preferred over A1, so both B1 and B2 should be soft limit reclaimed to be consistent with behaviour at the root level. > IMHO, it may checking the cgroup's soft limit standalone without > looking up its ancestors just as Ying said. Again, this would be a regression as soft limits have been applied hierarchically forever. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755815Ab2AQUZe (ORCPT ); Tue, 17 Jan 2012 15:25:34 -0500 Received: from mail-qw0-f53.google.com ([209.85.216.53]:54922 "EHLO mail-qw0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753396Ab2AQUZc convert rfc822-to-8bit (ORCPT ); Tue, 17 Jan 2012 15:25:32 -0500 MIME-Version: 1.0 In-Reply-To: <20120117145348.GA3144@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120112085904.GG24386@cmpxchg.org> <20120113224424.GC1653@cmpxchg.org> <4F158418.2090509@gmail.com> <20120117145348.GA3144@cmpxchg.org> Date: Tue, 17 Jan 2012 12:25:31 -0800 Message-ID: Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim From: Ying Han To: Johannes Weiner Cc: Sha , Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org X-System-Of-Record: true Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jan 17, 2012 at 6:53 AM, Johannes Weiner wrote: > On Tue, Jan 17, 2012 at 10:22:16PM +0800, Sha wrote: >> On 01/14/2012 06:44 AM, Johannes Weiner wrote: >> >On Fri, Jan 13, 2012 at 01:31:16PM -0800, Ying Han wrote: >> >>On Thu, Jan 12, 2012 at 12:59 AM, Johannes Weiner  wrote: >> >>>On Wed, Jan 11, 2012 at 01:42:31PM -0800, Ying Han wrote: >> >>>>On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner  wrote: >> >>>>>@@ -1318,6 +1123,36 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) >> >>>>>        return margin>>  PAGE_SHIFT; >> >>>>>  } >> >>>>> >> >>>>>+/** >> >>>>>+ * mem_cgroup_over_softlimit >> >>>>>+ * @root: hierarchy root >> >>>>>+ * @memcg: child of @root to test >> >>>>>+ * >> >>>>>+ * Returns %true if @memcg exceeds its own soft limit or contributes >> >>>>>+ * to the soft limit excess of one of its parents up to and including >> >>>>>+ * @root. >> >>>>>+ */ >> >>>>>+bool mem_cgroup_over_softlimit(struct mem_cgroup *root, >> >>>>>+                              struct mem_cgroup *memcg) >> >>>>>+{ >> >>>>>+       if (mem_cgroup_disabled()) >> >>>>>+               return false; >> >>>>>+ >> >>>>>+       if (!root) >> >>>>>+               root = root_mem_cgroup; >> >>>>>+ >> >>>>>+       for (; memcg; memcg = parent_mem_cgroup(memcg)) { >> >>>>>+               /* root_mem_cgroup does not have a soft limit */ >> >>>>>+               if (memcg == root_mem_cgroup) >> >>>>>+                       break; >> >>>>>+               if (res_counter_soft_limit_excess(&memcg->res)) >> >>>>>+                       return true; >> >>>>>+               if (memcg == root) >> >>>>>+                       break; >> >>>>>+       } >> >>>>Here it adds pressure on a cgroup if one of its parents exceeds soft >> >>>>limit, although the cgroup itself is under soft limit. It does change >> >>>>my understanding of soft limit, and might introduce regression of our >> >>>>existing use cases. >> >>>> >> >>>>Here is an example: >> >>>> >> >>>>Machine capacity 32G and we over-commit by 8G. >> >>>> >> >>>>root >> >>>>   ->  A (hard limit 20G, soft limit 15G, usage 16G) >> >>>>        ->  A1 (soft limit 5G, usage 4G) >> >>>>        ->  A2 (soft limit 10G, usage 12G) >> >>>>   ->  B (hard limit 20G, soft limit 10G, usage 16G) >> >>>> >> >>>>under global reclaim, we don't want to add pressure on A1 although its >> >>>>parent A exceeds its soft limit. Assume that if we set the soft limit >> >>>>corresponding to each cgroup's working set size (hot memory), and it >> >>>>will introduce regression to A1 in that case. >> >>>> >> >>>>In my existing implementation, i am checking the cgroup's soft limit >> >>>>standalone w/o looking its ancestors. >> >>>Why do you set the soft limit of A in the first place if you don't >> >>>want it to be enforced? >> >>The soft limit should be enforced under certain condition, not always. >> >>The soft limit of A is set to be enforced when the parent of A and B >> >>is under memory pressure. For example: >> >> >> >>Machine capacity 32G and we over-commit by 8G >> >> >> >>root >> >>->  A (hard limit 20G, soft limit 12G, usage 20G) >> >>        ->  A1 (soft limit 2G, usage 1G) >> >>        ->  A2 (soft limit 10G, usage 19G) >> >>->  B (hard limit 20G, soft limit 10G, usage 0G) >> >> >> >>Now, A is under memory pressure since the total usage is hitting its >> >>hard limit. Then we start hierarchical reclaim under A, and each >> >>cgroup under A also takes consideration of soft limit. In this case, >> >>we should only set priority = 0 to A2 since it contributes to A's >> >>charge as well as exceeding its own soft limit. Why punishing A1 (set >> >>priority = 0) also which has usage under its soft limit ? I can >> >>imagine it will introduce regression to existing environment which the >> >>soft limit is set based on the working set size of the cgroup. >> >> >> >>To answer the question why we set soft limit to A, it is used to >> >>over-commit the host while sharing the resource with its sibling (B in >> >>this case). If the machine is under memory contention, we would like >> >>to push down memory to A or B depends on their usage and soft limit. >> >D'oh, I think the problem is just that we walk up the hierarchy one >> >too many when checking whether a group exceeds a soft limit.  The soft >> >limit is a signal to distribute pressure that comes from above, it's >> >meaningless and should indeed be ignored on the level the pressure >> >originates from. >> > >> >Say mem_cgroup_over_soft_limit(root, memcg) would check everyone up to >> >but not including root, wouldn't that do exactly what we both want? >> > >> >Example: >> > >> >1. If global memory is short, we reclaim with root=root_mem_cgroup. >> >    A1 and A2 get soft limit reclaimed because of A's soft limit >> >    excess, just like the current kernel would do. >> > >> >2. If A hits its hard limit, we reclaim with root=A, so we only mind >> >    the soft limits of A1 and A2.  A1 is below its soft limit, all >> >    good.  A2 is above its soft limit, gets treated accordingly.  This >> >    is new behaviour, the current kernel would just reclaim them >> >    equally. >> > >> >Code: >> > >> >bool mem_cgroup_over_soft_limit(struct mem_cgroup *root, >> >                             struct mem_cgroup *memcg) >> >{ >> >     if (mem_cgroup_disabled()) >> >             return false; >> > >> >     if (!root) >> >             root = root_mem_cgroup; >> > >> >     for (; memcg; memcg = parent_mem_cgroup(memcg)) { >> >             if (memcg == root) >> >                     break; >> >             if (res_counter_soft_limit_excess(&memcg->res)) >> >                     return true; >> >     } >> >     return false; >> >} >> Hi Johannes, >> >> I don't think it solve the root of the problem, example: >> root >> -> A (hard limit 20G, soft limit 12G, usage 20G) >>     -> A1 ( soft limit 2G,   usage 1G) >>     -> A2 ( soft limit 10G, usage 19G) >>            ->B1 (soft limit 5G, usage 4G) >>            ->B2 (soft limit 5G, usage 15G) >> >> Now A is hitting its hard limit and start hierarchical reclaim under A. >> If we choose B1 to go through mem_cgroup_over_soft_limit, it will >> return true because its parent A2 has a large usage and will lead to >> priority=0 reclaiming. But in fact it should be B2 to be punished. > > Because A2 is over its soft limit, the whole hierarchy below it should > be preferred over A1, so both B1 and B2 should be soft limit reclaimed > to be consistent with behaviour at the root level. > >> IMHO, it may checking the cgroup's soft limit standalone without >> looking up its ancestors just as Ying said. > > Again, this would be a regression as soft limits have been applied > hierarchically forever. If we are comparing it to the current implementation, agree that the soft reclaim is applied hierarchically. In the example above, A2 will be picked for soft reclaim while A is hitting its hard limit, which in turns reclaim from B1 and B2 regardless of their soft limit setting. However, I haven't convinced myself this is how we are gonna use the soft limit. The soft limit setting for each cgroup is a hit for applying pressure under memory contention. One way of setting the soft limit is based on the cgroup's working set size. Thus, we allow cgroup to grow above its soft limit with cold page cache unless there is a memory pressure comes from above. Under the hierarchical reclaim, we will exam the soft limit and only apply extra pressure to the ones above their soft limit. Here the same example: root -> A (hard limit 20G, soft limit 12G, usage 20G) -> A1 ( soft limit 2G, usage 1G) -> A2 ( soft limit 10G, usage 19G) ->B1 (soft limit 5G, usage 4G) ->B2 (soft limit 5G, usage 15G) If A is hitting its hard limit, we will reclaim all the children under A hierarchically but only adding extra pressure to the ones above their soft limits (A2, B2). Adding extra pressure to B1 will introduce known regression based on customer expectation since the 4G usage are hot memory. I am not aware of how the existing soft reclaim being used, i bet there are not a lot. If we are making changes on the current implementation, we should also take the opportunity to think about the initial design as well. Thoughts? --Ying From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756080Ab2AQV4l (ORCPT ); Tue, 17 Jan 2012 16:56:41 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:49086 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755118Ab2AQV4k (ORCPT ); Tue, 17 Jan 2012 16:56:40 -0500 Date: Tue, 17 Jan 2012 22:56:26 +0100 From: Johannes Weiner To: Ying Han Cc: Sha , Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Message-ID: <20120117215626.GA2380@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120112085904.GG24386@cmpxchg.org> <20120113224424.GC1653@cmpxchg.org> <4F158418.2090509@gmail.com> <20120117145348.GA3144@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jan 17, 2012 at 12:25:31PM -0800, Ying Han wrote: > On Tue, Jan 17, 2012 at 6:53 AM, Johannes Weiner wrote: > > On Tue, Jan 17, 2012 at 10:22:16PM +0800, Sha wrote: > >> IMHO, it may checking the cgroup's soft limit standalone without > >> looking up its ancestors just as Ying said. > > > > Again, this would be a regression as soft limits have been applied > > hierarchically forever. > > If we are comparing it to the current implementation, agree that the > soft reclaim is applied hierarchically. In the example above, A2 will > be picked for soft reclaim while A is hitting its hard limit, which in > turns reclaim from B1 and B2 regardless of their soft limit setting. > However, I haven't convinced myself this is how we are gonna use the > soft limit. Of course I'm comparing it to the current implementation, this is what I'm changing! > The soft limit setting for each cgroup is a hit for applying pressure > under memory contention. One way of setting the soft limit is based on > the cgroup's working set size. Thus, we allow cgroup to grow above its > soft limit with cold page cache unless there is a memory pressure > comes from above. Under the hierarchical reclaim, we will exam the > soft limit and only apply extra pressure to the ones above their soft > limit. Here the same example: > > root > -> A (hard limit 20G, soft limit 12G, usage 20G) > -> A1 ( soft limit 2G, usage 1G) > -> A2 ( soft limit 10G, usage 19G) > > ->B1 (soft limit 5G, usage 4G) > ->B2 (soft limit 5G, usage 15G) > > If A is hitting its hard limit, we will reclaim all the children under > A hierarchically but only adding extra pressure to the ones above > their soft limits (A2, B2). Adding extra pressure to B1 will introduce > known regression based on customer expectation since the 4G usage are > hot memory. I can only repeat myself: A has a soft limit set, so the customer expects global pressure to arise sooner or later. If that happens, A will be soft-limit reclaimed hierarchically in the _existing code_. That's how the soft limit currently works and I don't mean to change it _with this patch_. The customer has to expect that B1 can be reclaimed as a consequence of the soft limit in A or A2 today, so I don't know where this expectation of different behaviour should even come from. How can this be a regression?! > I am not aware of how the existing soft reclaim being used, i bet > there are not a lot. If we are making changes on the current > implementation, we should also take the opportunity to think about the > initial design as well. Thoughts? I agree that these semantics should be up for debate. And I think changing it to something like you have in mind is indeed a good idea; to not have soft limits apply hierarchically but instead follow down the whole chain and only soft limit reclaim those that are themselves above their soft limit. But it's an entirely different matter! This patch is supposed to do only two things: 1. refactor the soft limit implementation, staying as close as possible/practical to the current semantics and 2. fix the inconsistency that soft limits are ignored when pressure does not originate at the root_mem_cgroup. If that is too much change in semantics I can easily ditch 2., I just didn't see the use of maintaining an inconsistency that resulted purely from the limitations of the current implementation by re-adding more code and because I think that this would not be surprising behaviour. It would be as simple as adding an extra check in reclaim that only minds soft limits upon global pressure: if (global_reclaim(sc) && mem_cgroup_over_soft_limit(root, memcg)) /* resulting action */ and it would have nothing to do how soft limits are actually applied once triggered. I can include this in the next version, but it won't fix the problem you seem to be having with the _existing_ behaviour. I also don't think that my patch will get in the way of what you are planning to do: in fact, you already have code that easily turns mem_cgroup_over_soft_limit() into a non-hierarchical predicate. Even more will change when you invert the soft limits to become actual guarantees and skip reclaiming memcgs that are below their soft limits but I don't think this patch is in the way of doing that, either. I feel that these are all orthogonal changes. So if possible, could we take just one step at a time and leave hypothetical behaviour out of it unless the proposed changes clearly get in the way of where we agreed we want to go? If I misunderstood everything completely and you actually believe this patch will get in the way, could you tell me where and how? Thanks. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932115Ab2AQXjI (ORCPT ); Tue, 17 Jan 2012 18:39:08 -0500 Received: from mail-qy0-f174.google.com ([209.85.216.174]:57310 "EHLO mail-qy0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756261Ab2AQXjF convert rfc822-to-8bit (ORCPT ); Tue, 17 Jan 2012 18:39:05 -0500 MIME-Version: 1.0 In-Reply-To: <20120117215626.GA2380@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120112085904.GG24386@cmpxchg.org> <20120113224424.GC1653@cmpxchg.org> <4F158418.2090509@gmail.com> <20120117145348.GA3144@cmpxchg.org> <20120117215626.GA2380@cmpxchg.org> Date: Tue, 17 Jan 2012 15:39:04 -0800 Message-ID: Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim From: Ying Han To: Johannes Weiner Cc: Sha , Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org X-System-Of-Record: true Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jan 17, 2012 at 1:56 PM, Johannes Weiner wrote: > On Tue, Jan 17, 2012 at 12:25:31PM -0800, Ying Han wrote: >> On Tue, Jan 17, 2012 at 6:53 AM, Johannes Weiner wrote: >> > On Tue, Jan 17, 2012 at 10:22:16PM +0800, Sha wrote: >> >> IMHO, it may checking the cgroup's soft limit standalone without >> >> looking up its ancestors just as Ying said. >> > >> > Again, this would be a regression as soft limits have been applied >> > hierarchically forever. >> >> If we are comparing it to the current implementation, agree that the >> soft reclaim is applied hierarchically. In the example above, A2 will >> be picked for soft reclaim while A is hitting its hard limit, which in >> turns reclaim from B1 and B2 regardless of their soft limit setting. >> However, I haven't convinced myself this is how we are gonna use the >> soft limit. > > Of course I'm comparing it to the current implementation, this is what > I'm changing! Thank you for clarifying it. Apparently i confused myself by comparing this patch with the one I had last time. >> The soft limit setting for each cgroup is a hit for applying pressure >> under memory contention. One way of setting the soft limit is based on >> the cgroup's working set size. Thus, we allow cgroup to grow above its >> soft limit with cold page cache unless there is a memory pressure >> comes from above. Under the hierarchical reclaim, we will exam the >> soft limit and only apply extra pressure to the ones above their soft >> limit. Here the same example: >> >> root >> -> A (hard limit 20G, soft limit 12G, usage 20G) >>    -> A1 ( soft limit 2G,   usage 1G) >>    -> A2 ( soft limit 10G, usage 19G) >> >>           ->B1 (soft limit 5G, usage 4G) >>           ->B2 (soft limit 5G, usage 15G) >> >> If A is hitting its hard limit, we will reclaim all the children under >> A hierarchically but only adding extra pressure to the ones above >> their soft limits (A2, B2). Adding extra pressure to B1 will introduce >> known regression based on customer expectation since the 4G usage are >> hot memory. > > I can only repeat myself: A has a soft limit set, so the customer > expects global pressure to arise sooner or later.  If that happens, A > will be soft-limit reclaimed hierarchically in the _existing code_. > That's how the soft limit currently works and I don't mean to change > it _with this patch_.  The customer has to expect that B1 can be > reclaimed as a consequence of the soft limit in A or A2 today, so I > don't know where this expectation of different behaviour should even > come from.  How can this be a regression?! sorry for the confusion :( I wasn't comparing this patch to the current implementation, maybe I should. If the goal of this patch set is to bring as close as possible to the current implementation, I don't have objections. > >> I am not aware of how the existing soft reclaim being used, i bet >> there are not a lot. If we are making changes on the current >> implementation, we should also take the opportunity to think about the >> initial design as well. Thoughts? > > I agree that these semantics should be up for debate.  And I think > changing it to something like you have in mind is indeed a good idea; > to not have soft limits apply hierarchically but instead follow down > the whole chain and only soft limit reclaim those that are themselves > above their soft limit.  But it's an entirely different matter! thanks, agree that patch could come after this. > > This patch is supposed to do only two things: 1. refactor the soft > limit implementation, staying as close as possible/practical to the > current semantics and 2. fix the inconsistency that soft limits are > ignored when pressure does not originate at the root_mem_cgroup.  If > that is too much change in semantics I can easily ditch 2., It would be nice to split the two into separate patches. The second which adds soft reclaim into per-memcg reclaim is a new functionality from the current implementation. I just > didn't see the use of maintaining an inconsistency that resulted > purely from the limitations of the current implementation by re-adding > more code and because I think that this would not be surprising > behaviour. agree. It would be as simple as adding an extra check in reclaim > that only minds soft limits upon global pressure: > >        if (global_reclaim(sc) && mem_cgroup_over_soft_limit(root, memcg)) >                /* resulting action */ > > and it would have nothing to do how soft limits are actually applied > once triggered.  I can include this in the next version, but it won't > fix the problem you seem to be having with the _existing_ behaviour. No, it won't solve all the problems but close. It looks pretty much as what I have, except the priority part. We can leave it for the following patch to further improve soft limit reclaim. I have no strong opinion to whether include the global_reclaim() or not, however it might bring your patch closer to the _existing_ implementation. (considers soft reclaim only under global reclaim ). > I also don't think that my patch will get in the way of what you are > planning to do: in fact, you already have code that easily turns > mem_cgroup_over_soft_limit() into a non-hierarchical predicate. > > Even more will change when you invert the soft limits to become actual > guarantees and skip reclaiming memcgs that are below their soft limits > but I don't think this patch is in the way of doing that, either. > I feel that these are all orthogonal changes.  So if possible, could > we take just one step at a time and leave hypothetical behaviour out > of it unless the proposed changes clearly get in the way of where we > agreed we want to go? > > If I misunderstood everything completely and you actually believe this > patch will get in the way, could you tell me where and how? The change by itself is easy to apply on top of yours. The hierarchical part took some of my time to understand, which now is clear to make it as close as possible to the _existing_ code. Feel free to post the updated patch whenever they are ready. Thanks --Ying > Thanks. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757142Ab2ARF2A (ORCPT ); Wed, 18 Jan 2012 00:28:00 -0500 Received: from fgwmail6.fujitsu.co.jp ([192.51.44.36]:55405 "EHLO fgwmail6.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750790Ab2ARF16 (ORCPT ); Wed, 18 Jan 2012 00:27:58 -0500 X-SecurityPolicyCheck-FJ: OK by FujitsuOutboundMailChecker v1.3.1 Date: Wed, 18 Jan 2012 14:26:38 +0900 From: KAMEZAWA Hiroyuki To: Johannes Weiner Cc: Andrew Morton , Michal Hocko , Balbir Singh , Ying Han , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Message-Id: <20120118142638.11667d2c.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <20120113121645.GA1653@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120112105427.4b80437b.kamezawa.hiroyu@jp.fujitsu.com> <20120113121645.GA1653@cmpxchg.org> Organization: FUJITSU Co. LTD. X-Mailer: Sylpheed 3.1.1 (GTK+ 2.10.14; i686-pc-mingw32) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 13 Jan 2012 13:16:56 +0100 Johannes Weiner wrote: > On Thu, Jan 12, 2012 at 10:54:27AM +0900, KAMEZAWA Hiroyuki wrote: > > Thank you for your work and the result seems atractive and code is much > > simpler. My small concerns are.. > > > > 1. This approach may increase latency of direct-reclaim because of priority=0. > > I think strictly speaking yes, but note that with kswapd being less > likely to get stuck in hammering on one group, the need for allocators > to enter direct reclaim itself is reduced. > > However, if this really becomes a problem in real world loads, the fix > is pretty easy: just ignore the soft limit for direct reclaim. We can > still consider it from hard limit reclaim and kswapd. > > > 2. In a case numa-spread/interleave application run in its own container, > > pages on a node may paged-out again and again becasue of priority=0 > > if some other application runs in the node. > > It seems difficult to use soft-limit with numa-aware applications. > > Do you have suggestions ? > > This is a question about soft limits in general rather than about this > particular patch, right? > Partially, yes. My concern is related to "1". Assume an application is binded to some cpu/node and try to allocate memory. If its memcg's usage is over softlimit, this application will play bad because newly allocated memory will be reclaim target soon, again.... > And if I understand correctly, the problem you are referring to is > this: an application and parts of a soft-limited container share a > node, the soft limit setting means that the container's pages on that > node are reclaimed harder. At that point, the container's share on > that node becomes tiny, but since the soft limit is oblivious to > nodes, the expansion of the other application pushes the soft-limited > container off that node completely as long as the container stays > above its soft limit with the usage on other nodes. > > What would you think about having node-local soft limits that take the > node size into account? > > local_soft_limit = soft_limit * node_size / memcg_size > > The soft limit can be exceeded globally, but the container is no > longer pushed off a node on which it's only occupying a small share of > memory. > Yes, I think this kind of care is required. What is the 'node_size' here ? size of pgdat ? size of per-node usage in the memcg ? > Putting it into proportion of the memcg size, not overall memory size > has the following advantages: > > 1. if the container is sitting on only one of several available > nodes without exceeding the limit globally, the memcg will not be > reclaimed harder just because it has a relatively large share of the > node. > > 2. if the soft limit excess is ridiculously high, the local soft > limits will be pushed down, so the tolerance for smaller shares on > nodes goes down in proportion to the global soft limit excess. > > Example: > > 4G soft limit * 2G node / 4G container = 2G node-local limit > > The container is globally within its soft limit, so the local limit is > at least the size of the node. It's never reclaimed harder compared > to other applications on the node. > > 4G soft limit * 2G node / 5G container = ~1.6G node-local limit > > Here, it will experience more pressure initially, but it will level > off when the shrinking usage and the thereby increasing node-local > soft limit meet. From that point on, the container and the competing > application will be treated equally during reclaim. > > Finally, if the container is 16G in size, i.e. 300% in excess, the > per-node tolerance is at 512M node-local soft limit, which IMO strikes > a good balance between zero tolerance and still applying some stress > to the hugely oversized container when other applications (with > virtually unlimited soft limits) want to run on the same node. > > What do you think? I like the idea. Another idea is changing 'priority' based on per-node stats if not too complicated... Thanks, -Kame From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757233Ab2ARJZ1 (ORCPT ); Wed, 18 Jan 2012 04:25:27 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:41613 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757028Ab2ARJZX (ORCPT ); Wed, 18 Jan 2012 04:25:23 -0500 Date: Wed, 18 Jan 2012 10:25:09 +0100 From: Johannes Weiner To: Sha Cc: Ying Han , Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Message-ID: <20120118092509.GI24386@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120112085904.GG24386@cmpxchg.org> <20120113224424.GC1653@cmpxchg.org> <4F158418.2090509@gmail.com> <20120117145348.GA3144@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jan 18, 2012 at 03:17:25PM +0800, Sha wrote: > > > I don't think it solve the root of the problem, example: > > > root > > > -> A (hard limit 20G, soft limit 12G, usage 20G) > > > -> A1 ( soft limit 2G, usage 1G) > > > -> A2 ( soft limit 10G, usage 19G) > > > ->B1 (soft limit 5G, usage 4G) > > > ->B2 (soft limit 5G, usage 15G) > > > > > > Now A is hitting its hard limit and start hierarchical reclaim under A. > > > If we choose B1 to go through mem_cgroup_over_soft_limit, it will > > > return true because its parent A2 has a large usage and will lead to > > > priority=0 reclaiming. But in fact it should be B2 to be punished. > > > Because A2 is over its soft limit, the whole hierarchy below it should > > be preferred over A1, so both B1 and B2 should be soft limit reclaimed > > to be consistent with behaviour at the root level. > > Well it is just the behavior that I'm expecting actually. But with my > humble comprehension, I can't catch the soft-limit-based hierarchical > reclaiming under the target cgroup (A2) in the current implementation > or after the patch. Both the current mem_cgroup_soft_reclaim or > shrink_zone select victim sub-cgroup by mem_cgroup_iter, but it > doesn't take soft limit into consideration, do I left anything ? No, currently soft limits are ignored if pressure originates from below root_mem_cgroup. But iff soft limits are applied right now, they are applied hierarchically, see mem_cgroup_soft_limit_reclaim(). In my opinion, the fact that soft limits are ignored when pressure is triggered sub-root_mem_cgroup is an artifact of the per-zone tree, so I allowed soft limits to be taken into account below root_mem_cgroup. But IMO, this is something different from how soft limit reclaim is applied once triggered: currently, soft limit reclaim applies to a whole hierarchy, including all children. And this I left unchanged. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757331Ab2ARJpd (ORCPT ); Wed, 18 Jan 2012 04:45:33 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:39928 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757286Ab2ARJpa (ORCPT ); Wed, 18 Jan 2012 04:45:30 -0500 Date: Wed, 18 Jan 2012 10:45:23 +0100 From: Johannes Weiner To: Ying Han Cc: Michal Hocko , Andrew Morton , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Message-ID: <20120118094523.GJ24386@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120113120406.GC17060@tiehlicka.suse.cz> <20120113155001.GB1653@cmpxchg.org> <20120113163423.GG17060@tiehlicka.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 13, 2012 at 01:45:30PM -0800, Ying Han wrote: > On Fri, Jan 13, 2012 at 8:34 AM, Michal Hocko wrote: > > On Fri 13-01-12 16:50:01, Johannes Weiner wrote: > >> On Fri, Jan 13, 2012 at 01:04:06PM +0100, Michal Hocko wrote: > >> > On Tue 10-01-12 16:02:52, Johannes Weiner wrote: > > [...] > >> > > +bool mem_cgroup_over_softlimit(struct mem_cgroup *root, > >> > > +                        struct mem_cgroup *memcg) > >> > > +{ > >> > > + if (mem_cgroup_disabled()) > >> > > +         return false; > >> > > + > >> > > + if (!root) > >> > > +         root = root_mem_cgroup; > >> > > + > >> > > + for (; memcg; memcg = parent_mem_cgroup(memcg)) { > >> > > +         /* root_mem_cgroup does not have a soft limit */ > >> > > +         if (memcg == root_mem_cgroup) > >> > > +                 break; > >> > > +         if (res_counter_soft_limit_excess(&memcg->res)) > >> > > +                 return true; > >> > > +         if (memcg == root) > >> > > +                 break; > >> > > + } > >> > > + return false; > >> > > +} > >> > > >> > Well, this might be little bit tricky. We do not check whether memcg and > >> > root are in a hierarchy (in terms of use_hierarchy) relation. > >> > > >> > If we are under global reclaim then we iterate over all memcgs and so > >> > there is no guarantee that there is a hierarchical relation between the > >> > given memcg and its parent. While, on the other hand, if we are doing > >> > memcg reclaim then we have this guarantee. > >> > > >> > Why should we punish a group (subtree) which is perfectly under its soft > >> > limit just because some other subtree contributes to the common parent's > >> > usage and makes it over its limit? > >> > Should we check memcg->use_hierarchy here? > >> > >> We do, actually.  parent_mem_cgroup() checks the res_counter parent, > >> which is only set when ->use_hierarchy is also set. > > > > Of course I am blind.. We do not setup res_counter parent for > > !use_hierarchy case. Sorry for noise... > > Now it makes much better sense. I was wondering how !use_hierarchy could > > ever work, this should be a signal that I am overlooking something > > terribly. > > > > [...] > >> > > @@ -2121,8 +2121,16 @@ static void shrink_zone(int priority, struct zone *zone, > >> > >                   .mem_cgroup = memcg, > >> > >                   .zone = zone, > >> > >           }; > >> > > +         int epriority = priority; > >> > > +         /* > >> > > +          * Put more pressure on hierarchies that exceed their > >> > > +          * soft limit, to push them back harder than their > >> > > +          * well-behaving siblings. > >> > > +          */ > >> > > +         if (mem_cgroup_over_softlimit(root, memcg)) > >> > > +                 epriority = 0; > >> > > >> > This sounds too aggressive to me. Shouldn't we just double the pressure > >> > or something like that? > >> > >> That's the historical value.  When I tried priority - 1, it was not > >> aggressive enough. > > > > Probably because we want to reclaim too much. Maybe we should do > > reduce nr_to_reclaim (ugly) or reclaim only overlimit groups until certain > > priority level as Ying suggested in her patchset. > > I plan to post that change on top of this, and this patch set does the > basic stuff to allow us doing further improvement. > > I still like the design to skip over_soft_limit cgroups until certain > priority. One way to set up the soft limit for each cgroup is to base > on its actual working set size, and we prefer to punish A first with > lots of page cache ( cold file pages above soft limit) than reclaiming > anon pages from B ( below soft limit ). Unless we can not get enough > pages reclaimed from A, we will start reclaiming from B. > > This might not be the ideal solution, but should be a good start. Thoughts? I don't like this design at all because unless you add weird code to detect if soft limits apply to any memcgs on the reclaimed hierarchy you may iterate over the same bunch of memcgs doing nothing for several times. For example in the default case of no softlimits set anywhere and you repeatedly walk ALL memcgs in the system doing jack until you reach your threshold priority level. Elegant is something else in my book. Once we invert soft limits to mean guarantees and make the default soft limit not infinity but zero, then we can ignore memcgs below their soft limit for a few priority levels just fine because being below the soft limit is the exception. But I don't really want to make this quite invasive behavioural change a requirement for a refactoring patch if possible. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756005Ab2ARLZk (ORCPT ); Wed, 18 Jan 2012 06:25:40 -0500 Received: from mail-iy0-f174.google.com ([209.85.210.174]:47337 "EHLO mail-iy0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752275Ab2ARLZi (ORCPT ); Wed, 18 Jan 2012 06:25:38 -0500 Message-ID: <4F16AC27.1080906@gmail.com> Date: Wed, 18 Jan 2012 19:25:27 +0800 From: Sha User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.23) Gecko/20110922 Thunderbird/3.1.15 MIME-Version: 1.0 To: Johannes Weiner CC: Ying Han , Andrew Morton , Michal Hocko , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120112085904.GG24386@cmpxchg.org> <20120113224424.GC1653@cmpxchg.org> <4F158418.2090509@gmail.com> <20120117145348.GA3144@cmpxchg.org> <20120118092509.GI24386@cmpxchg.org> In-Reply-To: <20120118092509.GI24386@cmpxchg.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/18/2012 05:25 PM, Johannes Weiner wrote: > On Wed, Jan 18, 2012 at 03:17:25PM +0800, Sha wrote: >>>> I don't think it solve the root of the problem, example: >>>> root >>>> -> A (hard limit 20G, soft limit 12G, usage 20G) >>>> -> A1 ( soft limit 2G, usage 1G) >>>> -> A2 ( soft limit 10G, usage 19G) >>>> ->B1 (soft limit 5G, usage 4G) >>>> ->B2 (soft limit 5G, usage 15G) >>>> >>>> Now A is hitting its hard limit and start hierarchical reclaim under A. >>>> If we choose B1 to go through mem_cgroup_over_soft_limit, it will >>>> return true because its parent A2 has a large usage and will lead to >>>> priority=0 reclaiming. But in fact it should be B2 to be punished. >>> Because A2 is over its soft limit, the whole hierarchy below it should >>> be preferred over A1, so both B1 and B2 should be soft limit reclaimed >>> to be consistent with behaviour at the root level. >> Well it is just the behavior that I'm expecting actually. But with my >> humble comprehension, I can't catch the soft-limit-based hierarchical >> reclaiming under the target cgroup (A2) in the current implementation >> or after the patch. Both the current mem_cgroup_soft_reclaim or >> shrink_zone select victim sub-cgroup by mem_cgroup_iter, but it >> doesn't take soft limit into consideration, do I left anything ? > No, currently soft limits are ignored if pressure originates from > below root_mem_cgroup. > > But iff soft limits are applied right now, they are applied > hierarchically, see mem_cgroup_soft_limit_reclaim(). Er... I'm even more confused: mem_cgroup_soft_limit_reclaim indeed choses the biggest soft-limit excessor first, but in the succeeding reclaim mem_cgroup_hierarchical_reclaim just selects a child cgroup by css_id which has nothing to do with soft limit (see mem_cgroup_select_victim). IMHO, it's not a genuine hierarchical reclaim. I check this from the latest memcg-devel git tree (branch since-3.1)... > In my opinion, the fact that soft limits are ignored when pressure is > triggered sub-root_mem_cgroup is an artifact of the per-zone tree, so > I allowed soft limits to be taken into account below root_mem_cgroup. > > But IMO, this is something different from how soft limit reclaim is > applied once triggered: currently, soft limit reclaim applies to a > whole hierarchy, including all children. And this I left unchanged. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757935Ab2ARP1N (ORCPT ); Wed, 18 Jan 2012 10:27:13 -0500 Received: from cantor2.suse.de ([195.135.220.15]:37428 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757891Ab2ARP1K (ORCPT ); Wed, 18 Jan 2012 10:27:10 -0500 Date: Wed, 18 Jan 2012 16:27:08 +0100 From: Michal Hocko To: Sha Cc: Johannes Weiner , Ying Han , Andrew Morton , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim Message-ID: <20120118152708.GG31112@tiehlicka.suse.cz> References: <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120112085904.GG24386@cmpxchg.org> <20120113224424.GC1653@cmpxchg.org> <4F158418.2090509@gmail.com> <20120117145348.GA3144@cmpxchg.org> <20120118092509.GI24386@cmpxchg.org> <4F16AC27.1080906@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4F16AC27.1080906@gmail.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 18-01-12 19:25:27, Sha wrote: [...] > Er... I'm even more confused: mem_cgroup_soft_limit_reclaim indeed > choses the biggest soft-limit excessor first, but in the succeeding reclaim > mem_cgroup_hierarchical_reclaim just selects a child cgroup by css_id mem_cgroup_soft_limit_reclaim picks up the hierarchy root (most excessing one) and mem_cgroup_hierarchical_reclaim reclaims from that subtree). It doesn't care who exceeds the soft limit under that hierarchy it just tries to push the root under its limit as much as it can. This is what Johannes tried to explain in the other email in the thred. > which has nothing to do with soft limit (see mem_cgroup_select_victim). > IMHO, it's not a genuine hierarchical reclaim. It is hierarchical because it iterates over hierarchy it is not and never was recursively soft-hierarchical... -- Michal Hocko SUSE Labs SUSE LINUX s.r.o. Lihovarska 1060/12 190 00 Praha 9 Czech Republic From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757979Ab2ARUi5 (ORCPT ); Wed, 18 Jan 2012 15:38:57 -0500 Received: from mail-qw0-f53.google.com ([209.85.216.53]:51207 "EHLO mail-qw0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757135Ab2ARUiz convert rfc822-to-8bit (ORCPT ); Wed, 18 Jan 2012 15:38:55 -0500 MIME-Version: 1.0 In-Reply-To: <20120118094523.GJ24386@cmpxchg.org> References: <1326207772-16762-1-git-send-email-hannes@cmpxchg.org> <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120113120406.GC17060@tiehlicka.suse.cz> <20120113155001.GB1653@cmpxchg.org> <20120113163423.GG17060@tiehlicka.suse.cz> <20120118094523.GJ24386@cmpxchg.org> Date: Wed, 18 Jan 2012 12:38:54 -0800 Message-ID: Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim From: Ying Han To: Johannes Weiner Cc: Michal Hocko , Andrew Morton , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org X-System-Of-Record: true Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jan 18, 2012 at 1:45 AM, Johannes Weiner wrote: > On Fri, Jan 13, 2012 at 01:45:30PM -0800, Ying Han wrote: >> On Fri, Jan 13, 2012 at 8:34 AM, Michal Hocko wrote: >> > On Fri 13-01-12 16:50:01, Johannes Weiner wrote: >> >> On Fri, Jan 13, 2012 at 01:04:06PM +0100, Michal Hocko wrote: >> >> > On Tue 10-01-12 16:02:52, Johannes Weiner wrote: >> > [...] >> >> > > +bool mem_cgroup_over_softlimit(struct mem_cgroup *root, >> >> > > +                        struct mem_cgroup *memcg) >> >> > > +{ >> >> > > + if (mem_cgroup_disabled()) >> >> > > +         return false; >> >> > > + >> >> > > + if (!root) >> >> > > +         root = root_mem_cgroup; >> >> > > + >> >> > > + for (; memcg; memcg = parent_mem_cgroup(memcg)) { >> >> > > +         /* root_mem_cgroup does not have a soft limit */ >> >> > > +         if (memcg == root_mem_cgroup) >> >> > > +                 break; >> >> > > +         if (res_counter_soft_limit_excess(&memcg->res)) >> >> > > +                 return true; >> >> > > +         if (memcg == root) >> >> > > +                 break; >> >> > > + } >> >> > > + return false; >> >> > > +} >> >> > >> >> > Well, this might be little bit tricky. We do not check whether memcg and >> >> > root are in a hierarchy (in terms of use_hierarchy) relation. >> >> > >> >> > If we are under global reclaim then we iterate over all memcgs and so >> >> > there is no guarantee that there is a hierarchical relation between the >> >> > given memcg and its parent. While, on the other hand, if we are doing >> >> > memcg reclaim then we have this guarantee. >> >> > >> >> > Why should we punish a group (subtree) which is perfectly under its soft >> >> > limit just because some other subtree contributes to the common parent's >> >> > usage and makes it over its limit? >> >> > Should we check memcg->use_hierarchy here? >> >> >> >> We do, actually.  parent_mem_cgroup() checks the res_counter parent, >> >> which is only set when ->use_hierarchy is also set. >> > >> > Of course I am blind.. We do not setup res_counter parent for >> > !use_hierarchy case. Sorry for noise... >> > Now it makes much better sense. I was wondering how !use_hierarchy could >> > ever work, this should be a signal that I am overlooking something >> > terribly. >> > >> > [...] >> >> > > @@ -2121,8 +2121,16 @@ static void shrink_zone(int priority, struct zone *zone, >> >> > >                   .mem_cgroup = memcg, >> >> > >                   .zone = zone, >> >> > >           }; >> >> > > +         int epriority = priority; >> >> > > +         /* >> >> > > +          * Put more pressure on hierarchies that exceed their >> >> > > +          * soft limit, to push them back harder than their >> >> > > +          * well-behaving siblings. >> >> > > +          */ >> >> > > +         if (mem_cgroup_over_softlimit(root, memcg)) >> >> > > +                 epriority = 0; >> >> > >> >> > This sounds too aggressive to me. Shouldn't we just double the pressure >> >> > or something like that? >> >> >> >> That's the historical value.  When I tried priority - 1, it was not >> >> aggressive enough. >> > >> > Probably because we want to reclaim too much. Maybe we should do >> > reduce nr_to_reclaim (ugly) or reclaim only overlimit groups until certain >> > priority level as Ying suggested in her patchset. >> >> I plan to post that change on top of this, and this patch set does the >> basic stuff to allow us doing further improvement. >> >> I still like the design to skip over_soft_limit cgroups until certain >> priority. One way to set up the soft limit for each cgroup is to base >> on its actual working set size, and we prefer to punish A first with >> lots of page cache ( cold file pages above soft limit) than reclaiming >> anon pages from B ( below soft limit ). Unless we can not get enough >> pages reclaimed from A, we will start reclaiming from B. >> >> This might not be the ideal solution, but should be a good start. Thoughts? > > I don't like this design at all because unless you add weird code to > detect if soft limits apply to any memcgs on the reclaimed hierarchy > you may iterate over the same bunch of memcgs doing nothing for > several times.  For example in the default case of no softlimits set > anywhere and you repeatedly walk ALL memcgs in the system doing jack > until you reach your threshold priority level.  Elegant is something > else in my book. Agree that change isn't ready until the default soft limit is changed to "0". > Once we invert soft limits to mean guarantees and make the default > soft limit not infinity but zero, then we can ignore memcgs below > their soft limit for a few priority levels just fine because being > below the soft limit is the exception.  But I don't really want to > make this quite invasive behavioural change a requirement for a > refactoring patch if possible. Sounds reasonable to me. --Ying From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752114Ab2ASGiW (ORCPT ); Thu, 19 Jan 2012 01:38:22 -0500 Received: from mail-iy0-f174.google.com ([209.85.210.174]:42286 "EHLO mail-iy0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750803Ab2ASGiS (ORCPT ); Thu, 19 Jan 2012 01:38:18 -0500 Message-ID: <4F17BA58.2090403@gmail.com> Date: Thu, 19 Jan 2012 14:38:16 +0800 From: Sha User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.23) Gecko/20110922 Thunderbird/3.1.15 MIME-Version: 1.0 To: Michal Hocko CC: Johannes Weiner , Ying Han , Andrew Morton , KAMEZAWA Hiroyuki , Balbir Singh , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim References: <1326207772-16762-3-git-send-email-hannes@cmpxchg.org> <20120112085904.GG24386@cmpxchg.org> <20120113224424.GC1653@cmpxchg.org> <4F158418.2090509@gmail.com> <20120117145348.GA3144@cmpxchg.org> <20120118092509.GI24386@cmpxchg.org> <4F16AC27.1080906@gmail.com> <20120118152708.GG31112@tiehlicka.suse.cz> In-Reply-To: <20120118152708.GG31112@tiehlicka.suse.cz> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/18/2012 11:27 PM, Michal Hocko wrote: > On Wed 18-01-12 19:25:27, Sha wrote: > [...] >> Er... I'm even more confused: mem_cgroup_soft_limit_reclaim indeed >> choses the biggest soft-limit excessor first, but in the succeeding reclaim >> mem_cgroup_hierarchical_reclaim just selects a child cgroup by css_id > mem_cgroup_soft_limit_reclaim picks up the hierarchy root (most > excessing one) and mem_cgroup_hierarchical_reclaim reclaims from that > subtree). It doesn't care who exceeds the soft limit under that > hierarchy it just tries to push the root under its limit as much as it > can. This is what Johannes tried to explain in the other email in the > thred. yeah, I finally twig what he meant... I'm not quite familiar with this part. Thanks a lot for the explanation. :-) Sha >> which has nothing to do with soft limit (see mem_cgroup_select_victim). >> IMHO, it's not a genuine hierarchical reclaim. > It is hierarchical because it iterates over hierarchy it is not and > never was recursively soft-hierarchical... >