From: Minchan Kim <minchan@kernel.org>
To: Michal Hocko <mhocko@kernel.org>
Cc: linux-kernel@vger.kernel.org, hillf.zj@alibaba-inc.com,
mgorman@suse.de, vbabka@suse.cz, mm-commits@vger.kernel.org,
linux-mm@kvack.org
Subject: Re: + mm-vmscan-add-mm_vmscan_inactive_list_is_low-tracepoint.patch added to -mm tree
Date: Fri, 13 Jan 2017 10:37:24 +0900 [thread overview]
Message-ID: <20170113013724.GA23494@bbox> (raw)
In-Reply-To: <20170112091016.GE2264@dhcp22.suse.cz>
Hello,
On Thu, Jan 12, 2017 at 10:10:17AM +0100, Michal Hocko wrote:
> On Thu 12-01-17 17:48:13, Minchan Kim wrote:
> > On Thu, Jan 12, 2017 at 09:15:54AM +0100, Michal Hocko wrote:
> > > On Thu 12-01-17 14:12:47, Minchan Kim wrote:
> > > > Hello,
> > > >
> > > > On Wed, Jan 11, 2017 at 04:52:39PM +0100, Michal Hocko wrote:
> > > > > On Wed 11-01-17 08:52:50, Minchan Kim wrote:
> > > > > [...]
> > > > > > > @@ -2055,8 +2055,8 @@ static bool inactive_list_is_low(struct
> > > > > > > if (!file && !total_swap_pages)
> > > > > > > return false;
> > > > > > >
> > > > > > > - inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
> > > > > > > - active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
> > > > > > > + total_inactive = inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
> > > > > > > + total_active = active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
> > > > > > >
> > > > > >
> > > > > > the decision of deactivating is based on eligible zone's LRU size,
> > > > > > not whole zone so why should we need to get a trace of all zones's LRU?
> > > > >
> > > > > Strictly speaking, the total_ counters are not necessary for making the
> > > > > decision. I found reporting those numbers useful regardless because this
> > > > > will give us also an information how large is the eligible portion of
> > > > > the LRU list. We do not have any other tracepoint which would report
> > > > > that.
> > > >
> > > > The patch doesn't say anything why it's useful. Could you tell why it's
> > > > useful and inactive_list_is_low should be right place?
> > > >
> > > > Don't get me wrong, please. I don't want to bother you.
> > > > I really don't want to add random stuff although it's tracepoint for
> > > > debugging.
> > >
> > > This doesn't sounds random to me. We simply do not have a full picture
> > > on 32b systems without this information. Especially when memcgs are
> > > involved and global numbers spread over different LRUs.
> >
> > Could you elaborate it?
>
> The problem with 32b systems is that you only can consider a part of the
> LRU for the lowmem requests. While we have global counters to see how
> much lowmem inactive/active pages we have, those get distributed to
> memcg LRUs. And that distribution is impossible to guess. So my thinking
> is that it can become a real head scratcher to realize why certain
> active LRUs are aged while others are not. This was the case when I was
> debugging the last issue which triggered all this. All of the sudden I
> have seen many invocations when inactive and active were zero which
> sounded weird, until I realized that those are memcg's lruvec which is
> what total numbers told me...
Hmm, it seems I miss something. AFAIU, what you need is just memcg
identifier, not all lru size. If it isn't, please tell more detail
usecase of all lru size in that particular tracepoint.
>
> Later on I would like to add an memcg identifier to the vmscan
> tracepoints but I didn't get there yet.
>
> > "
> > Currently we have tracepoints for both active and inactive LRU lists
> > reclaim but we do not have any which would tell us why we we decided to
> > age the active list. Without that it is quite hard to diagnose
> > active/inactive lists balancing. Add mm_vmscan_inactive_list_is_low
> > tracepoint to tell us this information.
> > "
> >
> > Your description says "why we decided to age the active list".
> > So, what's needed?
> >
> > >
> > > > > [...]
> > > > > > > @@ -2223,7 +2228,7 @@ static void get_scan_count(struct lruvec
> > > > > > > * lruvec even if it has plenty of old anonymous pages unless the
> > > > > > > * system is under heavy pressure.
> > > > > > > */
> > > > > > > - if (!inactive_list_is_low(lruvec, true, sc) &&
> > > > > > > + if (!inactive_list_is_low(lruvec, true, sc, false) &&
> > > > > >
> > > > > > Hmm, I was curious why you added trace boolean arguement and found it here.
> > > > > > Yes, here is not related to deactivation directly but couldn't we help to
> > > > > > trace it unconditionally?
> > > > >
> > > > > I've had it like that when I was debugging the mentioned bug and found
> > > > > it a bit disturbing. It generated more output than I would like and it
> > > > > wasn't really clear from which code path this has been called from.
> > > >
> > > > Indeed.
> > > >
> > > > Personally, I want to move inactive_list_is_low in shrink_active_list
> > > > and shrink_active_list calls inactive_list_is_low(...., true),
> > > > unconditionally so that it can make code simple/clear but cannot remove
> > > > trace boolean variable , which what I want. So, it's okay if you love
> > > > your version.
> > >
> > > I am not sure I am following. Why is the additional parameter a problem?
> >
> > Well, to me, it's not a elegance. Is it? If we need such boolean variable
> > to control show the trace, it means it's not a good place or think
> > refactoring.
>
> But, even when you refactor the code there will be other callers of
> inactive_list_is_low outside of shrink_active_list...
Yes, that's why I said "it's okay if you love your version". However,
we can do refactoring to remove "bool trace" and even, it makes code
more readable, I believe.
WARNING: multiple messages have this Message-ID (diff)
From: Minchan Kim <minchan@kernel.org>
To: Michal Hocko <mhocko@kernel.org>
Cc: linux-kernel@vger.kernel.org, hillf.zj@alibaba-inc.com,
mgorman@suse.de, vbabka@suse.cz, mm-commits@vger.kernel.org,
linux-mm@kvack.org
Subject: Re: + mm-vmscan-add-mm_vmscan_inactive_list_is_low-tracepoint.patch added to -mm tree
Date: Fri, 13 Jan 2017 10:37:24 +0900 [thread overview]
Message-ID: <20170113013724.GA23494@bbox> (raw)
In-Reply-To: <20170112091016.GE2264@dhcp22.suse.cz>
Hello,
On Thu, Jan 12, 2017 at 10:10:17AM +0100, Michal Hocko wrote:
> On Thu 12-01-17 17:48:13, Minchan Kim wrote:
> > On Thu, Jan 12, 2017 at 09:15:54AM +0100, Michal Hocko wrote:
> > > On Thu 12-01-17 14:12:47, Minchan Kim wrote:
> > > > Hello,
> > > >
> > > > On Wed, Jan 11, 2017 at 04:52:39PM +0100, Michal Hocko wrote:
> > > > > On Wed 11-01-17 08:52:50, Minchan Kim wrote:
> > > > > [...]
> > > > > > > @@ -2055,8 +2055,8 @@ static bool inactive_list_is_low(struct
> > > > > > > if (!file && !total_swap_pages)
> > > > > > > return false;
> > > > > > >
> > > > > > > - inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
> > > > > > > - active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
> > > > > > > + total_inactive = inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
> > > > > > > + total_active = active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
> > > > > > >
> > > > > >
> > > > > > the decision of deactivating is based on eligible zone's LRU size,
> > > > > > not whole zone so why should we need to get a trace of all zones's LRU?
> > > > >
> > > > > Strictly speaking, the total_ counters are not necessary for making the
> > > > > decision. I found reporting those numbers useful regardless because this
> > > > > will give us also an information how large is the eligible portion of
> > > > > the LRU list. We do not have any other tracepoint which would report
> > > > > that.
> > > >
> > > > The patch doesn't say anything why it's useful. Could you tell why it's
> > > > useful and inactive_list_is_low should be right place?
> > > >
> > > > Don't get me wrong, please. I don't want to bother you.
> > > > I really don't want to add random stuff although it's tracepoint for
> > > > debugging.
> > >
> > > This doesn't sounds random to me. We simply do not have a full picture
> > > on 32b systems without this information. Especially when memcgs are
> > > involved and global numbers spread over different LRUs.
> >
> > Could you elaborate it?
>
> The problem with 32b systems is that you only can consider a part of the
> LRU for the lowmem requests. While we have global counters to see how
> much lowmem inactive/active pages we have, those get distributed to
> memcg LRUs. And that distribution is impossible to guess. So my thinking
> is that it can become a real head scratcher to realize why certain
> active LRUs are aged while others are not. This was the case when I was
> debugging the last issue which triggered all this. All of the sudden I
> have seen many invocations when inactive and active were zero which
> sounded weird, until I realized that those are memcg's lruvec which is
> what total numbers told me...
Hmm, it seems I miss something. AFAIU, what you need is just memcg
identifier, not all lru size. If it isn't, please tell more detail
usecase of all lru size in that particular tracepoint.
>
> Later on I would like to add an memcg identifier to the vmscan
> tracepoints but I didn't get there yet.
>
> > "
> > Currently we have tracepoints for both active and inactive LRU lists
> > reclaim but we do not have any which would tell us why we we decided to
> > age the active list. Without that it is quite hard to diagnose
> > active/inactive lists balancing. Add mm_vmscan_inactive_list_is_low
> > tracepoint to tell us this information.
> > "
> >
> > Your description says "why we decided to age the active list".
> > So, what's needed?
> >
> > >
> > > > > [...]
> > > > > > > @@ -2223,7 +2228,7 @@ static void get_scan_count(struct lruvec
> > > > > > > * lruvec even if it has plenty of old anonymous pages unless the
> > > > > > > * system is under heavy pressure.
> > > > > > > */
> > > > > > > - if (!inactive_list_is_low(lruvec, true, sc) &&
> > > > > > > + if (!inactive_list_is_low(lruvec, true, sc, false) &&
> > > > > >
> > > > > > Hmm, I was curious why you added trace boolean arguement and found it here.
> > > > > > Yes, here is not related to deactivation directly but couldn't we help to
> > > > > > trace it unconditionally?
> > > > >
> > > > > I've had it like that when I was debugging the mentioned bug and found
> > > > > it a bit disturbing. It generated more output than I would like and it
> > > > > wasn't really clear from which code path this has been called from.
> > > >
> > > > Indeed.
> > > >
> > > > Personally, I want to move inactive_list_is_low in shrink_active_list
> > > > and shrink_active_list calls inactive_list_is_low(...., true),
> > > > unconditionally so that it can make code simple/clear but cannot remove
> > > > trace boolean variable , which what I want. So, it's okay if you love
> > > > your version.
> > >
> > > I am not sure I am following. Why is the additional parameter a problem?
> >
> > Well, to me, it's not a elegance. Is it? If we need such boolean variable
> > to control show the trace, it means it's not a good place or think
> > refactoring.
>
> But, even when you refactor the code there will be other callers of
> inactive_list_is_low outside of shrink_active_list...
Yes, that's why I said "it's okay if you love your version". However,
we can do refactoring to remove "bool trace" and even, it makes code
more readable, I believe.
>From 06eb7201d781155a8dee7e72fbb8423ec8175223 Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan@kernel.org>
Date: Fri, 13 Jan 2017 10:13:36 +0900
Subject: [PATCH] mm: refactoring inactive_list_is_low
Recently, Michal Hocko added tracepoint into inactive_list_is_low
for catching why VM decided to age the active list to know
active/inacive balancing problem. With that, unfortunately, it
added "bool trace" to inactlive_list_is_low to control some place
should be prohibited tracing. It is not elegant to me so this patch
try to clean it up.
Normally, most inactive_list_is_low is used for deciding active list
demotion but one site(i.e., get_scan_count) uses for other purpose
which reclaim file LRU forcefully. Sites for deactivation calls it
with shrink_active_list. It means inactive_list_is_low could be
located in shrink_active_list.
One more thing this patch does is to remove "ratio" in the tracepoint
because we can get it by post processing in script via simple math.
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
include/trace/events/vmscan.h | 9 +++-----
mm/vmscan.c | 51 ++++++++++++++++++++++++-------------------
2 files changed, 31 insertions(+), 29 deletions(-)
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 27e8a5c..406ea95 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -432,9 +432,9 @@ TRACE_EVENT(mm_vmscan_inactive_list_is_low,
TP_PROTO(int nid, int reclaim_idx,
unsigned long total_inactive, unsigned long inactive,
unsigned long total_active, unsigned long active,
- unsigned long ratio, int file),
+ int file),
- TP_ARGS(nid, reclaim_idx, total_inactive, inactive, total_active, active, ratio, file),
+ TP_ARGS(nid, reclaim_idx, total_inactive, inactive, total_active, active, file),
TP_STRUCT__entry(
__field(int, nid)
@@ -443,7 +443,6 @@ TRACE_EVENT(mm_vmscan_inactive_list_is_low,
__field(unsigned long, inactive)
__field(unsigned long, total_active)
__field(unsigned long, active)
- __field(unsigned long, ratio)
__field(int, reclaim_flags)
),
@@ -454,16 +453,14 @@ TRACE_EVENT(mm_vmscan_inactive_list_is_low,
__entry->inactive = inactive;
__entry->total_active = total_active;
__entry->active = active;
- __entry->ratio = ratio;
__entry->reclaim_flags = trace_shrink_flags(file) & RECLAIM_WB_LRU;
),
- TP_printk("nid=%d reclaim_idx=%d total_inactive=%ld inactive=%ld total_active=%ld active=%ld ratio=%ld flags=%s",
+ TP_printk("nid=%d reclaim_idx=%d total_inactive=%ld inactive=%ld total_active=%ld active=%ld flags=%s",
__entry->nid,
__entry->reclaim_idx,
__entry->total_inactive, __entry->inactive,
__entry->total_active, __entry->active,
- __entry->ratio,
show_reclaim_flags(__entry->reclaim_flags))
);
#endif /* _TRACE_VMSCAN_H */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 75cdf68..6890c21 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -150,6 +150,7 @@ unsigned long vm_total_pages;
static LIST_HEAD(shrinker_list);
static DECLARE_RWSEM(shrinker_rwsem);
+static bool inactive_list_is_low(bool file, unsigned long, unsigned long);
#ifdef CONFIG_MEMCG
static bool global_reclaim(struct scan_control *sc)
@@ -1962,6 +1963,22 @@ static void shrink_active_list(unsigned long nr_to_scan,
isolate_mode_t isolate_mode = 0;
int file = is_file_lru(lru);
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+ unsigned long inactive, active;
+ enum lru_list inactive_lru = file * LRU_FILE;
+ enum lru_list active_lru = file * LRU_FILE + LRU_ACTIVE;
+ bool deactivate;
+
+ inactive = lruvec_lru_size_eligibe_zones(lruvec, file * LRU_FILE,
+ sc->reclaim_idx);
+ active = lruvec_lru_size_eligibe_zones(lruvec, file * LRU_FILE +
+ LRU_ACTIVE, sc->reclaim_idx);
+ deactivate = inactive_list_is_low(file, inactive, active);
+ trace_mm_vmscan_inactive_list_is_low(pgdat->node_id,
+ sc->reclaim_idx,
+ lruvec_lru_size(lruvec, inactive_lru), inactive,
+ lruvec_lru_size(lruvec, active_lru), active, file);
+ if (!deactivate)
+ return;
lru_add_drain();
@@ -2073,13 +2090,10 @@ static void shrink_active_list(unsigned long nr_to_scan,
* 1TB 101 10GB
* 10TB 320 32GB
*/
-static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
- struct scan_control *sc, bool trace)
+static bool inactive_list_is_low(bool file,
+ unsigned long inactive, unsigned long active)
{
unsigned long inactive_ratio;
- unsigned long inactive, active;
- enum lru_list inactive_lru = file * LRU_FILE;
- enum lru_list active_lru = file * LRU_FILE + LRU_ACTIVE;
unsigned long gb;
/*
@@ -2089,22 +2103,12 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
if (!file && !total_swap_pages)
return false;
- inactive = lruvec_lru_size_eligibe_zones(lruvec, inactive_lru, sc->reclaim_idx);
- active = lruvec_lru_size_eligibe_zones(lruvec, active_lru, sc->reclaim_idx);
-
gb = (inactive + active) >> (30 - PAGE_SHIFT);
if (gb)
inactive_ratio = int_sqrt(10 * gb);
else
inactive_ratio = 1;
- if (trace)
- trace_mm_vmscan_inactive_list_is_low(lruvec_pgdat(lruvec)->node_id,
- sc->reclaim_idx,
- lruvec_lru_size(lruvec, inactive_lru), inactive,
- lruvec_lru_size(lruvec, active_lru), active,
- inactive_ratio, file);
-
return inactive * inactive_ratio < active;
}
@@ -2112,8 +2116,7 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
struct lruvec *lruvec, struct scan_control *sc)
{
if (is_active_lru(lru)) {
- if (inactive_list_is_low(lruvec, is_file_lru(lru), sc, true))
- shrink_active_list(nr_to_scan, lruvec, sc, lru);
+ shrink_active_list(nr_to_scan, lruvec, sc, lru);
return 0;
}
@@ -2153,6 +2156,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
enum lru_list lru;
bool some_scanned;
int pass;
+ unsigned long inactive, active;
/*
* If the zone or memcg is small, nr[l] can be 0. This
@@ -2243,7 +2247,11 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
* lruvec even if it has plenty of old anonymous pages unless the
* system is under heavy pressure.
*/
- if (!inactive_list_is_low(lruvec, true, sc, false) &&
+ inactive = lruvec_lru_size_eligibe_zones(lruvec,
+ LRU_FILE, sc->reclaim_idx);
+ active = lruvec_lru_size_eligibe_zones(lruvec,
+ LRU_FILE + LRU_ACTIVE, sc->reclaim_idx);
+ if (!inactive_list_is_low(true, inactive, active) &&
lruvec_lru_size_eligibe_zones(lruvec, LRU_INACTIVE_FILE, sc->reclaim_idx) >> sc->priority) {
scan_balance = SCAN_FILE;
goto out;
@@ -2468,9 +2476,7 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc
* Even if we did not try to evict anon pages at all, we want to
* rebalance the anon lru active/inactive ratio.
*/
- if (inactive_list_is_low(lruvec, false, sc, true))
- shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
- sc, LRU_ACTIVE_ANON);
+ shrink_active_list(SWAP_CLUSTER_MAX, lruvec, sc, LRU_ACTIVE_ANON);
}
/* Use reclaim/compaction for costly allocs or under memory pressure */
@@ -3118,8 +3124,7 @@ static void age_active_anon(struct pglist_data *pgdat,
do {
struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, memcg);
- if (inactive_list_is_low(lruvec, false, sc, true))
- shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
+ shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
sc, LRU_ACTIVE_ANON);
memcg = mem_cgroup_iter(NULL, memcg, NULL);
--
2.7.4
next prev parent reply other threads:[~2017-01-13 1:37 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-01-05 23:46 + mm-vmscan-add-mm_vmscan_inactive_list_is_low-tracepoint.patch added to -mm tree akpm
2017-01-10 23:52 ` Minchan Kim
2017-01-10 23:52 ` Minchan Kim
2017-01-11 15:52 ` Michal Hocko
2017-01-11 15:52 ` Michal Hocko
2017-01-12 5:12 ` Minchan Kim
2017-01-12 5:12 ` Minchan Kim
2017-01-12 8:15 ` Michal Hocko
2017-01-12 8:15 ` Michal Hocko
2017-01-12 8:48 ` Minchan Kim
2017-01-12 8:48 ` Minchan Kim
2017-01-12 9:10 ` Michal Hocko
2017-01-12 9:10 ` Michal Hocko
2017-01-13 1:37 ` Minchan Kim [this message]
2017-01-13 1:37 ` Minchan Kim
2017-01-13 7:47 ` Michal Hocko
2017-01-13 7:47 ` Michal Hocko
2017-01-13 8:57 ` Minchan Kim
2017-01-13 8:57 ` Minchan Kim
2017-01-13 9:10 ` Michal Hocko
2017-01-13 9:10 ` Michal Hocko
2017-01-17 6:45 ` Minchan Kim
2017-01-17 6:45 ` Minchan Kim
2017-01-17 10:12 ` Michal Hocko
2017-01-17 10:12 ` Michal Hocko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170113013724.GA23494@bbox \
--to=minchan@kernel.org \
--cc=hillf.zj@alibaba-inc.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=mhocko@kernel.org \
--cc=mm-commits@vger.kernel.org \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.